##  Writing `zip` files to a Pandas DF as a `csv` file. 
- 0. Download and inspect the zip file from the source using `wget.download()` method.
- 1. Move the zip file to its final data directory destination.
- 2. Extract the `Year` value from text file.
- 3. `Join` all dfs and create `csv` file.
- 4. Clean up.

In [1]:
%pip install wget --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import os
from glob import glob
import wget
from zipfile import ZipFile 

- 0. Download and inspect the zip file from the source using `wget.download()` method.

In [3]:
url = 'http://www.ssa.gov/OACT/babynames/names.zip'
babynames = wget.download(url)
babynames

'names.zip'

- 1. Move the zip file to its final data directory destination.

In [4]:
!mv names.zip data

In [5]:
!unzip data/names.zip 

Archive:  data/names.zip
  inflating: yob1880.txt             
  inflating: yob1881.txt             
  inflating: yob1882.txt             
  inflating: yob1883.txt             
  inflating: yob1884.txt             
  inflating: yob1885.txt             
  inflating: yob1886.txt             
  inflating: yob1887.txt             
  inflating: yob1888.txt             
  inflating: yob1889.txt             
  inflating: yob1890.txt             
  inflating: yob1891.txt             
  inflating: yob1892.txt             
  inflating: yob1893.txt             
  inflating: yob1894.txt             
  inflating: yob1895.txt             
  inflating: yob1896.txt             
  inflating: yob1897.txt             
  inflating: yob1898.txt             
  inflating: yob1899.txt             
  inflating: yob1900.txt             
  inflating: yob1901.txt             
  inflating: yob1902.txt             
  inflating: yob1903.txt             
  inflating: yob1904.txt             
  inflating: yob1905.txt 

In [6]:
name_files = glob('*.txt')
name_files

['yob2000.txt',
 'yob2014.txt',
 'yob1938.txt',
 'yob1910.txt',
 'yob1904.txt',
 'yob1905.txt',
 'yob1911.txt',
 'yob1939.txt',
 'yob2015.txt',
 'yob2001.txt',
 'yob2017.txt',
 'yob2003.txt',
 'yob1907.txt',
 'yob1913.txt',
 'yob1898.txt',
 'yob1899.txt',
 'yob1912.txt',
 'yob1906.txt',
 'yob2002.txt',
 'yob2016.txt',
 'yob2012.txt',
 'yob2006.txt',
 'yob1902.txt',
 'yob1916.txt',
 'yob1889.txt',
 'yob1888.txt',
 'yob1917.txt',
 'yob1903.txt',
 'yob2007.txt',
 'yob2013.txt',
 'yob2005.txt',
 'yob2011.txt',
 'yob1915.txt',
 'yob1901.txt',
 'yob1929.txt',
 'yob1928.txt',
 'yob1900.txt',
 'yob1914.txt',
 'yob2010.txt',
 'yob2004.txt',
 'yob1973.txt',
 'yob1967.txt',
 'yob1998.txt',
 'yob1999.txt',
 'yob1966.txt',
 'yob1972.txt',
 'yob1958.txt',
 'yob1964.txt',
 'yob1970.txt',
 'yob1971.txt',
 'yob1965.txt',
 'yob1959.txt',
 'yob1961.txt',
 'yob1975.txt',
 'yob1949.txt',
 'yob1948.txt',
 'yob1974.txt',
 'yob1960.txt',
 'yob1976.txt',
 'yob1962.txt',
 'yob1989.txt',
 'yob1988.txt',
 'yob196

Create a df from `name_files` using `pd.csv_read()` method.

In [7]:
# sample df
pd.read_csv(name_files[0])

Unnamed: 0,Emily,F,25957
0,Hannah,F,23085
1,Madison,F,19968
2,Ashley,F,17997
3,Sarah,F,17708
4,Alexis,F,17631
...,...,...,...
29770,Zeph,M,5
29771,Zeven,M,5
29772,Ziggy,M,5
29773,Zo,M,5


In [8]:
# prevent first row from rendering as header
pd.read_csv(name_files[0], header=None)

Unnamed: 0,0,1,2
0,Emily,F,25957
1,Hannah,F,23085
2,Madison,F,19968
3,Ashley,F,17997
4,Sarah,F,17708
...,...,...,...
29771,Zeph,M,5
29772,Zeven,M,5
29773,Ziggy,M,5
29774,Zo,M,5


In [9]:
#rename columns
columns = ['Name', 'Sex', 'Count']
sample_df = pd.read_csv(name_files[0], header=None, names=columns)
sample_df

Unnamed: 0,Name,Sex,Count
0,Emily,F,25957
1,Hannah,F,23085
2,Madison,F,19968
3,Ashley,F,17997
4,Sarah,F,17708
...,...,...,...
29771,Zeph,M,5
29772,Zeven,M,5
29773,Ziggy,M,5
29774,Zo,M,5


- 2. Extract the `Year` value from text file.

In [10]:
year_str = name_files[0]
year_str

'yob2000.txt'

In [11]:
year_str = year_str[3:7]
year_str

'2000'

In [12]:
year = int(year_str)
year

2000

In [13]:
# assign int year value to a new `Year` column
sample_df['Year'] = year 

sample_df.head()

Unnamed: 0,Name,Sex,Count,Year
0,Emily,F,25957,2000
1,Hannah,F,23085,2000
2,Madison,F,19968,2000
3,Ashley,F,17997,2000
4,Sarah,F,17708,2000


In [14]:
# check data types
sample_df.dtypes

Name     object
Sex      object
Count     int64
Year      int64
dtype: object

3. `Join` all dfs and create `csv` file.

In [15]:
all_dfs = []

try:
    for name_file in name_files:
        df = pd.read_csv(name_file, header=None, names =['Name', 'Sex', 'Count'])
        df['Year'] = int(name_file[3:7])
        all_dfs.append(df)
except Exception as e:
    print(e)
else:
    print(f'{len(all_dfs)} dfs are ready!')

142 dfs are ready!


In [16]:
# concatenate all dataframes
final_df = pd.concat(all_dfs)

final_df.shape

(2052781, 4)

In [17]:
final_df.to_csv('data/cleaned/SocialSecurityNamesAllYears.csv', index=False)

- 4. Clean up.

Remove unzipped files using `os.remove`

In [18]:
files = glob('*.txt')
files

['yob2000.txt',
 'yob2014.txt',
 'yob1938.txt',
 'yob1910.txt',
 'yob1904.txt',
 'yob1905.txt',
 'yob1911.txt',
 'yob1939.txt',
 'yob2015.txt',
 'yob2001.txt',
 'yob2017.txt',
 'yob2003.txt',
 'yob1907.txt',
 'yob1913.txt',
 'yob1898.txt',
 'yob1899.txt',
 'yob1912.txt',
 'yob1906.txt',
 'yob2002.txt',
 'yob2016.txt',
 'yob2012.txt',
 'yob2006.txt',
 'yob1902.txt',
 'yob1916.txt',
 'yob1889.txt',
 'yob1888.txt',
 'yob1917.txt',
 'yob1903.txt',
 'yob2007.txt',
 'yob2013.txt',
 'yob2005.txt',
 'yob2011.txt',
 'yob1915.txt',
 'yob1901.txt',
 'yob1929.txt',
 'yob1928.txt',
 'yob1900.txt',
 'yob1914.txt',
 'yob2010.txt',
 'yob2004.txt',
 'yob1973.txt',
 'yob1967.txt',
 'yob1998.txt',
 'yob1999.txt',
 'yob1966.txt',
 'yob1972.txt',
 'yob1958.txt',
 'yob1964.txt',
 'yob1970.txt',
 'yob1971.txt',
 'yob1965.txt',
 'yob1959.txt',
 'yob1961.txt',
 'yob1975.txt',
 'yob1949.txt',
 'yob1948.txt',
 'yob1974.txt',
 'yob1960.txt',
 'yob1976.txt',
 'yob1962.txt',
 'yob1989.txt',
 'yob1988.txt',
 'yob196

In [19]:
try:
    files_to_remove = glob('*.txt')
    # add readme to the files_to_remove list
    files_to_remove.append('NationalReadMe.pdf')
    for file in files_to_remove:
        os.remove(file)
except Exception as e:
    print(e)
else:
    print("All text files have been removed!")

All text files have been removed!


In [20]:
os.remove('./data/names.zip')