# Project 14: Importing & Merging many files (Baby Names Dataset) - Part 1

## Getting the Files from the Web

1. __Go__ to https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data and __download__ and __unzip__ the file.

## Importing one File & Understanding the Data Structure

2. __Loading__ the file __"yob1880.txt"__ into Pandas and __inspect__.

In [1]:
import pandas as pd 
df_test = pd.read_csv('yob1880.txt')

In [2]:
df_test.head()

Unnamed: 0,Mary,F,7065
0,Anna,F,2604
1,Emma,F,2003
2,Elizabeth,F,1939
3,Minnie,F,1746
4,Margaret,F,1578


In [3]:
#It has no header
df_test = pd.read_csv('yob1880.txt', header = None, names = ['Name', 'Gender', 'Count'])

In [5]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    2000 non-null   object
 1   Gender  2000 non-null   object
 2   Count   2000 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 47.0+ KB


## Importing & merging many files (easy case)

In [8]:
df_1880 = pd.read_csv('yob1880.txt', header = None, names = ['Name', 'Gender', 'Count'])
df_1881 = pd.read_csv('yob1881.txt', header = None, names = ['Name', 'Gender', 'Count'])

#Example of concat
pd.concat(objs = [df_1880, df_1881], axis = 0, keys = [1880, 1881], names = ['Year']).droplevel(-1).reset_index()

Unnamed: 0,Year,Name,Gender,Count
0,1880,Mary,F,7065
1,1880,Anna,F,2604
2,1880,Emma,F,2003
3,1880,Elizabeth,F,1939
4,1880,Minnie,F,1746
...,...,...,...,...
3930,1881,Wiliam,M,5
3931,1881,Wilton,M,5
3932,1881,Wing,M,5
3933,1881,Wood,M,5


In [9]:
#We append all the data into a list
dataframes =[]
years = range(1880,2019)
for year in years:
    df = pd.read_csv('yob{}.txt'.format(year), header = None, names = ['Name', 'Gender', 'Count'])
    dataframes.append(df)

In [10]:
dataframes

[           Name Gender  Count
 0          Mary      F   7065
 1          Anna      F   2604
 2          Emma      F   2003
 3     Elizabeth      F   1939
 4        Minnie      F   1746
 ...         ...    ...    ...
 1995     Woodie      M      5
 1996     Worthy      M      5
 1997     Wright      M      5
 1998       York      M      5
 1999  Zachariah      M      5
 
 [2000 rows x 3 columns],
            Name Gender  Count
 0          Mary      F   6919
 1          Anna      F   2698
 2          Emma      F   2034
 3     Elizabeth      F   1852
 4      Margaret      F   1658
 ...         ...    ...    ...
 1930     Wiliam      M      5
 1931     Wilton      M      5
 1932       Wing      M      5
 1933       Wood      M      5
 1934     Wright      M      5
 
 [1935 rows x 3 columns],
            Name Gender  Count
 0          Mary      F   8148
 1          Anna      F   3143
 2          Emma      F   2303
 3     Elizabeth      F   2186
 4        Minnie      F   2004
 ...         .

In [11]:
#We concat all the data from the list
df = pd.concat(dataframes, axis = 0, keys = years, names = ['Year']).droplevel(-1).reset_index()

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1957046 entries, 0 to 1957045
Data columns (total 4 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Year    int64 
 1   Name    object
 2   Gender  object
 3   Count   int64 
dtypes: int64(2), object(2)
memory usage: 59.7+ MB


In [14]:
#Finally, we convert the dataframe into a csv file
df.to_csv('names.csv', index = False)

In [15]:
pd.read_csv('names.csv')

Unnamed: 0,Year,Name,Gender,Count
0,1880,Mary,F,7065
1,1880,Anna,F,2604
2,1880,Emma,F,2003
3,1880,Elizabeth,F,1939
4,1880,Minnie,F,1746
...,...,...,...,...
1957041,2018,Zylas,M,5
1957042,2018,Zyran,M,5
1957043,2018,Zyrie,M,5
1957044,2018,Zyron,M,5
