# Project 6: Importing & Merging many files (Baby Names Dataset) - Part 2

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 6 (Part 2) on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## Getting the Files from the Web

1. __Go__ to https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-state-and-district-of-columbia-data and __download__ and __unzip__ the file.

In [1]:
import pandas as pd
import numpy as np

## Importing one File & Understanding the Data Structure 

2. __Load__ the file __"AK.txt"__ into Pandas and __inspect__.

In [3]:
df= pd.read_csv('AK.txt', names=['state','sex','year','name','count'])
df

Unnamed: 0,state,sex,year,name,count
0,AK,F,1910,Mary,14
1,AK,F,1910,Annie,12
2,AK,F,1910,Anna,10
3,AK,F,1910,Margaret,8
4,AK,F,1910,Helen,7
...,...,...,...,...,...
30526,AK,M,2023,Raphael,5
30527,AK,M,2023,Richard,5
30528,AK,M,2023,Seth,5
30529,AK,M,2023,Stetson,5


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30531 entries, 0 to 30530
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   state   30531 non-null  object
 1   sex     30531 non-null  object
 2   year    30531 non-null  int64 
 3   name    30531 non-null  object
 4   count   30531 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 1.2+ MB


## The glob module

3. From glob import glob.

In [5]:
from glob import glob

4. __Find__ all filenames with the structure __"A?.txt"__ in your current directory (? is a single character wildcard).

In [13]:
glob('a?.txt')

['AK.TXT', 'AL.TXT', 'AR.TXT', 'AZ.TXT']

5. __Find__ all filenames with the following structure in your current directory and save the resulting list in a variable:

In [11]:
filenames=glob("*.txt") # (* is a wildcard for zero or many characters)
filenames

['AK.TXT',
 'AL.TXT',
 'AR.TXT',
 'AZ.TXT',
 'CA.TXT',
 'CO.TXT',
 'CT.TXT',
 'DC.TXT',
 'DE.TXT',
 'FL.TXT',
 'GA.TXT',
 'HI.TXT',
 'IA.TXT',
 'ID.TXT',
 'IL.TXT',
 'IN.TXT',
 'KS.TXT',
 'KY.TXT',
 'LA.TXT',
 'MA.TXT',
 'MD.TXT',
 'ME.TXT',
 'MI.TXT',
 'MN.TXT',
 'MO.TXT',
 'MS.TXT',
 'MT.TXT',
 'NC.TXT',
 'ND.TXT',
 'NE.TXT',
 'NH.TXT',
 'NJ.TXT',
 'NM.TXT',
 'NV.TXT',
 'NY.TXT',
 'OH.TXT',
 'OK.TXT',
 'OR.TXT',
 'PA.TXT',
 'RI.TXT',
 'SC.TXT',
 'SD.TXT',
 'TN.TXT',
 'TX.TXT',
 'UT.TXT',
 'VA.TXT',
 'VT.TXT',
 'WA.TXT',
 'WI.TXT',
 'WV.TXT',
 'WY.TXT']

In [18]:
abbr=list(map(lambda x: x.split('.')[0],filenames))
abbr

['AK',
 'AL',
 'AR',
 'AZ',
 'CA',
 'CO',
 'CT',
 'DC',
 'DE',
 'FL',
 'GA',
 'HI',
 'IA',
 'ID',
 'IL',
 'IN',
 'KS',
 'KY',
 'LA',
 'MA',
 'MD',
 'ME',
 'MI',
 'MN',
 'MO',
 'MS',
 'MT',
 'NC',
 'ND',
 'NE',
 'NH',
 'NJ',
 'NM',
 'NV',
 'NY',
 'OH',
 'OK',
 'OR',
 'PA',
 'RI',
 'SC',
 'SD',
 'TN',
 'TX',
 'UT',
 'VA',
 'VT',
 'WA',
 'WI',
 'WV',
 'WY']

# Importing & merging many Files (complex case)

6. __Load__ all files (*.txt) and __merge/concatenate__ all files into one Pandas DataFrame.

In [26]:
concatenated_state_df=pd.DataFrame()
for state in filenames:
    df_=pd.read_csv(state,names=['state','sex','year','name','count'])
    concatenated_state_df= pd.concat([concatenated_state_df,df_])

7. Create a __RangeIndex__ and __save__ the DataFrame (with columns "State", "Gender", "Year", "Name", "Count") in a new csv-file.

In [27]:
concatenated_state_df.reset_index(inplace=True, drop=True)
concatenated_state_df

Unnamed: 0,state,sex,year,name,count
0,AK,F,1910,Mary,14
1,AK,F,1910,Annie,12
2,AK,F,1910,Anna,10
3,AK,F,1910,Margaret,8
4,AK,F,1910,Helen,7
...,...,...,...,...,...
6504156,WY,M,2023,Parker,5
6504157,WY,M,2023,Rhett,5
6504158,WY,M,2023,Roman,5
6504159,WY,M,2023,Ryan,5


In [30]:
concatenated_state_df.to_csv('concatenated_df_by_state.csv',index=False)

In [31]:
#Just to check if it was correctly saved
a=pd.read_csv('concatenated_df_by_state.csv')
a

Unnamed: 0,state,sex,year,name,count
0,AK,F,1910,Mary,14
1,AK,F,1910,Annie,12
2,AK,F,1910,Anna,10
3,AK,F,1910,Margaret,8
4,AK,F,1910,Helen,7
...,...,...,...,...,...
6504156,WY,M,2023,Parker,5
6504157,WY,M,2023,Rhett,5
6504158,WY,M,2023,Roman,5
6504159,WY,M,2023,Ryan,5
