# Get Census Data

In this file, we go through the below steps 1-3 in order to create a single csv file titled `cc-2003-2020-all-data.csv` covering years 2003 - 2020.

## Steps
### Step 1: Get the data from census.gov 
### Step 2a: Prepare data for years 2010-2010
### Step 2b: Prepare data for years 2003-2009
### Step 3: Merge the files  

Details for each step are included below with the code. 


### Step 1: Get the data from census.gov

There are many folders describing different groups of years. There will be two processes here, one for 2003-2009 and one for 2010-2020. These files download as csv files.
 
Go to this link: [https://www2.census.gov/programs-surveys/popest/datasets/](https://www2.census.gov/programs-surveys/popest/datasets/).

1. For step 2, download the data for years 2010-2020 -- this will be a single file. In the above link, download the file `CC-EST2020-ALLDATA6.csv` located under 2010-2020>>counties>>asrh. Save it under folder `data`.
2. For step 3, download the data for years 2003-2009 -- this will be multiple files. The 2003-2009 data is structured differently, in that we have to download a file for each state and aggregate them. Here you will need to go to the folder titled 2000-2009>>counties>>asrh in the link above.
For each of the following numbers (correlating to state IDS): 1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, we download the file `cc-est2009-alldata-#.csv`. Save these under folder `data/`

The structure of your project should look like this:

- parent_folder/
    - get-census-data.ipynb
    - data/
        -  CC-EST2020-ALLDATA6.csv
        -  cc-est2009-alldate-1.csv
        -  ...
        -  cc-est2009-alldate-56.csv

The below code will go to the URL and download all the necessary files listed above into the folder data/ that is alongside this folder. 

In [2]:
# first we check if there are folders for data and clean. 
# if there are not, we create them 
import os
import glob
    
def ensure_path_exists(path, delete_folder_contents=False):
    """Makes path if it doesn't exist. If it does exist, it
    deletes the contents of the path if delete_folder_contents set to True"""
    if not os.path.exists(path):
        os.makedirs('data/')
    elif delete_folder_contents:
        files = glob.glob(path+'/*')
        for f in files:
            os.remove(f)


ensure_path_exists('data/', True)
ensure_path_exists('clean/', True)

In [3]:
from bs4 import BeautifulSoup as bs
import requests
import re
import os


def get_soup(URL):
    return bs(requests.get(URL).text, 'html.parser')

def download_files(URL, desired_file_names, download_folder):
    """Downloads the given desired_file_names from given URL to given download_folder location"""
    for link in get_soup(URL).findAll("a", attrs={'href': re.compile(".csv")}):
        file_link = link.get('href')
        if file_link in desired_file_names:
    
            with open(os.path.join(download_folder, link.text), 'wb') as file:
                response = requests.get(URL + file_link)
                print(response)
                
                file.write(response.content)
                print("Downloaded file ", file_link, "from ", URL + file_link)

# # step 1: download 2010 - 2020 files [multiple files]
URL_step1 = "https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/asrh/"
desired_file_names_step1 = ['CC-EST2020-ALLDATA6.csv']
download_folder = 'data/'
download_files(URL_step1, desired_file_names_step1, download_folder)


# step 2: download 2003 - 2009 files [multiple files]
desired_counties = [1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56]

def get_desired_file_names(counties):
    desired_file_names = []
    for county_num in counties:
        desired_file_names.append("cc-est2009-alldata-"+ str(county_num).zfill(2) + '.csv')
    return desired_file_names

URL_step2 = "https://www2.census.gov/programs-surveys/popest/datasets/2000-2009/counties/asrh/"
desired_file_names = get_desired_file_names(desired_counties)
download_folder = 'data/'

download_files(URL_step2, desired_file_names, download_folder)



<Response [200]>
Downloaded file  CC-EST2020-ALLDATA6.csv from  https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/asrh/CC-EST2020-ALLDATA6.csv
<Response [200]>
Downloaded file  cc-est2009-alldata-01.csv from  https://www2.census.gov/programs-surveys/popest/datasets/2000-2009/counties/asrh/cc-est2009-alldata-01.csv
<Response [200]>
Downloaded file  cc-est2009-alldata-02.csv from  https://www2.census.gov/programs-surveys/popest/datasets/2000-2009/counties/asrh/cc-est2009-alldata-02.csv
<Response [200]>
Downloaded file  cc-est2009-alldata-04.csv from  https://www2.census.gov/programs-surveys/popest/datasets/2000-2009/counties/asrh/cc-est2009-alldata-04.csv
<Response [200]>
Downloaded file  cc-est2009-alldata-05.csv from  https://www2.census.gov/programs-surveys/popest/datasets/2000-2009/counties/asrh/cc-est2009-alldata-05.csv
<Response [200]>
Downloaded file  cc-est2009-alldata-06.csv from  https://www2.census.gov/programs-surveys/popest/datasets/2000-2009/counti

 ### Step 2a: Prepare data for years 2010-2020
In this step we deal with years 2010-2020. Let's use the file  `data/CC-EST2020-ALLDATA6.csv`. In this file, we keep the columns with the following headings: SUMLEV (column a) STATE (b), COUNTY (c), STNAME (d), CTYNAME (e), YEAR (f), AGEGRP (h), and any column with the suffix “_FEMALE” in the header (there should be 27 columns with this in the header but double check).
 
We also need to keep the following rows:
for STATE values 1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56 (these correspond to the state FIPS codes for the contiguous 48 states plus Hawaii, DC, and Alaska). All COUNTY levels for each state should be kept.
 
For YEAR, the following values should be kept: 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, where they correspond to July 1st of each year in the data


- 2010 -- 3
- ...
- 2020 -- 14 
 
For AGEGRP, 4, 5, 6, 7, 8, 9, and 10 should be kept; these correspond to five year age intervals from 15-19 to 45-49.
 
Save the file as `cc-2010-2020.csv` under folder `clean`. We should have this organization

- parent_folder/
    - get-census-data.ipynb
    - data/
        -  CC-EST2020-ALLDATA6.csv
        -  cc-est2009-alldate-1.csv
        -  ...
        -  cc-est2009-alldate-56.csv
    - clean/
        -  cc-2010-2020.csv

In [9]:
import pandas as pd

def clean_columns(df):
    """Keeps only columns 'SUMLEV', 'STATE', 'COUNTY', 'STNAME', 'CTYNAME', 'YEAR', 'AGEGRP', or those that include '_FEMALE'
      Takes in df and returns df
    """
    female_cols = [col for col in df.columns if '_FEMALE' in col]
    desired_cols = female_cols + ['SUMLEV', 'STATE', 'COUNTY', 'STNAME', 'CTYNAME', 'YEAR', 'AGEGRP']
    df = df[df.columns[df.columns.isin(desired_cols)]]
    return df

def clean_state_values(df):
    """Keeps only state rows  1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 
    45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56
    Takes in df and returns modified df
    """
    desired_states = [1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56]
    return df[df.STATE.isin(desired_states)]

def clean_year_values_1(df):
    """Keeps only year rows for 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14
    Takes in df and returns modified df
    """
    desired_years = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14]
    return df[df.YEAR.isin(desired_years)]


def clean_age_grp_values(df):
    """Keeps only age group values 4, 5, 6, 7, 8, 9, 10
    Takes in df and returns modified df
    """
    desired_age_grp = [4, 5, 6, 7, 8, 9, 10]
    return df[df.AGEGRP.isin(desired_age_grp)]

In [10]:
# read in the desired file 
step1df = pd.read_csv('data/CC-EST2020-ALLDATA6.csv', encoding = "ISO-8859-1")

# clean the data 
step1df = clean_columns(step1df)
step1df = clean_state_values(step1df)
step1df = clean_year_values_1(step1df)
step1df = clean_age_grp_values(step1df)

# save to directory clean/
step1df.to_csv("clean/cc-2010-2020.csv", index=False)


  step1df = pd.read_csv('data/CC-EST2020-ALLDATA6.csv', encoding = "ISO-8859-1")


### Step 2b: Prepare data for years 2003-2009
In this step we deal with years 2003-2009, using the files under `cc-est2009-alldate-#.csv` We first clean them by removing unneeded columns and rows before aggregating them together. This data is kept the exact same columns as in Step 2A.
 
For YEAR, the following values should be kept: 5, 6, 7, 8, 9, 10, 11, 13, where they correspond to July 1st of each year in the data
For AGEGRP, 4, 5, 6, 7, 8, 9, and 10 should be kept; these correspond to five year age intervals from 15-19 to 45-49.

 - 5 -- 2003
 - ...
 - 13 -- 2009 

Save the file as you go as `cc-2003-2009.csv` under folder `clean`

- parent_folder/
    - get-census-data.ipynb
    - data/
        -  CC-EST2020-ALLDATA6.csv
        -  cc-est2009-alldate-1.csv
        -  ...
        -  cc-est2009-alldate-56.csv
    - clean/
        -  cc-2010-2020.csv
        -  cc-2003-2009.csv

In [11]:
# the only different cleaning is of the year values 

def clean_year_values_2(df):
    """Keeps only year rows for 5, 6, 7, 8, 9, 10, 11, 13
    Takes in df and returns modified df"""
    desired_years = [5, 6, 7, 8, 9, 10, 11, 13]
    return df[df.YEAR.isin(desired_years)]
    
# get the file names we want from the data folder 
file_names = list(os.listdir('data/'))
file_names = [name for name in file_names if "cc-est2009-alldata-"  in name]

length_combined_so_far = 0

# clean in case we already have this file
if os.path.exists("clean/cc-2003-2009.csv"):
    os.remove("clean/cc-2003-2009.csv")

# for each of those files
for file in file_names:
    dataframe = pd.read_csv('data/'+file, encoding = "ISO-8859-1")
    
    # clean the data 
    dataframe = clean_columns(dataframe)
    dataframe = clean_state_values(dataframe)
    dataframe = clean_year_values_2(dataframe)
    dataframe = clean_age_grp_values(dataframe)

    # save to directory clean/
    if os.path.exists("clean/cc-2003-2009.csv"):
        extended_dfs = pd.concat([pd.read_csv("clean/cc-2003-2009.csv"), dataframe])
        extended_dfs.to_csv("clean/cc-2003-2009.csv", index=False)
        print("extended with ", file)
        
        # check that the csv is growing 
        combined_length = dataframe.shape[1] + length_combined_so_far
        length_combined_so_far += dataframe.shape[0]
        if length_combined_so_far!= combined_length: 
            print("issue in adding dfs")
        print("Number of rows in clean/cc-2003-2009.csv:", length_combined_so_far) 
    else: 
        dataframe.to_csv("clean/cc-2003-2009.csv", index=False)
        length_combined_so_far = dataframe.shape[0]

extended with  cc-est2009-alldata-05.csv
issue in adding dfs
Number of rows in clean/cc-2003-2009.csv: 7987
extended with  cc-est2009-alldata-11.csv
issue in adding dfs
Number of rows in clean/cc-2003-2009.csv: 8036
extended with  cc-est2009-alldata-10.csv
issue in adding dfs
Number of rows in clean/cc-2003-2009.csv: 8183
extended with  cc-est2009-alldata-04.csv
issue in adding dfs
Number of rows in clean/cc-2003-2009.csv: 8918
extended with  cc-est2009-alldata-38.csv
issue in adding dfs
Number of rows in clean/cc-2003-2009.csv: 11515
extended with  cc-est2009-alldata-12.csv
issue in adding dfs
Number of rows in clean/cc-2003-2009.csv: 14798
extended with  cc-est2009-alldata-06.csv
issue in adding dfs
Number of rows in clean/cc-2003-2009.csv: 17640
extended with  cc-est2009-alldata-13.csv
issue in adding dfs
Number of rows in clean/cc-2003-2009.csv: 25431
extended with  cc-est2009-alldata-17.csv
issue in adding dfs
Number of rows in clean/cc-2003-2009.csv: 30429
extended with  cc-est20

  dataframe = pd.read_csv('data/'+file, encoding = "ISO-8859-1")


extended with  cc-est2009-alldata-08.csv
issue in adding dfs
Number of rows in clean/cc-2003-2009.csv: 147833
extended with  cc-est2009-alldata-34.csv
issue in adding dfs
Number of rows in clean/cc-2003-2009.csv: 148862
extended with  cc-est2009-alldata-20.csv
issue in adding dfs
Number of rows in clean/cc-2003-2009.csv: 154007


### Step 3: Merge the files  
Merge files `cc-2010-2020.csv` and `cc-2003-2009.csv` under folder `clean/` from step 2 and 3 into one final output file `cc-2003-2020-all-data.csv` under `clean`, to get: 

- parent_folder/
    - get-census-data.ipynb
    - data/
        -  CC-EST2020-ALLDATA6.csv
        -  cc-est2009-alldate-1.csv
        -  ...
        -  cc-est2009-alldate-56.csv
    - clean/
        -  cc-2010-2020.csv
        -  cc-2003-2009.csv
        -  cc-2003-2020.csv # this is the final file 

In [21]:
# first we fix the years 
# get the data 
early_df = pd.read_csv("clean/cc-2003-2009.csv")
late_df = pd.read_csv("clean/cc-2010-2020.csv")

# get the original values for the years 
print("original year values")
print("2003-2009 dataframe: ",early_df.YEAR.unique())
print("2010-2013 dataframe: ",late_df.YEAR.unique())

# change the years in the 2003-2009 data from [ 5  6  7  8  9 10 11] to [2003 ... 2009]
olderdict = {
    5: 2003,
    6: 2004,
    7: 2005,
    8: 2006, 
    9: 2007, 
    10: 2008, 
    11: 2009
}
# replace the values 
early_df = early_df.replace({"YEAR": olderdict})

# change the years in the 2010+ data from [ 3  4  5  6  7  8  9 10 11 12 14] to [2010 ... 2020]
newerdict = {
    3: 2010, 
    4: 2011, 
    5: 2012, 
    6: 2013, 
    7: 2014, 
    8: 2015, 
    9: 2016, 
    10: 2017, 
    11: 2018, 
    12: 2019, # we skip 13 
    14: 2020 
}

# replace the values 
late_df = late_df.replace({"YEAR": newerdict})


print("updated year values")
print("2003-2009 dataframe: ",early_df.YEAR.unique())
print("2010-2013 dataframe: ",late_df.YEAR.unique())

original year values
2003-2009 dataframe:  [ 5  6  7  8  9 10 11]
2010-2013 dataframe:  [ 3  4  5  6  7  8  9 10 11 12 14]
updated year values
2003-2009 dataframe:  [2003 2004 2005 2006 2007 2008 2009]
2010-2013 dataframe:  [2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020]


In [17]:
combined_dfs = pd.concat([early_df, late_df])
combined_dfs.to_csv("clean/cc-2003-2020.csv", index=False)


print(f"Length of clean/cc-2003-2009.csv: {early_df.shape[0]}")
print(f"Length of clean/cc-2010-2020.csv: {late_df.shape[0]}")
print(f"Length of clean/cc-2003-2020.csv: {combined_dfs.shape[0]}")

Length of clean/cc-2003-2009.csv: 154007
Length of clean/cc-2010-2020.csv: 242011
Length of clean/cc-2003-2020.csv: 396018


In [19]:
combined_dfs.YEAR.unique()

array([2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013,
       2014, 2015, 2016, 2017, 2018, 2019, 2020])