## Cleaning US Census Data

In this project, I will pretend to be a newly hired Data Analyst at the Census Bureau, which collects census data and creates interesting visualizations and insights from it.
The person who had this job before me, left all the data they had for the most recent census. It is in multiple csv files. They didn’t use pandas, they would just look through these csv files manually whenever they wanted to find something. Sometimes they would copy and paste certain numbers into Excel to make charts.

The boss wants me to make some scatterplots and histograms by the end of the day. So let's get this data into pandas and into reasonable shape so that we can make the histograms.

We have different files names states0, states1, and so on. Let's convert these files into a DataFrame.
We will do this by using glob, then looping through the census files available and loading them into DataFrames. 
Then, we will concatenate all of those DataFrames together into one DataFrame, called us_census.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# importing glob. In Python, the glob module is used to retrieve files/pathnames matching a specified pattern. 
# The pattern rules of glob follow standard Unix path expansion rules.

import glob

In [3]:
# using regex to load the files with names which start with 'states' and then have a digit.(We haev downloaded 10 files with 
# names state0, state1 and so on till state9 on the system which we are going to retrieve using the wildcard *)

files = glob.glob(r"C:\Users\amanp\OneDrive\Desktop\states*.csv")


In [4]:
# looping through the files and loading them into a DataFrame by concatinating.

df_list=[]
for filename in files:
  data = pd.read_csv(filename)
  df_list.append(data)
us_census = pd.concat(df_list) 

In [5]:
# let's check the columns of our new dataframe

us_census.columns

Index(['Unnamed: 0', 'State', 'TotalPop', 'Hispanic', 'White', 'Black',
       'Native', 'Asian', 'Pacific', 'Income', 'GenderPop'],
      dtype='object')

In [6]:
# let's also check the datatypes

us_census.dtypes

Unnamed: 0     int64
State         object
TotalPop       int64
Hispanic      object
White         object
Black         object
Native        object
Asian         object
Pacific       object
Income        object
GenderPop     object
dtype: object

In [7]:
# Look at the .head() of the DataFrame so that we can see why some of these dtypes are objects instead of integers or 
# floats.

us_census.head()

Unnamed: 0.1,Unnamed: 0,State,TotalPop,Hispanic,White,Black,Native,Asian,Pacific,Income,GenderPop
0,0,Alabama,4830620,3.75%,61.88%,31.25%,0.45%,1.05%,0.03%,"$43,296.36",2341093M_2489527F
1,1,Alaska,733375,5.91%,60.91%,2.85%,16.39%,5.45%,1.06%,"$70,354.74",384160M_349215F
2,2,Arizona,6641928,29.57%,57.12%,3.85%,4.36%,2.88%,0.17%,"$54,207.82",3299088M_3342840F
3,3,Arkansas,2958208,6.22%,71.14%,18.97%,0.52%,1.14%,0.15%,"$41,935.63",1451913M_1506295F
4,4,California,38421464,37.29%,40.22%,5.68%,0.41%,13.05%,0.35%,"$67,264.78",19087135M_19334329F


So the columns with numerical values haev either % sign or $ sign with them due to which their dtype is object. Let's perform some steps to clean them and convert them into the right types so that we can use the data for manipulation later on.

In [8]:
# If we check the income column first, we see that each entry in the Income column has a $ as the first character.
# So, in order to convert the columns to numbers, we should remove those dollar signs. Let's use regex to convert '$' into 
# empty space.

us_census.Income = us_census.Income.replace('[\$,]', '', regex=True)

In [9]:
# The GenderPop column has both M and F. Let's separate this into two columns, the Men column, and the Women column using 
# str.split and check the results.

gender = us_census.GenderPop.str.split('_')

us_census['Males'] = gender.str.get(0)
us_census['Females'] = gender.str.get(1)
us_census.head()

Unnamed: 0.1,Unnamed: 0,State,TotalPop,Hispanic,White,Black,Native,Asian,Pacific,Income,GenderPop,Males,Females
0,0,Alabama,4830620,3.75%,61.88%,31.25%,0.45%,1.05%,0.03%,43296.36,2341093M_2489527F,2341093M,2489527F
1,1,Alaska,733375,5.91%,60.91%,2.85%,16.39%,5.45%,1.06%,70354.74,384160M_349215F,384160M,349215F
2,2,Arizona,6641928,29.57%,57.12%,3.85%,4.36%,2.88%,0.17%,54207.82,3299088M_3342840F,3299088M,3342840F
3,3,Arkansas,2958208,6.22%,71.14%,18.97%,0.52%,1.14%,0.15%,41935.63,1451913M_1506295F,1451913M,1506295F
4,4,California,38421464,37.29%,40.22%,5.68%,0.41%,13.05%,0.35%,67264.78,19087135M_19334329F,19087135M,19334329F


In [11]:
# There is still an M or an F character in each entry! Let's remove those and then convert it into numerical dtype.

us_census.Males = us_census['Males'].replace('[M]', '', regex=True)
us_census.Females = us_census['Females'].replace('[F]', '', regex=True)
us_census.Males= pd.to_numeric(us_census.Males)
us_census.Females= pd.to_numeric(us_census.Females)

# Let's check our dataframe now.

us_census.head()

Unnamed: 0.1,Unnamed: 0,State,TotalPop,Hispanic,White,Black,Native,Asian,Pacific,Income,GenderPop,Males,Females
0,0,Alabama,4830620,3.75%,61.88%,31.25%,0.45%,1.05%,0.03%,43296.36,2341093M_2489527F,2341093,2489527.0
1,1,Alaska,733375,5.91%,60.91%,2.85%,16.39%,5.45%,1.06%,70354.74,384160M_349215F,384160,349215.0
2,2,Arizona,6641928,29.57%,57.12%,3.85%,4.36%,2.88%,0.17%,54207.82,3299088M_3342840F,3299088,3342840.0
3,3,Arkansas,2958208,6.22%,71.14%,18.97%,0.52%,1.14%,0.15%,41935.63,1451913M_1506295F,1451913,1506295.0
4,4,California,38421464,37.29%,40.22%,5.68%,0.41%,13.05%,0.35%,67264.78,19087135M_19334329F,19087135,19334329.0


Good Job! Now we have two separate columns for number of males and females with numeric dtype and income column too with numeric dtype.

#### Plotting graph between income and number of females

Since we want to plot a graph between income and number of females, let's first check for the nan values in the female column.

In [12]:
print(us_census.Females.count())

57


In [13]:
print(us_census.shape)

(60, 13)


This shows that there are 3 missing values in the females column. let's fill those with the value got by subtracting nuhmber of males from total population.

In [14]:
us_census.Females = us_census.Females.fillna(us_census.TotalPop - us_census.Males)

# NOw let's again recheck to verify.

print(us_census.Females.count())

60


So now we have no missing values.

Let's check for duplicates now before plotting the graph.

In [15]:
us_census.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
0    False
1    False
2    False
3    False
4    False
5    False
0    False
1    False
2    False
3    False
4    False
5    False
0    False
1    False
2    False
3    False
4    False
5    False
0    False
1    False
2    False
3    False
4    False
5    False
0    False
1    False
2    False
3    False
4    False
5    False
0    False
1    False
2    False
3    False
4    False
5    False
0    False
1    False
2    False
3    False
4    False
5    False
0    False
1    False
2    False
3    False
4    False
5    False
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

In [16]:
# Let's drop the duplicates and check the shape of new dataframe.

us_census_new = us_census.drop_duplicates()
print(us_census_new.shape)

(60, 13)


In [17]:
us_census.head()

Unnamed: 0.1,Unnamed: 0,State,TotalPop,Hispanic,White,Black,Native,Asian,Pacific,Income,GenderPop,Males,Females
0,0,Alabama,4830620,3.75%,61.88%,31.25%,0.45%,1.05%,0.03%,43296.36,2341093M_2489527F,2341093,2489527.0
1,1,Alaska,733375,5.91%,60.91%,2.85%,16.39%,5.45%,1.06%,70354.74,384160M_349215F,384160,349215.0
2,2,Arizona,6641928,29.57%,57.12%,3.85%,4.36%,2.88%,0.17%,54207.82,3299088M_3342840F,3299088,3342840.0
3,3,Arkansas,2958208,6.22%,71.14%,18.97%,0.52%,1.14%,0.15%,41935.63,1451913M_1506295F,1451913,1506295.0
4,4,California,38421464,37.29%,40.22%,5.68%,0.41%,13.05%,0.35%,67264.78,19087135M_19334329F,19087135,19334329.0


So now, we are ready to make a graph between income and number of females.
Let's import matplotlib and make our graph.


#### Conclusion and next steps:

In this project we have cleaned the US sensus data using various methods.
Now this data can be used for making different plots.

1. We can make a graph between income and number of females.

2. We can make a bunch of histograms out of the race data that we have. We can look at the .columns again to see what the race categories are.

3. We can try to make a histogram for each one! We will have to get the columns into numerical format, and those percentage signs will have to go. Also, fill the nan values with something that makes sense!

4. We can make some more interesting graphs to show the team! We may need to clean the data even more to do it, or the cleaning we have already done may give the ease of manipulation.