# Cleaning US Census Data

You just got hired as a Data Analyst at the Census Bureau, which collects census data and creates interesting visualizations and insights from it.

The person who had your job before you left you all the data they had for the most recent census. It is in multiple `csv` files. They didn't use pandas, they would just look through these `csv` files manually whenever they wanted to find something. Sometimes they would copy and paste certain numbers into Excel to make charts.

The thought of it makes you shiver. This is not scalable or repeatable.

Your boss wants you to make some scatterplots and histograms by the end of the day. Can you get this data into `pandas` and into reasonable shape so that you can make these histograms?

## Inspect the Data!

1. The first visualization your boss wants you to make is a scatterplot that shows average income in a state vs proportion of women in that state.

   Open some of the census `csv` files that came with the kit you downloaded. How are they named? What kind of information do they hold? Will they help us make this graph?

All files are csv. 10 files, with similar name: "states0" to "states9".
The file states0.csv have: State, TotalPop, Hispanic, White, Black, Native, Asian, Pacific, Income,GenderPop.
With Income and GenderPop variables, I can calculate the average income vs. proportion of women in that state. 

2. It will be easier to inspect this data once we have it in a DataFrame. You can't even call `.head()` on these `csv`s! How are you supposed to read them?

   Using `glob`, loop through the census files available and load them into DataFrames. Then, concatenate all of those DataFrames together into one DataFrame, called something like `us_census`.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [27]:
states_df = []

# list of a possible matches for a pathname and save them as dataframe
import glob
for state in glob.glob('*.csv'):
    states_df.append(pd.read_csv(state))

In [37]:
# to fix the table, let´s concat 
us_census = pd.concat(states_df)

In [38]:
print(us_census.head())

   Unnamed: 0         State  TotalPop Hispanic   White   Black Native  Asian  \
0           0          Ohio  11575977    3.67%  75.90%  16.21%  0.17%  1.62%   
1           1      Oklahoma   3849733   10.08%  66.06%   8.31%  6.72%  1.80%   
2           2        Oregon   3939233   11.44%  78.40%   1.73%  1.00%  3.59%   
3           3  Pennsylvania  12779559    6.13%  77.38%  11.63%  0.12%  2.80%   
4           4   Puerto Rico   3583073   98.89%   0.77%   0.09%  0.00%  0.08%   

  Pacific       Income          GenderPop  
0   0.02%  $49,655.25   5662893M_5913084F  
1   0.11%  $48,100.85   1906944M_1942789F  
2   0.35%  $54,271.90   1948453M_1990780F  
3   0.02%  $56,170.46   6245344M_6534215F  
4   0.00%  $20,720.54   1713860M_1869213F  


In [39]:
us_census.describe(include='all')

Unnamed: 0.1,Unnamed: 0,State,TotalPop,Hispanic,White,Black,Native,Asian,Pacific,Income,GenderPop
count,60.0,60,60.0,60,60,60,60,60,55,60,60
unique,,51,,50,51,50,39,49,18,51,51
top,,Ohio,,3.67%,75.90%,5.68%,0.12%,1.62%,0.02%,"$49,655.25",5662893M_5913084F
freq,,2,,2,2,3,4,4,12,2,2
mean,2.5,,6238516.0,,,,,,,,
std,1.722237,,6588488.0,,,,,,,,
min,0.0,,626604.0,,,,,,,,
25%,1.0,,2030429.0,,,,,,,,
50%,2.5,,4701414.0,,,,,,,,
75%,4.0,,7303256.0,,,,,,,,


In [40]:
us_census.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60 entries, 0 to 5
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  60 non-null     int64 
 1   State       60 non-null     object
 2   TotalPop    60 non-null     int64 
 3   Hispanic    60 non-null     object
 4   White       60 non-null     object
 5   Black       60 non-null     object
 6   Native      60 non-null     object
 7   Asian       60 non-null     object
 8   Pacific     55 non-null     object
 9   Income      60 non-null     object
 10  GenderPop   60 non-null     object
dtypes: int64(2), object(9)
memory usage: 5.6+ KB


Columns:
- Pacific have 5 null
- What is Unnamed? We should delete it?
- Number and percentage are objects, except TotalPop

3. Look at the `.columns` and the `.dtypes` of the `us_census` DataFrame. Are those datatypes going to hinder you as you try to make histograms?

4. Look at the `head()` of the DataFrame so that you can understand why some of these `dtypes` are objects instead of integers or floats.

   Start to make a plan for how to convert these columns into the right types for manipulation.

Convert to the right types:
- eliminate % from numbers in columns >> float
- Income >> int
- GenderPop >> separete (M)ale and (F)emale >> int
- TotalPop >> proportional int

## Regex to the Rescue

5. Use regex to turn the `Income` column into a format that is ready for conversion into a numerical type.

6. Look at the `GenderPop` column. We are going to want to separate this into two columns, the `Men` column, and the `Women` column.

   Split the column into those two new columns using `str.split` and separating out those results.

7. Convert both of the columns into numerical datatypes.

   There is still an `M` or an `F` character in each entry! We should remove those before we convert.

8. Now you should have the columns you need to make the graph and make sure your boss does not slam a ruler angrily on your desk because you've wasted your whole day cleaning your data with no results to show!

   Use matplotlib to make a scatterplot!
   
   ```py
   plt.scatter(the_women_column, the_income_column)
   ```
   
   Remember to call `plt.show()` to see the graph!

9. You want to double check your work. You know from experience that these monstrous csv files probably have `nan` values in them! Print out your column with the number of women per state to see.

   We can fill in those `nan`s by using pandas' `.fillna()` function.
   
   You have the `TotalPop` per state, and you have the `Men` per state. As an estimate for the `nan` values in the `Women` column, you could use the `TotalPop` of that state minus the `Men` for that state.
   
   Print out the `Women` column after filling the `nan` values to see if it worked!

10. We forgot to check for duplicates! Use `.duplicated()` on your `census` DataFrame to see if we have duplicate rows in there.

11. Drop those duplicates using the `.drop_duplicates()` function.

12. Make the scatterplot again. Now, it should be perfect! Your job is secure, for now.

## Histogram of Races

13. Now your boss wants you to make a bunch of histograms out of the race data that you have. Look at the `.columns` again to see what the race categories are.

14. Try to make a histogram for each one!

    You will have to get the columns into the numerical format, and those percentage signs will have to go.
    
    Don't forget to fill the `nan` values with something that makes sense! You probably dropped the duplicate rows when making your last graph, but it couldn't hurt to check for duplicates again.

## Get Creative

15. Phew. You've definitely impressed your boss on your first day of work.

    But is there a way you really convey the power of pandas and Python over the drudgery of `csv` and Excel?
    
    Try to make some more interesting graphs to show your boss, and the world! You may need to clean the data even more to do it, or the cleaning you have already done may give you the ease of manipulation you've been searching for.