### American Community Survey 5 Year Data (or ACS5)

To get started, let's read the ACS 5 year data for California tracts into a `dataframe` using the  `pandas read_csv` method. 

As we read in the ACS data we will tell pandas to make sure that the data in the column `FIPS_11_digit` is read in as a string to preserve leading zeros in the census tract identifiers.

In [None]:
# Read in the ACS5 data for CA into a pandas DataFrame.
# Note: We force the FIPS_11_digit to be read in as a string to preserve any leading zeroes.
acs5data_df = pd.read_csv("../notebook_data/census/ACS5yr/census_variables_CA.csv", dtype={'FIPS_11_digit':str})

Pandas provides a number of methods to view information about a dataframe.

The pandas dataframe attribute `shape` tells us the number of rows and columns in the dataframe.

In [None]:
# Take a look at the shape of the dataframe
acs5data_df.shape

Each row in our dataframe is an observation. For the ACS5 data each observation is about a census tract.

Each column in our dataframe is a variable for that observation.

Let's use `head` to take a look at the first 5 rows in the dataframe.

In [None]:
# Take a look at the data
acs5data_df.head()

A `...` in the middle of the top row indicates that there are two many columns to display.

The pandas dataframe `columns` attribute returns a list of the column names.

In [None]:
acs5data_df.columns

We can see more information about the variables included in our ACS5 year data using the `info` method. This method tells us at a glance what variables (or columns) are included in the data, the data type of each variable, and which variables have values for all rows.

In [None]:
acs5data_df.info()

### Brief review of the ACS data

These variables were combined from different ACS 5 year tables. We have information for the following:

- `c_race` - Total population
- `c_white` - Total white non-Latinx
- `c_black` - Total black and African American non-Latinx
- `c_asian` - Total Asian non-Latinx
- `c_latinx` - Total Latinx
- `state_fips` - State level FIPS code
- `county_fips` - County level FIPS code
- `tract_fips` - Tracts level FIPS code
- `med_rent` - Median rent
- `med_hhinc` - Median household income
- `c_tenants` - Total tenants
- `c_owners` - Total owners
- `c_renters` - Total renters
- `c_movers` - Total number of people who moved
- `c_stay` - Total number of people who stayed
- `c_movelocal` - Number of people who moved locally
- `c_movecounty` - Number of people who moved counties
- `c_movestate` - Number of people who moved states
- `c_moveabroad` - Number of people who moved abroad
- `c_commute` - Total number of commuters
- `c_car` - Number of commuters who use a car
- `c_carpool` - Number of commuters who carpool
- `c_transit` - Number of commuters who use public transit
- `c_bike` - Number of commuters who bike
- `c_walk` - Number of commuters who bike
- `year` - ACS data year
- `FIPS_11_digit` - 11-digit FIPS code

The ACS variables that start with `c_` are counts, those that start with `med_` are medians.  Variables that end in `_moe` denote margin of error. There are also a number of derived variables that start with `p_`. These are proportions calcuated from the counts divided by the table denominator (the total count for whom that variable was assessed).

We're going to drop all of our `moe` columns by identifying all of those that end with `_moe`. We can do that in two steps, first by using `filter` to identify columns that contain the string `_moe`.

In [None]:
moe_cols = acs5data_df.filter(like='_moe',axis=1).columns
moe_cols

Note how we set the filter `like=` to a value that matches the pattern of the names of the columns we want to drop. You need to make sure you get all but only the columns that you want to drop.

<div style="display:inline-block;vertical-align:top;">
    <img src="http://www.pngall.com/wp-content/uploads/2016/03/Light-Bulb-Free-PNG-Image.png" width="30" align=left > 
</div>  
<div style="display:inline-block;">

#### Question
</div>

What do you think happens if you match `_mo` instead of `_moe` in the filter?

Now that we've got our list of moe columns, we can use `.drop()` to remove them from the dataframe. 

In [None]:
# Drop MOE columns
acs5data_df.drop(moe_cols, axis=1, inplace=True)

Check that you no longer have the moe columns in the dataframe.

In [None]:
acs5data_df.columns

### Select data for our county and year of interest

Our ACS5 data contains observations for all CA counties and two ACS 5 year periods.

The counties are identified by a unique Census FIPS code. 
- You can see the list of all CA Counties and their FIPS codes [here](https://en.wikipedia.org/wiki/List_of_counties_in_California).

Let's use the `.unique` to check the unique set of county FIPS codes included in our dataframe.

In [None]:
acs5data_df['county_fips'].unique()  #what counties are in our dataframe

Now use `.unique` to see what years are included.

In [None]:
acs5data_df['year'].unique()

We are interested in Alameda County, which has the FIPS code `001`.  Moreover, we are only interested in the 2018 ACS 5 year data.  Let's filter the data to keep only the rows that match these two conditions.


In [None]:
acs5data_df_ac = acs5data_df[(acs5data_df['year']==2018) & (acs5data_df['county_fips']==1)]

<div style="display:inline-block;vertical-align:top;">
    <img src="http://www.pngall.com/wp-content/uploads/2016/03/Light-Bulb-Free-PNG-Image.png" width="30" align=left > 
</div>  
<div style="display:inline-block;">

#### Question
</div>

Why do we filter on `county_fips==1` instead of `county_fips==001` or `county_fips=='001'`?

In [None]:
# Write your thoughts here

Now, check the contents of our dataframe again.

In [None]:
# now what is the shape of the data when filtered for Alameda County?
print(acs5data_df_ac.shape)

In [None]:
# Take a look at the first 5 rows
acs5data_df_ac.head()

>**Pro-tip:** Checking your row and column counts and values often with `.shape` and values with `.head` help to make sure that these values are consistent with your understanding of the data.

### Saving our output

It's a good idea to save your data if you have done any major processing on it. Let's save our Alameda County sub-setted ACS5 data to a CSV file.

In [None]:
# Save processed data to a csv file - give it a name that is meaningful
acs5data_df_ac.to_csv('../outdata/acs5data_2018_AC.csv')

Confirm that the file was saved with a [shell command](https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html#Shell-Commands-in-IPython).  Shell commands are prefaced by a `!` and allow you to access the file system and run commands like you would from a terminal window. (This may differ if you are on a windows computer)

In [None]:
!ls ../outdata

#### Exercise

Now do this for the SF ACS data:
1. Find the FIPS code for [SF county](https://en.wikipedia.org/wiki/List_of_counties_in_California)
2. Subset the ACS data to keep only rows for SF county in 2018 and assign to `acs5data_df_sf`
3. Save out ACS data as `acs5data_2018_SF.csv`




In [None]:
# Your code here


*Click here for solution*

<!--- 
    # SOLUTION
    # 1 & 2 Subset ACS data for SF
    acs5data_df_sf = acs5data_df[(acs5data_df['county_fips']==75) & (acs5data_df.year==2018)]

    # SOLUTION
    acs5data_df_sf.head()

    # SOLUTION
    # 3. Save out ACS data as 'acs5data_2018_SF.csv'
    acs5data_df_sf.to_csv('../outdata/acs5data_2018_SF.csv')
--->