# 5311 and 5310 Applicants
* [Research Request](https://github.com/cal-itp/data-analyses/issues/333)

In [None]:
# Packages to import
# Pandas is the full name of the package but call it pd for short.
import pandas as pd
from calitp import *

# Formatting the notebook
# The max columns to display will be 100
pd.options.display.max_columns = 100

# There will allow you to print all the rows in your data
pd.set_option("display.max_rows", None)

# This will prevent columns from being truncated
pd.set_option("display.max_colwidth", None)

## Load the Excel Sheet
* Can read the original Excel workbook by the specific sheet you want. 
* Save your sheet as a Pandas dataframe - it can be called anything, but usually it's <i>something_df</i>. 
    * Dataframe = basically jsut a table of data. 
    * If you want to open multiple sheets, you'd assign them to different objects and different names. 
* "to_snakecase" changes the column names to all lowercases and replaces any spaces with underescores.

In [None]:
df = to_snakecase(
   pd.read_excel("gs://calitp-analytics-data/data-analyses/grants/Grant+Projects_7_30_2022.xlsx", sheet_name="Grant Projects")
 )

# df = pd.read_excel("./Grant+Projects_7_30_2022.xlsx")

In [None]:
# Save your dataframe to the folder you are in
# df.to_excel("./Grant+Projects_7_30_2022.xlsx", index=False)

## Explore the data 
* Let's check out our data by answering questions such as
    * How many columns and rows does it have? 
    * How many missing values are there? 
    * What are the mean/median? 
* Any time you want to do something to your data, chain the function after the object.
    * In Excel, you'd do SUM(column you want)
    * In Pandas, you'd do df['column you want'].sum()
* [Resource](https://pandas.pydata.org/docs/user_guide/basics.html)    

In [None]:
# Check out the first five rows
# Any line with a pound symbol in front is a comment and won't be rendered
df.sample(40)

In [None]:
# Check out the last five rows
df.tail()

In [None]:
# Check out how many rows and columns, # of null values in each column, and the data type of each column
df.info()

In [None]:
# The data goes spans between 2011 to 2022. Check out how many projects were funded by year.
# df["column 1"].value_counts()

In [None]:
# Not sure what a function does: use help
help(sum)

In [None]:
# Get some basic stats
df.describe()

## Clean up
* [Tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html)

### Data type is important. 
* If you have a column of monetary values presented as $139, 293.92 and you want to find the mean, this won't work. 
* This column is considered an "object" column due to the dollar sign and comma - same way as if you typed "caltrans".
    * You'll have make sure it's an integer.
* Based on df.info() clean up other columns that aren't the right data type

In [None]:
"""
If there are columns that SHOULD be an integer but isn't: input them into the list
after this for loop. This strips empty $ and commas in the columns you list, 
then changes them to the data type of int.

for c in ["column_one", "column_two", "column_three"]:
    df[c] = df[c].str.replace("$", "").str.replace(",", "").astype(int)
"""

### Beware of duplicate values
* Grants data might be manually entered by multiple people. As such, values can be inconsistent. 
* BART, Bay Area Rapid Transit, and Bay Area Rapid Transit (BART) are all the same agency. 
* However, if you are counting the number of unique agencies, these would be counted as 3 different agencies, which is inaccurate.


In [None]:
# Check out your agencies and see if there are any duplicates by
# sorting your column of agencies from A-Z and seeing only unique ones
# df["column"].sort_values().unique()

In [None]:
# Check out total nunique values
# df["column"].nunique()

In [None]:
"""
If there are duplicate values, you can replace them with an existing one with a dictionary
If this cell is irrelevant,  go up to the top where it says "code" and change it to "markdown". 
You can also move the three quotation marks at the bottom of this cell to comment out the code.
If all the agencies are only listed once.

df["column"] = df["column"].replace(
    {"old value 1": "correct value 1", "old value 2": "correct value 2"}
)

"""

## Filter what you want
* You don't necessarily want all the years, all the programs, etc. 
* Filter out what you are interested in.

### Grants you want

In [None]:
"""
Create a list that contains the grants you are interested in. 
A list is great because you can go in and delete/add items. 
Line below makes it easy to grab the values.
"""
df["funding_program"].unique()

In [None]:
# Paste whatever values you want between the brckets.
# The values need to be in quotes.
grants_wanted = [
    "Section 5311",
    "5310 Exp",
    "5310 Trad",
    "5311(f) Cont",
    "5339 (National)",
    "5339 (State)",
    "CMAQ (FTA 5311)",
    "Section 5311(f)",
    "5311(f) Round 2",
]

In [None]:
"""
Keep only the grants in my list and create a NEW variable.
It's best to create new objects when you make changes, so you can always reference
the original object. 
"""
df2 = df[df["funding_program"].isin(grants_wanted)]

### Columns you want
* Drop irrelvant columns 

In [None]:
df2["funding_program"].value_counts()

In [None]:
# List out all your columns
df2.columns

In [None]:
df2.head()

In [None]:
# Copy and paste the irrelevant ones into this list below
unwanted_columns = [
    "grant_number",
    "upin",
    "description",
    "ali",
    "contract_number",
    "allocationamount",
    "encumbered_amount",
    "expendedamount",
    "activebalance",
    "closedoutbalance",
    "project_closed_by",
    "project_closed_date",
    "project_closed_time",
]

In [None]:
# Drop them - assign to a new dataframe if you wish
df2 = df2.drop(columns=unwanted_columns)

In [None]:
# Check out your hard work with 5 random rows. Is this what you want?
df2.sample(5)

In [None]:
"""
Filter out for years. Check the data type of the column you are filtering on. 
Perhaps years will need quotes because it's an object or maybe it's an integer, so 
no quotes are necessary.
"""
df3 = df2[df2["project_year"] > 2018]

In [None]:
"""
Filter out for only 5311. 
This ignores the case, so 'ac transit' and 'AC TRANSIT' will show up.
"""
df_5311 = df3[(df3.funding_program.str.contains("5311", case=False))]

In [None]:
df_5311["funding_program"].value_counts()

In [None]:
# Check out the length, aka # of rows after filtering
len(df_5311)

In [None]:
# Repeat same steps for 5310, make sure to cast this into a different dataframe

In [None]:
df_5310 = df3[(df3.funding_program.str.contains("5310", case=False))]

In [None]:
df_5310["funding_program"].value_counts()

In [None]:
df_5339 = df3[(df3.funding_program.str.contains("5339", case=False))]

In [None]:
df_5339["funding_program"].value_counts()

In [None]:
len(df3)

In [None]:
len(df_5310) + len(df_5311) + len(df_5339)

In [None]:
len(df3) == (len(df_5310) + len(df_5311) + len(df_5339))

In [None]:
my_common_cols = df_5311.columns.tolist()

In [None]:
my_common_cols

## Airtable

### Split dataframes and merge them back together
<img src= "download.jfif"> 

In [None]:
# Grab the funds I want into a list
airtable_wanted = [
    "Section 5311",
    "5310 Exp",
    "5310 Trad",
    "5311(f) Cont",
    "5339 (National)",
    "5339 (State)",
    "CMAQ (FTA 5311)",
    "Section 5311(f)",
    "5311(f) Round 2",
]

In [None]:
# Filter out for the funds I want
airtable = df[df["funding_program"].isin(airtable_wanted)]

In [None]:
# Check that all the grants are here 
airtable["funding_program"].value_counts()

In [None]:
# Filter out for projects that are later than 2018
airtable = airtable[airtable["project_year"] > 2018]

In [None]:
# Subset df into a smaller one: since we only care if an organization appeared in 
# a grant's dataframe at any point after 2018, we don't need the year/other info
airtable = airtable[["funding_program", "organization_name"]]

In [None]:
airtable.sample(50)

In [None]:
# Subset three dfs with for a specific grant
df_5311 = airtable[(airtable.funding_program.str.contains("5311", case=False))]

In [None]:
df_5310 = airtable[(airtable.funding_program.str.contains("5310", case=False))]

In [None]:
df_5339 = airtable[(airtable.funding_program.str.contains("5339", case=False))]

In [None]:
# Using a for loop,we can print out how many rows correspond with each "flavor" of the grant program
for i in [df_5311, df_5310, df_5339]:
    print(i["funding_program"].value_counts())
    print(len(i)) 

In [None]:
f"original table length's {len(airtable)} and all three of our subsetted data frames add up to {len(df_5339) + len(df_5310)  + len(df_5311)}"

### First merge: merging 5311 and 5310 

In [None]:
# First merge: merging 5311 and 5310 
m_5311_5310 = pd.merge(
    df_5311,
    df_5310,
    how="outer",
    on=["organization_name"],
    indicator=True,
)

In [None]:
# Check out the results 
m_5311_5310["_merge"].value_counts()

In [None]:
# Preview what happens to the length when you drop the duplicates.  
len(m_5311_5310), len(m_5311_5310.drop_duplicates(subset=["organization_name"]))

In [None]:
# Actually drop the duplicates of agency name, since the same agencies appear multiple times across the years
# Which is why the df is super long.
# Dropping a subset allows you to choose which column(s) to drop the duplicates of
# When you don't specify, this looks across all the columns of a df
m2_5311_5310 = m_5311_5310.drop_duplicates(subset=["organization_name"])

In [None]:
m_5311_5310['_merge'].value_counts()

In [None]:
# Rename the merge column to something that is a little clearer 
m2_5311_5310 = m2_5311_5310.rename(columns = {'_merge': '5311_5310_overlap'}) 

In [None]:
# Replace right only/left only with clearer definitions 
m2_5311_5310["5311_5310_overlap"] = m2_5311_5310["5311_5310_overlap"].replace(
    {"left_only": "5311 only", "right_only": "5310 only", "both": "Both 5311 and 5310"}
)

In [None]:
# Sample a few rows 
m2_5311_5310.sample(40)

### Second merge: df above with 5339

In [None]:
# Now merge in 5339 with the merged 5311 & 5310 stuff
m3_all = pd.merge(
    m2_5311_5310,
    df_5339,
    how="outer",
    on = ["organization_name"],
    indicator=True,
)

In [None]:
# Again drop the duplicates of organizations
m4 = m3_all.drop_duplicates(subset=["organization_name"])

In [None]:
m4["_merge"].value_counts()

In [None]:
m4.shape

In [None]:
# Look at organizations A-Z
m4[['organization_name','5311_5310_overlap','_merge']].sort_values('_merge')

In [None]:
# Use a function to replace left_only/both/right_only for more clarity
# https://github.com/cal-itp/data-analyses/blob/main/grant_misc/A2_dla.ipynb
# df is the argument of the function
def recategorize(df):   
    if (df['_merge']=='left_only') and (df['5311_5310_overlap'] == '5311 only'):
        return '5311 Only'
    elif (df['_merge']=='left_only') and (df['5311_5310_overlap'] == '5310 only'):
        return '5310 only'
    elif (df['_merge']=='left_only') and (df['5311_5310_overlap'] == 'Both 5311 and 5310'):
        return '5311 and 5310'
    elif (df['_merge']=='both') and (df['5311_5310_overlap'] == '5311 only'):
        return '5311 and 5339'
    elif (df['_merge']=='both') and (df['5311_5310_overlap'] == '5310 only'):
        return '5310 and 5339'
    else: 
        return '5310, 5339, 5311'

In [None]:
# Apply a function along an axis of the DataFrame. 
# Axis = 1 means across each row of the df 
# Axis = 0 means across each column of the df 
m4['_merge'] = m4.apply(recategorize, axis = 1)

In [None]:
# Rename _merge 
m4 = m4.rename(columns = {'_merge':'5310_5311_5339_overlap'}) 

In [None]:
# Remove columns 
m4 = m4.drop(columns = ["funding_program_y", "funding_program_x", "5311_5310_overlap", "funding_program"]) 

In [None]:
# Sort your dataframe by organization name, A-Z
m4.sort_values(['organization_name'])

## Save your work
* You can save all your hardwork into a single Excel workbook to our [Google Cloud Storage](https://console.cloud.google.com/storage/browser/calitp-analytics-data/data-analyses/grants;tab=objects?project=cal-itp-data-infra&prefix=&forceOnObjectsSortingFiltering=false).

In [None]:
"""
with pd.ExcelWriter(
    "gs://calitp-analytics-data/data-analyses/5311-5310/5310-5311-5339-agency-overlap.xlsx"
) as writer:
    m4.to_excel(writer, sheet_name="5310-5311-years", index= False)
"""

In [None]:
# with pd.ExcelWriter("gs://calitp-analytics-data/data-analyses/grants/5311-5310.xlsx")  as writer: m2_5311_5310.to_excel(writer,sheet_name="5310-5311-years", index= False)