# 5311 and 5310 Applicants
* [Research Request](https://github.com/cal-itp/data-analyses/issues/333)

In [1]:
# Packages to import
# Pandas is the full name of the package but call it pd for short.
import pandas as pd
from calitp import *

# Formatting the notebook
# The max columns to display will be 100
pd.options.display.max_columns = 100

# There will allow you to print all the rows in your data
pd.set_option("display.max_rows", None)

# This will prevent columns from being truncated
pd.set_option("display.max_colwidth", None)



## Load the Excel Sheet
* Can read the original Excel workbook by the specific sheet you want. 
* Save your sheet as a Pandas dataframe - it can be called anything, but usually it's <i>something_df</i>. 
    * Dataframe = basically jsut a table of data. 
    * If you want to open multiple sheets, you'd assign them to different objects and different names. 
* "to_snakecase" changes the column names to all lowercases and replaces any spaces with underescores.

In [2]:
#df = to_snakecase(
#    pd.read_excel("gs://calitp-analytics-data/data-analyses/grants/Grant+Projects_7_30_2022.xlsx", sheet_name="Grant Projects")
#)

df = pd.read_excel("./Grant+Projects_7_30_2022.xlsx")

In [3]:
# Save your dataframe to the folder you are in 
# df.to_excel("./Grant+Projects_7_30_2022.xlsx", index=False)

## Explore the data 
* Let's check out our data by answering questions such as
    * How many columns and rows does it have? 
    * How many missing values are there? 
    * What are the mean/median? 
* Any time you want to do something to your data, chain the function after the object.
    * In Excel, you'd do SUM(column you want)
    * In Pandas, you'd do df['column you want'].sum()
* [Resource](https://pandas.pydata.org/docs/user_guide/basics.html)    

In [4]:
# Check out the first five rows
# Any line with a pound symbol in front is a comment and won't be rendered
df.head()

Unnamed: 0,grant_fiscal_year,funding_program,grant_number,project_year,organization_name,upin,description,ali,contract_number,allocationamount,encumbered_amount,expendedamount,activebalance,closedoutbalance,project_status,project_closed_by,project_closed_date,project_closed_time
0,2011,Section 5311,CA-18-X047 | 0012000083,2016,City of Chowchilla,BCG0000228,Operating Assistance,300902,64BO17-00368,53221.0,114511.0,53221.0,0.0,0,Open,,,
1,2011,Section 5311,CA-18-X047 | 0012000083,2016,Madera County,BCG0000283,Buy <30-Ft Bus For Expansion,111304,64BC17-00408,110663.0,110663.0,101352.02,9310.98,0,Open,,,
2,2011,Section 5311,CA-18-X047 | 0012000083,2016,Madera County,BCG0000284,Purchase Replacement Van,111215,64BC17-00408,20643.0,44265.0,20643.0,0.0,0,Open,,,
3,2012,Section 5311,CA-18-X052 | 0012000304,2016,Madera County,BCG0000284,Purchase Replacement Van,111215,64BC17-00408,23622.0,44265.0,22868.3,753.7,0,Open,,,
4,2012,Section 5311,CA-18-X052 | 0012000304,2016,Madera County,BCG0000286,Purchase Expansion <30ft Bus,111304,64BC17-00480,22925.0,113319.0,22655.51,269.49,0,Open,,,


In [5]:
# Check out the last five rows
df.tail()

Unnamed: 0,grant_fiscal_year,funding_program,grant_number,project_year,organization_name,upin,description,ali,contract_number,allocationamount,encumbered_amount,expendedamount,activebalance,closedoutbalance,project_status,project_closed_by,project_closed_date,project_closed_time
2760,2022,Section 5311(f),TBD | 0022000356-F,2022,Sunline Transit Agency,BCG0003870,Operating Assistance Sliding Scale - 5311(f) - Route10,300902,,257375.0,0.0,0.0,257375.0,0,Open,,,
2761,2022,Section 5311(f),TBD | 0022000356-F,2022,Trinity County Department of Transportation,BCG0003993,Operating Assistance Sliding Scale RED/LEW 22/23,300902,,173820.0,0.0,0.0,173820.0,0,Open,,,
2762,2022,Section 5311(f),TBD | 0022000356-F,2022,Trinity County Department of Transportation,BCG0003997,Operating Assistance Sliding Scale WC 22/23,300902,,152038.0,0.0,0.0,152038.0,0,Open,,,
2763,2022,Section 5311(f),TBD | 0022000356-F,2022,Yosemite Area Regional Transportation System,BCG0004056,Operating Assistance Sliding Scale,300902,,300000.0,0.0,0.0,300000.0,0,Open,,,
2764,2022,Section 5311(f),TBD | 0022000356-F,2022,Yurok Tribe Transit,BCG0004031,Operating Assistance Sliding Scale - Orleans to Willow Creek,300902,,116064.0,0.0,0.0,116064.0,0,Open,,,


In [6]:
# Check out how many rows and columns, # of null values in each column, and the data type of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2765 entries, 0 to 2764
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   grant_fiscal_year    2765 non-null   int64  
 1   funding_program      2765 non-null   object 
 2   grant_number         2765 non-null   object 
 3   project_year         2765 non-null   int64  
 4   organization_name    2765 non-null   object 
 5   upin                 2765 non-null   object 
 6   description          2765 non-null   object 
 7   ali                  2765 non-null   object 
 8   contract_number      2498 non-null   object 
 9   allocationamount     2765 non-null   float64
 10  encumbered_amount    2765 non-null   float64
 11  expendedamount       2765 non-null   float64
 12  activebalance        2765 non-null   float64
 13  closedoutbalance     2765 non-null   int64  
 14  project_status       2765 non-null   object 
 15  project_closed_by    0 non-null      f

In [8]:
# The data goes spans between 2011 to 2022. Check out how many projects were funded by year.
# df["column 1"].value_counts()

In [None]:
# Not sure what a function does: use help
help(sum)

In [None]:
# Get some basic stats
df.describe()

## Clean up
* [Tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html)

### Data type is important. 
* If you have a column of monetary values presented as $139, 293.92 and you want to find the mean, this won't work. 
* This column is considered an "object" column due to the dollar sign and comma - same way as if you typed "caltrans".
    * You'll have make sure it's an integer.
* Based on df.info() clean up other columns that aren't the right data type

In [None]:
"""
If there are columns that SHOULD be an integer but isn't: input them into the list
after this for loop. This strips empty $ and commas in the columns you list, 
then changes them to the data type of int.

for c in ["column_one", "column_two", "column_three"]:
    df[c] = df[c].str.replace("$", "").str.replace(",", "").astype(int)
"""

### Beware of duplicate values
* Grants data might be manually entered by multiple people. As such, values can be inconsistent. 
* BART, Bay Area Rapid Transit, and Bay Area Rapid Transit (BART) are all the same agency. 
* However, if you are counting the number of unique agencies, these would be counted as 3 different agencies, which is inaccurate.


In [None]:
# Check out your agencies and see if there are any duplicates by
# sorting your column of agencies from A-Z and seeing only unique ones
# df["column"].sort_values().unique()

In [None]:
# Check out total nunique values
# df["column"].nunique()

In [None]:
"""
If there are duplicate values, you can replace them with an existing one with a dictionary
If this cell is irrelevant,  go up to the top where it says "code" and change it to "markdown". 
You can also move the three quotation marks at the bottom of this cell to comment out the code.
If all the agencies are only listed once.

df["column"] = df["column"].replace(
    {"old value 1": "correct value 1", "old value 2": "correct value 2"}
)

"""

## Filter what you want
* You don't necessarily want all the years, all the programs, etc. 
* Filter out what you are interested in.

### Grants you want

In [9]:
"""
Create a list that contains the grants you are interested in. 
A list is great because you can go in and delete/add items. 
Line below makes it easy to grab the values.
"""
df["funding_program"].unique()

array(['Section 5311', '5310 Exp', '5310 Trad', '5311(f) Cont',
       '5339 (National)', '5339 (State)', 'CMAQ (FTA 5311)',
       'Section 5311(f)', 'Toll Credits', '5311(f) Round 2', 'CARES Act',
       'CARES Act (F)', 'ARPA', 'CRRSAA'], dtype=object)

In [10]:
# Paste whatever values you want between the brckets.
# The values need to be in quotes.
grants_wanted = ['Section 5311', '5310 Exp', '5310 Trad', '5311(f) Cont',
       '5339 (National)', '5339 (State)', 'CMAQ (FTA 5311)',
       'Section 5311(f)', '5311(f) Round 2']

In [11]:
"""
Keep only the grants in my list and create a NEW variable.
It's best to create new variables when you make changes, so you can always reference
the original variable. 
"""
df2 = df[df["funding_program"].isin(grants_wanted)]

### Columns you want
* Drop irrelvant columns 

In [12]:
df2['funding_program'].value_counts()

5310 Trad          986
Section 5311       720
5310 Exp           166
Section 5311(f)    140
5339 (State)       129
5339 (National)     48
CMAQ (FTA 5311)     44
5311(f) Cont        41
5311(f) Round 2     27
Name: funding_program, dtype: int64

In [13]:
# List out all your columns
df2.columns

Index(['grant_fiscal_year', 'funding_program', 'grant_number', 'project_year',
       'organization_name', 'upin', 'description', 'ali', 'contract_number',
       'allocationamount', 'encumbered_amount', 'expendedamount',
       'activebalance', 'closedoutbalance', 'project_status',
       'project_closed_by', 'project_closed_date', 'project_closed_time'],
      dtype='object')

In [14]:
df2.head()

Unnamed: 0,grant_fiscal_year,funding_program,grant_number,project_year,organization_name,upin,description,ali,contract_number,allocationamount,encumbered_amount,expendedamount,activebalance,closedoutbalance,project_status,project_closed_by,project_closed_date,project_closed_time
0,2011,Section 5311,CA-18-X047 | 0012000083,2016,City of Chowchilla,BCG0000228,Operating Assistance,300902,64BO17-00368,53221.0,114511.0,53221.0,0.0,0,Open,,,
1,2011,Section 5311,CA-18-X047 | 0012000083,2016,Madera County,BCG0000283,Buy <30-Ft Bus For Expansion,111304,64BC17-00408,110663.0,110663.0,101352.02,9310.98,0,Open,,,
2,2011,Section 5311,CA-18-X047 | 0012000083,2016,Madera County,BCG0000284,Purchase Replacement Van,111215,64BC17-00408,20643.0,44265.0,20643.0,0.0,0,Open,,,
3,2012,Section 5311,CA-18-X052 | 0012000304,2016,Madera County,BCG0000284,Purchase Replacement Van,111215,64BC17-00408,23622.0,44265.0,22868.3,753.7,0,Open,,,
4,2012,Section 5311,CA-18-X052 | 0012000304,2016,Madera County,BCG0000286,Purchase Expansion <30ft Bus,111304,64BC17-00480,22925.0,113319.0,22655.51,269.49,0,Open,,,


In [15]:
# Copy and paste the irrelevant ones into this list below
unwanted_columns = ['grant_number', 
     'upin', 'description', 'ali', 'contract_number',
       'allocationamount', 'encumbered_amount', 'expendedamount',
       'activebalance', 'closedoutbalance', 'project_closed_by', 'project_closed_date', 'project_closed_time']

In [16]:
# Drop them - assign to a new dataframe if you wish
df2 = df2.drop(columns=unwanted_columns)

In [17]:
# Check out your hard work with 5 random rows. Is this what you want?
df2.sample(5)

Unnamed: 0,grant_fiscal_year,funding_program,project_year,organization_name,project_status
2489,2021,Section 5311,2021,County of Tulare,Open
404,2017,5310 Trad,2017,Inyo-Mono Association for the Handicapped,Open
625,2017,5310 Trad,2017,"Vivalon, Inc.",Open
665,2017,5339 (State),2017,City of Rio Vista,Open
545,2017,5310 Trad,2017,S.M.O.O.T.H.,Open


## Insights
* Now that you have a clean data frame, it's time to get some insights.


### Do these organizations overlap between 5310 and 5311?
* For Airtable
* Currently our dataframe contains both 5311 and 5310. 
* To compare the agencies, we need to break apart the dataframe so each grant will have its own dataframe.

In [18]:
"""
Filter out for years. Check the data type of the column you are filtering on. 
Perhaps years will need quotes because it's an object or maybe it's an integer, so 
no quotes are necessary.
"""
df3 = df2[df2["project_year"] > 2018]

In [19]:
"""
Filter out for only 5311. 
This ignores the case, so 'ac transit' and 'AC TRANSIT' will show up.
"""
df_5311 = df3[(df3.funding_program.str.contains("5311", case=False))]

In [20]:
df_5311['funding_program'].value_counts()

Section 5311       416
Section 5311(f)    112
5311(f) Round 2     27
CMAQ (FTA 5311)     24
Name: funding_program, dtype: int64

In [21]:
# Check out the length, aka # of rows after filtering
len(df_5311)

579

In [22]:
# Repeat same steps for 5310, make sure to cast this into a different dataframe

In [23]:
df_5310 = df3[(df3.funding_program.str.contains("5310", case=False))]

In [24]:
df_5310['funding_program'].value_counts()

5310 Trad    547
5310 Exp      88
Name: funding_program, dtype: int64

In [25]:
df_5339 = df3[(df3.funding_program.str.contains("5339", case=False))]

In [26]:
df_5339['funding_program'].value_counts()

5339 (State)       98
5339 (National)    30
Name: funding_program, dtype: int64

In [27]:
len(df3)

1342

In [28]:
len(df_5310) + len(df_5311) + len(df_5339)

1342

In [31]:
len(df3) == (len(df_5310) + len(df_5311) + len(df_5339))

True

#### Merge the split dataframes together
<img src= "download.jfif"> 

* Merging = joining two data sets on any columns they have in common.
* [Merge types & tutorials](https://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/)
* Merge agency_info table below and merge it with our df
* Now that you have 2 dataframes, compare them with an outer merge.
     * By using an outer merge, we can find out whether an agency appears in both 5311 and 5310, only the 5311, or only the 5310 through the indicator column that we turned out
     

In [30]:
"""
Save your merge results into a new dataframe called m1.
FYI: you can merge on more than one column! 
"""
m2 = m2.merge(
    your_df1,
    your_df2,
    how="outer",
    on=["ALL the columns they have in common"],
    indicator=True,
)

NameError: name 'm2' is not defined

In [None]:
"""
Indicator values are both/left/only. You can 
change the values to something like 'both 5310 and 5311',
'5311 only', etc. Scroll back up to the 'duplicate values'
section to change these values with a dictionary.
"""
# Create a new copy of column _merge

In [None]:
# Drop _merge column because we can't have two _merge columns in the same dataframe

In [None]:
# Map the values 

### Repeat this step for 5339

* [Function from here](https://github.com/cal-itp/data-analyses/blob/main/Agreement_Overlap/add_dla.ipynb)

## Save your work
* You can save all your hardwork into a single Excel workbook to our [Google Cloud Storage](https://console.cloud.google.com/storage/browser/calitp-analytics-data/data-analyses/grants;tab=objects?project=cal-itp-data-infra&prefix=&forceOnObjectsSortingFiltering=false).

In [None]:
# This will be saved to our GCS bucket.
with pd.ExcelWriter(
    "gs://calitp-analytics-data/data-analyses/grants/put your file name here.xlsx"
) as writer:
    your_df1.to_excel(writer, sheet_name="your name", index=False)
    your_df2.to_excel(writer, sheet_name="your name", index=False)