# 5311 and 5310 Applicants
* [Research Request](https://github.com/cal-itp/data-analyses/issues/333)

In [None]:
# Packages to import
# Pandas is the full name of the package but call it pd for short.
import pandas as pd
from calitp import *

# You only need to import these if you want to use something from the warehouse
from calitp.tables import tbl
from calitp import query_sql
from siuba import *
import calitp.magics

# Formatting the notebook
# The max columns to display will be 100
pd.options.display.max_columns = 100

# There will allow you to print all the rows in your data
pd.set_option("display.max_rows", None)

# This will prevent columns from being truncated
pd.set_option("display.max_colwidth", None)

## Load the Excel Sheet
* Can read the original Excel workbook by the specific sheet you want. 
* Save your sheet as a Pandas dataframe - it can be called anything, but usually it's <i>something_df</i>. 
    * Dataframe = basically jsut a table of data. 
    * If you want to open multiple sheets, you'd assign them to different objects and different names. 
* "to_snakecase" changes the column names to all lowercases and replaces any spaces with underescores.

In [None]:
#df = to_snakecase(
#    pd.read_excel("gs://calitp-analytics-data/data-analyses/grants/Grant+Projects_7_30_2022.xlsx", sheet_name="Grant Projects")
#)

df = pd.read_excel("./Grant+Projects_7_30_2022.xlsx")

In [None]:
df.to_excel("./Grant+Projects_7_30_2022.xlsx", index=False)

## Explore the data 
* Let's check out our data by answering questions such as
    * How many columns and rows does it have? 
    * How many missing values are there? 
    * What are the mean/median? 
* Any time you want to do something to your data, chain the function after the object.
    * In Excel, you'd do SUM(column you want)
    * In Pandas, you'd do df['column you want'].sum()
* [Resource](https://pandas.pydata.org/docs/user_guide/basics.html)    

In [None]:
# Check out the first five rows
# Any line with a pound symbol in front is a comment and won't be rendered
df.head()

In [None]:
# Check out the last five rows
df.tail()

In [None]:
# Check out how many rows and columns, # of null values in each column, and the data type of each column
df.info()

In [None]:
# The data goes spans between 2011 to 2022. Check out how many projects were funded by year.
df["column 1"].value_counts()

In [None]:
# Not sure what a function does: use help
help(sum)

In [None]:
# Get some basic stats
df.describe()

## Clean up
* [Tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html)

### Data type is important. 
* If you have a column of monetary values presented as $139, 293.92 and you want to find the mean, this won't work. 
* This column is considered an "object" column due to the dollar sign and comma - same way as if you typed "caltrans".
    * You'll have make sure it's an integer.
* Based on df.info() clean up other columns that aren't the right data type

In [None]:
"""
If there are columns that SHOULD be an integer but isn't: input them into the list
after this for loop. This strips empty $ and commas from columns, 
then changes them to the data type of int.
"""
for c in ["column_one", "column_two", "column_three"]:
    df[c] = df[c].str.replace("$", "").str.replace(",", "").astype(int)

### Beware of duplicate values
* Grants data might be manually entered by multiple people. As such, values can be inconsistent. 
* BART, Bay Area Rapid Transit, and Bay Area Rapid Transit (BART) are all the same agency. 
* However, if you are counting the number of unique agencies, these would be counted as 3 different agencies, which is inaccurate.


In [None]:
# Check out your agencies and see if there are any duplicates by
# sorting your column of agencies from A-Z and seeing only unique ones
df["column"].sort_values().unique()

In [None]:
# Check out total nunique values
df["column"].nunique()

In [None]:
"""
If there are duplicate values, you can replace them with an existing one with a dictionary
If this cell is irrelevant,  go up to the top where it says "code" and change it to "markdown". 
You can also move the three quotation marks at the bottom of this cell to comment out the code.
If all the agencies are only listed once.
"""
df["column"] = df["column"].replace(
    {"old value 1": "correct value 1", "old value 2": "correct value 2"}
)

## Filter what you want
* You don't necessarily want all the years, all the programs, etc. 
* Filter out what you are interested in.

### Grants you want

In [None]:
"""
Create a list that contains the grants you are interested in. 
A list is great because you can go in and delete/add items. 
Line below makes it easy to grab the values.
"""
df["column 1"].unique()

In [None]:
# Paste whatever values you want between the brckets.
# The values need to be in quotes.
grants_wanted = []

In [None]:
"""
Keep only the grants in my list and create a NEW variable.
It's best to create new variables when you make changes, so you can always reference
the original variable. 
"""
df2 = df2[df2["column"].isin([grants_wanted])]

### Columns you want
* Drop irrelvant columns 

In [None]:
# List out all your columns
df2.columns

In [None]:
# Copy and paste the irrelevant ones into this list below
unwanted_columns = ["column 1", "column 2", "column 3"]

In [None]:
# Drop them - assign to a new dataframe if you wish
df2 = df2.drop(columns=unwanted_columns)

In [None]:
# Check out your hard work with 5 random rows. Is this what you want?
df2.sample(5)

## Insights
* Now that you have a clean data frame, it's time to get some insights.


### Do these organizations overlap between 5310 and 5311?
* For Airtable
* Currently our dataframe contains both 5311 and 5310. 
* To compare the agencies, we need to break apart the dataframe so each grant will have its own dataframe.

In [None]:
"""
Filter out for years. Check the data type of the column you are filtering on. 
Perhaps years will need quotes because it's an object or maybe it's an integer, so 
no quotes are necessary.
"""
df3 = df2[df2["column"] > year_you_want]

In [None]:
"""
Filter out for only 5311. 
This ignores the case, so 'ac transit' and 'AC TRANSIT' will show up.
"""
df_5311 = df3[(df3.column.str.contains("5311", case=False))]

In [None]:
# Check out the length, aka # of rows after filtering
len(df_5311)

In [None]:
# Repeat same steps for 5310, make sure to cast this into a different dataframe

#### Merge the two split dataframes together
* Merging = joining two data sets on any columns they have in common.
* [Merge types & tutorials](https://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/)
* Merge agency_info table below and merge it with our df
* Now that you have 2 dataframes, compare them with an outer merge.
     * By using an outer merge, we can find out whether an agency appears in both 5311 and 5310, only the 5311, or only the 5310 through the indicator column that we turned out

In [None]:
"""
Save your merge results into a new dataframe called m1.
FYI: you can merge on more than one column! 
"""
m2 = m2.merge(
    your_df1,
    your_df2,
    how="outer",
    on=["ALL the columns they have in common"],
    indicator=True,
)

In [None]:
"""
Indicator values are both/left/only. You can 
change the values to something like 'both 5310 and 5311',
'5311 only', etc. Scroll back up to the 'duplicate values'
section to change these values with a dictionary.
"""

In [None]:
"""
Query agency info from our warehouse
"""
agency_info = (
    tbl.gtfs_schedule.agency() 
    >> collect()
    >> distinct()
)

In [None]:
m1 = m1.merge(
    agency_info,
    df3,
    how="outer",
    on=["ALL the columns they have in common"],
    indicator=True,
)

In [None]:
"""
Once you are happy with your analysis, assign it to a 
new variable such as agg1 = df3.groupby().
When you don't assign something to a variable, the results
aren't saved.
"""
m1.groupby(["column 1", "column 2"]).agg({"column 3": "mean/nunique/count/sum/etc"})

## Save your work
* You can save all your hardwork into a single Excel workbook to our [Google Cloud Storage](https://console.cloud.google.com/storage/browser/calitp-analytics-data/data-analyses/grants;tab=objects?project=cal-itp-data-infra&prefix=&forceOnObjectsSortingFiltering=false).

In [None]:
# This will be saved to our GCS bucket.
with pd.ExcelWriter(
    "gs://calitp-analytics-data/data-analyses/grants/put your file name here.xlsx"
) as writer:
    your_df1.to_excel(writer, sheet_name="your name", index=False)
    your_df2.to_excel(writer, sheet_name="your name", index=False)