# Exercise 1: Familiarize yourself with `pandas`
If you are new to Python, check out the introductory Python courses available through Caltrans's LinkedIn Learning Library:
* https://www.linkedin.com/learning/search?keywords=python&u=36029164

Skills: 
* `pandas` is one of the base Python packages for working with tabular data.
* F-strings
* Export to Google Cloud Storage
* Practice committing on GitHub

References: 
* https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html
* https://docs.calitp.org/data-infra/analytics_tools/saving_code.html

## What are we working with today? 
* Today we will be working on Caltrans System Investment Strategy (CSIS) today. Per this [description](https://dot.ca.gov/programs/transportation-planning/division-of-transportation-planning/corridor-and-system-planning/csis)
> The California Department of Transportation (Caltrans) is committed to leading climate action and advancing social equity in the transportation sector set forth by the California State Transportation Agency (CalSTA) Climate Action Plan for Transportation Infrastructure (CAPTI, 2021)...Caltrans is in a significant leadership role to carry out meaningful measures that advance state’s goals and priorities through the development and implementation of the Caltrans System Investment Strategy (CSIS). The CSIS, which implements one of CAPTI’s key actions, is envisioned to be an investment framework through a data and performance-driven approach that guides transportation investments and decisions.
* One way DDS is working on CSIS is by automating the scoring of projects using Python. We score each project based on how well they do in various categories, aka metrics such as Zero Emmission Vehicles, Vehicle Miles Traveled, and more. 
* While the values in we are working with today are all <i>fake</i>, the exercise really is based on actual datasets and assignments. 

In [1]:

import pandas as pd

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

## Check out the data 
* Download the Excel workbook containing all the CSIS data from Google Cloud Storage [here](https://console.cloud.google.com/storage/browser/_details/calitp-analytics-data/data-analyses/starter_kit/starter_kit_csis_scoring_workbook.xlsx;tab=live_object?project=cal-itp-data-infra). 
    * Open it up in Excel and take a look.
### Read in the data
* We are reading our Excel Workbook into a Pandas dataframe.
* While there is a very [technical definition](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) of what a dataframe is, you can think of it as an Excel sheet that holds your data. A pandas dataframe merely allows you to clean the data programatically.

In [3]:
url = "gs://calitp-analytics-data/data-analyses/starter_kit/starter_kit_csis_scoring_workbook.xlsx"

In [4]:
df = pd.read_excel(url)

### Previewing Data 
* Often, you want to get a sneak preview of your data. 
* Thankfully, Python provides many methods for you to do so. 
* Below are a couple of very common methods we use. 
    * `.head()` shows the first five rows, while `.tail()` shows the last five.
    * `.sample()` shows you a random row.
    * Want to see or less than five? Specify it in the parantheses: `.head(10)`.
* Try everything yourself below.

### Reviewing the Data - More Methods!
* `df.shape` gives you the number of rows and columns in your dataset.
* `df.columns` returns all of the column names.
* `df.info()` per the [pandas docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) <i>prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.</i>
* Experiment below. 
* More food for thought:
    * `Dtype` is critical. There are integers, objects, booleans, floats...
    * Does the `dtype` of each column below make sense to you? 
    * The `dtype` of object is a catchall term.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ct_district    29 non-null     int64 
 1   project_name   29 non-null     object
 2   Scope of Work  29 non-null     object
dtypes: int64(1), object(2)
memory usage: 824.0+ bytes


### Deeper Dive
* We now know a good amount about our dataset, but the # of rows and columns are not always so thrilling. 
* Let's take a look at each column.

* `.value_counts()` helps you see how many times the same value appears. 

In [6]:
df.ct_district.value_counts()

9     5
10    3
8     3
3     3
1     3
12    3
4     3
2     2
5     1
7     1
11    1
6     1
Name: ct_district, dtype: int64

* `.nunique()` displays the number of distinct values in your column
    * This is particulary useful because there are many times when the number of unique values of a column should match the number of rows of your dataset <b>exactly</b>.
    * In our case, our dataframe has 29 rows and we should have 29 unique project names and scope of work descriptions.

In [7]:
df.project_name.nunique()

29

In [8]:
# Notice that when you have spaces in between each string of your column name,
# you need to refer the column using brackets []. 
df["Scope of Work"].nunique()

29

## Something missing? 
* Open up our dataset using Excel. 
* Take a look at the sheets: how many are there in the Excel worbook? 
* Which sheet is loaded into `df` above? 

### Lists

In [9]:
# Enter in all the sheets you are interested in loading into Python.
# By the way, they always need to be strings.
my_sheets = ["projects_auto",
            "overall_score"]

In [10]:
len(my_sheets)

2

In [11]:
# Index
my_sheets[0]

'projects_auto'

In [12]:
my_sheets[1]

'overall_score'

In [13]:
# Open the workbook in a dictionary
df2 = pd.read_excel(
    url,
    sheet_name=my_sheets,
)

### Specificity is beautiful.
* Grab out each individual sheet into its own dataframe using `df2.get(my_sheets[enter in the number])`. 
* Make sure your `dataframe` is titled descriptively.
* `df` is not exactly very telling. 

In [14]:
projects_df = df2.get(my_sheets[0])

In [15]:
scores_df = df2.get(my_sheets[1])

## Add a new column
* Oops! Us analysts were so wrapped up in scoring, we forgot to to total up all the metrics to find the overall_score for the project. 
* Do so and place your results in a column called `overall_score`
* There are a couple of ways to do this.
* More food for thought:
    * What does `axis = 1` mean?
    * What happens if you do `.sum(axis=0)`?
    * Try everything once.
    * You don't always have to save everything into a dataframe. You can do something like `df.sum(axis=0)` just to see what happens. 
        * Just make sure your dataframe isn't too large or else you will run out of memory!
    * What happens when you create a new column with `scores_df.overall_score`? 

In [16]:
scores_df["overall_score"] = scores_df.select_dtypes(include=['int64', 'float64']).sum(axis=1)

## Subsetting
* Your manager asks for the `overall_score` for each project. They do not want to see the other metrics, only the project's name and its total score.
* Subset the dataframe and <b>save</b> it into a new dataframe.
* There are many ways to do the same thing in Python. 
     * The best way is usually the one with the least amount of text and code.

In [17]:
# Enter in the columns you want to keep
columns_to_keep = ["overall_score", "project_name"]

In [18]:
# Enter in the columns you want to drop
columns_to_drop = []

In [19]:

scores_df.drop(columns = columns_to_drop)

Unnamed: 0,project_name,accessibility_score,dac_accessibility_score,dac_traffic_impacts_score,freight_efficiency_score,freight_sustainability_score,mode_shift_score,lu_natural_resources_score,safety_score,vmt_score,zev_score,public_engagement_score,climate_resilience_score,program_fit_score,overall_score
0,Meadow Magic Multi-Use Path,10,3,4,8,3,6,10,9,2,4,5,2,2,68
1,Bunny Hop Bike Boulevard,8,9,5,8,7,8,10,8,5,1,1,3,9,82
2,Strawberry Shortcake Sidewalks,1,3,1,10,5,10,3,7,4,3,3,2,3,55
3,River Ramble Rabbit Trail,4,2,9,9,9,10,9,4,7,1,3,5,2,74
4,Lilac Lane Dream Complete Street,10,10,9,4,9,10,7,2,1,7,1,3,3,76
5,Unicorn Expressway,10,4,5,9,4,5,9,8,5,4,9,3,6,81
6,Sunflower Gables Intermodal Facility,2,8,9,10,1,6,5,3,9,9,1,7,4,74
7,Seaside Strawberry Port Revitalization,2,7,1,5,7,3,7,6,4,1,5,2,8,58
8,Countryside Clover Rail Connector,3,3,1,9,7,5,7,5,8,1,7,3,10,69
9,Tranquil Truck Trot,4,5,5,8,8,1,5,2,7,4,6,8,10,73


In [20]:
scores_df[columns_to_keep]

Unnamed: 0,overall_score,project_name
0,68,Meadow Magic Multi-Use Path
1,82,Bunny Hop Bike Boulevard
2,55,Strawberry Shortcake Sidewalks
3,74,River Ramble Rabbit Trail
4,76,Lilac Lane Dream Complete Street
5,81,Unicorn Expressway
6,74,Sunflower Gables Intermodal Facility
7,58,Seaside Strawberry Port Revitalization
8,69,Countryside Clover Rail Connector
9,73,Tranquil Truck Trot


## Export to Google Cloud Storage (GCS)
* Our original Excel workbook's file path is `"gs://calitp-analytics-data/data-analyses/starter_kit/starter_kit_csis_scoring_workbook.xlsx"`
* Save your subsetted dataframe from above back into the `starter_kit` folder. 
* Sure you could do `"gs://calitp-analytics-data/data-analyses/starter_kit/aggregated_csis.xlsx"` but that is an eyesore.
* Essentially, the only difference between these two file paths are `aggregated_csis.xlsx` and `starter_kit_csis_scoring_workbook.xlsx` because the file_path `gs://calitp-analytics-data/data-analyses/starter_kit/` remains the same. 
* This is where f-strings come in. What are f-strings? 
> Python f-strings provide a quick way to interpolate and format strings. They’re readable, concise, and less prone to error than traditional string interpolation and formatting tools...
    * Read more about them [here](https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python).
* <b> Let's practice </b>!

In [21]:
# My file_path is always going to be `gs://calitp-analytics-data/data-analyses/starter_kit/`.
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/starter_kit/"

In [22]:
# However my file is going to change.
# I want to name my subsetted dataframe as "aggregated" and I want it to be saved as an Excel workbook.
FILE = "starter_kit_example_final_scores.xlsx"

In [23]:
# Put them together using a f-string
f"{GCS_FILE_PATH}{FILE}"

'gs://calitp-analytics-data/data-analyses/starter_kit/starter_kit_example_final_scores.xlsx'

In [24]:
# What if I wanted to read back the original file using f-strings? 
scores_df[["project_name","overall_score"]].to_excel(f"{GCS_FILE_PATH}{FILE}")

* Export your entire dataframe with the new `overall_score` column using `df.to_parquet()`. 
    * We typically  prefer saving to `parquets` and you can read why [here](https://docs.calitp.org/data-infra/analytics_new_analysts/03-data-management.html#parquet).
* Export your subsetted dataframe with only the `overall_score` and `project_name` columns using `df.to_excel()`. 
    * Open up your new Excel workbook and see if it's what you expect.
    * Hint: you will probably get a very annoying extra column! 
    * Try out some of the arguments [listed](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html#pandas.DataFrame.to_excel).

In [25]:
scores_df.to_parquet(f"{GCS_FILE_PATH}starter_kit_example_final_scores.parquet")

## You're almost done!
* Name this notebook `YOURNAME_exercise1.ipynb`
    * If you need to rename because you already named it, do it within the terminal.
    * `git mv OLDNAME.ipynb NEWNAME.ipynb`. 
    * The `mv` stands for move, and renaming a file is basically "moving" its path. Doing it this way retains the git history associated with the notebook. If you rename directly with right click, rename, you destroy the git history.
* Use a descriptive commit message (ex: adding chart, etc). GitHub already tracks who makes the commit, the date, the timestamp of it, the files being affected, so your commit message should be more descriptive than the metadata already stored.