# Exercise 1: Familiarize yourself with `pandas` and `python`
If you are new to Python, there are many resources!
* There are introductory Python courses available through [Caltrans's LinkedIn Learning Library](https://www.linkedin.com/learning/search?keywords=python&u=36029164).
* If videos aren't for you, [Practical Python for Data Science](https://www.practicalpythonfordatascience.com/00_python_crash_course) is an incredibly helpful book.

## Skills 
* `pandas` is one of the base Python packages for working with tabular data.
* F-strings
* Export to Google Cloud Storage
* Practice committing on GitHub

## How to use the tutorials
* The tutorials are divided by skills/concepts we are going to learn.
* There are hints and instructions on the top.
* There are links to references and it is highly recommended to read through them and practice them in this notebook, in addition to these exercises. 

## What are we working with today? 
* Today we will be working on Caltrans System Investment Strategy (CSIS) today. Per this [description](https://dot.ca.gov/programs/transportation-planning/division-of-transportation-planning/corridor-and-system-planning/csis)
> <i>The California Department of Transportation (Caltrans) is committed to leading climate action and advancing social equity in the transportation sector set forth by the California State Transportation Agency (CalSTA) Climate Action Plan for Transportation Infrastructure (CAPTI, 2021)...Caltrans is in a significant leadership role to carry out meaningful measures that advance state’s goals and priorities through the development and implementation of the Caltrans System Investment Strategy (CSIS). The CSIS, which implements one of CAPTI’s key actions, is envisioned to be an investment framework through a data and performance-driven approach that guides transportation investments and decisions.</i>
* DDS is working on CSIS is by automating the scoring of projects using Python. We score each project based on how well they do in various categories, aka metrics such as Zero Emmission Vehicles, Vehicle Miles Traveled, and more. 
* While the values in we are working with today are all <i>fake</i>, the exercise is based on actual datasets and assignments. 

## Import Pandas
* You are importing the package `pandas` that is the backbone of all data analysis work. 
* You can import countless packages. `numpy` and `geopandas` are also popular. 

In [1]:

import pandas as pd

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

## Jupyter Notebook
* You're using a Jupyter Notebook right now.
* Take some time to get used to this interface. 
* AMANDA TO DO: find a tutorial.

## Check out the data 
* Download the Excel workbook containing all the CSIS data from Google Cloud Storage [here](https://console.cloud.google.com/storage/browser/_details/calitp-analytics-data/data-analyses/starter_kit/starter_kit_csis_scoring_workbook.xlsx;tab=live_object?project=cal-itp-data-infra). 
    * Open it up in Excel and take a look.
### Read in the data
* We are reading our Excel Workbook into a Pandas dataframe.
* While there is a very [technical definition](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) of what a dataframe is, you can think of it as an Excel sheet that holds your data. 
* <b> Resource</b>: [This page of the Practical Python for Data Science](https://www.practicalpythonfordatascience.com/02_loading_data)

In [3]:
url = "gs://calitp-analytics-data/data-analyses/starter_kit/starter_kit_csis_scoring_workbook.xlsx"

In [4]:
df = pd.read_excel(url)

### Previewing Data 
* Often, you want to get a sneak preview of your data. 
* Thankfully, Python provides many methods for you to do so. 
* Below are a couple of very common methods we use. 
    * `.head()` shows the first five rows, while `.tail()` shows the last five.
    * `.sample()` shows you a random row.
    * Want to see or less than five? Specify it in the parantheses: `.head(10)` allows you to see the first 10 rows and `.head(2)` allows you to see the first 2.
* Try everything yourself below.

### More Methods!
* `df.shape` gives you the number of rows and columns in your dataset.
* `df.columns` returns all of the column names.
* `df.info()` per the [pandas docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) <i>prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.</i>
* Experiment below. 
* More food for thought:
    * `Dtype` is critical. There are integers, objects, booleans, floats...
    * Does the `dtype` of each column below make sense to you? 
    * The `dtype` of object is a catchall term.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44 entries, 0 to 43
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ct_district    44 non-null     int64 
 1   project_name   44 non-null     object
 2   Scope of Work  44 non-null     object
 3   Project Cost   44 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 1.5+ KB


### Deeper Dive
* We now know a good amount about our dataset, but the # of rows and columns are not always so thrilling. 
* Let's take a look at each column.

* `.value_counts()` helps you see how many times the same value appears. 

In [6]:
df.ct_district.value_counts()

3     7
2     6
6     6
11    6
7     5
9     3
12    3
1     2
4     2
8     2
10    1
5     1
Name: ct_district, dtype: int64

* `.nunique()` displays the number of distinct values in your column
    * This is  useful because there are many occassions when the number of unique values of a column should match the number of rows of your dataset <b>exactly</b>.
    * In our case, our dataframe has 44 rows and we should have 44 unique project names and scope of work descriptions.

In [7]:
df.project_name.nunique()

44

In [11]:
df.shape

(44, 4)

* Notice that when you have spaces in between each string of your column name, you need to refer the column using brackets []. 

In [12]:
df["Scope of Work"].nunique()

44

## Something missing? 
* Open up our dataset using Excel. 
* Take a look at the bottom: how many sheets are there in the Excel worbook? 
* Which sheet is loaded into `df` above? 

### Lists: An Introduction
* We can load in all of the sheets in an Excel workbook using a <b>list</b>
* Per [Practical Python for Data Science](https://www.practicalpythonfordatascience.com/00_python_crash_course_datatypes.html?highlight=dictionary#list): <i>"lists represent a collection of objects and are constructed with square brackets, separating items with commas. A list can contain a collection of one datatype...It can also contain a collection of mixed datatypes</i>"."
    * Play around with some of the examples in the link above in this notebook.
* Notice that the items in this list are <i>strings</i>. Read about strings [here](https://www.practicalpythonfordatascience.com/00_python_crash_course_datatypes.html?highlight=dictionary#string).

In [14]:

my_sheets = ["projects_auto",
            "overall_score"]

In [15]:
len(my_sheets)

2

* You can access each element of the list using an index. 

In [16]:
# Index
my_sheets[0]

'projects_auto'

In [17]:
my_sheets[1]

'overall_score'

* Read the in the Excel workbook into a dataframe.
* Using the argument `sheet_name` you can open up a specific sheet in an Excel workbook or multiple sheets that is held in a list.

In [18]:
df2 = pd.read_excel(
    url,
    sheet_name=my_sheets,
)

### Specificity is beautiful.
* Grab out each individual sheet into its own dataframe using `df2.get(my_sheets[enter in the index number])`. 
* Make sure your `dataframe` is titled descriptively.
* `df` is not exactly very telling. 

In [19]:
projects_df = df2.get(my_sheets[0])

In [20]:
scores_df = df2.get(my_sheets[1])

## Add a new column
* Oops! Us analysts were so wrapped up in scoring, we forgot to to total up all the metrics to find the overall_score for the project. 
* Place your results in a column called `overall_score`
* There are a couple of ways to do this: expeirment!
* Food for thought:
    * What does `axis = 1` mean?
    * What happens if you do `.sum(axis=0)`?
    * You don't always have to save everything into a dataframe. You can do something like `df.sum(axis=0)` just to see what happens. 
        * Just make sure your dataframe isn't too large or else you will run out of memory!
    * What happens when you create a new column with `scores_df.overall_score` instead of `scores_df["overall_score"]`? 

In [21]:
scores_df["overall_score"] = scores_df.select_dtypes(include=['int64', 'float64']).sum(axis=1)

## Subsetting
* Your manager asks for the `overall_score` for each project. They do not want to see the other metrics, only the project's name and its total score.
* Subset the dataframe and <b>save</b> it into a new dataframe.
* Again, there are many ways to do the same thing in Python. 
* <b>Method 1:</b> Enter in all the columns you want to keep in a list and place the list in another set of brackets.

In [23]:
# Enter in the columns you want to keep
columns_to_keep = ["project_name","overall_score"]

In [30]:
subsetted_df1 = scores_df[columns_to_keep]

* <b>Method 2</b>: You can enter in all the columns in a list you want to drop and use `.drop()`

In [None]:
# Enter in the columns you want to drop
columns_to_drop = []

In [24]:

subsetted_df2 = scores_df.drop(columns = columns_to_drop)

NameError: name 'columns_to_drop' is not defined

## Export to Google Cloud Storage (GCS)
* Save your <b>subsetted dataframe</b> from above back into the `starter_kit` folder. The file path should be something like this `"gs://calitp-analytics-data/data-analyses/starter_kit/aggregated_csis.xlsx"`.
* However, remember our original Excel workbook's file path? It was`"gs://calitp-analytics-data/data-analyses/starter_kit/starter_kit_csis_scoring_workbook.xlsx"`
* Essentially, the only difference between these two file paths are `aggregated_csis.xlsx` and `starter_kit_csis_scoring_workbook.xlsx` because the folder path `gs://calitp-analytics-data/data-analyses/starter_kit/` remains the same. 
* This is where f-strings come in.
> Python f-strings provide a quick way to interpolate and format strings. They’re readable, concise, and less prone to error than traditional string interpolation and formatting tools...
    * Read more about them [here](https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python).
* <b> Reference</b>
    *  [Saving Code](https://docs.calitp.org/data-infra/analytics_tools/saving_code.html)
* <b> Let's practice </b>!
    * My file_path is always going to be `gs://calitp-analytics-data/data-analyses/starter_kit/`.

In [26]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/starter_kit/"

* However the file is going to change.
* Save the file name in an object called `FILE`.

In [27]:

FILE = "starter_kit_example_final_scores.xlsx"

* Using `f-string`, combine `GCS_FILE_PATH` and `FILE` together.

In [28]:
# Put them together using a f-string
f"{GCS_FILE_PATH}{FILE}"

'gs://calitp-analytics-data/data-analyses/starter_kit/starter_kit_example_final_scores.xlsx'

* Now go open up your new Excel workbook and see if it's what you expect.
    * Hint: you will probably get a very annoying extra column! 
    * Try out some of the arguments [listed](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html#pandas.DataFrame.to_excel) and save your file again.

In [29]:

scores_df[["project_name","overall_score"]].to_excel(f"{GCS_FILE_PATH}{FILE}")

* Export the entire (not subsetted) dataframe with the new `overall_score` column using `df.to_parquet()`. 
    * We typically  prefer saving to `parquets`. Why? Read below. Text taken from [here](https://docs.calitp.org/data-infra/analytics_new_analysts/03-data-management.html#parquet).
    * <i>Parquet is an “open source columnar storage format for use in data analysis systems.” Columnar storage is more efficient as it is easily compressed and the data is more homogenous. CSV files utilize a row-based storage format which is harder to compress, a reason why Parquets files are preferable for larger datasets. Parquet files are faster to read than CSVs, as they have a higher querying speed and preserve datatypes (i.e. Number, Timestamps, Points). They are best for intermediate data storage and large datasets (1GB+) on most any on-disk storage. This file format is also good for passing dataframes between Python and R. A similar option is feather.</i>

In [31]:
scores_df.to_parquet(f"{GCS_FILE_PATH}starter_kit_example_final_scores.parquet")

## You're almost done!
* Name this notebook `YOURNAME_exercise1.ipynb`
    * You can't right click and rename the file, since this notebook is tracked with Git. 
    * Rename it using `git mv OLDNAME.ipynb NEWNAME.ipynb`. 
    * The `mv` stands for move, and renaming a file is basically "moving" its path. 
    * Doing it this way retains the git history associated with the notebook. If you rename directly with right click, rename, you destroy the git history.
* Use a descriptive commit message (ex: adding chart, etc). GitHub already tracks who makes the commit, the date, the timestamp of it, the files being affected, so your commit message should be more descriptive than the metadata already stored.