Thank you for all the resources. Super helpful for beginners. Love your clear, line by line instructions. Very orgnanized, easy to understand. Starting from Github is a great idea, matches the order of our daily work.

# Exercise 1: `Git`, `pandas`,`python`, `f-strings`, Importing and Exporting data.
If you are new to Python, there are many resources to help you! Below is just a small sample of what is available.
* There are introductory Python courses available through [Caltrans's LinkedIn Learning Library](https://www.linkedin.com/learning/search?keywords=python&u=36029164).
* [Practical Python for Data Science](https://www.practicalpythonfordatascience.com/00_python_crash_course) is an incredibly helpful resource. Material from it is linked throughout.

## How to use these tutorials
* The tutorials are divided by skills/concepts we are going to learn.
* There are hints and instructions on the top.
* There are links to references. 
**It is highly recommended to read through them and practice them in this notebook.**

## What are we working with today? 
* Today we will be working on Caltrans System Investment Strategy (CSIS) data. Per this [description](https://dot.ca.gov/programs/transportation-planning/division-of-transportation-planning/corridor-and-system-planning/csis)
> <i>The California Department of Transportation (Caltrans) is committed to leading climate action and advancing social equity in the transportation sector set forth by the California State Transportation Agency (CalSTA) Climate Action Plan for Transportation Infrastructure (CAPTI, 2021)...Caltrans is in a significant leadership role to carry out meaningful measures that advance state’s goals and priorities through the development and implementation of the Caltrans System Investment Strategy (CSIS). The CSIS, which implements one of CAPTI’s key actions, is envisioned to be an investment framework through a data and performance-driven approach that guides transportation investments and decisions.</i>
* The Data Science Branch is working on CSIS is by automating the scoring of projects using Python. We score each project based on how well they do on various  metrics such as Zero Emmission Vehicles, Vehicle Miles Traveled Reduction, and more. 
* While the values in we are working with today are all <i>fake</i>, the exercise is based on the actual data and work we've done. 

## GitHub - Making a Branch
* You are probably on the `main` branch of our `data-analyses` repo. 
* The `main` branch is [here](https://github.com/cal-itp/data-analyses).
* We never work on the `main` branch. 
* You can think of the `main` branch as an area that contains our work only when it's at a good stopping point.
* We typically save (via `merging` a `pull request`) our work to the `main` branch at the end of the work week.
* The rest of the time, we work on our own branches, making frequent `commits` to save and document our work along the way. 
* Let's make (or rather `check out`) our own branch.

**Steps**
1. Go to the terminal.
2. Paste `git pull origin main` which pulls down the main branch with the latest work. 
3. Paste `git switch -c your-branch` in the terminal. Swap out `your-branch` with something else.
     * We typically name branches with all lowercase, and dashes instead of underscores. Instead of `Amanda_Branch`, write `amanda-branch`.
4. Your terminal should now show `jovyan@jupyter-your_name ~/data-analyses (your-branch) $ ` which means you successfully made your new branch!

## Import Packages
* Before doing some data cleaning and analyzing, we need to equip ourselves with the right tools.
* Part of our "toolbox" packages that you `import` into your notebook.
* **Resource**: [Importing Dependencies via Practical Python for Data Science](https://www.practicalpythonfordatascience.com/05_data_exploration.html?highlight=dependencies#importing-our-dependencies)

### `Pandas`
* Below, you are importing the package `pandas` that is the backbone of our data analysis work. 
* Other packages DDS commonly uses are `geopandas` for geospatial data work and `altair` for making charts.

In [1]:
import pandas as pd

* This block of code below adjusts the notebook's settings.
* I am setting the maximum number of columns to be displayed to be 100 because the default number of columns shown is much smaller.
* I want any `float` columns to be rounded to 2 decimal points.
* I want all of the rows in the dataframe to display. ???
* I don't want my string columns to be truncated.
    * A long string value will display like this <i>The California Department of Transportation (Caltrans) is committed to leading climate action and advancing social equity...</i> would be displayed something like this <i>The California Department of Transportation (Caltrans) is...</i> without this line of code.
* Adjust some of these settings if you wish to make this notebook the proper environment for you.

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

### `calitp_data_analysis`
* DDS also has our own [internal library of functions](https://docs.calitp.org/data-infra/analytics_tools/python_libraries.html#calitp-data-analysis).
    * You can check out all the functions [here](https://github.com/cal-itp/data-infra/tree/main/packages/calitp-data-analysis/calitp_data_analysis).
* Below, we are importing only one function called `to_snakecase` from the python submodule `sql` in our package `calitp_data_analysis`. 
* `to_snakecase` allows us to change the column names of our dataset from something like `Project Description` to `project_description`. 
    * Turning the column names to lower case and replacing the spaces with underscores makes referencing specific columns much easier.

In [3]:
from calitp_data_analysis.sql import to_snakecase

## Jupyter Notebook
* You're using a Jupyter Notebook right now.
* There are many benefits of using a notebook for our analysis, which you can read about here in our [DDS Docs](https://docs.calitp.org/data-infra/analytics_new_analysts/04-notebooks.html).
* Take some time to get used to this interface. 
    * Press ctrl+enter to run a cell
    * Go up to the Kernel and rerun all the cells.
    * Use the scissors at the top to delete out the cell.
    * Adjust your settings to be dark instead of light. !!! this is so cool. Did not know this before !!!
* There are many tutorials available on Youtube, just skip the installation portion. 
    * [This one looks promising](https://youtu.be/LW2Rye_l8L0?si=B8kojobCe3OIF3xg).

## Check out the data 
* Download the Excel workbook containing all the CSIS data from Google Cloud Storage [here](https://console.cloud.google.com/storage/browser/_details/calitp-analytics-data/data-analyses/starter_kit/starter_kit_csis_scoring_workbook.xlsx;tab=live_object?project=cal-itp-data-infra). 
* Open the workbook up in Excel and take a look at how many sheets it contains.

### Read in the data
* We are reading our Excel Workbook into a Pandas dataframe.
* While there is a very [technical definition](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) of what a dataframe is, you can think of it as an Excel sheet that holds your data. 
* <b> Resource</b>: [Practical Python for Data Science](https://www.practicalpythonfordatascience.com/02_loading_data)

In [5]:
FOLDER = "../starter_kit/"
FILE_NAME = "starter_kit_csis_scoring_workbook.xlsx"
df = pd.read_excel(f"{FOLDER}{FILE_NAME}")
df.head(2)

Unnamed: 0,ct_district,project_name,Scope of Work,Project Cost,lead agency
0,8,Meadow Magic Multi-Use Path,"A 2-mile Class I bike lane and multi-use path through a scenic meadow, featuring wildflower plantings, public art installations, and educational signage highlighting local wildlife.",4189348,Meadow Bunny Public Transportation (MBPT)
1,2,Bunny Hop Bike Boulevard,"A Class II bike lane with charming streetlights, benches, and bike racks designed to resemble carrot sticks, connecting residential neighborhoods to local schools and parks.",8647685,Unicorn Fairy Express Bus (UFX)


In [None]:
#url = "gs://calitp-analytics-data/data-analyses/starter_kit/starter_kit_csis_scoring_workbook.xlsx"

* Read in the dataframe without the function `to_snakecase()` first to see what happens.

In [None]:
#df_no_snakecase = (pd.read_excel(url))

In [None]:
#df_no_snakecase.head(2)

* Read in the dataframe with `to_snakecase()` now and compare the difference between the column names. 

In [6]:
#read data with snackcase functon
workbook = to_snakecase(pd.read_excel(f"{FOLDER}{FILE_NAME}"))
workbook.head(2)

Unnamed: 0,ct_district,project_name,scope_of_work,project_cost,lead_agency
0,8,Meadow Magic Multi-Use Path,"A 2-mile Class I bike lane and multi-use path through a scenic meadow, featuring wildflower plantings, public art installations, and educational signage highlighting local wildlife.",4189348,Meadow Bunny Public Transportation (MBPT)
1,2,Bunny Hop Bike Boulevard,"A Class II bike lane with charming streetlights, benches, and bike racks designed to resemble carrot sticks, connecting residential neighborhoods to local schools and parks.",8647685,Unicorn Fairy Express Bus (UFX)


In [None]:
#df = to_snakecase(pd.read_excel(url))

In [None]:
#df.head(2)

### Previewing Data 
* Often, you want to get a sneak preview of your data. 
* Below are a couple of very common methods we use. 
    * `.head()` shows the first five rows, while `.tail()` shows the last five.
    * `.sample()` shows you a random row.
    * Want to see or less than five? Specify it in the parantheses: `.head(10)` allows you to see the first 10 rows and `.sample(2)` allows you to see two random rows.
* **Resource**: [Practical Python for Data Science: Data Inspection](https://www.practicalpythonfordatascience.com/02_loading_data)

In [None]:
#df.head(2)

### More Methods!
* `df.shape` gives you the number of rows and columns in your dataset.
* `df.columns` returns all of the column names.
* `df.info()` per the [pandas docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) <i>prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.</i>
* **Experiment below.** 
* More food for thought:
    * `Dtype` is critical. There are integers, objects, booleans, floats...
    * Does the `dtype` of each column below make sense to you? 
    * The `dtype` of `object` is a catchall term. It can either contain all string values like "muffins" and "apples" or a mix of string and other data types like "6 muffins" and "3 apples."

### Deeper Dive
* Let's take a closer look at some columns.
* `.value_counts()` helps you see how many times the same value appears. 

In [34]:
workbook.dtypes

ct_district       int64
project_name     object
scope_of_work    object
project_cost      int64
lead_agency      object
dtype: object

In [35]:
workbook.ct_district.value_counts()

4     6
3     6
8     5
11    5
12    4
5     4
9     3
6     3
7     3
2     2
10    2
1     1
Name: ct_district, dtype: int64

* `.nunique()` displays the number of distinct values in your column
    * This is  useful because often the number of unique values of a column should match the number of rows of your dataset <b>exactly</b>.
    * In our case, our dataframe has 44 rows and we should have 44 unique project names and scope of work descriptions.

In [36]:
workbook.project_name.nunique()

44

In [37]:
workbook.shape

(44, 5)

* You can preview a column with brackets [] as well with the column name encased in quotation marks.
* However, simply using a period . is much easier.

In [38]:
workbook["scope_of_work"].nunique()

44

* Describe() gives you some descriptive statistics

In [39]:
workbook.describe()

Unnamed: 0,ct_district,project_cost
count,44.0,44.0
mean,6.73,5041007.95
std,3.29,2795837.16
min,1.0,679173.0
25%,4.0,2745870.5
50%,6.5,4648018.0
75%,9.25,7324164.5
max,12.0,9971021.0


In [40]:
workbook.project_cost.describe()

count        44.00
mean    5041007.95
std     2795837.16
min      679173.00
25%     2745870.50
50%     4648018.00
75%     7324164.50
max     9971021.00
Name: project_cost, dtype: float64

## Something missing? 
* Open up our dataset using Excel. 
* Take a look at the bottom: how many sheets are there in the Excel worbook? 
* Which sheet is loaded into `df` above? 

### Lists: An Introduction
* We can load in all of the sheets in an Excel workbook using a list
* Per [Practical Python for Data Science](https://www.practicalpythonfordatascience.com/00_python_crash_course_datatypes.html?highlight=dictionary#list): <i>"lists represent a collection of objects and are constructed with square brackets, separating items with commas. A list can contain a collection of one datatype...It can also contain a collection of mixed datatypes</i>".
    * **Play around with some of the examples in the link above in this notebook.**
    * You will be using lists often in your work, so it is best to be familiar with this datatype.

### Application of Lists
* I am placing all of the sheets in our Excel Workbook in a list.
* Notice that the items in this list are <i>strings</i>. 
    * Read about strings [here](https://www.practicalpythonfordatascience.com/00_python_crash_course_datatypes.html?highlight=dictionary#string).
* You can access each element of the list using an index.
    * An index represents the location of an element with a number.
    * The index always starts at 0. What we consider the first item is not index "1", it's index "0".

In [7]:
my_sheets = ["projects_auto",
            "overall_score"]

In [8]:
len(my_sheets)

2

In [9]:
# Index 0 is projects_auto
my_sheets[0]

'projects_auto'

* Read the in the Excel workbook into a dataframe.
* Using the argument `sheet_name` you can open up a specific sheet in an Excel workbook or multiple sheets that is held in a list.

In [10]:
# can't use to_snakecase function here. because score2 is a list of multiple df, but not a df
workbook2 = pd.read_excel(f"{FOLDER}{FILE_NAME}", sheet_name=my_sheets)

In [None]:
#df2 = pd.read_excel(
    #url,
    #sheet_name=my_sheets,
#)

### Specificity is beautiful.
* Grab out each individual sheet into its own dataframe using `df2.get(my_sheets[enter in the index number])`. 
* Make sure your `dataframe` is titled descriptively.
* `df` is not exactly very telling. 
* Use the function `to_snakecase` to clean up your column names

In [11]:
project = to_snakecase(workbook2.get(my_sheets[0]))
project.head(2)

Unnamed: 0,ct_district,project_name,scope_of_work,project_cost,lead_agency
0,8,Meadow Magic Multi-Use Path,"A 2-mile Class I bike lane and multi-use path through a scenic meadow, featuring wildflower plantings, public art installations, and educational signage highlighting local wildlife.",4189348,Meadow Bunny Public Transportation (MBPT)
1,2,Bunny Hop Bike Boulevard,"A Class II bike lane with charming streetlights, benches, and bike racks designed to resemble carrot sticks, connecting residential neighborhoods to local schools and parks.",8647685,Unicorn Fairy Express Bus (UFX)


In [12]:
score = to_snakecase(workbook2.get(my_sheets[1]))
score.head(2)

Unnamed: 0,project_name,accessibility_score,dac_accessibility_score,dac_traffic_impacts_score,freight_efficiency_score,freight_sustainability_score,mode_shift_score,lu_natural_resources_score,safety_score,vmt_score,zev_score,public_engagement_score,climate_resilience_score,program_fit_score
0,Meadow Magic Multi-Use Path,10,6,1,5,9,7,5,2,9,5,8,7,8
1,Bunny Hop Bike Boulevard,2,3,1,6,5,3,9,7,5,8,2,9,5


## Add a new column
* Oops! Us analysts were so wrapped up in scoring, we forgot to to total up all the metrics to find the overall_score for the project. 
* Using the dataframe you read in from the Excel sheet "Overall Score", sum up all the metric columns into a column called `overall_score`
* There are a couple of ways to do this: experiment! 
* Here are some resources:
    * [Stackoverflow](https://stackoverflow.com/questions/22342285/summing-two-columns-in-a-pandas-dataframe)
    * [Statology](https://www.statology.org/pandas-sum-specific-columns/)
* Food for thought:
    * What happens when you create a new column with `scores_df.overall_score` instead of `scores_df["overall_score"]`? 
    * What does `axis = 1` mean?
    * What happens if you do `.sum(axis=0)`?
    * You don't always have to save everything into a dataframe. You can do something like `df.sum(axis=0)` just to see what happens. 
        * Just make sure your dataframe isn't too large or else you will run out of memory!
    

In [17]:
score['overall_score'] = score.sum(axis=1)

  score['overall_score'] = score.sum(axis=1)


In [18]:
score.head(2)

Unnamed: 0,project_name,accessibility_score,dac_accessibility_score,dac_traffic_impacts_score,freight_efficiency_score,freight_sustainability_score,mode_shift_score,lu_natural_resources_score,safety_score,vmt_score,zev_score,public_engagement_score,climate_resilience_score,program_fit_score,overall_score
0,Meadow Magic Multi-Use Path,10,6,1,5,9,7,5,2,9,5,8,7,8,82
1,Bunny Hop Bike Boulevard,2,3,1,6,5,3,9,7,5,8,2,9,5,65


In [13]:
score.overall_score = score.sum(axis=1)

  score.overall_score = score.sum(axis=1)
  score.overall_score = score.sum(axis=1)


In [14]:
score.overall_score = score.sum(axis=0)

In [15]:
score['overall_score'] = score.sum(axis=0)

In [16]:
score.head(2)

Unnamed: 0,project_name,accessibility_score,dac_accessibility_score,dac_traffic_impacts_score,freight_efficiency_score,freight_sustainability_score,mode_shift_score,lu_natural_resources_score,safety_score,vmt_score,zev_score,public_engagement_score,climate_resilience_score,program_fit_score,overall_score
0,Meadow Magic Multi-Use Path,10,6,1,5,9,7,5,2,9,5,8,7,8,
1,Bunny Hop Bike Boulevard,2,3,1,6,5,3,9,7,5,8,2,9,5,


## Subsetting
* Your manager asks for the `overall_score` for each project in Excel format. 
* They do not want to see the other metrics, only the project's name and its `overall_score`
* Subset the dataframe and saveit into a new dataframe.
* <b>Method 1:</b> Enter in all the columns you want to keep in a list and place the list in another set of brackets.

In [19]:
# Enter in the columns you want to keep
columns_to_keep = ['project_name', 'overall_score']

In [20]:
subsetted_score = score[columns_to_keep]
subsetted_score.head(2)

Unnamed: 0,project_name,overall_score
0,Meadow Magic Multi-Use Path,82
1,Bunny Hop Bike Boulevard,65


* <b>Method 2</b>: You can enter in all the columns in a list you want to drop and use `.drop()`

In [None]:
# Enter in the columns you want to drop
columns_to_drop = []

In [None]:

subsetted_df2 = scores_df.drop(columns = columns_to_drop)

## F-Strings
* Save your <b>subsetted dataframe</b> from above back into the `starter_kit` folder. 
* The file path should be something like this `"gs://calitp-analytics-data/data-analyses/starter_kit/your_file_name_here.xlsx"`.
* However, remember our original Excel workbook's file path? It was`"gs://calitp-analytics-data/data-analyses/starter_kit/starter_kit_csis_scoring_workbook.xlsx"`
* The **only** difference between these two file paths are `your_file_name_here.xlsx` and `starter_kit_csis_scoring_workbook.xlsx` because the folder path `gs://calitp-analytics-data/data-analyses/starter_kit/` remains the same. 
* This is where f-strings come in. 
> Python f-strings provide a quick way to interpolate and format strings. They’re readable, concise, and less prone to error than traditional string interpolation and formatting tools...
    * Excerpt from [here](https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python).
#### Application of F-Strings
* My file_path is always going to be `gs://calitp-analytics-data/data-analyses/starter_kit/` so I'll set that in its own variable.


In [21]:
FOLDER = "../starter_kit/"

In [None]:
#GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/starter_kit/"

* However the file is going to change.
* Save the file name in a new variable called `FILE`.

In [22]:
FILE_NAME = "sub_score.xlsx"

In [None]:

#FILE = 

* Using a `f-string`, combine `GCS_FILE_PATH` and `FILE` together.

In [None]:
# Put them together using a f-string
#f"{GCS_FILE_PATH}{FILE}"

* Now go open up your new Excel workbook and see if it's what you expect.
    * Hint: you will probably get a very annoying extra column! 
    * Try out some of the arguments [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html#pandas.DataFrame.to_excel) and save your file again.

In [23]:
subsetted_score.to_excel(f"{FOLDER}{FILE_NAME}")

In [None]:

#df.to_excel(f"{GCS_FILE_PATH}{FILE}")

### Parquets
* Export the entire (not subsetted) dataframe with the new `overall_score` column using `df.to_parquet()`. 
    * We typically  prefer saving to `parquets`. Why? Read below. Text taken from [here](https://docs.calitp.org/data-infra/analytics_new_analysts/03-data-management.html#parquet).
    * <i>Parquet is an “open source columnar storage format for use in data analysis systems.” Columnar storage is more efficient as it is easily compressed and the data is more homogenous. CSV files utilize a row-based storage format which is harder to compress, a reason why Parquets files are preferable for larger datasets. Parquet files are faster to read than CSVs, as they have a higher querying speed and preserve datatypes (i.e. Number, Timestamps, Points). They are best for intermediate data storage and large datasets (1GB+) on most any on-disk storage. This file format is also good for passing dataframes between Python and R. A similar option is feather.</i>
* <b> Reference</b>
    *  [DDS Docs: Saving Code](https://docs.calitp.org/data-infra/analytics_tools/saving_code.html)
* Make sure you use a f-string.

In [25]:
FILE_NAME = "overall_score.parquet"
score.to_parquet(f"{FOLDER}{FILE_NAME}")

## Git - `Committing` Code
* In the terminal, paste `git mv 2024_basics_01.ipynb your_new_notebook.ipynb`. 
    * This renames your notebook.
    * You can't right click and rename the file, since this notebook is tracked with Git. 
    * The `mv` stands for move, and renaming a file is basically "moving" its path. 
    *  If you rename directly with right click, rename, you destroy the git history.
    * Doing it this way retains the git history associated with the notebook.
* In the terminal, paste `git add your_new_notebook.ipynb`. 
    * This adds your new notebook.
    * To add all files with a certain extension, write `git add *ipynb`.
* Continuing in the terminal, paste `git commit -m 'write a message here'`
    * This details the work you did this particular coding session. 
    * A typical message would be: `git commit -m 'added charts'` or `git commit -m 'worked on exercise 1'`
    * GitHub already tracks the change's date and timestamp, the files being affected, who made the change, and more so you don't need to include details like these details.
* Finally, in the terminal, paste `git push origin your_branch`.
    * This pushes up your change to the remote `data-analyses` repo onto your own branch.
    * Now, all your work is safely stored on and recorded by GitHub.