# Exercise 4: Python Scripts, Display, Markdown, Preparing for the Portfolio
* Cleaning and analyzing data takes a lot of time, patience, and skill.
* However, presenting the data to stakeholders is also equaly important.
* At DDS, we often present our work in a Jupyter Notebook.
* This exercise will walk you through how we do so. 

In [None]:
import _starterkit_utils
import altair as alt
import numpy as np
import pandas as pd
from calitp_data_analysis import calitp_color_palette

In [None]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

## Python Scripts
* Up until now, we have been placing all of our code in the Jupyter Notebook.
* While this is convenient, it's not the best practice. 
* A notebook full of code isn't easy to digest. 
* Jupyter notebooks are also very difficult for Git to version control. Everything gets jumbled. 
* The best solution is to move the bulk of your code when you have reached a stopping point to a Python Script. 
    * Read all about the benefits of scripts [here in our DDS docs](https://docs.calitp.org/data-infra/analytics_tools/scripts.html).
    * Summary points from the docs page above:
        * <i>Python scripts (.py) are plain text files. Git tracks plain text changes easily.</i>
        * <i>Scripts are robust to scaling and reproducing work.</i>
        * <i>Break out scripts by concepts / stages</i>
        * <i>All functions used in scripts should have docstrings. Type hints are encouraged!</i>
* Making Python scripts is an art and not straight forward.
* I have already populated a `.py` file called `_starterkit_utils` with some sample functions.
    * I have imported my Python Script just like how I imported my other dependencies (Pandas, Altair, Numpy).
    * Read about dependencies [here](https://www.practicalpythonfordatascience.com/05_data_exploration).

In [None]:
import _starterkit_utils

### Breakdown of a Script.
#### Function 1
* Following what the DDS docs says, I am creating a new function every time I am processing the data in another stage.
* I have one function that loads in my dataset.
* Take a look at the column names: they are no longer in `snakecase` because I applied a function that capitalizes it properly.
* To use a function in a Script, write `name_of_your_script.name_of_the_function(whatever arguments)`

In [None]:
df = _starterkit_utils.load_dataset()

#### Function 2:
* After loading in the dataset from GCS, I am entering my second stage of processing the data.
* I am aggregating my dataframe by Category, basically the same function in Exercise 3. 

In [None]:
aggregated_df = _starterkit_utils.aggregate_by_category(df)

In [None]:
aggregated_df

#### Function 3
* I want to swap my dataframe from wide to long. 
* [Read about wide to long.](https://www.statology.org/long-vs-wide-data/)
* [Pandas doc on melt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html)

In [None]:
df2 = _starterkit_utils.wide_to_long(df)

In [None]:
df2.head(2)

#### Function 4
* Now that I have my aggregated data, I want to visualize my results,
* `style_df` takes my pandas dataframe and makes it look a bit sleeker.

In [None]:
_starterkit_utils.style_df(aggregated_df)

#### Function 5 
* This is function that creates a chart that shows the scores by metric for each project.

In [None]:
d2_wide_df = df2.loc[df2["CalTrans District"] == 2].reset_index(drop=True)

In [None]:
_starterkit_utils.create_metric_chart(d2_wide_df)

### Your turn
* Create your own `.py` file with your own functions. 
* Make sure to separate out functions by theme:
    * One function that loads the dataset and does some light cleaning.
    * One (or more) functions that transform your dataframe.
        * `melt()`, `.T`, `.groupby()` are just some of the many options available through `pandas`.
    * One (or more) functions that visualize your dataframe.
        * Could be a chart, a styled dataframe, a wordcloud. 
* Other things to consider
    * [CalTrans Districts are currently integers, but they have actual names that can be mapped.](https://cwwp2.dot.ca.gov/documentation/district-map-county-chart.htm) 
    * Are the currency columns formatted with $ and commas?
    * Are all the scores formatted with the same number of decimals?
    * Are the string columns formatted with the right punctuation and capitalization?
* Please note, you will be using these functions for Exercise 5. Make sure your functions are on the <b>district</b> grain. 

## Markdown/Display
* Although our code is now neatly stored in a Python script, a Jupyter Notebook on its own is a bit plain, even when we have beautiful charts. 
* There are many ways to jazz it up.
* AMANDA: Link some resources.

#### Images
* You can add an image in a markdown cell
`<img src="https://raw.githubusercontent.com/cal-itp/data-analyses/refs/heads/main/portfolio/Calitp_logo_MAIN.png" width=100 height=100 />`<p>
<img src="https://raw.githubusercontent.com/cal-itp/data-analyses/refs/heads/main/portfolio/Calitp_logo_MAIN.png" width=100 height=100 />
* You can add an image in a code cell if you import the packages below.

In [None]:
from IPython.display import HTML, Image, Markdown, display, display_html

In [None]:
display(Image(filename="./19319_en_1.jpg", retina=True))

### Display
* Of course, you can write your narratives in a Markdown cell like what I'm doing right now.
* However, what do you do if you want to incorporate values from your dataframe into the narrative?
* Writing out the values isn't necessarily the best idea.If the values change, you'll have to rewrite your narrative.
* The best way is to use `display` and `markdown` like below.
* We are using District 3 as an example

#### No hard coding
* Save out your desired value into a new variable if you are manipulating it.

In [None]:
d3_df = df.loc[df["CalTrans District"] == 3].reset_index(drop=True)

In [None]:
d3_median_score = d3_df["Overall Score"].median()

In [None]:
d3_total_projects = d3_df["Project Name"].nunique()

In [None]:
d3_max_project = d3_df["Project Cost"].max()

In [None]:
d3_max_project = f"${d3_max_project:,.2f}"

In [None]:
d3_max_project


* The f string has multiple quotation marks. This allows you to write a f-string that goes over multiple lines.

In [None]:
display(
    Markdown(
        f"""<h3>District 3</h3>
        The median score for projects in District 3 is <b>{d3_median_score}</b><br> 
        The total number of projects is <b>{d3_total_projects}</b><br>
        The most expensive project costs <b>{d3_max_project}</b>
        """
    )
)

* You can code in this cell. I'm filtering out for district 3 values.
* Notice the header went from `<h3>` to `<h4>`. You can also swap it out for `<h2>` and `<h1>`

In [None]:
display(
    Markdown(
        f"""<h4>Metric Scores</h4>
        """
    )
)
d3_wide_df = df2.loc[df2["CalTrans District"] == 3].reset_index(drop=True)
display(_starterkit_utils.create_metric_chart(d2_wide_df))

### This can be a function too
* What if I wanted to generate these narratives for every district?
* I can simply turn this into a function.
* Remember to look at the code in `_starterkit_utils.py`

* I only want to print out a couple of districts or else this notebook will become too large

In [None]:
for district in range(10,13):
    _starterkit_utils.create_district_summary(df, district)