# Exercise 4: Python Scripts, Concept of Grains, Display, Markdown,
* After cleaning and analyzing data, it's time to present the data in a beautifl fashion.
* At DDS, we often present our work directly in a Jupyter Notebook, which has many benefits such as.
    * We save the time it takes to copy and paste our graphs into a PowerPoint 
    * We ensure the accuracy of the data since we aren't manually retyping the data. 

In [None]:
import _starterkit_utils
import altair as alt
import numpy as np
import pandas as pd
from calitp_data_analysis import calitp_color_palette

In [None]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

## Python Scripts
* Up until now, we have been placing all of our code in the Jupyter Notebook.
* While this is convenient, it's not the best practice. 
* A notebook full of code also isn't easy for viewers - it gets chaotic, quickly! 
* **The best solution is to move the bulk of your code when you have reached a stopping point to a Python Script.**
* Read all about the benefits of scripts [here in our DDS docs](https://docs.calitp.org/data-infra/analytics_tools/scripts.html). Summary points below: 
    * Summary points from the docs page above. What are Python scripts?
        * <i>Python scripts (.py) are plain text files. Git tracks plain text changes easily.</i>
        * <i>Scripts are robust to scaling and reproducing work.</i>
        * <i>Break out scripts by concepts / stages</i>
        * <i>All functions used in scripts should have docstrings. Type hints are encouraged!</i>
    * Which components should a script contain?
        * <i>1 script for importing external data and changing it from shapefile/geojson/csv to parquet/geoparquet
        * <i>If only using warehouse data or upstream warehouse data cached in GCS, can skip this first script
        * <i>At least 1 script for data processing to produce processed output for visualization
        * <i>Break out scripts by concepts / stages
        * <i>Include data catalog, README for the project
        * <i>All functions used in scripts should have docstrings. Type hints are encouraged!
### Sample Script 
* Making Python scripts is an art and not straight forward.
* I have already populated a `.py` file called `_starterkit_utils` with some sample functions.
* I imported my Python Script just like how I imported my other dependencies (Pandas, Altair, Numpy).

In [None]:
import _starterkit_utils

### Breakdown of the Sample Script
#### Function 1
* You can also preview what a function does by writing `script_name.function_name??`
* Following what the DDS docs says, I am creating a new function every time I am processing the data in another stage.
* I have one function that loads in my dataset.

In [None]:
_starterkit_utils.load_dataset??


* To use a function in a Script, write `name_of_your_script.name_of_the_function(whatever arguments)`
* Take a look at the column names: they are no longer in `snakecase` because I applied a function that capitalizes it properly.

In [None]:
df = _starterkit_utils.load_dataset()

#### Function 2:
* After loading in the dataset from GCS, I am entering my second stage of processing the data.
* I am aggregating my dataframe by category. 

In [None]:
_starterkit_utils.aggregate_by_category??

In [None]:
aggregated_df = _starterkit_utils.aggregate_by_category(df)

In [None]:
aggregated_df

#### Function 3
* I want process my data a second way by changing it from wide to long. 
* [Read about wide to long.](https://www.statology.org/long-vs-wide-data/)
* [Pandas doc on melt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html)

In [None]:
_starterkit_utils.wide_to_long??

In [None]:
df2 = _starterkit_utils.wide_to_long(df)

In [None]:
df2.head(2)

#### Function 4
* Now that I have my aggregated data, I want to visualize my results,
* `style_df` takes my pandas dataframe and makes it look a bit sleeker.

In [None]:
_starterkit_utils.style_df(aggregated_df)

#### Function 5 
* After aggregating and reshaping the data, the next function presents the data.
* This is function that creates a chart that shows the scores by metric for each project.

In [None]:
_starterkit_utils.create_metric_chart??

In [None]:
_starterkit_utils.create_metric_chart(df2)

## Grains
* This is a light introduction to the concept of grains.
* Grain means the level your dataset is presented at.
* Basically, what does each row represent?
* The original dataset is presented on the project-level grain because each row represents a unique project. 

In [None]:
df[["Project Name", "Overall Score"]].head()

* If we aggregate the dataset using Caltrans District, then this dataset would be on the district gain.

In [None]:
df.groupby(["Caltrans District"]).agg({"Project Name": "nunique"}).reset_index().rename(
    columns={"Project Name": "Total Projects"}
)

* If we aggregate the dataset by lead agency, then this dataset would be on the agency gain.

In [None]:
df.groupby(["Lead Agency"]).agg({"Project Name": "nunique"}).reset_index().rename(
    columns={"Project Name": "Total Projects"}
)

* Grains can get very minute. The one below is Lead Agency and Category Grain. 

In [None]:
df.groupby(["Lead Agency", "Category"]).agg(
    {"Project Name": "nunique"}
).reset_index().rename(columns={"Project Name": "Total Projects"})

## Create your own Script
* **Make sure your functions make sense for the district grain. You will be using these functions for Exercise 5.**
* In your script, separate out functions by step like above. 
    * One function that loads the dataset and does some light cleaning.
    * One (or more) functions that transform your dataframe.
        * `melt()`, `.T`, `.groupby()` are just some of the many options available through `pandas`. 
    * One (or more) functions that visualize your dataframe.
        * Could be a chart, a styled dataframe, a wordcloud. 
* Other things to consider
    * Our [DDS Docs](https://docs.calitp.org/data-infra/publishing/sections/4_notebooks_styling.html#narrative) has a great guide on what "checkboxes" need to be "checked" when presenting data. The first 3 sections are the most relevant.
    * To summarize the docs, double check:
        * Are currency columns formatted with $ and commas?
        * Are all the scores formatted with the same number of decimals?
        * Are the string columns formatted with the right punctuation and capitalization?
        * Are the column names formatted properly? While `snake_case` is very handy when we are analyzing the dataframe, it is not very nice when presenting the data. We typically reverse the `snake_case` back to something like `Project Name`.
        * [Caltrans Districts are currently integers, but they have actual names that can be mapped.](https://cwwp2.dot.ca.gov/documentation/district-map-county-chart.htm) 
   

## Markdown/Display
* Although our code is now neatly stored in a Python script, a Jupyter Notebook on its own is a bit plain, even when we have beautiful charts. 
* There are many ways to jazz it up.
* **Resource**: [Data Camp's Markdown Tutorial](https://www.datacamp.com/tutorial/markdown-in-jupyter-notebook)
### Images
#### In a Markdown Cell
* You can add an image in a markdown cell
`<img src="https://raw.githubusercontent.com/cal-itp/data-analyses/refs/heads/main/portfolio/Calitp_logo_MAIN.png" width=100 height=100 />`<p>
<img src="https://raw.githubusercontent.com/cal-itp/data-analyses/refs/heads/main/portfolio/Calitp_logo_MAIN.png" width=100 height=100 />
#### In a Code Cell
* You can add an image in a code cell if you import the packages below.

In [None]:
from IPython.display import HTML, Image, Markdown, display, display_html

In [None]:
display(Image(filename="./19319_en_1.jpg", retina=True))

### Display
* Of course, you can write your narratives in a Markdown cell like what I'm doing right now.
* However, what if you want to incorporate values from your dataframe into the narrative?
* Writing out the values manually in markdown locks you in. If the values change, you'll have to rewrite your narrative which is timely and prone to inaccuracy.
* The best way is to use `display` and `markdown` from  `from IPython.display`
* We are using District 3 as an example

#### No hard coding
* Save out your desired value into a new variable whenever you want to reference it in a narrative.

In [None]:
# Filter for D3
d3_df = df.loc[df["Caltrans District"] == 3].reset_index(drop=True)

In [None]:
# Find the median overall score
d3_median_score = d3_df["Overall Score"].median()

In [None]:
# Find total projects
d3_total_projects = d3_df["Project Name"].nunique()

In [None]:
# Find the most expensive project
d3_max_project = d3_df["Project Cost"].max()

In [None]:
# Format the cost so it's something like $1,000,000 instead of 1000000
d3_max_project = f"${d3_max_project:,.2f}"

#### Long F-String + Headers
* F-strings can have multiple quotation marks. This allows you to write a f-string that goes over multiple lines.
* `<h3>` and `</h3>` displays District 3 in a header. 
    * Headers vary in size, 1 being the largest. 
* `<b></b>` bolds the text. 
    * `<s></s>` strikes the text.
* Notice that you always have to **close** your HTML with `</whatever_you_are_doing>`

In [None]:
display(
    Markdown(
        f"""<h3>District 3</h3>
        The median score for projects in District 3 is <b>{d3_median_score}</b><br> 
        The total number of projects is <b>{d3_total_projects}</b><br>
        <s>The most expensive project costs</s> <b>{d3_max_project}</b>
        """
    )
)

#### You can code in this cell. I'm filtering out for district 3 values.
* Notice the header went from `<h3>` to `<h4>`. 

In [None]:
display(
    Markdown(
        f"""<h5>Metric Scores</h5>
        """
    )
)
display(_starterkit_utils.create_metric_chart(df2))

### `Markdown` and `Display` can be worked into functions 
* What if I wanted to generate these reports for every district?
* I can simply turn this into a function.

In [None]:
_starterkit_utils.create_district_summary??

In [None]:
for district in range(10, 12):
    _starterkit_utils.create_district_summary(df, district)

## Your turn to combine all your functions into one function
* Take some inspiration from ` _starterkit_utils.create_district_summary(df, district).`
* Incorporate concepts from `markdown` and `display` to create a polished report. 