# Exercise 2: Merging, Aggregating, Filtering, and Visualizing

In [None]:
import altair as alt
import pandas as pd

In [None]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

* Read back in the `parquet` file with the `overall_score` you created from exercise 1.
* Read the Excel sheet containing the project information (scope of work, district, and project name).
* Use f-strings.

In [None]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/starter_kit/"

In [None]:
EXCEL_FILE = "starter_kit_csis_scoring_workbook.xlsx"

In [None]:
OVERALL_SCORE_FILE = "starter_kit_example_final_scores.parquet"

In [None]:
projects_df = pd.read_excel(f"{GCS_FILE_PATH}{EXCEL_FILE}")

In [None]:
projects_df.head(2)

In [None]:
overall_scores_df = pd.read_parquet(f"{GCS_FILE_PATH}{OVERALL_SCORE_FILE}")

In [None]:
overall_scores_df.head(2)

## Merging 
* Your manager asks you to aggregate the find by District:
    * average overall score
    * max score 
    * Min score
    * Number of unique projects
* Annoyingly enough, the `overall_score` column and the `ct_district` are in two different dataframes. You'll have to merge it. 
* Welcome to DDS! This will happen to you all the time starting now. 

### Food for thought 
* Which do columns the two dataframes have in common?
    * You can merge on more than one column. In fact, it's best practice to! 
* What type of merge will achieve my goal?
    * Inner, outer, left, or right
* What do I expect out of the merge?
    * Do I expect all the values of the merge keys to be 1:1? Or m:1? 
    * Do I expect a project to correspond with multiple districts? Maybe, projects can and do cross multiple boundaries.
    * Do I expect a project to correspond with only one total cost estimate value? Yes, there shouldn't be multiple cost estimates for the same project!
* How do I go about checking the data after the merge?
    * Which arguments are available to help me per the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)?

In [None]:
m1 = pd.merge(projects_df, overall_scores_df, on=["project_name"])

### Double Checking
* How many rows do you expect?
* How many unique projects are there? 
* Hint: check your original dataframes as well

#### The Beauty of Outer Joins 
* To save you some grief and time, `outer` joins are very useful.
* Merge your dataframes again using an `outer` join and with `indicator = True` on.
* Using `.value_counts()` check out how many rows are found in both dataframes, the left only, and the right only

In [None]:
m2 = pd.merge(
    projects_df, overall_scores_df, on=["project_name"], indicator=True, how="outer"
)

In [None]:
m2._merge.value_counts()

### Filtering
* Filter out for only the `left_only` and `right_only` values.
* AH note: link to  docs page with tutorial.

In [None]:
m2.loc[m2._merge != "both"][["project_name", "_merge"]]

### Dictionaries: An Introduction 
* String data is often entered in many different ways. BART can be entered in as bart, Bay Area Rapid Transit, BaRT, and more. 
* Take a look as to why these projects are not merging. 
* In Excel, it's easy to go in and manually tweak everything. However, that is not reproducible. 
* Since there are essentially only a couple of names to replace, we can do it using a dictionary.
* Decide whether you want to rename the values in the left dataframe or the right one. 
    * AH: Link to docs
    * Explain what a dictionary is
* Take a look at elements 
    * Trailing white spaces
    * Capitalization
    * Spelling
    * Symbols

In [None]:
# I highly recommend you use .unique() to find the project names.
# Often there are trailing white spaces that are naked to our human eyes.

In [None]:
new_names = {
    "main street muffin top ": "Main Street Muffin Top Revitalization",
    "Bunny Lane HOV+2 heaven": "Bunny Lane HOV+2 Haven",
    "Rainbow Rush hot  Lanes": "Rainbow Rush HOT Lanes",
}

In [None]:
projects_df.project_name = projects_df.project_name.replace(new_names)

* Merge your dataframes again. This time it should work.
* Please also specify the merge type and the columns. 
* Although Pandas does this automatically, it's good practice to write everything out.


In [None]:
final_m = pd.merge(projects_df, overall_scores_df, how="inner", on="project_name")

In [None]:
# Save this dataframe as a parquet to GCS
final_m.to_parquet(f"{GCS_FILE_PATH}starter_kit_merge.parquet")

## Groupby
* You're done merging...Oh wait, that wasn't even part of your manager's request. You still need to aggregate. 
* There are many options Some are `groupby / agg`, `pivot_table`, `groupby / transform`
* Hint: rename these columns to be descriptive because we are no longer looking at the `overall_scores`

In [None]:
final_m["min_score"] = final_m.overall_score

In [None]:
final_m["max_score"] = final_m.overall_score

In [None]:
agg1 = (
    final_m.groupby(["ct_district"])
    .agg(
        {
            "overall_score": "median",
            "min_score": "min",
            "max_score": "max",
            "project_name": "nunique",
        }
    )
    .reset_index()
)

In [None]:
agg1 = agg1.rename(
    columns={"overall_score": "median_score", "project_name": "n_projects"}
)

In [None]:
agg1

## Visualizing 
* You're done aggregating, but the dataframe looks objectively plain.
* Unfortunately in the world of data, looks <b>do</b> matter. 
* Let's explore a couple of ways to present your data.

### Styling a Dataframe
* `pandas` has quite a few options that allow you to style your dataframe.
* [This tutorial](https://betterdatascience.com/style-pandas-dataframes/) offers some great ways to jazz up your dataframe.
* You can always read the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) for more ideas.
* Some ideas:
    * Change the font
    * Turn off the index
    * Use colors to indicate low-high values
    * Change the alignment of the values

### Altair
* While a table is great, sometimes the stakeholder prefers a chart. 
* Our preferred visualization library is `Altair`, although there are other options.
    * Their docs page is [here](https://altair-viz.github.io/).
* The code to create a simple bar chart goes something like this. 
    * `alt.Chart(source).mark_bar().encode(x='a',y='b')`
    * `source` is the dataframe you want to use for your chart.
    * `x` denotes the column you are plotting on the X-axis. Make sure your column name has quotation marks around it. 
    * `y` denotes the column you are plotting on the Y-axis. 
* If you want a line chart, simply swap out `.mark_bar()` for `.mark_line`
    * `alt.Chart(source).mark_line().encode(x='x',y='f(x)')`
* <b>Make your first chart below.</b>

In [None]:
alt.Chart(agg1).mark_bar().encode(x="ct_district", y="n_projects")

#### Customizing
* `altair` offers an endless ways to amp up the personality of your chart.
* Additionally, the chart above without a title and legend is a data visualization "taboo" and the dull Facebook blue is uninspiring. 

#### Add a title
* You can do so within `.Chart()`

In [None]:
alt.Chart(agg1, title="your_title_here").mark_bar().encode(
    x="ct_district", y="n_projects"
)

#### Add some color
* Explain our calitp_color_palette.

In [None]:
from calitp_data_analysis import calitp_color_palette

In [None]:
# To see what is inside a module, just put two question marks
# From here, you can choose another color palette
# calitp_color_palette??

In [None]:
alt.Chart(agg1, title="your_title_here").mark_bar().encode(
    x="ct_district",
    y="n_projects",
    color=alt.Color(
        "n_projects",  # This is the column you want the color of your bar to be based on
        title="legend_title_here",  # This is the legend of your title
        scale=alt.Scale(
            range=calitp_color_palette.CALITP_DIVERGING_COLORS
        ),  # This is where you can customize the colors,
    ),
)

#### Adjusting the Axis
* Axis domain
* Axis values:
    * Caltrans districts are integers or strings? 

In [None]:
ct_districts = {
    1: "D1",
    2: "D2",
    3: "D3",
    4: "D4",
    5: "D5",
    6: "D6",
    7: "D7",
    8: "D8",
    9: "D9",
    10: "D10",
    11: "D11",
    12: "D12",
}

In [None]:
agg1["ct_district"] = agg1["ct_district"].replace(ct_districts)

In [None]:
alt.Chart(agg1, title="your_title_here").mark_bar().encode(
    x=alt.X("ct_district"),
    y=alt.Y("n_projects", scale=alt.Scale(domain=[0, 5])),
    color=alt.Color(
        "n_projects",
        title="legend_title_here",
        scale=alt.Scale(range=calitp_color_palette.CALITP_DIVERGING_COLORS),
    ),
)

### Finishing Touches 
* Sizing
* Tooltip
* Saving to a png

In [None]:
alt.Chart(agg1, title="your_title_here").mark_bar(size=20).encode(
    x=alt.X("ct_district"),
    y=alt.Y("n_projects", scale=alt.Scale(domain=[0, 5])),
    color=alt.Color(
        "n_projects",
        title="legend_title_here",
        scale=alt.Scale(range=calitp_color_palette.CALITP_DIVERGING_COLORS),
    ),
    tooltip=list(agg1.columns),
).properties(width=400, height=250)

### We have only visualized one column of data. 
* We have only visualized one column of data, but we have a couple of columns above. 
* Make a few other charts in different styles. Altair's [gallery](https://altair-viz.github.io/gallery/index.html) is a great resource to kick off your chart-making career. 