# Exercise 2: Merging, Aggregating, Filtering, and Visualizing

In [1]:
import altair as alt
import pandas as pd

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

* Read back in the `parquet` file with the `overall_score` you created from exercise 1.
* Read the Excel sheet containing the project information (scope of work, district, and project name).
* Use f-strings.

In [3]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/starter_kit/"

In [4]:
EXCEL_FILE = "starter_kit_csis_scoring_workbook.xlsx"

In [5]:
OVERALL_SCORE_FILE = "starter_kit_example_final_scores.parquet"

In [6]:
projects_df = pd.read_excel(f"{GCS_FILE_PATH}{EXCEL_FILE}")

In [7]:
projects_df.head(2)

Unnamed: 0,ct_district,project_name,Scope of Work
0,10,Meadow Magic Multi-Use Path,"A 2-mile Class I bike lane and multi-use path through a scenic meadow, featuring wildflower plantings, public art installations, and educational signage highlighting local wildlife."
1,8,Bunny Hop Bike Boulevard,"A Class II bike lane with charming streetlights, benches, and bike racks designed to resemble carrot sticks, connecting residential neighborhoods to local schools and parks."


In [8]:
overall_scores_df = pd.read_parquet(f"{GCS_FILE_PATH}{OVERALL_SCORE_FILE}")

In [9]:
overall_scores_df.head(2)

Unnamed: 0,project_name,accessibility_score,dac_accessibility_score,dac_traffic_impacts_score,freight_efficiency_score,freight_sustainability_score,mode_shift_score,lu_natural_resources_score,safety_score,vmt_score,zev_score,public_engagement_score,climate_resilience_score,program_fit_score,overall_score
0,Meadow Magic Multi-Use Path,10,3,4,8,3,6,10,9,2,4,5,2,2,68
1,Bunny Hop Bike Boulevard,8,9,5,8,7,8,10,8,5,1,1,3,9,82


## Merging 
* Your manager asks you to aggregate the find by District:
    * average overall score
    * max score 
    * Min score
    * Number of unique projects
* Annoyingly enough, the `overall_score` column and the `ct_district` are in two different dataframes. You'll have to merge it. 
* Welcome to DDS! This will happen to you all the time starting now. 

### Food for thought 
* Which do columns the two dataframes have in common?
    * You can merge on more than one column. In fact, it's best practice to! 
* What type of merge will achieve my goal?
    * Inner, outer, left, or right
* What do I expect out of the merge?
    * Do I expect all the values of the merge keys to be 1:1? Or m:1? 
    * Do I expect a project to correspond with multiple districts? Maybe, projects can and do cross multiple boundaries.
    * Do I expect a project to correspond with only one total cost estimate value? Yes, there shouldn't be multiple cost estimates for the same project!
* How do I go about checking the data after the merge?
    * Which arguments are available to help me per the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)?

In [10]:
m1 = pd.merge(projects_df, overall_scores_df, on=["project_name"])

### Double Checking
* How many rows do you expect?
* How many unique projects are there? 
* Hint: check your original dataframes as well

#### The Beauty of Outer Joins 
* To save you some grief and time, `outer` joins are very useful.
* Merge your dataframes again using an `outer` join and with `indicator = True` on.
* Using `.value_counts()` check out how many rows are found in both dataframes, the left only, and the right only

In [11]:
m2 = pd.merge(
    projects_df, overall_scores_df, on=["project_name"], indicator=True, how="outer"
)

In [12]:
m2._merge.value_counts()

both          26
left_only      3
right_only     3
Name: _merge, dtype: int64

### Filtering
* Filter out for only the `left_only` and `right_only` values.
* AH note: link to  docs page with tutorial.

In [13]:
m2.loc[m2._merge != "both"][["project_name", "_merge"]]

Unnamed: 0,project_name,_merge
10,Rainbow Rush hot Lanes,left_only
12,Bunny Lane HOV+2 heaven,left_only
26,main street muffin top,left_only
29,Rainbow Rush HOT Lanes,right_only
30,Bunny Lane HOV+2 Haven,right_only
31,Main Street Muffin Top Revitalization,right_only


### Dictionaries: An Introduction 
* String data is often entered in many different ways. BART can be entered in as bart, Bay Area Rapid Transit, BaRT, and more. 
* Take a look as to why these projects are not merging. 
* In Excel, it's easy to go in and manually tweak everything. However, that is not reproducible. 
* Since there are essentially only a couple of names to replace, we can do it using a dictionary.
* Decide whether you want to rename the values in the left dataframe or the right one. 
    * AH: Link to docs
    * Explain what a dictionary is
* Take a look at elements 
    * Trailing white spaces
    * Capitalization
    * Spelling
    * Symbols

In [14]:
# I highly recommend you use .unique() to find the project names.
# Often there are trailing white spaces that are naked to our human eyes.

In [15]:
new_names = {
    "main street muffin top ": "Main Street Muffin Top Revitalization",
    "Bunny Lane HOV+2 heaven": "Bunny Lane HOV+2 Haven",
    "Rainbow Rush hot  Lanes": "Rainbow Rush HOT Lanes",
}

In [16]:
projects_df.project_name = projects_df.project_name.replace(new_names)

* Merge your dataframes again. This time it should work.
* Please also specify the merge type and the columns. 
* Although Pandas does this automatically, it's good practice to write everything out.

In [17]:
final_m = pd.merge(projects_df, overall_scores_df, how="inner", on="project_name")

## Groupby
* You're done merging...Oh wait, that wasn't even part of your manager's request. You still need to aggregate. 
* There are many options Some are `groupby / agg`, `pivot_table`, `groupby / transform`
* Hint: rename these columns to be descriptive because we are no longer looking at the `overall_scores`

In [18]:
final_m["min_score"] = final_m.overall_score

In [19]:
final_m["max_score"] = final_m.overall_score

In [72]:
agg1 = (
    final_m.groupby(["ct_district"])
    .agg(
        {
            "overall_score": "median",
            "min_score": "min",
            "max_score": "max",
            "project_name": "nunique",
        }
    )
    .reset_index()
)

In [73]:
agg1 = agg1.rename(
    columns={"overall_score": "median_score", "project_name": "n_projects"}
)

In [74]:
agg1

Unnamed: 0,ct_district,median_score,min_score,max_score,n_projects
0,1,69.0,68,81,3
1,2,60.5,55,66,2
2,3,72.0,68,74,3
3,4,69.0,63,83,3
4,5,73.0,73,73,1
5,6,64.0,64,64,1
6,7,78.0,78,78,1
7,8,82.0,80,89,3
8,9,74.0,65,81,5
9,10,72.0,68,76,3


## Visualizing 
* You're done aggregating, but the dataframe looks objectively plain.
* Unfortunately in the world of data, looks <b>do</b> matter. 
* Let's explore a couple of ways to present your data.

### Styling a Dataframe
* `pandas` has quite a few options that allow you to style your dataframe.
* [This tutorial](https://betterdatascience.com/style-pandas-dataframes/) offers some great ways to jazz up your dataframe.
* You can always read the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) for more ideas.
* Some ideas:
    * Change the font
    * Turn off the index
    * Use colors to indicate low-high values
    * Change the alignment of the values

### Altair
* While a table is great, sometimes the stakeholder prefers a chart. 
* Our preferred visualization library is `Altair`, although there are other options.
    * Their docs page is [here](https://altair-viz.github.io/).
* The code to create a simple bar chart goes something like this. 
    * `alt.Chart(source).mark_bar().encode(x='a',y='b')`
    * `source` is the dataframe you want to use for your chart.
    * `x` denotes the column you are plotting on the X-axis. Make sure your column name has quotation marks around it. 
    * `y` denotes the column you are plotting on the Y-axis. 
* If you want a line chart, simply swap out `.mark_bar()` for `.mark_line`
    * `alt.Chart(source).mark_line().encode(x='x',y='f(x)')`
* <b>Make your first chart below.</b>

In [23]:
alt.Chart(agg1).mark_bar().encode(x="ct_district", y="n_projects")

#### Customizing
* `altair` offers an endless ways to amp up the personality of your chart.
* Additionally, the chart above without a title and legend is a data visualization "taboo" and the dull Facebook blue is uninspiring. 

#### Add a title

In [25]:
alt.Chart(agg1, title="your_title_here").mark_bar().encode(
    x="ct_district", y="n_projects"
)

#### Add some color
* Explain our calitp_color_palette.

In [35]:
from calitp_data_analysis import calitp_color_palette

In [39]:
# To see what is inside a module, just put two question marks
# From here, you can choose another color palette
# calitp_color_palette??

In [40]:
alt.Chart(agg1, title="your_title_here").mark_bar().encode(
    x="ct_district",
    y="n_projects",
    color=alt.Color(
        "n_projects",  # This is the column you want the color of your bar to be based on
        title="legend_title_here",  # This is the legend of your title
        scale=alt.Scale(
            range=calitp_color_palette.CALITP_DIVERGING_COLORS
        ),  # This is where you can customize the colors,
    ),
)

#### Adjusting the Axis

In [70]:
ct_districts = {
    1: "D1",
    2: "D2",
    3: "D3",
    4: "D4",
    5: "D5",
    6: "D6",
    7: "D7",
    8: "D8",
    9: "D9",
    10: "D10",
    11: "D11",
    12: "D12",
}

In [76]:
agg1["ct_district"] = agg1["ct_district"].replace(ct_districts)

In [81]:
alt.Chart(agg1, title="your_title_here").mark_bar().encode(
    x=alt.X("ct_district"),
    y=alt.Y("n_projects", scale=alt.Scale(domain=[0, 5])),
    color=alt.Color(
        "n_projects",
        title="legend_title_here",
        scale=alt.Scale(range=calitp_color_palette.CALITP_DIVERGING_COLORS),
    ),
)

### Finishing Touches 
* Sizing
* Tooltip
* Remapping
* Saving

In [83]:
alt.Chart(agg1, title="your_title_here").mark_bar(size=20).encode(
    x=alt.X("ct_district"),
    y=alt.Y("n_projects", scale=alt.Scale(domain=[0, 5])),
    color=alt.Color(
        "n_projects",
        title="legend_title_here",
        scale=alt.Scale(range=calitp_color_palette.CALITP_DIVERGING_COLORS),
    ),
    tooltip=list(agg1.columns),
).properties(width=400, height=250)

### We have only visualized one column of data. 
* There are still a few more columns. Make a couple of other charts using your code above. 

### Save your work