# Exercise 2: Merging, Aggregating, Filtering, and Visualizing

In [3]:
import altair as alt
import pandas as pd

In [4]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

## Read back in the two files you'll need using f'strings

In [5]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/starter_kit/"

In [6]:
OG_FILE = "starter_kit_csis_scoring_workbook.xlsx"

In [19]:
OVERALL_SCORE_FILE = "final_scores.xlsx"

In [8]:
projects_df = pd.read_excel(f"{GCS_FILE_PATH}{FILE1}")

In [14]:
projects_df.head(2)

Unnamed: 0,ct_district,project_name,Scope of Work
0,10,Meadow Magic Multi-Use Path,"A 2-mile Class I bike lane and multi-use path through a scenic meadow, featuring wildflower plantings, public art installations, and educational signage highlighting local wildlife."
1,8,Bunny Hop Bike Boulevard,"A Class II bike lane with charming streetlights, benches, and bike racks designed to resemble carrot sticks, connecting residential neighborhoods to local schools and parks."


In [20]:
overall_scores_df = pd.read_excel(f"{GCS_FILE_PATH}{OVERALL_SCORE_FILE}")

In [21]:
overall_scores_df.head(2)

Unnamed: 0.1,Unnamed: 0,project_name,overall_score
0,0,Meadow Magic Multi-Use Path,136
1,1,Bunny Hop Bike Boulevard,164


## Merging 
* Your manager asks you to aggregate the average overall score, the max score, and the min score for each Caltrans District.
* Annoyingly enough, the `overall_score` column and the `ct_district` are in two different dataframes. You'll have to merge it. 
* Welcome to DDS! This will happen to you all the time starting now. 

### Food for thought 
* Which do columns the two dataframes have in common?
    * You can merge on more than one column. In fact, it's best practice to! 
* What type of merge will achieve my goal?
    * Inner, outer, left, or right
* What do I expect out of the merge?
    * Do I expect all the values of the merge keys to be 1:1? Or m:1? 
    * Do I expect a project to correspond with multiple districts? Maybe, projects can and do cross multiple boundaries.
    * Do I expect a project to correspond with only one total cost estimate value? Yes, there shouldn't be multiple cost estimates for the same project!
* How do I go about checking the data after the merge?
    * Which arguments are available to help me per the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)?

In [10]:
m1 = pd.merge(projects_df, overall_scores_df, on = ["project_name"])

### Double Checking
* How many rows do you expect?
* How many unique projects are there? 
* Hint: check your original dataframes as well

#### The Beauty of Outer Joins 
* To save you some grief and time, `outer` joins are very useful.
* Merge your dataframes again using an `outer` join and with `indicator = True` on.
* Using `.value_counts()` check out how many rows are found in both dataframes, the left only, and the right only

In [25]:
m2 = pd.merge(projects_df, overall_scores_df, on = ["project_name"], indicator = True, how = "outer")

In [26]:
m2._merge.value_counts()

both          26
left_only      3
right_only     3
Name: _merge, dtype: int64

### Filtering
* Filter out for only the `left_only` and `right_only` values.
* AH note: link to  docs page with tutorial.

In [27]:
m2.loc[m2._merge != "both"]

Unnamed: 0.1,ct_district,project_name,Scope of Work,Unnamed: 0,overall_score,_merge
10,8.0,Rainbow Rush hot Lanes,"High-Occupancy Toll lanes with dynamic pricing, utilizing advanced traffic management systems to optimize congestion relief and reduce emissions.",,,left_only
12,12.0,Bunny Lane HOV+2 heaven,"A High-Occupancy Vehicle lane with comfortable waiting areas, complimentary Wi-Fi, and convenient access to nearby amenities.",,,left_only
26,8.0,main street muffin top,"Pedestrian-friendly improvements to a charming town center, incorporating decorative lighting, street furniture, and enhanced storefronts.",,,left_only
29,,Rainbow Rush HOT Lanes,,10.0,178.0,right_only
30,,Bunny Lane HOV+2 Haven,,12.0,150.0,right_only
31,,Main Street Muffin Top Revitalization,,26.0,160.0,right_only


### Dictionaries: An Introduction 
* String data is often entered in many different ways. BART can be entered in as bart, Bay Area Rapid Transit, BaRT, and more. 
* Take a look as to why these projects are not merging. 
* In Excel, it's easy to go in and manually tweak everything. However, that is not reproducible. 
* Since there are essentially only a couple of names to replace, we can do it using a dictionary.
* Decide whether you want to rename the values in the left dataframe or the right one. 
    * AH: Link to docs
    * Explain what a dictionary is


In [31]:
# I highly recommend you use .unique() to find the project names. 
# Often there are trailing white spaces that are naked to our human eyes.


In [30]:
new_names = {'main street muffin top ':'Main Street Muffin Top Revitalization',
            'Bunny Lane HOV+2 heaven':'Bunny Lane HOV+2 Haven',
            'Rainbow Rush hot Lanes':'Rainbow Rush HOT Lanes'}

In [33]:
projects_df.project_name = projects_df.project_name.replace(new_names)

* Merge your dataframes again. This time it should work.
* Please also specify the merge type and the columns. 
* Although Pandas does this automatically, it's good practice to write everything out.

In [35]:
final_m = pd.merge(projects_df, overall_scores_df, how = "inner", on = "project_name")

In [36]:
final_m.project_name.nunique()

28

In [38]:
final_m.head(2)

Unnamed: 0.1,ct_district,project_name,Scope of Work,Unnamed: 0,overall_score
0,10,Meadow Magic Multi-Use Path,"A 2-mile Class I bike lane and multi-use path through a scenic meadow, featuring wildflower plantings, public art installations, and educational signage highlighting local wildlife.",0,136
1,8,Bunny Hop Bike Boulevard,"A Class II bike lane with charming streetlights, benches, and bike racks designed to resemble carrot sticks, connecting residential neighborhoods to local schools and parks.",1,164


## Groupby
* You're done merging...Oh wait, that wasn't even part of your manager's request.
* Let's revisit: they want you to "aggregate the average overall score, the max score, and the min score for each Caltrans District."
* For `pandas`: there are many options. Some are `groupby / agg`, `pivot_table`, `groupby / transform`
* Hint: rename these columns to be descriptive because we are no longer looking at the `overall_scores`

In [39]:
agg1 = final_m.groupby(["ct_district"]).agg({"overall_score":"median"}).reset_index()

In [41]:
agg1 = agg1.rename(columns = {"overall_score":"median_score"})

In [42]:
agg1

Unnamed: 0,ct_district,median_score
0,1,138.0
1,2,121.0
2,3,144.0
3,4,138.0
4,5,146.0
5,6,128.0
6,7,156.0
7,8,162.0
8,9,148.0
9,10,144.0


## Visualizing 
* While your manager is pleased with your work, they forgot to mention that they will be presenting this.
* Thus, they want a visualization of median scores by districts.
* Our preferred visualization library is `Altair` but there are many others. 
* Make a chart 