# Exercise 2: Merging, Aggregating, Filtering, and Visualizing

In [1]:
import altair as alt
import pandas as pd
from calitp_data_analysis.sql import to_snakecase

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

* Read back in the `parquet` file with the `overall_score` you created from exercise 1.
* Read the Excel sheet containing the project information (scope of work, district, and project name).
* Use f-strings.

In [3]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/starter_kit/"

In [4]:
EXCEL_FILE = "starter_kit_csis_scoring_workbook.xlsx"

In [5]:
OVERALL_SCORE_FILE = "starter_kit_example_final_scores.parquet"

In [6]:
projects_df = to_snakecase(pd.read_excel(f"{GCS_FILE_PATH}{EXCEL_FILE}"))

In [7]:
projects_df.head(2)

Unnamed: 0,ct_district,project_name,scope_of_work,project_cost
0,2,Meadow Magic Multi-Use Path,"A 2-mile Class I bike lane and multi-use path through a scenic meadow, featuring wildflower plantings, public art installations, and educational signage highlighting local wildlife.",6265525
1,3,Bunny Hop Bike Boulevard,"A Class II bike lane with charming streetlights, benches, and bike racks designed to resemble carrot sticks, connecting residential neighborhoods to local schools and parks.",3777437


In [8]:
overall_scores_df = pd.read_parquet(f"{GCS_FILE_PATH}{OVERALL_SCORE_FILE}")

In [9]:
overall_scores_df.head(2)

Unnamed: 0,project_name,accessibility_score,dac_accessibility_score,dac_traffic_impacts_score,freight_efficiency_score,freight_sustainability_score,mode_shift_score,lu_natural_resources_score,safety_score,vmt_score,zev_score,public_engagement_score,climate_resilience_score,program_fit_score,overall_score
0,Meadow Magic Multi-Use Path,1,8,9,3,10,3,8,2,2,10,4,2,4,66
1,Bunny Hop Bike Boulevard,9,5,2,5,6,2,4,5,9,5,2,10,7,71


## Merging 
* Your manager asks you to aggregate the dataframe by Caltrans District to find:
    * Median overall score
    * Max overall score 
    * Min overall score
    * Number of unique projects
* Annoyingly enough, the `overall_score` column and the `ct_district` are in two different dataframes. 
* You'll have to <b>merge</b> it on the common column(s) the two dataframes share.
* Welcome to DDS! This will happen to you all the time starting now. 

### Relevant Resources
* If needed, read about merges before diving in. 
    * [Resource #1 is a great tutorial for beginners](https://www.practicalpythonfordatascience.com/03_cleaning_data.html?highlight=merge#merging-dataframes-together).
    * [Resource #2 is written by our own Tiffany Ku, but it contains some geospatial references so it's a bit more to digest](https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html#merge-tabular-and-geospatial-data-for-data-analysis).
### Food for thought 
* Which columns do the two dataframes have in common?
* What type of merge will achieve my goal?
    * Inner, outer, left, or right
* What do I expect out of the merge?
    * Do I expect all the values of the merge keys to be 1:1? Or m:1? 
    * Do I expect a project to correspond with multiple districts? Maybe, projects can and do cross multiple boundaries.
    * Do I expect a project to correspond with only one total cost estimate value? Yes, there shouldn't be multiple cost estimates for the same project!
* How do I go about checking the data after the merge?
    * Which arguments are available to help me per the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)?

In [10]:
m1 = pd.merge(projects_df, overall_scores_df, on=["project_name"])

### Double Checking
* How many rows do you expect?
* How many unique projects are there? 
* <b>Hint</b>: check your original dataframes as well

In [11]:
m1.shape

(41, 18)

In [12]:
m1.project_name.nunique()

41

In [13]:
projects_df.project_name.nunique()

44

### The Beauty of Outer Joins 
* As you have noticed, we are missing a couple of projects.
* This is where `outer` joins are very useful.
* Merge your dataframes again using an `outer` join and with `indicator = True` on.
* Using `.value_counts()` check out how many rows are found in both dataframes, the left only, and the right only

In [14]:
m2 = pd.merge(
    projects_df, overall_scores_df, on=["project_name"], indicator=True, how="outer"
)

In [15]:
m2._merge.value_counts()

both          41
left_only      3
right_only     3
Name: _merge, dtype: int64

### Filtering
* Filter out for only the `left_only` and `right_only` values.
    * `!=` means does not equal to. 
    

In [16]:
m2.loc[m2._merge != "both"][["project_name", "_merge"]]

Unnamed: 0,project_name,_merge
10,Rainbow Rush hot Lanes,left_only
12,Bunny Lane HOV+2 heaven,left_only
26,main street muffin top,left_only
44,Rainbow Rush HOT Lanes,right_only
45,Bunny Lane HOV+2 Haven,right_only
46,Main Street Muffin Top Revitalization,right_only


* You could also use `isin([list of elements you want to keep])`

In [17]:
m2.loc[m2._merge.isin(["left_only","right_only"])][["project_name", "_merge"]]

Unnamed: 0,project_name,_merge
10,Rainbow Rush hot Lanes,left_only
12,Bunny Lane HOV+2 heaven,left_only
26,main street muffin top,left_only
44,Rainbow Rush HOT Lanes,right_only
45,Bunny Lane HOV+2 Haven,right_only
46,Main Street Muffin Top Revitalization,right_only


### Dictionaries
* String data is often entered in many different ways. BART can be entered in as bart, Bay Area Rapid Transit, BaRT, and more. 
* Often, strings are the reason why your dataframe is not merging properly.
* In Excel, it's easy to go in and manually tweak everything. However, that is not reproducible and time consuming. 
* Since there are essentially only a couple of names to replace, we can do it using a <b>dictionary</b>.

#### What is a dictionary?
* Per Practical Python for Data Science, a dictionary is <i>Dictionaries are used to store data values in key:value pairs. Similar to the list, a dictionary is a collection of objects. It is also mutable, meaning that you can add, remove, change values inside of it...With the list, we access elements using the index. With the dictionary, we access elements using keys.</i>.
    * Read more [here](https://www.practicalpythonfordatascience.com/00_python_crash_course_datatypes.html?highlight=dictionary#dictionary) and experiment with the example in the docs in this notebook.
    

In [18]:
# Practice Here

#### Replacing Values
* [Relevanting Reading](https://www.practicalpythonfordatascience.com/03_cleaning_data#recoding-column-values).
* Step 1: Filter out for the rows that <b>didn't</b> merge. Find the unique values of the `project_name` column using `.unique()`
* Take a look at elements using 
    * Trailing white spaces
    * Capitalization
    * Spelling
    * Symbols

In [19]:
m2.loc[m2._merge.isin(["left_only","right_only"])].project_name.unique()

array(['Rainbow Rush hot  Lanes', 'Bunny Lane HOV+2 heaven',
       'main street muffin top ', 'Rainbow Rush HOT Lanes',
       'Bunny Lane HOV+2 Haven', 'Main Street Muffin Top Revitalization'],
      dtype=object)

* Step 2: Decide whether you want to rename the values in the left dataframe or the right one. 
* Step 3: The <b>keys</b>, are the values you want to replace. The <b>values</b>, are what you want to replace these values with. 

In [20]:
new_names = {
    "main street muffin top ": "Main Street Muffin Top Revitalization",
    "Bunny Lane HOV+2 heaven": "Bunny Lane HOV+2 Haven",
    "Rainbow Rush hot  Lanes": "Rainbow Rush HOT Lanes",
}

* Step 4: Use your dictionary in `.replace()` to recode the values.

In [21]:
projects_df.project_name = projects_df.project_name.replace(new_names)

#### Merge your dataframes again. This time it should work.


In [22]:
final_m = pd.merge(projects_df, overall_scores_df, how="inner", on="project_name")

In [23]:
final_m.shape

(44, 18)

In [24]:
final_m.head(1)

Unnamed: 0,ct_district,project_name,scope_of_work,project_cost,accessibility_score,dac_accessibility_score,dac_traffic_impacts_score,freight_efficiency_score,freight_sustainability_score,mode_shift_score,lu_natural_resources_score,safety_score,vmt_score,zev_score,public_engagement_score,climate_resilience_score,program_fit_score,overall_score
0,2,Meadow Magic Multi-Use Path,"A 2-mile Class I bike lane and multi-use path through a scenic meadow, featuring wildflower plantings, public art installations, and educational signage highlighting local wildlife.",6265525,1,8,9,3,10,3,8,2,2,10,4,2,4,66


* Save this dataframe as a parquet to GCS under a new name

In [25]:

final_m.to_parquet(f"{GCS_FILE_PATH}starter_kit_example_merge.parquet")

## Groupby
* You're done merging...Oh wait, that wasn't even part of your manager's request. You still need to aggregate. 
* The refresh your memory: by Caltrans District to find
    * Median overall score
    * Max overall score 
    * Min overall score
    * Number of unique projects
* There are many options Some are `groupby / agg`, `pivot_table`, `groupby / transform`
* <b>Resources</b>: [DDS Docs](https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html#aggregating)
    * Use the space below and explore these tutorials. 
    * Then, apply your new knowledge to the prompt above.
* Hint: After aggregating, your column name will no longer be relevant. For example, if you use `scope_of_work` to count the number of projects, this column no longer represents `scope_of_work`. It should be renamed something like `n_projects`.
    * Rename your columns using this `df.rename(columns={"old_column_name":"new_column_name"})`

In [26]:
final_m["min_score"] = final_m.overall_score

In [27]:
final_m["max_score"] = final_m.overall_score

In [28]:
agg1 = (
    final_m.groupby(["ct_district"])
    .agg(
        {
            "overall_score": "median",
            "min_score": "min",
            "max_score": "max",
            "project_name": "nunique",
        }
    )
    .reset_index()
)

In [29]:
agg1 = agg1.rename(
    columns={"overall_score": "median_score", "project_name": "n_projects"}
)

In [30]:
agg1

Unnamed: 0,ct_district,median_score,min_score,max_score,n_projects
0,1,69.5,69,70,2
1,2,72.5,64,74,6
2,3,71.0,62,77,7
3,4,66.0,65,67,2
4,5,81.0,81,81,1
5,6,66.5,57,85,6
6,7,76.0,53,86,5
7,8,78.5,63,94,2
8,9,81.0,48,90,3
9,10,80.0,80,80,1


## Visualizing 
* You're done aggregating, but the dataframe looks objectively plain.
* Unfortunately in the world of data, looks <b>do</b> matter. 
* Let's explore a couple of ways to present your data.

### Styling a Dataframe
* `pandas` has quite a few options that allow you to style your dataframe.
* [This tutorial](https://betterdatascience.com/style-pandas-dataframes/) offers some great ways to jazz up your dataframe.
* You can always read the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) for more ideas.
* Some ideas:
    * Change the font
    * Turn off the index
    * Use colors to indicate low-high values
    * Change the alignment of the values

### Altair
* While a table is great, sometimes the stakeholder prefers a chart. 
* Our preferred visualization library is `Altair`, although there are other options.
    * Their docs page is [here](https://altair-viz.github.io/).
* The code to create a simple bar chart goes something like this. 
    * `alt.Chart(source).mark_bar().encode(x='a',y='b')`
    * `source` is the dataframe you want to use for your chart.
    * `x` denotes the column you are plotting on the X-axis. Make sure your column name has quotation marks around it. 
    * `y` denotes the column you are plotting on the Y-axis. 
* <b>Make your first chart below.</b>

In [31]:
alt.Chart(agg1).mark_bar().encode(x="ct_district", y="n_projects")

#### Customizing
* `altair` offers an endless ways to amp up the personality of your chart.
* Additionally, the chart above without a title and legend is a data visualization "taboo" and the dull blue is uninspiring. 

##### Add a title
* You can do so within `.Chart()`

In [32]:
alt.Chart(agg1, title="your_title_here").mark_bar().encode(
    x="ct_district", y="n_projects"
)

### Different Charts
* If you want something that isn't a bar chart, simply swap out `.mark_bar()` for `.mark_line` or `mark_circle`.


In [33]:
alt.Chart(agg1, title="your_title_here").mark_circle().encode(
    x="ct_district", y="n_projects"
)

#### Add some color/DDS's Python Library
* We have some default color palettes that are already in our [internal library of functions](https://docs.calitp.org/data-infra/analytics_tools/python_libraries.html#calitp-data-analysis).

In [34]:
# Import the color palettes
from calitp_data_analysis import calitp_color_palette

* To see what is inside a module,  put two question marks behind it.
* From here, you can choose another color palette.

In [35]:
calitp_color_palette??

[0;31mType:[0m        module
[0;31mString form:[0m <module 'calitp_data_analysis.calitp_color_palette' from '/opt/conda/lib/python3.9/site-packages/calitp_data_analysis/calitp_color_palette.py'>
[0;31mFile:[0m        /opt/conda/lib/python3.9/site-packages/calitp_data_analysis/calitp_color_palette.py
[0;31mSource:[0m     
[0;31m# --------------------------------------------------------------#[0m[0;34m[0m
[0;34m[0m[0;31m# Cal-ITP style guide[0m[0;34m[0m
[0;34m[0m[0;31m# Google Drive > Cal-ITP Team > Project Resources >[0m[0;34m[0m
[0;34m[0m[0;31m# Branded Resources and External Comms Guidelines > Branded Resources > Style Guide[0m[0;34m[0m
[0;34m[0m[0;31m# --------------------------------------------------------------#[0m[0;34m[0m
[0;34m[0m[0mCALITP_CATEGORY_BRIGHT_COLORS[0m [0;34m=[0m [0;34m[[0m[0;34m[0m
[0;34m[0m    [0;34m"#2EA8CE"[0m[0;34m,[0m  [0;31m# darker blue[0m[0;34m[0m
[0;34m[0m    [0;34m"#EB9F3C"[0m[0;34m,[0m  [0;3

In [36]:
alt.Chart(agg1, title="your_title_here").mark_bar().encode(
    x="ct_district",
    y="n_projects",
    color=alt.Color(
        "n_projects",  # This is the column you want the color of your bar to be based on
        title="legend_title_here",  # This is the legend of your title
        scale=alt.Scale(
            range=calitp_color_palette.CALITP_DIVERGING_COLORS # This is where you can customize the colors,
        ),  
    ),
)

#### Adjusting the Axis
* Sometimes, we want to adjust the axis to have a min and max value.
* You do so using the `scale=alt.Scale(domain=[min_value, max_value]))` argument behind the X and Y axis.
* `alt.X()` and `alt.Y` gives you many more customization options.

In [37]:
alt.Chart(agg1, title="your_title_here").mark_bar().encode(
    x=alt.X("ct_district", scale=alt.Scale(domain=[1, 12])),
    y=alt.Y("n_projects", scale=alt.Scale(domain=[0, 10])),
    color=alt.Color(
        "n_projects",
        title="legend_title_here",
        scale=alt.Scale(range=calitp_color_palette.CALITP_DIVERGING_COLORS),
    ),
)

### Finishing Touches 
* `.properties(width=400, height=250)` adjusts the size of your chart. 
* `tooltip=[columns you want]` gives you additional details on the columns you specify when you hover over each bar/circle/etc.
* `.mark_bar(size=30)` adjusts the size of the bar/circle/etc.

In [38]:
alt.Chart(agg1, title="your_title_here").mark_bar(size = 10).encode(
    x=alt.X("ct_district", scale=alt.Scale(domain=[1, 12])),
    y=alt.Y("n_projects", scale=alt.Scale(domain=[0, 10])),
    color=alt.Color(
        "n_projects",
        title="legend_title_here",
        scale=alt.Scale(range=calitp_color_palette.CALITP_DIVERGING_COLORS),
    ),
    tooltip=["ct_district", "n_projects"]
).properties(width=400, height=250)

### We have only visualized one column of data. 
* We have only visualized one column of data, but we have a couple of columns above. 
* Try to customize your grid. If you can dream it, you can probably do it with Altair. 
* You can turn off the grid lines, rotate the axis labels by various degrees, label the bars, add a dropdown menu to change the axis, and more. 
* Make a few other charts in different styles.
* Altair's [gallery](https://altair-viz.github.io/gallery/index.html) is a great resource for inspiration.
* DDS's [portfolio](https://analysis.calitp.org/) also contains a plethora of examples.
