# Exercise 2: Merging, Aggregating, Filtering, and Visualizing

In [1]:
import altair as alt
import pandas as pd
from calitp_data_analysis.sql import to_snakecase

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

* Read back in the `parquet` file with the `overall_score` you created from exercise 1.
* Read the Excel sheet containing the project information (scope of work, district, and project name).
* **Use f-strings.**

In [3]:
FOLDER = "../starter_kit/"
FILE_NAME = "overall_score.parquet"
score = pd.read_parquet(f"{FOLDER}{FILE_NAME}")

In [4]:
FOLDER = "../starter_kit/"
FILE_NAME = "starter_kit_csis_scoring_workbook.xlsx"

my_sheets = ["projects_auto", "overall_score"]
workbook = pd.read_excel(f"{FOLDER}{FILE_NAME}", sheet_name=my_sheets)

project = to_snakecase(workbook.get(my_sheets[0]))

In [None]:
#keep_col = ['ct_district', 'project_name', 'scope_of_work']
#sub_project = project[keep_col]

In [None]:
#GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/starter_kit/"

In [5]:
score['project_name'].nunique()

44

In [6]:
score.shape

(44, 15)

In [7]:
project['project_name'].nunique()

44

In [8]:
project.shape

(44, 5)

In [9]:
project['ct_district'].nunique()

11

In [10]:
project.ct_district.value_counts()

11    8
2     6
5     6
4     5
7     5
8     4
10    3
12    2
3     2
9     2
6     1
Name: ct_district, dtype: int64

## Merging 
* **Goal**: Your manager asks you to aggregate the dataframe by the Caltrans District grain to find
    * Median overall score
    * Max overall score 
    * Min overall score
    * Number of unique projects
* Annoyingly enough, the `overall_score` column and the `ct_district` are in two different dataframes. 
* You'll have to <b>merge</b> the dataframes on the common column(s) the two dataframes share.
* Welcome to DDS! This will happen to you all the time starting now. 

### Relevant Resources
* Read about and practice merges before continuing on the exercise. 
    * [Resource #1 is a great tutorial for beginners](https://www.practicalpythonfordatascience.com/03_cleaning_data.html?highlight=merge#merging-dataframes-together).
    * [Resource #2 is written by our own Tiffany Ku, but it contains some geospatial references so it's a bit more to digest](https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html#merge-tabular-and-geospatial-data-for-data-analysis).
    

In [None]:
# Practice Here

In [11]:
score.head(2)

Unnamed: 0,project_name,accessibility_score,dac_accessibility_score,dac_traffic_impacts_score,freight_efficiency_score,freight_sustainability_score,mode_shift_score,lu_natural_resources_score,safety_score,vmt_score,zev_score,public_engagement_score,climate_resilience_score,program_fit_score,overall_score
0,Meadow Magic Multi-Use Path,10,6,1,5,9,7,5,2,9,5,8,7,8,82
1,Bunny Hop Bike Boulevard,2,3,1,6,5,3,9,7,5,8,2,9,5,65


In [12]:
project.head(2)

Unnamed: 0,ct_district,project_name,scope_of_work,project_cost,lead_agency
0,8,Meadow Magic Multi-Use Path,"A 2-mile Class I bike lane and multi-use path through a scenic meadow, featuring wildflower plantings, public art installations, and educational signage highlighting local wildlife.",4189348,Meadow Bunny Public Transportation (MBPT)
1,2,Bunny Hop Bike Boulevard,"A Class II bike lane with charming streetlights, benches, and bike racks designed to resemble carrot sticks, connecting residential neighborhoods to local schools and parks.",8647685,Unicorn Fairy Express Bus (UFX)


### Now merge your two CSIS dataframes
**Food for Thought**
* Which columns do the two dataframes have in common?
* What type of merge will achieve my goal?
    * Inner, outer, left, or right?
* What do I expect out of the merge?
    * Do I expect all the values of the merge keys to be 1:1? Or m:1? 
    * Do I expect a project to correspond with multiple districts? Maybe, projects can and do cross multiple boundaries.
    * Do I expect a project to correspond with only one total cost estimate value? Yes, there shouldn't be multiple cost estimates for the same project!
* How do I go about checking the data after the merge?
    * Which arguments are available to help me per the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)?

### Double Checking
* How many rows do you expect?
* How many unique projects are there? 
* <b>Hint</b>: check the lengths of your original dataframes as well

### The Beauty of Outer Joins 
* As you have noticed, we are missing a couple of projects.
* This is where `outer` joins are very useful.
* Merge your dataframes again using an `outer` join and with `indicator = True` on.
    * `m2 = pd.merge(df1, df2, on=[column], indicator=True, how="outer")`
* Using `.value_counts()` on the column named `_merge` created by `indicator=True` to check out how many rows are found in both dataframes, the left only, and the right only

In [20]:
m = pd.merge(project, score, on='project_name', indicator=True, how="outer", validate = 'm:1')
#m.head(2)

In [21]:
m.shape

(47, 20)

In [24]:
m._merge.value_counts()

both          41
left_only      3
right_only     3
Name: _merge, dtype: int64

### Filtering
* Filter out for only the `left_only` and `right_only` values.
    * `!=` means does not equal to. 
    * `==` means equal to.

* You could also use `isin([list of elements you want to keep])` to retain multiple elements you want.

In [30]:
keep_left = ["left_only"]
m_left = m[m._merge.isin(keep_left)]
#m_left.head()
m_left.shape

(3, 20)

In [29]:
keep_right = ["right_only"]
m_right = m[m._merge.isin(keep_right)]
#m_right.head()
m_right.shape

(3, 20)

* If you want to filter out multiple elements use `~df.column.isin([list of elements you don't want to keep])`

In [34]:
# ??? Is this for deleting multiple columns. not rows???
test_left = ~m._merge.isin(['both', 'right'])
test_left.shape

(47,)

### Dictionaries
* String data is often entered in many different ways. 
    * BART can be entered in as bart, Bay Area Rapid Transit, BaRT, and more. 
* Often, differing strings between two dataframes are the reason why your dataframe is not merging properly.
* In Excel, it's easy to go in and manually tweak everything. However, that is not reproducible and time consuming. 
* Luckily with Python we can automate this. 
* Since there are only a couple of names to replace, we can do it using a <b>dictionary</b>.

#### What is a dictionary?
* Per Practical Python for Data Science, a dictionary is <i>Dictionaries are used to store data values in key:value pairs. Similar to the list, a dictionary is a collection of objects. It is also mutable, meaning that you can add, remove, change values inside of it...With the list, we access elements using the index. With the dictionary, we access elements using keys.</i>
* Dictionaries are very important.
* Read more [here](https://www.practicalpythonfordatascience.com/00_python_crash_course_datatypes.html?highlight=dictionary#dictionary) and **follow its example in the cells below.**
    

In [None]:
# Practice Here

#### Application of Dictionaries: Replacing Values
* [Resource](https://www.practicalpythonfordatascience.com/03_cleaning_data#recoding-column-values)
* **Step 1**: Filter out for the rows that <b>didn't</b> merge. Find the unique values of the `project_name` column using `.unique()`
* Take a look at elements using 
    * Trailing white spaces
    * Capitalization
    * Spelling
    * Symbols

In [45]:
m.project_name.nunique()

47

In [46]:
keep_row = ["right_only", "left_only"]
m_keep = m[m._merge.isin(keep_row)]

In [47]:
# method 1
m_keep.project_name.unique()

array(['Rainbow Rush hot  Lanes', 'Bunny Lane HOV+2 heaven',
       'main street muffin top ', 'Rainbow Rush HOT Lanes',
       'Bunny Lane HOV+2 Haven', 'Main Street Muffin Top Revitalization'],
      dtype=object)

In [48]:
# method 2
keep_col = ['project_name', '_merge']
m_keep = m_keep[keep_col]

#m_keep.shape
m_keep.head(6)

Unnamed: 0,project_name,_merge
10,Rainbow Rush hot Lanes,left_only
12,Bunny Lane HOV+2 heaven,left_only
26,main street muffin top,left_only
44,Rainbow Rush HOT Lanes,right_only
45,Bunny Lane HOV+2 Haven,right_only
46,Main Street Muffin Top Revitalization,right_only


* **Step 2:** Decide whether you want to rename the values in the left dataframe or the right one. 
* **Step 3:** The <b>keys</b>, are the values you want to replace. The <b>values</b>, are what you want to replace these values with. 
    * Let's say my left value is "AC Transit" but I want it to be "Alameda Contra Costa County Transit", my dictionary would be 
    * `my_dict = {"AC Transit":"Alameda Contra Costa County Transit"}`

In [None]:
my_dict = {"Rainbow Rush hot Lanes":"Rainbow Rush HOT Lanes",
           "Bunny Lane HOV+2 heaven":"Bunny Lane HOV+2 Heaven",
          "main street muffin top":"Main Street Muffin Top Revitalization"}

* **Step 4**: Use your dictionary in `.replace()` to recode the values.

In [60]:
project.project_name = project.project_name.replace({"Rainbow Rush hot Lanes":"Rainbow Rush HOT Lanes",
           "Bunny Lane HOV+2 heaven":"Bunny Lane HOV+2 Heaven",
          "main street muffin top":"Main Street Muffin Top Revitalization"})

In [None]:
#df.project_name = df.project_name.replace(your_dictionary)

In [61]:
project.project_name.value_counts()

Meadow Magic Multi-Use Path                  1
Bunny Hop Bike Boulevard                     1
Gingerbread Village Green Complete Street    1
Trail of Treats and Transit Hub              1
main street muffin top                       1
Park and Ride Petal Paradise                 1
Waterfront Waffle Walk and Bike              1
Fairy Glen Boulevard                         1
Pixie Pathway Reconstruction                 1
Meadowbrook Magic Makeover                   1
Elven Exchange Interchange                   1
Riverbend Pixie Ramp                         1
Larkspur Loop On-Ramp to Fairytop            1
Mystic Meadow Managed Lane                   1
Sagebrush SMART Lane of the Ancients         1
Laurel Lane Enchanted Express                1
Golden Gate Glimmer Express Lane             1
Canyon Creek Sparkle Toll Lane               1
Sunset Valley Twinkle Fast Lane              1
Parkside Pixie Carpool Lane                  1
Ridgewood Ride-Share Rainbow Lane            1
Hydrogen Have

#### Merge your dataframes again. 
* This time the number of unique project names should match the rows of the merged dataframe perfectly.
* Make sure to double check that!

In [58]:
m2 = pd.merge(project, score, on='project_name', indicator=True, how="outer", validate = 'm:1')
#m2.head(2)
m2.shape

(47, 20)

#### Save this dataframe as a parquet to GCS under a new name
* Use a `f-string`

## Groupby
* You're done merging...Oh wait, that wasn't even part of your manager's request. You still need to aggregate. 
* By Caltrans District to find
    * Median overall score
    * Max overall score 
    * Min overall score
    * Number of unique projects
* There are many options Some are `groupby / agg`, `pivot_table`, `groupby / transform`
* <b>Resource</b>: 
    * [DDS Docs](https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html#aggregating)

In [None]:
# Practice tutorial linked above here

### Apply your new knowledge to the prompt above.
* Hint: After aggregating, some of the column names will no longer be relevant. 
* For example, if you use `scope_of_work` to count the number of projects, this column no longer represents `scope_of_work`.
* It should be renamed something like `n_projects`.
* Rename your columns using this `df.rename(columns={"old_column_name":"new_column_name"})`

## Visualizing 
* You're done aggregating, but the dataframe looks objectively plain.
* Let's explore a couple of ways to present your data.

### Styling a Dataframe
* `pandas` has quite a few options that allow you to style your dataframe.
* [This tutorial](https://betterdatascience.com/style-pandas-dataframes/) offers some great ways to jazz up your dataframe.
* You can always read the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) for more ideas.
* Some ideas:
    * Change the font
    * Turn off the index
    * Use colors to code low-high values
    * Change the alignment of the values

In [None]:
# Practice here 

### Altair
* While a table is great, sometimes a chart is a better way to display an insight.
* Our preferred visualization library is `Altair`.
    * Docs page is [here](https://altair-viz.github.io/).
* The code to create a simple bar chart goes something like this. 
    * `alt.Chart(source).mark_bar().encode(x='a',y='b')`
    * `source` is the dataframe you want to use for your chart.
    * `x` denotes the column you are plotting on the X-axis. Make sure your column name has quotation marks around it. 
    * `y` denotes the column you are plotting on the Y-axis. 
* <b>Make your first chart below.</b>

In [None]:
alt.Chart(agg1).mark_bar().encode(
    x="ct_district", y="n_projects"
)

#### Customizing
* `altair` offers an endless ways to amp up the personality of your chart.
* Additionally, the chart above without a title and legend is a data visualization "taboo" and the dull blue is uninspiring. 

#### Add a title
* You can do so within  `.Chart()`
`alt.Chart(source,  title="your_title_here").mark_bar().encode(x='a',y='b')`

In [None]:
alt.Chart(agg1, title="your_title_here").mark_bar().encode(
    x="ct_district", y="n_projects"
)

#### Different Charts
* If you want something that isn't a bar chart, simply swap out `.mark_bar()` for `.mark_line` or `mark_circle`.


In [None]:
alt.Chart(agg1, title="your_title_here").mark_line().encode(
    x="ct_district", y="n_projects"
)

#### Add some color/DDS's Python Library
* We have some default color palettes that are already in our [internal library of functions](https://docs.calitp.org/data-infra/analytics_tools/python_libraries.html#calitp-data-analysis).

In [None]:
# Import the color palettes
from calitp_data_analysis import calitp_color_palette

* To see what is inside a module,  put two question marks behind it.
* From here, you can choose another color palette.

In [None]:
calitp_color_palette??

* Place the column you want the colors to be based on in `color=alt.Color(column)`
* Place your color palette in the `scale` argument `scale=alt.Scale(range=your_color_palette)`.

In [None]:
alt.Chart(agg1, title="your_title_here").mark_bar().encode(
    x="ct_district",
    y="n_projects",
    color=alt.Color(
        "n_projects",  # This is the column you want the color of your bar to be based on
        title="legend_title_here",  # This is the legend of your title
        scale=alt.Scale(
            range=calitp_color_palette.CALITP_DIVERGING_COLORS  # This is where you can customize the colors,
        ),
    ),
)

#### Adjusting the Axis
* Sometimes, we want to adjust the axis to have a min and max value.
* You do so using the `scale=alt.Scale(domain=[min_value, max_value]))` argument behind the X and Y axis.

In [None]:
alt.Chart(agg1, title="your_title_here").mark_bar().encode(
    x=alt.X("ct_district", scale=alt.Scale(domain=[1, 12])),
    y=alt.Y("n_projects", scale=alt.Scale(domain=[0, 10])),
    color=alt.Color(
        "n_projects",
        title="legend_title_here",
        scale=alt.Scale(range=calitp_color_palette.CALITP_DIVERGING_COLORS),
    ),
)

### Finishing Touches 
* `.properties(width=400, height=250)` adjusts the size of your chart. 
* `tooltip=[columns you want]` allows you to create a tooltip that pops up when you hover over each bar/circle/etc.
* `.mark_bar(size=10)` adjusts the size of the bar/circle/etc.

In [None]:
alt.Chart(agg1, title="your_title_here").mark_bar(size=10).encode(
    x=alt.X("ct_district", scale=alt.Scale(domain=[1, 12])),
    y=alt.Y("n_projects", scale=alt.Scale(domain=[0, 10])),
    color=alt.Color(
        "n_projects",
        title="legend_title_here",
        scale=alt.Scale(range=calitp_color_palette.CALITP_DIVERGING_COLORS),
    ),
    tooltip=["ct_district", "n_projects"],
).properties(width=400, height=250)

### We have only visualized one column of data. 
* We have only visualized one column of data, but we have a couple of columns above. 
* Try to customize your graph. If you can dream it, you can probably do it with Altair. 
    * You can turn off the grid lines, rotate the axis labels by various degrees, label the bars, add a dropdown menu to change the axis, and more. 
* Make a few other charts in different styles.
* Inspiration
    * Altair's [gallery](https://altair-viz.github.io/gallery/index.html)
    * DDS's [portfolio](https://analysis.calitp.org/)
