1.  Download the file “MBTA_Line_and_Stop.csv” from Files on Canvas.  (Data is originally from the MBTA data portal.) Links to an external site.

2.  First, explore the data in your application of choice (e.g., Tableau, Excel, etc.) to get a feel for the data, and clean up as necessary.

3.  Open a Jupyter notebook and load the data.

4.  Create a line chart and a scatter plot with brushing-and-linking so that if you select points in the scatter plot then the relevant points/areas are highlighted in the line chart and visa-versa.  You may choose any data you wish from the MBTA_Line_and_Stop.csv file.  

**Recall that a scatter plot is most effective for plotting quantitive versus quantitative data, and a line chart is most effective for plotting quantitative versus ordinal or categorical data.  Line charts are also particularly suited for data that varies over time with time plotted along the x-axis.

5.  Add to one of the charts a details-on-demand mouse-over effect to get more information about specific data points.

6.  Add a pop-out effect (see Lecture 16) to help the Huskie viewers see where the Ruggles (Orange Line) and Northeastern (Green Line) stops are in the plot(s).

7.  Finally, make sure to add the following textual components to your PDF:

A caption for each visualization to explain what it is plotting and the most interesting insights/observations.
A text blurb at the bottom of the page to identify what kind of pop-out effect you used for Step 6 and why you choose it (1-2 sentences is sufficient).
Don’t forget to, as always, include titles on your graphs as well as axis labels, appropriate axis scaling, etc.

### Part 1
***
Dataset downloaded.

### Part 2
***
Dataset explored and cleaned in Jupyter Notebooks as shown below.

In [1]:
# Imports
import pandas as pd
import numpy as np
from datetime import datetime
from vega_datasets import data
import altair as alt
from altair import pipe, limit_rows, to_values

**CITATION** - 
To solve the error of the dataset being too big:
[source](https://github.com/altair-viz/altair/issues/611)

In [2]:
# Fix issue of dataset being too big
t = lambda data: pipe(data, limit_rows(max_rows=100000), to_values)
alt.data_transformers.register('custom', t)
alt.data_transformers.enable('custom')

DataTransformerRegistry.enable('custom')

In [3]:
# Read MBTA csv and create DataFrame
mbta = pd.read_csv('MBTA_Line_and_Stop.csv')
mbta

Unnamed: 0,FID,mode,season,route_id,route_name,direction_id,day_type_id,day_type_name,time_period_id,time_period_name,stop_name,stop_id,total_ons,total_offs,number_service_days,average_ons,average_offs,average_flow
0,1,0,Fall 2019,Green,Green Line,0,day_type_01,weekday,time_period_01,VERY_EARLY_MORNING,Allston Street,place-alsgr,0,17,77,0,0,4
1,2,0,Fall 2019,Green,Green Line,0,day_type_01,weekday,time_period_01,VERY_EARLY_MORNING,Arlington,place-armnl,2675,8021,77,35,104,381
2,3,0,Fall 2019,Green,Green Line,0,day_type_01,weekday,time_period_01,VERY_EARLY_MORNING,Babcock Street,place-babck,0,151,77,0,2,8
3,4,0,Fall 2019,Green,Green Line,0,day_type_01,weekday,time_period_01,VERY_EARLY_MORNING,Back of the Hill,place-bckhl,0,36,77,0,0,4
4,5,0,Fall 2019,Green,Green Line,0,day_type_01,weekday,time_period_01,VERY_EARLY_MORNING,Beaconsfield,place-bcnfd,12,67,77,0,1,44
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7915,7916,1,Fall 2017,Red,Red Line,0,day_type_02,saturday,time_period_10,OFF_PEAK,Porter,place-portr,59366,9940,16,3710,621,13842
7916,7917,1,Fall 2017,Red,Red Line,0,day_type_02,saturday,time_period_10,OFF_PEAK,Quincy Adams,place-qamnl,388,26507,16,24,1657,1549
7917,7918,1,Fall 2017,Red,Red Line,0,day_type_02,saturday,time_period_10,OFF_PEAK,Quincy Center,place-qnctr,2128,67000,16,133,4188,3182
7918,7919,1,Fall 2017,Red,Red Line,0,day_type_02,saturday,time_period_10,OFF_PEAK,Savin Hill,place-shmnl,2292,18338,16,143,1146,7237


In [4]:
# Show the time_period_name associated with the time_period_name
mbta_time = mbta[['time_period_id', 'time_period_name']]
mbta_time = mbta_time.drop_duplicates()
mbta_time.sort_values(by='time_period_id')
mbta_time

Unnamed: 0,time_period_id,time_period_name
0,time_period_01,VERY_EARLY_MORNING
50,time_period_05,MIDDAY_SCHOOL
59,time_period_04,MIDDAY_BASE
81,time_period_06,PM_PEAK
150,time_period_02,EARLY_AM
182,time_period_03,AM_PEAK
247,time_period_07,EVENING
402,time_period_10,OFF_PEAK
412,time_period_08,LATE_EVENING
426,time_period_11,OFF_PEAK


In [5]:
# Group the data based on season, time_period_id, and route_name
mbta_group = mbta.groupby(['season', 'time_period_id', 'route_name']).sum()
mbta_group

  mbta_group = mbta.groupby(['season', 'time_period_id', 'route_name']).sum()


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,FID,mode,direction_id,total_ons,total_offs,number_service_days,average_ons,average_offs,average_flow
season,time_period_id,route_name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Fall 2017,time_period_01,Blue Line,163222,24,12,185003,185002,1968,2257,2256,11786
Fall 2017,time_period_01,Green Line,447530,0,66,143550,145850,10824,1748,1774,11340
Fall 2017,time_period_01,Orange Line,286463,40,20,191168,191168,3280,2330,2335,14416
Fall 2017,time_period_01,Red Line,334499,44,22,247944,247945,3608,3023,3021,16930
Fall 2017,time_period_02,Blue Line,161633,24,12,374632,374630,1968,4567,4569,23141
...,...,...,...,...,...,...,...,...,...,...,...
Fall 2019,time_period_10,Red Line,233811,44,22,989603,989606,528,82469,82469,385926
Fall 2019,time_period_11,Blue Line,111294,24,12,304726,304726,192,38092,38092,168400
Fall 2019,time_period_11,Green Line,130039,0,66,603306,605899,1056,75418,75743,357237
Fall 2019,time_period_11,Orange Line,201661,40,20,431874,431872,320,53987,53985,286739


In [6]:
# Flatten the sorted DataFrame so that there are no more hierarchical indices
mbta_sorted = pd.DataFrame(mbta_group.to_records())
mbta_sorted

Unnamed: 0,season,time_period_id,route_name,FID,mode,direction_id,total_ons,total_offs,number_service_days,average_ons,average_offs,average_flow
0,Fall 2017,time_period_01,Blue Line,163222,24,12,185003,185002,1968,2257,2256,11786
1,Fall 2017,time_period_01,Green Line,447530,0,66,143550,145850,10824,1748,1774,11340
2,Fall 2017,time_period_01,Orange Line,286463,40,20,191168,191168,3280,2330,2335,14416
3,Fall 2017,time_period_01,Red Line,334499,44,22,247944,247945,3608,3023,3021,16930
4,Fall 2017,time_period_02,Blue Line,161633,24,12,374632,374630,1968,4567,4569,23141
...,...,...,...,...,...,...,...,...,...,...,...,...
127,Fall 2019,time_period_10,Red Line,233811,44,22,989603,989606,528,82469,82469,385926
128,Fall 2019,time_period_11,Blue Line,111294,24,12,304726,304726,192,38092,38092,168400
129,Fall 2019,time_period_11,Green Line,130039,0,66,603306,605899,1056,75418,75743,357237
130,Fall 2019,time_period_11,Orange Line,201661,40,20,431874,431872,320,53987,53985,286739


### Part 3
***
Jupyter Notebook for Altair visualizations created.

### Part 4 - Scatter Plot and Line Chart
***
Code for my scatter plot and line chart is below.

This code colors to the routes, create the selection feature, and create a dropdown menu to choose which season to look at.

In [7]:
# Assign colors to each route
routes = alt.Scale(domain=['Red Line', 'Green Line', 'Blue Line', 'Orange Line'],
                   range=['#CD2626', '#006400', '#1874CD','#FF6103'])
color = alt.Color('route_name:N', scale=routes)

# Create seasons list
seasons = mbta['season'].unique()

# Multiclick on bottom panel 
click = alt.selection_multi(encodings=['color'])

# A dropdown filter to select the season
season_dropdown = alt.binding_select(options=seasons, name='Season')
season_select = alt.selection_single(fields=['season'], bind=season_dropdown)

### Scatter Plot
The scatter plot represents the average passengers on compared with the average passengers off for all stops on all routes during all times of day. This scatter plot gives a good visual on all stops and time of day in the data, to see which route is the most popular and when. There is a details-on-demand feature that shows the `'stop_name', 'time_period_name', 'average_ons', 'average_offs',` and `'route_name'`. The color represents which line it is, either the Red, Green, Orange, or Blue lines. Using the dropdown menu at the bottom, you can select which season you want to look at (Fall 2017, Fall 2018, or Fall 2019). This graph is also linked to the line chart below, so when you select a point, all points in that route are shown, the rest are grayed out, and the corresponding line in the line chart is also shown with other lines grayed out. The lime green squares denote the Northeastern University stop and the black squares denote the Ruggles stop, as mentioned in the title. Looking at the scatter plot, many of the points at the higher ends of the x and y-axis are from the Red line and Green line. The Park Street Green line has the highest `'average_ons'` during the `'OFF_PEAK'` time period. It is interesting to see that the Blue line is the least popular but it makes sense as I have never been on it. I expected more Orange line points to be higher on the graph, so it is interesting to see that the Green and Red lines are more popular. It is also interesting to see how the plot changes as you select a different season. 

In [8]:
# Scatter plot title
scatter_title = alt.TitleParams(
    'MBTA Average Passengers On vs. Average Passengers Offs Based on Season',
    subtitle=['Compared with the Average Flow based on Time of Day', 
              'Northeastern University stop shown as Lime Green Square',
             'Ruggles stop shown as Black Square'],
    anchor='middle',
    orient='top')

# Scatter plot 
scatter = alt.Chart(mbta, title=scatter_title).mark_point().encode(
    alt.X('average_ons:Q', title='Average Ons'),
    alt.Y('average_offs:Q',
          title='Average Offs'
         ),
    color=alt.condition(click, color, alt.value('lightgray')),
    opacity=alt.condition(click, alt.value(1), alt.value(0.2)),
    tooltip=['stop_name:N', 'time_period_name:N', 
             'average_ons:Q', 'average_offs:Q', 'route_name:N']
).properties(
    width=800,
    height=600
).add_selection(
    click, season_select
).transform_filter(
    season_select
)

### Pop-Out Effect
The pop-out effect allows users to easily locate the Northeastern University and the Ruggles stops. The lime green squares denote the Northeastern University stop and the black squares denote the Ruggles stop, as mentioned in the title. Unfortunately, I could not layer this legend with the bigger scatter plot due to interactivity restrictions, but this description is listed in the title. Looking at the scatter plot, it is easy to tell that the Ruggles stop has higher averages at most points in the day compared with the Northeastern stop. This makes sense as those who use the Northeastern stop are primarily Northeastern students, but the Ruggles stop is generally used by the greater community. 


In [9]:
# Locate Ruggles stop points
ruggles = mbta[mbta.stop_name == 'Ruggles']

# Create black points for Ruggles stops
ruggles_scat = alt.Chart(ruggles).mark_square(color='black').encode(
    x=alt.X('average_ons:Q', title='Average Ons'),
    y=alt.Y('average_offs:Q', title='Average Offs'),
    tooltip=['stop_name:N', 'time_period_name:N', 
             'average_ons:Q', 'average_offs:Q', 'route_name:N']
).properties(
    width=800,
    height=600
)

# Locate Northeastern stop points
northeastern = mbta[mbta.stop_name == 'Northeastern University']

# Create black points for Northeastern stops
northeastern_scat = alt.Chart(northeastern).mark_square(color='lime').encode(
    x=alt.X('average_ons:Q', title='Average Ons'),
    y=alt.Y('average_offs:Q', title='Average Offs'),
    tooltip=['stop_name:N', 'time_period_name:N', 
             'average_ons:Q', 'average_offs:Q', 'route_name:N']
).properties(
    width=800,
    height=600
)

### Line Chart
This line chart shows the time period compared with the average flow of all the routes. This graph is a good indication of which lines have the most passengers during certain parts of the day. The line chart also responds to the dropdown menu, so you can see how the `'average_flow'` has changed throughout the years. It is clear that the Green line during the `'OFF_PEAK'` has the highest `'average_flow'`. However, the Red line has higher `'average_flow'` during most time periods in the beginning of the day. Similar to the scatter plot, if you select a line on the line chart, only that route will appear and the same route will be reflected on the scatter plot. There is a details-on-demand feature that shows the `'route_name', 'average_ons', 'average_offs', 'total_ons', 'total_offs'`, and `'average_flow'`. The color represents the line (Red, Green, Orange, or Blue). 

In [10]:
# Create axis labels to correspond with the time_period_id
axis_labels = ("datum.label == 'time_period_01' ? 'Very Early Morning' \n"
               ": datum.label == 'time_period_02' ? 'Early AM' \n"
               ": datum.label == 'time_period_03' ? 'AM Peak' \n"
               ": datum.label == 'time_period_04' ? 'Midday Base' \n"
               ": datum.label == 'time_period_05' ? 'Midday School' \n"
               ": datum.label == 'time_period_06' ? 'PM Peak' \n"
               ": datum.label == 'time_period_07' ? 'Evening' \n"
               ": datum.label == 'time_period_08' ? 'Late Evening' \n"
               ": datum.label == 'time_period_09' ? 'Night' \n"
               ": datum.label == 'time_period_10' ? 'Off Peak' \n"
               ": datum.label == 'time_period_11' ? 'Off Peak': 'Frequent'")


# Line chart title
line_title = alt.TitleParams(
    'Time of Day vs. Average Flow',
    anchor='middle',
    orient='top')

# Line chart
lines = alt.Chart(mbta_sorted, title=line_title).mark_line().encode(
    x = alt.X('time_period_id', axis=alt.Axis(labelExpr=axis_labels), title='Time Period'),
    y= alt.Y('average_flow:Q', title='Average Flow'),
    color=alt.condition(click, color, alt.value('lightgray')),
    opacity=alt.condition(click, alt.value(1), alt.value(0.2)),
    tooltip=['route_name:N', 'average_ons:Q', 'average_offs:Q', 
             'total_ons:Q', 'total_offs:Q', 'average_flow:Q']
).properties(
    width=550,
).add_selection(
    click, season_select
).transform_filter(
    season_select
)

### Display Charts

In [11]:
# Display all charts
chart = alt.vconcat((scatter + ruggles_scat + northeastern_scat), lines, center=True)
chart



### Part 5 - Details on Demand
***
On the scatter plot, there is a details-on-demand feature that shows the `'stop_name', 'time_period_name', 'average_ons', 'average_offs',` and `'route_name'`. On the line chart, there is a details-on-demand feature that shows the `'route_name', 'average_ons', 'average_offs', 'total_ons', 'total_offs'`, and `'average_flow'`.

### Part 6 - Pop-Out Effect
***

The pop-out effect allows users to easily locate the Northeastern University and the Ruggles stops. The lime green squares denote the Northeastern University stop and the black squares denote the Ruggles stop, as mentioned in the title. Unfortunately, I could not layer this legend with the bigger scatter plot due to interactivity restrictions, but this description is listed in the title. Looking at the scatter plot, it is easy to tell that the Ruggles stop has higher averages at most points in the day compared with the Northeastern stop. This makes sense as those who use the Northeastern stop are primarily Northeastern students, but the Ruggles stop is generally used by the greater community. 


### Part 7 - Explanations
***
**Pop-Out Effect**
The pop-out effect I chose to use utilized shape and color. I changed the shape of the points for Northeastern University and the Ruggles stops to squares, so it was easily seen against the outlined circles that were the other stops. I took it one step further and changed the color of these points so it was even easier to see them. Since there was a lot of data being plotted, these two effects made it easy for users to locate these stops on the plot. 
