# Exercise 3:
* Keep a subset of columns and clean up column names (no spaces, newlines, etc):
    * columns related to identifying the agency
    * population, passenger trips
    * transit mode
    * at least 3 service metric variables, normalized and not normalized
* Deal with duplicates - what is the unit for each row? What is the unit for desired analysis? Should an agency appear multiple times, and if so, why?
* Aggregate at least 2 ways and show an interesting comparison, after dealing with duplicates somehow (either aggregation and/or defining what the unit of analysis is)
* Calculate weighted average after the aggregation for the service metrics
* Decide on one type of chart to visualize, and generalize it as a function
* Make charts using the function

### Helpful Hints for Functions
* Opportunities are from components that are generalizable in making a chart
* Maybe these components need the same lines of code to clean them
* You can always further define variables within a function
* You can always use f-strings within functions to make slight modifications to the parameters you pass

## Reading in my file & cleaning it up

In [208]:
import pandas as pd
import numpy as np
import shared_utils

pd.set_option('display.max_columns', None)

In [209]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/bus_service_increase/"
FILE_NAME = "ntd_metrics_2019.csv"
df2 = pd.read_csv(f"{GCS_FILE_PATH}{FILE_NAME}")

In [210]:
#cleaning up columns
df2.columns = df2.columns.str.strip().str.replace(' ', '_')

## Subsetting for the 3 performance metrics & agency identifiers.

<b> Just notes on the different abbreviations
* TOS: Types of Service

* MB: Driving a bus

* DO: Vehicle Maintenance 

* VOMS: vehicles operated in annual maximum service

In [211]:
list(df2.columns)

['Agency',
 'City',
 'State',
 'Legacy_NTD_ID',
 'NTD_ID',
 'Organization_Type',
 'Reporter_Type',
 'Primary_UZA\n_Population',
 'Agency_VOMS',
 'Mode',
 'TOS',
 'Mode_VOMS',
 'Ratios:',
 'Fare_Revenues_per_Unlinked_Passenger_Trip',
 'Fare_Revenues_per_Unlinked_Passenger_Trip_Questionable',
 'Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio)',
 'Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio)_Questionable',
 'Cost_per\n_Hour',
 'Cost_per_Hour_Questionable',
 'Passengers_per_Hour',
 'Passengers_per_Hour_Questionable',
 'Cost_per_Passenger',
 'Cost_per_Passenger_Questionable',
 'Cost_per_Passenger_Mile',
 'Cost_per_Passenger_Mile_Questionable',
 'Source_Data:',
 'Fare_Revenues_Earned',
 'Fare_Revenues_Earned_Questionable',
 'Total_Operating_Expenses',
 'Total_Operating_Expenses_Questionable',
 'Unlinked_Passenger_Trips',
 'Unlinked_Passenger_Trips_Questionable',
 'Vehicle_Revenue_Hours',
 'Vehicle_Revenue_Hours_Questionable',
 'Passenger_Miles',
 'Passenger_Miles_Ques

In [212]:
df3 = df2[['Agency', 'City', 'State', 'Legacy_NTD_ID', 'NTD_ID',
    'Primary_UZA\n_Population', 'Mode','Cost_per\n_Hour','Passengers_per_Hour','Cost_per_Passenger','TOS','Unlinked_Passenger_Trips','Total_Operating_Expenses']]

#### I don't get this error message below?

In [213]:
df3.rename(columns = {'Primary_UZA\n_Population': 'Primary_Population', 'Cost_per\n_Hour': 'Cost_per_hour'}, inplace=True) #renaming columns 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


In [214]:
print(f"There are: {len(df3)} rows before any filtering")

There are: 3685 rows before any filtering


In [215]:
df3.head(2)

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Primary_Population,Mode,Cost_per_hour,Passengers_per_Hour,Cost_per_Passenger,TOS,Unlinked_Passenger_Trips,Total_Operating_Expenses
0,MTA New York City Transit,New York,NY,2008,20008,18351295,HR,$267.97,139.6,$1.92,DO,2712521697,"$5,206,727,193"
1,MTA New York City Transit,New York,NY,2008,20008,18351295,CB,$393.55,18.6,$21.13,DO,11477164,"$242,520,835"


## Deal with Duplicates: What is the unit for each row? What is the unit for desired analysis? Should an agency appear multiple times, and if so, why?

* The unit for each row is the mode of transportation by agency & geography, looking at the cost per hour of operating the vehicle, how many passengers boarded per hour, and cost per passenger. 

* Filtering out TOS for only Directly Operated vehicles to hopefully eliminate some duplicates and concentrate on analyzing a certain aspect first. 

* An agency should appear multiple times. Although I am not sure why LA has the same entry, below I saw that some agencies have similiar (or even the same) names from different states so these count as unique entries. I also guess that some agencies have multiple models of buses and might categorize them differently. I used to work on scheduling buses at UC Davis and remember they kept specific track of each bus and model when we rented them out, so maybe that's why it is split. 

#### How do I drop these? Does dropna work?

In [216]:
#looking at duplicates...these are just NA values. 
duplicateRowsDF = df3[df3.duplicated()]
duplicateRowsDF

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Primary_Population,Mode,Cost_per_hour,Passengers_per_Hour,Cost_per_Passenger,TOS,Unlinked_Passenger_Trips,Total_Operating_Expenses
3681,,,,,,,,,,,,,
3682,,,,,,,,,,,,,
3683,,,,,,,,,,,,,
3684,,,,,,,,,,,,,


#### Question

In [217]:
#drop nas..do I have to do inplace= true here?
df3.dropna().head(2)

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Primary_Population,Mode,Cost_per_hour,Passengers_per_Hour,Cost_per_Passenger,TOS,Unlinked_Passenger_Trips,Total_Operating_Expenses
0,MTA New York City Transit,New York,NY,2008,20008,18351295,HR,$267.97,139.6,$1.92,DO,2712521697,"$5,206,727,193"
1,MTA New York City Transit,New York,NY,2008,20008,18351295,CB,$393.55,18.6,$21.13,DO,11477164,"$242,520,835"


In [218]:
print(f"Number of unique agencies after we deduplicate: {df3.Agency.nunique()}")

Number of unique agencies after we deduplicate: 2169


In [219]:
print(f"There are: {len(df3)} rows after any filtering")

There are: 3685 rows after any filtering


In [220]:
#checking for TOS 
df3.TOS.unique().tolist()

['DO', 'PT', nan]

In [221]:
#filtering out for just DO 
df4 = df3[(df3["TOS"] == "DO")]

In [222]:
#checking duplicates again
duplicates = (df4.groupby(['Agency','Mode']).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count'))


In [223]:
duplicates

Unnamed: 0,Agency,Mode,count
0,Union County Transit,DR,2
1,"Advance Transit, Inc. NH",DR,2
2,Jackson County,DR,2
3,Valley Transit,DR,2
4,Valley Transit,MB,2
...,...,...,...
2537,"County of Miami-Dade , dba: Transportation & P...",MG,1
2538,"County of Muskegon, dba: Muskegon Area Transit...",DR,1
2539,"County of Muskegon, dba: Muskegon Area Transit...",MB,1
2540,"County of Placer, dba: Placer County Departmen...",DR,1


In [224]:
#looking at the few of the agencies that still have duplicates to see why...
df4[(df4.Agency.str.contains("Southern Teton Area Rapid Transit"))]

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Primary_Population,Mode,Cost_per_hour,Passengers_per_Hour,Cost_per_Passenger,TOS,Unlinked_Passenger_Trips,Total_Operating_Expenses
1759,Southern Teton Area Rapid Transit,Jackson,WY,8R05-010,8R05-80188,0,DR,$22.40,1.0,$21.46,DO,5244,"$112,528"
1761,Southern Teton Area Rapid Transit,Jackson,WY,8R05-010,8R05-80188,0,MB,$71.66,20.3,$3.53,DO,1098224,"$3,875,238"
3384,Southern Teton Area Rapid Transit,Jackson,WY,8R05-010,0R01-80188,0,MB,$146.39,15.4,$9.52,DO,33383,"$317,816"


In [225]:
#looking at the few of the agencies that still have duplicates to see why...
df4[(df4.Agency.str.contains("Union County Transit")) & (df4.Agency.notna()) & 
   (df4.Mode=="DR")]

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Primary_Population,Mode,Cost_per_hour,Passengers_per_Hour,Cost_per_Passenger,TOS,Unlinked_Passenger_Trips,Total_Operating_Expenses
2849,Union County Transit,LIBERTY,IN,5R02-038,5R02-50387,0,DR,$34.50,1.7,$20.13,DO,24582,"$494,890"
3550,Union County Transit,Blairsville,GA,4R03-012,4R03-41145,0,DR,$22.69,1.6,$14.24,DO,6180,"$87,989"


In [226]:
#some agencies have VERY similar names in different states...
df4[(df4.Agency.str.contains("Jackson County")) 
   & (df4.Agency.notna())
  ]

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Primary_Population,Mode,Cost_per_hour,Passengers_per_Hour,Cost_per_Passenger,TOS,Unlinked_Passenger_Trips,Total_Operating_Expenses
1526,"Jackson County Transportation, Inc.",Marianna,FL,4R02-019,4R02-41198,0,DR,$52.98,1.4,$38.23,DO,37567,"$1,436,173"
1557,"Jackson County Transportation, Inc.",Marianna,FL,4R02-019,4R02-41198,0,MB,$31.76,2.8,$11.49,DO,2195,"$25,216"
1981,"Delaware, Dubuque & Jackson County Regional Tr...",Dubuque,IA,7R01-008,7R01-70136,0,DR,$66.81,3.6,$18.55,DO,99025,"$1,837,191"
2047,Jackson County Mass Transit District,Carbondale,IL,5204,50204,67821,MB,$47.76,3.3,$14.67,DO,93691,"$1,374,554"
2614,Jackson County,Sylva,NC,4R06-023,4R06-41167,0,DR,$47.40,1.6,$28.92,DO,18663,"$539,672"
2660,Jackson County,Sylva,NC,4R06-023,4R06-41167,0,MB,$47.58,3.0,$15.69,DO,8004,"$125,576"
2953,Jackson County Council on Aging,Scottsboro,AL,4R01-016,4R01-41180,0,DR,$40.65,2.9,$14.25,DO,29346,"$418,250"
3370,Jackson County,Jefferson,GA,4R03-009,4R03-41154,0,DR,$23.39,1.7,$14.09,DO,12285,"$173,101"


In [227]:
#same thing
df4[(df4.Agency.str.contains("Valley Transit")) & (df4.Agency.notna()) ]

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Primary_Population,Mode,Cost_per_hour,Passengers_per_Hour,Cost_per_Passenger,TOS,Unlinked_Passenger_Trips,Total_Operating_Expenses
688,Tri-Valley Transit Inc,Middlebury,VT,1R06-001,1R06-10143,0,MB,$72.87,5.6,$13.02,DO,105277,"$1,370,321"
689,Tri-Valley Transit Inc,Middlebury,VT,1R06-001,1R06-10143,0,DR,$28.12,1.4,$20.29,DO,119556,"$2,425,301"
690,Tri-Valley Transit Inc,Middlebury,VT,1R06-001,1R06-10143,0,CB,$80.41,3.5,$23.27,DO,57918,"$1,347,724"
771,"City of Appleton, dba: Valley Transit",Appleton,WI,5001,50001,216154,MB,$95.67,15.9,$6.03,DO,939541,"$5,661,984"
929,Concho Valley Transit District,San Angelo,TX,6102,60102,92984,DR,$79.26,2.0,$40.24,DO,88063,"$3,543,343"
932,Concho Valley Transit District,San Angelo,TX,6102,60102,92984,MB,$82.59,11.0,$7.48,DO,211728,"$1,584,665"
1326,"City of Williamsport, dba: River Valley Transit",Williamsport,PA,3026,30026,56142,MB,$141.67,23.0,$6.15,DO,1310695,"$8,063,312"
1332,"City of Williamsport, dba: River Valley Transit",Williamsport,PA,3026,30026,56142,DR,$96.00,1.4,$68.57,DO,7,$480
1598,Treasure Valley Transit,Nampa,ID,0R01-012,00373,151499,DR,$51.78,1.9,$27.52,DO,45719,"$1,258,248"
1600,Treasure Valley Transit,Nampa,ID,0R01-012,00373,151499,MB,$73.37,4.5,$16.43,DO,70882,"$1,164,627"


## Changing mode abbrevations 
* Mapping a few of the abbreviations to the full name for ease.

* [Transit.dot.gov's data dictionary](https://www.transit.dot.gov/ntd/national-transit-database-ntd-glossary#C)

In [228]:
# Map values from abbreviations to the full name
MODE_NAMES = {
    'YR': 'Hybrid Rail', 
    'SR': 'Street Car',
     'MB': 'Bus', 
    'LR': 'Light Rail',
    'CB': 'Commuter Bus',
    'CC': 'Cable Car',
    'MG': 'Monorail and Automated Guideway'
}

#### Question, I guess there is a much neater way to have a catch all "other" group for the other abbreviations? I guess I could have written a if else statement?

In [229]:
#mapping "other" modes
MODE_NAMES.update(dict.fromkeys(['HR',
 'DR',
 'RB',
 'CR',
 'VP',
 'DT',
 'FB',
 'TB',
 'PB',
 'IP',
 'TR',
 'AR'], 'Other'))

In [230]:
df4 = df4.assign(
    mode_full_name = df4.Mode.map(MODE_NAMES)
)

In [231]:
df4.mode_full_name.value_counts()

Other                              1548
Bus                                 870
Commuter Bus                         95
Light Rail                           21
Street Car                           13
Monorail and Automated Guideway       4
Cable Car                             1
Hybrid Rail                           1
Name: mode_full_name, dtype: int64

In [232]:
#just checking for cable car
df4[(df4.mode_full_name.str.contains("Cable"))]

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Primary_Population,Mode,Cost_per_hour,Passengers_per_Hour,Cost_per_Passenger,TOS,Unlinked_Passenger_Trips,Total_Operating_Expenses,mode_full_name
121,"City and County of San Francisco, dba: San Fra...",San Francisco,CA,9015,90015,3281212,CC,$529.36,43.0,$12.32,DO,5703705,"$70,277,173",Cable Car


## Cleaning up columns

In [233]:
(df4.columns)

Index(['Agency', 'City', 'State', 'Legacy_NTD_ID', 'NTD_ID',
       'Primary_Population', 'Mode', 'Cost_per_hour', 'Passengers_per_Hour',
       'Cost_per_Passenger', 'TOS', 'Unlinked_Passenger_Trips',
       'Total_Operating_Expenses', 'mode_full_name'],
      dtype='object')

In [234]:
#cleaning up string columns
for i in ['Agency', 'City', 'State', 'mode_full_name']:
    df4[i] = df4[i].str.strip().replace(',', '').astype({i: str})

In [235]:
df4["Cost_per_Passenger"].replace({"1,218.00": "1218"}, inplace=True)

In [236]:
#cleaning up the numeric columns
for i in ['Primary_Population', 'Cost_per_hour', 'Passengers_per_Hour',
       'Cost_per_Passenger', 'Unlinked_Passenger_Trips', 'Total_Operating_Expenses']:
    df4[i] = df4[i].replace({'\$':''}, regex=True).replace(',','', regex= True).astype({i: float})

In [237]:
#making sure all the data types are accurate
df4.dtypes

Agency                       object
City                         object
State                        object
Legacy_NTD_ID                object
NTD_ID                       object
Primary_Population          float64
Mode                         object
Cost_per_hour               float64
Passengers_per_Hour         float64
Cost_per_Passenger          float64
TOS                          object
Unlinked_Passenger_Trips    float64
Total_Operating_Expenses    float64
mode_full_name               object
dtype: object

In [238]:
#dropping the old mode 
df4 = df4.drop(columns = ['Mode'])

In [239]:
df4.to_csv("./df4.csv", index= False) #just exporting to CSV so I can check it out

# Aggregation & Chart Making

In [240]:
#making my function for a chart
import altair as alt


def bar_chart(df, x, y):
    alt.Chart(df).mark_bar().encode(
    alt.X('x'),
    alt.Y('y'),
    color =alt.Color('State', scale=alt.Scale(
                                  range=altair_utils.FIVETHIRTYEIGHT_CATEGORY_COLORS)),
    tooltip = [alt.Tooltip('x'),
               alt.Tooltip('y')
              ]).interactive().properties( width=400,height=250)
    return chart 

[Markdown reference](https://www.datacamp.com/community/tutorials/markdown-in-jupyter-notebook)

### First Analysis...Why did filtering out for bus not work?

In [241]:
#filter out for other for mode
first = df4[df4["mode_full_name"] == "Bus"]

In [242]:
first= df4.drop_duplicates().groupby(['City', 'mode_full_name']).agg({'Passengers_per_Hour': 'median', 'Cost_per_Passenger': 'median' }).reset_index()
first.sort_values(by="Cost_per_Passenger", ascending= False)

Unnamed: 0,City,mode_full_name,Passengers_per_Hour,Cost_per_Passenger
1947,Vega Alta,Other,1.10,1218.00
692,Ft. Myers,Bus,2.40,438.44
624,Fallon,Bus,1.65,376.25
1884,Texarkana,Commuter Bus,0.30,312.38
456,Crow Agency,Other,0.30,297.58
...,...,...,...,...
1878,Telluride,Commuter Bus,4.00,1.20
1298,Mountain Village,Other,9.10,1.15
821,Hastings,Bus,29.60,1.15
1504,Pigeon Forge,Bus,66.70,1.15
