# Exercise 3: More tabular data wrangling

Skills:
* Looping
* Dictionary to map values
* Dealing with duplicates
* Make use of Markdown cells to write some narrative or commentary!

References:
* https://docs.calitp.org/data-infra/analytics_new_analysts/02-data-analysis-intermediate.html

In [1]:
import pandas as pd

In [2]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/bus_service_increase/"
FILE_NAME = "ntd_metrics_2019.csv"

df = pd.read_csv(f"{GCS_FILE_PATH}{FILE_NAME}")
df.head(2)



Unnamed: 0,Agency,City,State,Legacy NTD ID,NTD ID,Organization Type,Reporter Type,Primary UZA\n Population,Agency VOMS,Mode,...,Passenger Miles Questionable,Vehicle Revenue Miles,Vehicle Revenue Miles Questionable,Any data questionable?,Unnamed: 39,Unnamed: 40,Unnamed: 41,1,Unnamed: 43,Unnamed: 44
0,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,HR,...,,354616371,,No,,,,Hide questionable data tags,,
1,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,CB,...,,9866807,,No,,,,Show questionable data tags,,


In [3]:
df.columns = df.columns.str.replace('\n', ' ').str.replace(' ', '_')

In [4]:
print(f"# obs: {len(df)}")
print(f"# unique IDs: {df.NTD_ID.nunique()}")

# obs: 3685
# unique IDs: 2183


In [5]:
# Pick an example -- see that agency provides service for different modes
# df.NTD_ID.value_counts()
df[df.NTD_ID=="10003"].Mode.value_counts()

MB    2
FB    1
DR    1
CR    1
RB    1
HR    1
TB    1
LR    1
Name: Mode, dtype: int64

### Dealing with Duplicates

* Explore why there are duplicates
* What's the analysis about? What should the unit of analysis be?
* Should duplicates be dropped? Should duplicates be aggregated into 1 entry?
* Hint: It depends on the analysis, and there might be a bit of both. Sometimes, aggregation makes sense. Duplicates require further investigation -- why do they appear in the dataset multiple times? Unless it's completely duplicate information, it doesn't make sense to just drop. It may show that the analysis can be more disaggregate than previously thought.

In [6]:
list(df.columns)

['Agency',
 'City',
 'State',
 'Legacy_NTD_ID',
 'NTD_ID',
 'Organization_Type',
 'Reporter_Type',
 'Primary_UZA__Population',
 'Agency_VOMS',
 'Mode',
 'TOS',
 'Mode_VOMS',
 'Ratios:',
 'Fare_Revenues_per_Unlinked_Passenger_Trip_',
 'Fare_Revenues_per_Unlinked_Passenger_Trip_Questionable',
 'Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio)',
 'Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio)_Questionable',
 'Cost_per__Hour',
 'Cost_per_Hour_Questionable',
 'Passengers_per_Hour',
 'Passengers_per_Hour_Questionable',
 'Cost_per_Passenger',
 'Cost_per_Passenger_Questionable',
 'Cost_per_Passenger_Mile',
 'Cost_per_Passenger_Mile_Questionable',
 'Source_Data:',
 'Fare_Revenues_Earned',
 'Fare_Revenues_Earned_Questionable',
 'Total_Operating_Expenses',
 'Total_Operating_Expenses_Questionable',
 'Unlinked_Passenger_Trips',
 'Unlinked_Passenger_Trips_Questionable',
 'Vehicle_Revenue_Hours',
 'Vehicle_Revenue_Hours_Questionable',
 'Passenger_Miles',
 'Passenger_Miles_Quest

In [7]:
keep_me = ["MB"]
df1 = df[df.Mode.isin(keep_me)]
df1


pd.options.display.max_columns = 100
df1[df1.NTD_ID=="10003"]

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Organization_Type,Reporter_Type,Primary_UZA__Population,Agency_VOMS,Mode,TOS,Mode_VOMS,Ratios:,Fare_Revenues_per_Unlinked_Passenger_Trip_,Fare_Revenues_per_Unlinked_Passenger_Trip_Questionable,Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio),Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio)_Questionable,Cost_per__Hour,Cost_per_Hour_Questionable,Passengers_per_Hour,Passengers_per_Hour_Questionable,Cost_per_Passenger,Cost_per_Passenger_Questionable,Cost_per_Passenger_Mile,Cost_per_Passenger_Mile_Questionable,Source_Data:,Fare_Revenues_Earned,Fare_Revenues_Earned_Questionable,Total_Operating_Expenses,Total_Operating_Expenses_Questionable,Unlinked_Passenger_Trips,Unlinked_Passenger_Trips_Questionable,Vehicle_Revenue_Hours,Vehicle_Revenue_Hours_Questionable,Passenger_Miles,Passenger_Miles_Questionable,Vehicle_Revenue_Miles,Vehicle_Revenue_Miles_Questionable,Any_data_questionable?,Unnamed:_39,Unnamed:_40,Unnamed:_41,1,Unnamed:_43,Unnamed:_44
35,Massachusetts Bay Transportation Authority,Boston,MA,1003,10003,Independent Public Agency or Authority of Tran...,Full Reporter,4181019,2464,MB,DO,779,,$0.97,,0.23,,$156.04,,36.5,,$4.28,,$1.66,,,"$96,518,545",,"$424,586,999",,99301293,,2721051,,255494460,,21357273,,No,,,,,,
36,Massachusetts Bay Transportation Authority,Boston,MA,1003,10003,Independent Public Agency or Authority of Tran...,Full Reporter,4181019,2464,MB,PT,70,,$0.09,W,0.02,,$63.06,,10.8,W,$5.85,W,$2.57,W,,"$85,258",,"$5,562,842",,951692,W,88210,,2162081,W,1028451,,Yes,,,,,,


# Potential reason for duplicates
In the above example of `NTD_ID` there are two types of services `TOS` within the same `Mode` types : Directly Operated `DO` and Puchased Transportation `PT`. Agencies might be running their own transportation services as well as running the services through others.





In [8]:
# But what about this case?
df[(df.Agency.str.contains("Los Angeles County Metropolitan Transportation Authority ")) 
   & (df.Agency.notna())].Mode.value_counts()



MB    2
HR    1
RB    1
LR    1
VP    1
Name: Mode, dtype: int64

In [9]:
df[(df.Agency.str.contains("Los Angeles County Metropolitan Transportation Authority ")) & (df.Agency.notna())]

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Organization_Type,Reporter_Type,Primary_UZA__Population,Agency_VOMS,Mode,TOS,Mode_VOMS,Ratios:,Fare_Revenues_per_Unlinked_Passenger_Trip_,Fare_Revenues_per_Unlinked_Passenger_Trip_Questionable,Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio),Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio)_Questionable,Cost_per__Hour,Cost_per_Hour_Questionable,Passengers_per_Hour,Passengers_per_Hour_Questionable,Cost_per_Passenger,Cost_per_Passenger_Questionable,Cost_per_Passenger_Mile,Cost_per_Passenger_Mile_Questionable,Source_Data:,Fare_Revenues_Earned,Fare_Revenues_Earned_Questionable,Total_Operating_Expenses,Total_Operating_Expenses_Questionable,Unlinked_Passenger_Trips,Unlinked_Passenger_Trips_Questionable,Vehicle_Revenue_Hours,Vehicle_Revenue_Hours_Questionable,Passenger_Miles,Passenger_Miles_Questionable,Vehicle_Revenue_Miles,Vehicle_Revenue_Miles_Questionable,Any_data_questionable?,Unnamed:_39,Unnamed:_40,Unnamed:_41,1,Unnamed:_43,Unnamed:_44
13,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,HR,DO,68,,$0.73,,0.19,,$536.99,,137.3,,$3.91,,$0.81,,,"$31,426,577",,"$168,453,369",,43074277,,313697,,207664947,,6874200,,No,,,,,,
14,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,RB,DO,26,,$0.73,,0.19,,$231.80,,62.0,,$3.74,,$0.57,,,"$4,997,045",,"$25,666,876",,6860145,,110727,,45206002,,1719522,,No,,,,,,
15,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,LR,DO,198,,$0.72,,0.1,,$515.13,,68.8,,$7.48,,$0.96,,,"$42,986,478",,"$446,368,668",,59655365,,866517,,462756222,,17757242,,No,,,,,,
16,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,MB,DO,1784,,$0.72,,0.15,,$190.75,,40.1,,$4.75,,$1.16,,,"$182,029,213",,"$1,209,706,503",,254580163,,6341989,,1044644827,,65595822,,No,,,,,,
17,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,MB,PT,134,,$0.31,,0.07,,$107.12,,24.8,,$4.31,,$0.90,,,"$3,849,877",,"$53,066,904",,12307451,,495401,,59202628,,5775759,,No,,,,,,
18,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,VP,PT,1259,,$4.81,,1.01,,$21.49,,4.5,,$4.74,,$0.11,,,"$15,580,993",,"$15,376,446",,3240720,,715408,,142563803,,28602524,,No,,,,,,


In [10]:
# Find the column that has different values
df[(df.Agency.str.contains("Los Angeles County Metropolitan Transportation Authority ")) 
   & (df.Agency.notna()) & 
   (df.Mode=="MB")
  ]

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Organization_Type,Reporter_Type,Primary_UZA__Population,Agency_VOMS,Mode,TOS,Mode_VOMS,Ratios:,Fare_Revenues_per_Unlinked_Passenger_Trip_,Fare_Revenues_per_Unlinked_Passenger_Trip_Questionable,Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio),Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio)_Questionable,Cost_per__Hour,Cost_per_Hour_Questionable,Passengers_per_Hour,Passengers_per_Hour_Questionable,Cost_per_Passenger,Cost_per_Passenger_Questionable,Cost_per_Passenger_Mile,Cost_per_Passenger_Mile_Questionable,Source_Data:,Fare_Revenues_Earned,Fare_Revenues_Earned_Questionable,Total_Operating_Expenses,Total_Operating_Expenses_Questionable,Unlinked_Passenger_Trips,Unlinked_Passenger_Trips_Questionable,Vehicle_Revenue_Hours,Vehicle_Revenue_Hours_Questionable,Passenger_Miles,Passenger_Miles_Questionable,Vehicle_Revenue_Miles,Vehicle_Revenue_Miles_Questionable,Any_data_questionable?,Unnamed:_39,Unnamed:_40,Unnamed:_41,1,Unnamed:_43,Unnamed:_44
16,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,MB,DO,1784,,$0.72,,0.15,,$190.75,,40.1,,$4.75,,$1.16,,,"$182,029,213",,"$1,209,706,503",,254580163,,6341989,,1044644827,,65595822,,No,,,,,,
17,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,MB,PT,134,,$0.31,,0.07,,$107.12,,24.8,,$4.31,,$0.90,,,"$3,849,877",,"$53,066,904",,12307451,,495401,,59202628,,5775759,,No,,,,,,


Similar to the above example, for the `Los Angeles County Metropolitan Transportation Authority`, there are two different types of services for buses.

In [11]:
subset_cols = [
    'Agency', 'City', 'State', 'Legacy_NTD_ID', 'NTD_ID',
    'Organization_Type', 'Reporter_Type', 'Mode'
]

subset_cols

['Agency',
 'City',
 'State',
 'Legacy_NTD_ID',
 'NTD_ID',
 'Organization_Type',
 'Reporter_Type',
 'Mode']

In [12]:
print(f"# obs: {len(df)}")

# obs: 3685


In [13]:
print(f"# obs after dropping dups: {len(df.drop_duplicates(subset=subset_cols))}")

# What does this indicate? Use Markdown cell and jot down some of the logic.

# obs after dropping dups: 3553


[Markdown reference](https://www.datacamp.com/community/tutorials/markdown-in-jupyter-notebook)

<i> The number of dropped observation indicates that there were multiple agencies that operate in the same `City`, has the same `NTD_ID`, `Organization_Type`, `Reporter_Type` as well as `Mode`. <i>

### Changing values by using a dictionary to map

In [14]:
# Transit mode uses a code, 
# Use a dictionary to map those codes to its full name
MODE_NAMES = {
    'MB': 'Bus', 
    'LR': 'Light Rail',
    'CB': 'Commuter Bus',
}

# What happens to the ones that aren't specified in MODE_NAMES?
df = df.assign(
    mode_full_name = df.Mode.map(MODE_NAMES)
)


In [15]:
df.mode_full_name.value_counts()

Bus             1244
Commuter Bus     177
Light Rail        23
Name: mode_full_name, dtype: int64

In [16]:
df[df.mode_full_name.isna()].Mode.value_counts()

DR    1879
VP     112
DT     103
FB      40
CR      27
SR      22
HR      15
RB      13
YR       6
MG       6
TB       5
IP       3
TR       2
PB       1
CC       1
AR       1
Name: Mode, dtype: int64

Map values from `Mode` to these categories: rail, bus, and other. 

Use `assign` and `map`.

In [17]:
MODE_NAMES = {
    'MB': 'bus', 
    'LR': 'rail',
    'CB': 'bus',
    'DR': 'rail',
    'VP': 'other',
    'DT': 'other',
    'FB': 'bus',
    'CR': 'rail',
    'SR': 'rail',
    'HR': 'rail',
    'RB': 'bus',
    'YR': 'rail',
    'MG': 'other',
    'TB': 'bus',
    'IP': 'other',
    'TR': 'rail',
    'PB': 'bus',
    'CC': 'other',
    'AR': 'rail'
}

df = df.assign(
    mode_full_name = df.Mode.map(MODE_NAMES)
)

df[df.mode_full_name.isna()].Mode.value_counts()

Series([], Name: Mode, dtype: int64)

In [18]:
#def mode_full_name(row):
    if row.Mode.str.endswith('R'):
        return 'rail'
    elif row.Mode.str.endswith('B'):
        return 'bus'
    else:
        return 'other'
    

#df['mode_full_name1'] = df.apply(mode_full_name, axis = 1)
#df

IndentationError: unexpected indent (1151805520.py, line 2)

### Looping

Can loop across columns or loop across subsets of data.

Sometimes, looping can make sense if you're repeating certain steps. Use it if it makes sense.

Here, for 2 different columns, `Agency_VOMS` and `Mode_VOMS`, the values show up as strings.

Print the dtypes out for all the columns. 

In [50]:
df.dtypes
df.head(2)

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Organization_Type,Reporter_Type,Primary_UZA__Population,Agency_VOMS,Mode,TOS,Mode_VOMS,Ratios:,Fare_Revenues_per_Unlinked_Passenger_Trip_,Fare_Revenues_per_Unlinked_Passenger_Trip_Questionable,Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio),Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio)_Questionable,Cost_per__Hour,Cost_per_Hour_Questionable,Passengers_per_Hour,Passengers_per_Hour_Questionable,Cost_per_Passenger,Cost_per_Passenger_Questionable,Cost_per_Passenger_Mile,Cost_per_Passenger_Mile_Questionable,Source_Data:,Fare_Revenues_Earned,Fare_Revenues_Earned_Questionable,Total_Operating_Expenses,Total_Operating_Expenses_Questionable,Unlinked_Passenger_Trips,Unlinked_Passenger_Trips_Questionable,Vehicle_Revenue_Hours,Vehicle_Revenue_Hours_Questionable,Passenger_Miles,Passenger_Miles_Questionable,Vehicle_Revenue_Miles,Vehicle_Revenue_Miles_Questionable,Any_data_questionable?,Unnamed:_39,Unnamed:_40,Unnamed:_41,1,Unnamed:_43,Unnamed:_44,mode_full_name
0,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,HR,DO,5413,,$1.34,,0.7,,$267.97,,139.6,,$1.92,,$0.50,,,"$3,643,213,720",,"$5,206,727,193",,2712521697,,19430373,,10462782577,,354616371,,No,,,,Hide questionable data tags,,,rail
1,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,CB,DO,437,,$6.66,,0.32,,$393.55,,18.6,,$21.13,,$1.58,,,"$76,398,352",,"$242,520,835",,11477164,,616233,,153389117,,9866807,,No,,,,Show questionable data tags,,,bus


Print `value_counts()` for `Agency_VOMS` and `Mode_VOMS`. What is making these values appear as strings?

In [20]:
df.Mode_VOMS.value_counts()

2      351
3      284
4      232
5      231
1      226
      ... 
66       1
384      1
456      1
744      1
163      1
Name: Mode_VOMS, Length: 270, dtype: int64

In [21]:
df.Agency_VOMS.value_counts()

2        154
6        143
8        140
5        137
3        132
        ... 
153        1
127        1
188        1
150        1
1,026      1
Name: Agency_VOMS, Length: 255, dtype: int64

<i> Comma is causing the data type to appear as strings <i>

For those 2 columns, replace the commas with blanks and fill in missing values with `"0"` (zero, but with quotation marks to make it a string). 

Coerce these columns to be numeric.

In [22]:
for c in ["Agency_VOMS", "Mode_VOMS"]:
    df[c] = df[c].str.replace(',', '').fillna('0').astype({c: int})

In [23]:
for s in ["CA", "ID"]:
    subset_df = df[df.State==s]


In [24]:
    display(subset_df[["Agency", "City"]].drop_duplicates().head())

Unnamed: 0,Agency,City
703,"Ada County Highway District, dba: ACHD Commute...",Boise
778,Valley Regional Transit,Meridian
1440,"City of Pocatello, dba: Pocatello Regional Tra...",Pocatello
1482,Mountain Rides Transportation Authority,Ketchum
1598,Treasure Valley Transit,Nampa


### To Do:
* Keep a subset of columns and clean up column names (no spaces, newlines, etc):
    * columns related to identifying the agency
    * population, passenger trips
    * transit mode
    * at least 3 service metric variables, normalized and not normalized
* Deal with duplicates - what is the unit for each row? What is the unit for desired analysis? Should an agency appear multiple times, and if so, why?
* Aggregate at least 2 ways and show an interesting comparison, after dealing with duplicates somehow (either aggregation and/or defining what the unit of analysis is)
* Calculate weighted average after the aggregation for the service metrics
* Decide on one type of chart to visualize, and generalize it as a function
* Make charts using the function


### Step by Step

These are the 3 service metrics columns to keep (in addition to the columns listed above):
1. Fare Revenues  
1. Total Operating Expenses 
1. Vehicle Revenue Miles

The normalized columns are the ones adjusted by population or volume. 
* Instead of total fare revenues, it's the fare revenues per unlinked trip.
* Instead of total cost, it's cost per passenger or cost per hour.

In [25]:
df1 = df[[
    'Agency', 'Primary_UZA__Population', 'Mode', 'Fare_Revenues_per_Unlinked_Passenger_Trip_', 'Cost_per_Passenger', 'Total_Operating_Expenses','Vehicle_Revenue_Miles','mode_full_name' 
]]

In [26]:
df1.columns = df1.columns.str.lower().str.replace('__','_')
df1

Unnamed: 0,agency,primary_uza_population,mode,fare_revenues_per_unlinked_passenger_trip_,cost_per_passenger,total_operating_expenses,vehicle_revenue_miles,mode_full_name
0,MTA New York City Transit,18351295,HR,$1.34,$1.92,"$5,206,727,193",354616371,rail
1,MTA New York City Transit,18351295,CB,$6.66,$21.13,"$242,520,835",9866807,bus
2,MTA New York City Transit,18351295,MB,$1.22,$3.88,"$2,685,918,268",86233591,bus
3,MTA New York City Transit,18351295,DR,$2.03,$106.96,"$516,470,491",37759280,rail
4,MTA New York City Transit,18351295,RB,$1.06,$3.36,"$103,071,355",3382426,bus
...,...,...,...,...,...,...,...,...
3680,,,,,,,,
3681,,,,,,,,
3682,,,,,,,,
3683,,,,,,,,


Deal with duplicates. 

For an agency with multiple modes, aggregate it across modes and get the sum for the service metrics.

Ex: sum up the total fare revenues for an agency with rail, bus, and ferry modes. sum across the modes.

In [34]:
for c in ['total_operating_expenses','fare_revenues_per_unlinked_passenger_trip_','total_operating_expenses','cost_per_passenger']:
    df1[c] = df1[c].str.replace(',', '').fillna('0').str.replace('$','')

  df1[c] = df1[c].str.replace(',', '').fillna('0').str.replace('$','')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1[c] = df1[c].str.replace(',', '').fillna('0').str.replace('$','')


In [43]:
pivot = (df1.groupby(['agency','primary_uza_population','mode'])
        .agg({'total_operating_expenses': 'sum',
              'fare_revenues_per_unlinked_passenger_trip_': 'sum',
             'cost_per_passenger':'sum'}
             ).reset_index()
        )

pivot

Unnamed: 0,agency,primary_uza_population,mode,total_operating_expenses,fare_revenues_per_unlinked_passenger_trip_,cost_per_passenger
0,City of Gadsden,64172,DR,459502,1.47,14.63
1,City of Gadsden,64172,MB,477971,0.32,6.42
2,Sistersville Ferry,0,FB,47877,0.11,12.93
3,10-15 Regional Transit Agency,0,DR,2903338,0.72,13.71
4,A&C Bus Corporation & Montgomery & Westside Ow...,18351295,MB,5498147,1.35,1.36
...,...,...,...,...,...,...
3535,Yurok Tribe,0,DR,459364,0.00,19.35
3536,Yurok Tribe,0,FB,76960,0.00,171.02
3537,"Zia Therapy Center, Inc.",128600,DR,258040,1.13,18.73
3538,"Zia Therapy Center, Inc.",128600,MB,947028,0.60,8.50


Does it make sense to sum up the normalized metrics?

For an agency with 3 modes (rail, bus, ferry) it make sense to sum up `fares_per_passenger` across those 3 modes? Why or why not?

If bus passengers make up 80% of the agency's passengers (rail 15%, ferry 5%), how do we make sure the normalized metric accounts for this? Bus fares are significantly lower than rail and ferry fares, in this scenario. How do we make sure that `fares_per_passenger` metric reflects this mix?

# Normalizing Metrics

> It makes sense to sum up the normalized metrics across different modes. 


> For an agency with 3 modes (rail, bus and ferry), it might not make sense to sum up `fares_per_passenger` as the number of trips for each types of mode could be different.

>If there is a distinction on the usage of the mode i.e. if bus passenger make up 80%, then we can use the weighted average to normalize the specific metrics. 


What is the correct way to calculate `fares_per_passenger` across modes for the same operator?

Show the correct way. Drop the existing normalized metrics and calculate it across modes for the agency. The resulting dataframe should be 1 row for each agency, with the service metrics aggregated to that agency across modes, as well as normalized metrics(per passenger or per passenger trip) across modes.

Make a bar chart for one service metric for 5 agencies (show both normalized and not normalized).

Ex: if you choose fare revenues, make a bar chart for total fare revenues and fare revenues per passenger trip. The 5 agencies should appear together on a single bar chart.


### Helpful Hints for Functions
* Opportunities are from components that are generalizable in making a chart
* Maybe these components need the same lines of code to clean them
* You can always further define variables within a function
* You can always use f-strings within functions to make slight modifications to the parameters you pass

In [None]:
# Sample function
import altair as alt

def make_bar_chart(df, x_col, y_col):
    x_title = f"{x_col.title()}"
    
    chart = (alt.Chart(df)
             .mark_bar()
             .encode(
                 x=alt.X(x_col, title=x_title),
                 y=alt.Y(y_col, title=""),
             )
            )
    return chart
