# Exercise 3: More tabular data wrangling

Skills:
* Looping
* Dictionary to map values
* Dealing with duplicates
* Make use of Markdown cells to write some narrative or commentary!

References:
* https://docs.calitp.org/data-infra/analytics_new_analysts/02-data-analysis-intermediate.html

In [1]:
import pandas as pd

In [2]:
pwd

'/home/jovyan/data-analyses/starter_kit/monica_exercises'

In [3]:
GCS_FILE_PATH = "../data/"
FILE_NAME = "exercise_2_3_ntd_metrics_2019.parquet"

df = pd.read_parquet(f"{GCS_FILE_PATH}{FILE_NAME}")
df.head(2)


Unnamed: 0,Agency,City,State,Legacy NTD ID,NTD ID,Organization Type,Reporter Type,Primary UZA\n Population,Agency VOMS,Mode,...,Passenger Miles Questionable,Vehicle Revenue Miles,Vehicle Revenue Miles Questionable,Any data questionable?,Unnamed: 39,Unnamed: 40,Unnamed: 41,1,Unnamed: 43,Unnamed: 44
0,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,HR,...,,354616371,,No,,,,Hide questionable data tags,,
1,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,CB,...,,9866807,,No,,,,Show questionable data tags,,


In [4]:
df.columns = df.columns.str.replace('\n', ' ').str.replace(' ', '_')

In [5]:
print(f"# obs: {len(df)}")
print(f"# unique IDs: {df.NTD_ID.nunique()}")

# obs: 3685
# unique IDs: 2183


In [6]:
# Pick an example -- see that agency provides service for different modes
# df.NTD_ID.value_counts()
df[df.NTD_ID=="10003"].Mode.value_counts()

MB    2
FB    1
DR    1
CR    1
RB    1
HR    1
TB    1
LR    1
Name: Mode, dtype: int64

### Dealing with Duplicates

* Explore why there are duplicates
* What's the analysis about? What should the unit of analysis be?
* Should duplicates be dropped? Should duplicates be aggregated into 1 entry?
* Hint: It depends on the analysis, and there might be a bit of both. Sometimes, aggregation makes sense. Duplicates require further investigation -- why do they appear in the dataset multiple times? Unless it's completely duplicate information, it doesn't make sense to just drop. It may show that the analysis can be more disaggregate than previously thought.

In [7]:
# But what about this case? -- monica answer: it looks like this agency has 2 MB modes 
df[(df.Agency.str.contains("Los Angeles County Metropolitan Transportation Authority ")) 
   & (df.Agency.notna())].Mode.value_counts()

MB    2
HR    1
RB    1
LR    1
VP    1
Name: Mode, dtype: int64

In [8]:
# Find the column that has different values - - monica answer: vehicle revenue miles, would those need to be added together?
df[(df.Agency.str.contains("Los Angeles County Metropolitan Transportation Authority ")) 
   & (df.Agency.notna()) & 
   (df.Mode=="MB")
  ]
#what does df.Agency.notna() do? Non NaN?

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Organization_Type,Reporter_Type,Primary_UZA__Population,Agency_VOMS,Mode,...,Passenger_Miles_Questionable,Vehicle_Revenue_Miles,Vehicle_Revenue_Miles_Questionable,Any_data_questionable?,Unnamed:_39,Unnamed:_40,Unnamed:_41,1,Unnamed:_43,Unnamed:_44
16,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,MB,...,,65595822,,No,,,,,,
17,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,MB,...,,5775759,,No,,,,,,


In [9]:
subset_cols = [
    'Agency', 'City', 'State', 'Legacy_NTD_ID', 'NTD_ID',
    'Organization_Type', 'Reporter_Type', 'Mode'
]

print(f"# obs: {len(df)}")
print(f"# obs after dropping dups: {len(df.drop_duplicates(subset=subset_cols))}")

# What does this indicate? Use Markdown cell and jot down some of the logic. 
#-- monica answer: This code allows us to compare the length of the data before dropping duplicates
#there are actually 3553 "unique" projects and 132 extras that are duplicates

# obs: 3685
# obs after dropping dups: 3553


[Markdown reference](https://www.datacamp.com/community/tutorials/markdown-in-jupyter-notebook)

### Monica answer:
<blockquote>It looks like vehicle revenue miles could be aggregated to consolidate agency/mode duplicates? I feel like this also depends if geometry is alsoassociated with the data and you plan to display. Because in the later case, I would not drop the duplicate.</blockquote>

### Changing values by using a dictionary to map

In [10]:
# Transit mode uses a code, 
# Use a dictionary to map those codes to its full name
MODE_NAMES = {
    'MB': 'Bus', 
    'LR': 'Light Rail',
    'CB': 'Commuter Bus',
}

# What happens to the ones that aren't specified in MODE_NAMES? -- Monica answer: You end up with a NaN
df = df.assign(
    mode_full_name = df.Mode.map(MODE_NAMES)
)
df

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Organization_Type,Reporter_Type,Primary_UZA__Population,Agency_VOMS,Mode,...,Vehicle_Revenue_Miles,Vehicle_Revenue_Miles_Questionable,Any_data_questionable?,Unnamed:_39,Unnamed:_40,Unnamed:_41,1,Unnamed:_43,Unnamed:_44,mode_full_name
0,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,HR,...,354616371,,No,,,,Hide questionable data tags,,,
1,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,CB,...,9866807,,No,,,,Show questionable data tags,,,Commuter Bus
2,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,MB,...,86233591,,No,,2.0,,1,,2.0,Bus
3,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,DR,...,37759280,,No,,,,,,,
4,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,RB,...,3382426,,No,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3680,,,,,,,,,,,...,,,No,,,,,,,
3681,,,,,,,,,,,...,,,No,,,,,,,
3682,,,,,,,,,,,...,,,No,,,,,,,
3683,,,,,,,,,,,...,,,No,,,,,,,


In [11]:
df.mode_full_name.value_counts()

Bus             1244
Commuter Bus     177
Light Rail        23
Name: mode_full_name, dtype: int64

In [12]:
df.Mode.value_counts()

DR    1879
MB    1244
CB     177
VP     112
DT     103
FB      40
CR      27
LR      23
SR      22
HR      15
RB      13
YR       6
MG       6
TB       5
IP       3
TR       2
PB       1
CC       1
AR       1
Name: Mode, dtype: int64

Map values from `Mode` to these categories: rail, bus, and other. 

Use `assign` and `map`.

In [13]:
MODE_NAMES = {
    'DR': 'Other',
    'MB': 'Bus',
    'CB': 'Bus', 
    'VP': 'Other',
    'DT': 'Other',
    'FB': 'Other',
    'CR': 'Rail',
    'LR': 'Rail', 
    'SR': 'Other',
    'HR': 'Rail',
    'RB': 'Bus',
    'YR': 'Rail',
    'MG': 'Other',
    'TB': 'Other',
    'IP': 'Other',
    'TR': 'Other',
    'PB': 'Other',
    'CC': 'Other',
    'AR': 'Rail',
    'LR': 'Rail',
    'CB': 'Bus',
}
    
df = df.assign(
    mode_full_name = df.Mode.map(MODE_NAMES)
)
df

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Organization_Type,Reporter_Type,Primary_UZA__Population,Agency_VOMS,Mode,...,Vehicle_Revenue_Miles,Vehicle_Revenue_Miles_Questionable,Any_data_questionable?,Unnamed:_39,Unnamed:_40,Unnamed:_41,1,Unnamed:_43,Unnamed:_44,mode_full_name
0,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,HR,...,354616371,,No,,,,Hide questionable data tags,,,Rail
1,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,CB,...,9866807,,No,,,,Show questionable data tags,,,Bus
2,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,MB,...,86233591,,No,,2.0,,1,,2.0,Bus
3,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,DR,...,37759280,,No,,,,,,,Other
4,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,RB,...,3382426,,No,,,,,,,Bus
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3680,,,,,,,,,,,...,,,No,,,,,,,
3681,,,,,,,,,,,...,,,No,,,,,,,
3682,,,,,,,,,,,...,,,No,,,,,,,
3683,,,,,,,,,,,...,,,No,,,,,,,


### Looping

Can loop across columns or loop across subsets of data.

Sometimes, looping can make sense if you're repeating certain steps. Use it if it makes sense.

Here, for 2 different columns, `Agency_VOMS` and `Mode_VOMS`, the values show up as strings.

Print the dtypes out for all the columns. 

In [14]:
df.dtypes

Agency                                                                      object
City                                                                        object
State                                                                       object
Legacy_NTD_ID                                                               object
NTD_ID                                                                      object
Organization_Type                                                           object
Reporter_Type                                                               object
Primary_UZA__Population                                                     object
Agency_VOMS                                                                 object
Mode                                                                        object
TOS                                                                         object
Mode_VOMS                                                                   object
Rati

Print `value_counts()` for `Agency_VOMS` and `Mode_VOMS`. What is making these values appear as strings?

In [15]:
df.Agency_VOMS.value_counts()

2        154
6        143
8        140
5        137
3        132
        ... 
153        1
127        1
188        1
150        1
1,026      1
Name: Agency_VOMS, Length: 255, dtype: int64

In [16]:
df.Mode_VOMS.value_counts()

2      351
3      284
4      232
5      231
1      226
      ... 
66       1
384      1
456      1
744      1
163      1
Name: Mode_VOMS, Length: 270, dtype: int64

In [17]:
#Can the two above be chained together?

For those 2 columns, replace the commas with blanks and fill in missing values with `"0"` (zero, but with quotation marks to make it a string). 

Coerce these columns to be numeric.

In [18]:
for c in ["Agency_VOMS", "Mode_VOMS"]:
    df[c] = df[c].str.replace(',', '').fillna('0').astype({c: int})

In [19]:
#showing the first five rows (.head()) without duplicates .drop_duplicates() from CA..... questions....
#ID was Idaho
#first table is CA, second is Idaho
#Anytime you see "for", you will take a list and see certain elements ..inject a variable
# subset_df = df[df.State==s] is stating that s stands for state
#display(subset_df[["Agency", "City"]].drop_duplicates().head()) says we only want to look at these elements relating to state
for s in ["CA", "ID"]:
    subset_df = df[df.State==s]
    display(subset_df[["Agency", "City"]].drop_duplicates().head())

Unnamed: 0,Agency,City
13,Los Angeles County Metropolitan Transportation...,Los Angeles
72,Orange County Transportation Authority,Orange
94,Access Services,El Monte
120,"City and County of San Francisco, dba: San Fra...",San Francisco
131,San Diego Metropolitan Transit System,San Diego


Unnamed: 0,Agency,City
703,"Ada County Highway District, dba: ACHD Commute...",Boise
778,Valley Regional Transit,Meridian
1440,"City of Pocatello, dba: Pocatello Regional Tra...",Pocatello
1482,Mountain Rides Transportation Authority,Ketchum
1598,Treasure Valley Transit,Nampa


### To Do:
* Keep a subset of columns and clean up column names (no spaces, newlines, etc):
    * columns related to identifying the agency
    * population, passenger trips
    * transit mode
    * at least 3 service metric variables, normalized and not normalized
* Deal with duplicates - what is the unit for each row? What is the unit for desired analysis? Should an agency appear multiple times, and if so, why?
* Aggregate at least 2 ways and show an interesting comparison, after dealing with duplicates somehow (either aggregation and/or defining what the unit of analysis is)
* Calculate weighted average after the aggregation for the service metrics
* Decide on one type of chart to visualize, and generalize it as a function
* Make charts using the function


### Step by Step

These are the 3 service metrics columns to keep (in addition to the columns listed above):
1. Fare Revenues  
1. Total Operating Expenses 
1. Vehicle Revenue Miles

The normalized columns are the ones adjusted by population or volume. 
* Instead of total fare revenues, it's the fare revenues per unlinked trip.
* Instead of total cost, it's cost per passenger or cost per hour.

In [20]:
#Keep a subset of columns
subset_cols2 = ['Legacy_NTD_ID', 'NTD_ID','Agency', 'City', 'State', 'Organization_Type','Fare_Revenues_Earned', 'Unlinked_Passenger_Trips',
          'Primary_UZA__Population', 'Agency_VOMS', 'Mode_VOMS', 'Fare_Revenues_per_Unlinked_Passenger_Trip_', 
          'Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio)', 'Cost_per__Hour', 'Passengers_per_Hour',
          'Cost_per_Passenger','Total_Operating_Expenses','Cost_per_Passenger_Mile','Vehicle_Revenue_Hours','Vehicle_Revenue_Miles',
          'mode_full_name']

df2 = df[subset_cols2]

df2.head()

Unnamed: 0,Legacy_NTD_ID,NTD_ID,Agency,City,State,Organization_Type,Fare_Revenues_Earned,Unlinked_Passenger_Trips,Primary_UZA__Population,Agency_VOMS,...,Fare_Revenues_per_Unlinked_Passenger_Trip_,Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio),Cost_per__Hour,Passengers_per_Hour,Cost_per_Passenger,Total_Operating_Expenses,Cost_per_Passenger_Mile,Vehicle_Revenue_Hours,Vehicle_Revenue_Miles,mode_full_name
0,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...","$3,643,213,720",2712521697,18351295,10885,...,$1.34,0.7,$267.97,139.6,$1.92,"$5,206,727,193",$0.50,19430373,354616371,Rail
1,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...","$76,398,352",11477164,18351295,10885,...,$6.66,0.32,$393.55,18.6,$21.13,"$242,520,835",$1.58,616233,9866807,Bus
2,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...","$846,111,742",691616614,18351295,10885,...,$1.22,0.32,$219.87,56.6,$3.88,"$2,685,918,268",$1.82,12215926,86233591,Bus
3,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...","$9,781,667",4828423,18351295,10885,...,$2.03,0.02,$129.45,1.2,$106.96,"$516,470,491",$11.92,3989579,37759280,Other
4,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...","$32,469,300",30695695,18351295,10885,...,$1.06,0.32,$199.16,59.3,$3.36,"$103,071,355",$1.81,517519,3382426,Bus


Deal with duplicates. 

For an agency with multiple modes, aggregate it across modes and get the sum for the service metrics.

Ex: sum up the total fare revenues for an agency with rail, bus, and ferry modes. sum across the modes.

aggregate and sum for the service metrics. Ex: sum up the total fare revenues for an agency with rail,bus, and ferry modes. sum across the modes.

In [21]:
subset_cols = [
    'Legacy_NTD_ID', 'NTD_ID','Agency', 'City', 'State', 'Organization_Type', 'Primary_UZA__Population', 
    'Agency_VOMS', 'Mode_VOMS', 'Fare_Revenues_per_Unlinked_Passenger_Trip_','Fare_Revenues_Earned', 'Unlinked_Passenger_Trips',
    'Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio)','Total_Operating_Expenses', 'Cost_per__Hour', 'Passengers_per_Hour',
    'Cost_per_Passenger','Cost_per_Passenger_Mile','Vehicle_Revenue_Hours','Vehicle_Revenue_Miles',
    'mode_full_name'
]
#Check duplicates
print(f"# obs: {len(df2)}")
print(f"# obs after dropping dups: {len(df2.drop_duplicates(subset=subset_cols))}")

# obs: 3685
# obs after dropping dups: 3681


In [22]:
#Drop duplicates based on subset columns
df3 = df2.drop_duplicates(subset=subset_cols2)

#Checking duplicates again
print(f"# observation: {len(df3)}")
print(f"# observations after dropping dups: {len(df3.drop_duplicates(subset=subset_cols2))}")
df3.head()

# observation: 3681
# observations after dropping dups: 3681


Unnamed: 0,Legacy_NTD_ID,NTD_ID,Agency,City,State,Organization_Type,Fare_Revenues_Earned,Unlinked_Passenger_Trips,Primary_UZA__Population,Agency_VOMS,...,Fare_Revenues_per_Unlinked_Passenger_Trip_,Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio),Cost_per__Hour,Passengers_per_Hour,Cost_per_Passenger,Total_Operating_Expenses,Cost_per_Passenger_Mile,Vehicle_Revenue_Hours,Vehicle_Revenue_Miles,mode_full_name
0,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...","$3,643,213,720",2712521697,18351295,10885,...,$1.34,0.7,$267.97,139.6,$1.92,"$5,206,727,193",$0.50,19430373,354616371,Rail
1,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...","$76,398,352",11477164,18351295,10885,...,$6.66,0.32,$393.55,18.6,$21.13,"$242,520,835",$1.58,616233,9866807,Bus
2,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...","$846,111,742",691616614,18351295,10885,...,$1.22,0.32,$219.87,56.6,$3.88,"$2,685,918,268",$1.82,12215926,86233591,Bus
3,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...","$9,781,667",4828423,18351295,10885,...,$2.03,0.02,$129.45,1.2,$106.96,"$516,470,491",$11.92,3989579,37759280,Other
4,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...","$32,469,300",30695695,18351295,10885,...,$1.06,0.32,$199.16,59.3,$3.36,"$103,071,355",$1.81,517519,3382426,Bus


In [23]:
#Replace money signs, commas and parentheses with blank and nulls NaN with 0
for values in ['Fare_Revenues_per_Unlinked_Passenger_Trip_','Cost_per__Hour','Total_Operating_Expenses','Cost_per_Passenger',
               'Cost_per_Passenger_Mile', 'Fare_Revenues_Earned', 'Unlinked_Passenger_Trips','Primary_UZA__Population']:
    
    df2[values] = df2[values].str.replace('$', '').str.replace(',', '').str.replace('(','').str.replace(')','').fillna('0').astype({values: float})

  df2[values] = df2[values].str.replace('$', '').str.replace(',', '').str.replace('(','').str.replace(')','').fillna('0').astype({values: float})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[values] = df2[values].str.replace('$', '').str.replace(',', '').str.replace('(','').str.replace(')','').fillna('0').astype({values: float})


Does it make sense to sum up the normalized metrics?
For an agency with 3 modes (rail, bus, ferry) it make sense to sum up `fares_per_passenger` across those 3 modes? Why or why not?

If bus passengers make up 80% of the agency's passengers (rail 15%, ferry 5%), how do we make sure the normalized metric accounts for this? Bus fares are significantly lower than rail and ferry fares, in this scenario. How do we make sure that `fares_per_passenger` metric reflects this mix?

In [24]:
df2.head()

Unnamed: 0,Legacy_NTD_ID,NTD_ID,Agency,City,State,Organization_Type,Fare_Revenues_Earned,Unlinked_Passenger_Trips,Primary_UZA__Population,Agency_VOMS,...,Fare_Revenues_per_Unlinked_Passenger_Trip_,Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio),Cost_per__Hour,Passengers_per_Hour,Cost_per_Passenger,Total_Operating_Expenses,Cost_per_Passenger_Mile,Vehicle_Revenue_Hours,Vehicle_Revenue_Miles,mode_full_name
0,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...",3643214000.0,2712522000.0,18351295.0,10885,...,1.34,0.7,267.97,139.6,1.92,5206727000.0,0.5,19430373,354616371,Rail
1,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...",76398350.0,11477160.0,18351295.0,10885,...,6.66,0.32,393.55,18.6,21.13,242520800.0,1.58,616233,9866807,Bus
2,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...",846111700.0,691616600.0,18351295.0,10885,...,1.22,0.32,219.87,56.6,3.88,2685918000.0,1.82,12215926,86233591,Bus
3,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...",9781667.0,4828423.0,18351295.0,10885,...,2.03,0.02,129.45,1.2,106.96,516470500.0,11.92,3989579,37759280,Other
4,2008,20008,MTA New York City Transit,New York,NY,"Subsidiary Unit of a Transit Agency, Reporting...",32469300.0,30695700.0,18351295.0,10885,...,1.06,0.32,199.16,59.3,3.36,103071400.0,1.81,517519,3382426,Bus


In [None]:
#I need help with the above

What is the correct way to calculate `fares_per_passenger` across modes for the same operator?

Show the correct way. Drop the existing normalized metrics and calculate it across modes for the agency. The resulting dataframe should be 1 row for each agency, with the service metrics aggregated to that agency across modes, as well as normalized metrics(per passenger or per passenger trip) across modes.

In [25]:
#Always use the parts of a normalized columns -- almost never agg by normalized columns

#Fare_Revenues_Earned sum
Fare_Revenues_Earned = df2.groupby(['Agency']).agg({'Fare_Revenues_Earned':'sum'}).reset_index()
Fare_Revenues_Earned.head()

Unnamed: 0,Agency,Fare_Revenues_Earned
0,City of Gadsden,69826.0
1,Sistersville Ferry,405.0
2,10-15 Regional Transit Agency,151459.0
3,A&C Bus Corporation & Montgomery & Westside Ow...,5459416.0
4,ALTRAN Transit Authority,252932.0


In [26]:
#Unlinked_Passenger_Trips sum
Unlinked_Passenger_Trips = df2.groupby(['Agency']).agg({'Unlinked_Passenger_Trips':'sum'}).reset_index()
Unlinked_Passenger_Trips.head()

Unnamed: 0,Agency,Unlinked_Passenger_Trips
0,City of Gadsden,105904.0
1,Sistersville Ferry,3702.0
2,10-15 Regional Transit Agency,211790.0
3,A&C Bus Corporation & Montgomery & Westside Ow...,4041143.0
4,ALTRAN Transit Authority,98348.0


In [27]:
#divide to normalize
Fare_Revs_per_Passenger_tbl = Fare_Revenues_Earned
Fare_Revs_per_Passenger_tbl['Passengers'] = Unlinked_Passenger_Trips.Unlinked_Passenger_Trips
Fare_Revs_per_Passenger_tbl['Fare_Revs_per_Passenger'] = Fare_Revenues_Earned.Fare_Revenues_Earned/Unlinked_Passenger_Trips.Unlinked_Passenger_Trips
Fare_Revs_per_Passenger_tbl.head()

Unnamed: 0,Agency,Fare_Revenues_Earned,Passengers,Fare_Revs_per_Passenger
0,City of Gadsden,69826.0,105904.0,0.659333
1,Sistersville Ferry,405.0,3702.0,0.1094
2,10-15 Regional Transit Agency,151459.0,211790.0,0.715138
3,A&C Bus Corporation & Montgomery & Westside Ow...,5459416.0,4041143.0,1.350958
4,ALTRAN Transit Authority,252932.0,98348.0,2.571806


In [28]:
#compare against Fare_Revenues_per_Unlinked_Passenger_Trip_ mean
Fare_Revenues_per_Unlinked_Passenger_trip = df2.groupby("Agency").agg({"Fare_Revenues_per_Unlinked_Passenger_Trip_": "mean"}).reset_index()
Fare_Revenues_per_Unlinked_Passenger_trip

Unnamed: 0,Agency,Fare_Revenues_per_Unlinked_Passenger_Trip_
0,City of Gadsden,0.895000
1,Sistersville Ferry,0.110000
2,10-15 Regional Transit Agency,0.720000
3,A&C Bus Corporation & Montgomery & Westside Ow...,1.350000
4,ALTRAN Transit Authority,2.570000
...,...,...
2164,Yuba-Sutter Transit Authority,2.500000
2165,Yuma County Intergovernmental Public Transport...,2.056667
2166,Yurok Tribe,0.000000
2167,"Zia Therapy Center, Inc.",0.865000


Make a bar chart for one service metric for 5 agencies (show both normalized and not normalized).

Ex: if you choose fare revenues, make a bar chart for total fare revenues and fare revenues per passenger trip. The 5 agencies should appear together on a single bar chart.


In [29]:
keep_me = ['Yurok Tribe', 'Zuni Pueblo', '10-15 Regional Transit Agency','MTA New York City Transit','Orange County Transportation Authority']
Fare_Revs_per_Passenger_Agencies_tbl = Fare_Revs_per_Passenger_tbl[Fare_Revs_per_Passenger_tbl.Agency.isin(keep_me)]
Fare_Revs_per_Passenger_Agencies_tbl

Unnamed: 0,Agency,Fare_Revenues_Earned,Passengers,Fare_Revs_per_Passenger
2,10-15 Regional Transit Agency,151459.0,211790.0,0.715138
1286,MTA New York City Transit,4607975000.0,3451140000.0,1.335204
1535,Orange County Transportation Authority,54749720.0,40743650.0,1.343761
2166,Yurok Tribe,0.0,24189.0,0.0
2168,Zuni Pueblo,19430.0,43585.0,0.445796


In [30]:
Fare_Revs_per_Passenger_tbl = Fare_Revs_per_Passenger_tbl.rename(columns = {'Fare_Revenues_Earned': 'Fare_Revs'})
Fare_Revs_per_Passenger_tbl

Unnamed: 0,Agency,Fare_Revs,Passengers,Fare_Revs_per_Passenger
0,City of Gadsden,69826.0,105904.0,0.659333
1,Sistersville Ferry,405.0,3702.0,0.109400
2,10-15 Regional Transit Agency,151459.0,211790.0,0.715138
3,A&C Bus Corporation & Montgomery & Westside Ow...,5459416.0,4041143.0,1.350958
4,ALTRAN Transit Authority,252932.0,98348.0,2.571806
...,...,...,...,...
2164,Yuba-Sutter Transit Authority,1246334.0,931948.0,1.337343
2165,Yuma County Intergovernmental Public Transport...,752976.0,531761.0,1.416005
2166,Yurok Tribe,0.0,24189.0,0.000000
2167,"Zia Therapy Center, Inc.",82435.0,125215.0,0.658348


In [32]:
import altair as alt

source = Fare_Revs_per_Passenger_Agencies_tbl

alt.Chart(source).mark_bar().encode(
    x="Agency",
    y="sum(Fare_Revs_per_Passenger)",
)

In [None]:
# Sample function
import altair as alt

def make_bar_chart(df, x_col, y_col):
    x_title = f"{x_col.title()}"
    
    chart = (alt.Chart(df)
             .mark_bar()
             .encode(
                 x=alt.X(x_col, title=x_title),
                 y=alt.Y(y_col, title=""),
             )
            )
    return chart

### Helpful Hints for Functions
* Opportunities are from components that are generalizable in making a chart
* Maybe these components need the same lines of code to clean them
* You can always further define variables within a function
* You can always use f-strings within functions to make slight modifications to the parameters you pass

In [None]:
# Sample function
import altair as alt

def make_bar_chart(df, x_col, y_col):
    x_title = f"{x_col.title()}"
    
    chart = (alt.Chart(df)
             .mark_bar()
             .encode(
                 x=alt.X(x_col, title=x_title),
                 y=alt.Y(y_col, title=""),
             )
            )
    return chart
