# Exercise 3: More tabular data wrangling.

Skills:
* Looping
* Dictionary to map values
* Dealing with duplicates
* Make use of Markdown cells to write some narrative or commentary!

References:
* https://docs.calitp.org/data-infra/analytics_new_analysts/02-data-analysis-intermediate.html

In [1]:
import pandas as pd

In [2]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/bus_service_increase/"
FILE_NAME = "ntd_metrics_2019.csv"

df = pd.read_csv(f"{GCS_FILE_PATH}{FILE_NAME}")
df.shape



(3685, 45)

In [3]:
# in all columns in df, relace string formats with other formats
df.columns = df.columns.str.replace('\n', ' ').str.replace(' ', '_')
df.shape

(3685, 45)

In [4]:
print(f"# obs: {len(df)}")
print(f"# unique IDs: {df.NTD_ID.nunique()}")

# obs: 3685
# unique IDs: 2183


In [5]:
# Pick an example -- see that agency provides service for different modes
# df.NTD_ID.value_counts()
df[df.NTD_ID=="10003"].Mode.value_counts()


MB    2
FB    1
DR    1
CR    1
RB    1
HR    1
TB    1
LR    1
Name: Mode, dtype: int64

### Dealing with Duplicates

* Explore why there are duplicates
* What's the analysis about? What should the unit of analysis be?
* Should duplicates be dropped? Should duplicates be aggregated into 1 entry?
* Hint: It depends on the analysis, and there might be a bit of both. Sometimes, aggregation makes sense. Duplicates require further investigation -- why do they appear in the dataset multiple times? Unless it's completely duplicate information, it doesn't make sense to just drop. It may show that the analysis can be more disaggregate than previously thought.

In [6]:
# But what about this case?
# in the df, in series call Agency, check if string contains .... AND in Agency series, filter for cells that are not empty. then do value counts of not empty rows.
# duplicates may indicate a 1:m relationship with another table. or like, in the warehouse, if something is 'current' or has a different timestamp

df[(df.Agency.str.contains("Los Angeles County Metropolitan Transportation Authority ")) 
   & (df.Agency.notna())].Mode.value_counts()

MB    2
HR    1
RB    1
LR    1
VP    1
Name: Mode, dtype: int64

In [7]:
# Find the column that has different values

#VEHICLE_REVENUE_MILES

df[(df.Agency.str.contains("Los Angeles County Metropolitan Transportation Authority ")) 
   & (df.Agency.notna()) & 
   (df.Mode=="MB")
  ]

Unnamed: 0,Agency,City,State,Legacy_NTD_ID,NTD_ID,Organization_Type,Reporter_Type,Primary_UZA__Population,Agency_VOMS,Mode,...,Passenger_Miles_Questionable,Vehicle_Revenue_Miles,Vehicle_Revenue_Miles_Questionable,Any_data_questionable?,Unnamed:_39,Unnamed:_40,Unnamed:_41,1,Unnamed:_43,Unnamed:_44
16,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,MB,...,,65595822,,No,,,,,,
17,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,MB,...,,5775759,,No,,,,,,


In [8]:
subset_cols = [
    'Agency', 'City', 'State', 'Legacy_NTD_ID', 'NTD_ID',
    'Organization_Type', 'Reporter_Type', 'Mode'
]

print(f"# obs: {len(df)}")
print(f"# obs after dropping dups: {len(df.drop_duplicates(subset=subset_cols))}")

# What does this indicate? Use Markdown cell and jot down some of the logic.

# obs: 3685
# obs after dropping dups: 3553


### Response to above
The initial dataframe has 3,685 rows. 

However, based on the columns listed in the `subset_col` list, the `df.dtop_duplicates` dataframe now only has 3,553 rows. 
Indicaating that there are some duplicate rows based on the `subset_col` list, but may have been unique rows in the initial dataframe. The rows may start to differ outside of the `subset_col` list.


[Markdown reference](https://www.datacamp.com/community/tutorials/markdown-in-jupyter-notebook)

### Changing values by using a dictionary to map

In [9]:
# Transit mode uses a code, 
# Use a dictionary to map those codes to its full name
MODE_NAMES = {
    'MB': 'Bus', 
    'LR': 'Light Rail',
    'CB': 'Commuter Bus',
}

# What happens to the ones that aren't specified in MODE_NAMES?
# assigns() creates a new col to the df. in this case, assign new col called mode_full_name. then fill the col by taking the mode of the values listed in `mode_names` list.
df = df.assign(mode_full_name = df.Mode.map(MODE_NAMES))
df.columns

Index(['Agency', 'City', 'State', 'Legacy_NTD_ID', 'NTD_ID',
       'Organization_Type', 'Reporter_Type', 'Primary_UZA__Population',
       'Agency_VOMS', 'Mode', 'TOS', 'Mode_VOMS', 'Ratios:',
       'Fare_Revenues_per_Unlinked_Passenger_Trip_',
       'Fare_Revenues_per_Unlinked_Passenger_Trip_Questionable',
       'Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio)',
       'Fare_Revenues_per_Total_Operating_Expense_(Recovery_Ratio)_Questionable',
       'Cost_per__Hour', 'Cost_per_Hour_Questionable', 'Passengers_per_Hour',
       'Passengers_per_Hour_Questionable', 'Cost_per_Passenger',
       'Cost_per_Passenger_Questionable', 'Cost_per_Passenger_Mile',
       'Cost_per_Passenger_Mile_Questionable', 'Source_Data:',
       'Fare_Revenues_Earned', 'Fare_Revenues_Earned_Questionable',
       'Total_Operating_Expenses', 'Total_Operating_Expenses_Questionable',
       'Unlinked_Passenger_Trips', 'Unlinked_Passenger_Trips_Questionable',
       'Vehicle_Revenue_Hours', 'Vehicle_R

In [10]:
#only prits the `mode_names` we defined above and leaves the rest of the undefined rows blank? So only counds the non-n/a values in the `mode_full_name` column
df.mode_full_name.value_counts()

Bus             1244
Commuter Bus     177
Light Rail        23
Name: mode_full_name, dtype: int64

In [11]:
# `isna` checks for empty values. (opposite of `notna`)
# looks at the empty values in the `mode_full_name` series in the df, and counts the values of modes in those empty values
df[df.mode_full_name.isna()].Mode.value_counts()

DR    1879
VP     112
DT     103
FB      40
CR      27
SR      22
HR      15
RB      13
YR       6
MG       6
TB       5
IP       3
TR       2
PB       1
CC       1
AR       1
Name: Mode, dtype: int64

In [12]:
# Map values from Mode to rail, bus, and other 
# used AND to consolidate some of the values. values ending with R == rail. values ending with B == Bus, the remaining is other
#last fucntions says; create dataframe called df, replacing the df from above. use assign to create a new col in df2 called 'Mode_cat'. in `mode_cat`, fill in the cells by mapping the strings in the 'mode fill' dictionary to the values in the `Mode` Col form the initial df
#also included the code form the previous directory 

mode_fill = {'AR'and'CR' and 'DR' and 'HR' and 'SR' and 'TR' and 'YR':'Rail',
             'FB' and 'PB' and 'RB'and 'TB': 'Bus',
             'CC' and 'MG' and 'IP' and 'VP' and 'DT':'other',
             'MB': 'Bus', 
             'LR': 'Light Rail',
             'CB': 'Commuter Bus',
}
df = df.assign(mode_cat = df.Mode.map(mode_fill))


In [21]:
# in df, count the values of each type in the `mode_cat` col

df.mode_cat.value_counts()

Bus             1249
Commuter Bus     177
other            103
Light Rail        23
Rail               6
Name: mode_cat, dtype: int64

### Looping

Can loop across columns or loop across subsets of data.

Sometimes, looping can make sense if you're repeating certain steps. Use it if it makes sense.

In [16]:
# need help unstanding this section

# for ... in ... [] starts the Loop

# C == Agency_VOMS and Mode_VOMS
# in df[c] (aka in df Agency_VOMS and Mode_VOMS) replace the string character , with empty space, and fill in blank spaces as integer 0

for c in ["Agency_VOMS", "Mode_VOMS"]:
    df[c] = df[c].str.replace(',', '').fillna('0').astype({c: int})

In [17]:
# create a new df called subset_df based off of initial df
# within df look for 

for s in ["CA", "ID"]:
    subset_df = df[df.State==s]
    display(subset_df[["Agency", "City"]].drop_duplicates().head())

Unnamed: 0,Agency,City
13,Los Angeles County Metropolitan Transportation...,Los Angeles
72,Orange County Transportation Authority,Orange
94,Access Services,El Monte
120,"City and County of San Francisco, dba: San Fra...",San Francisco
131,San Diego Metropolitan Transit System,San Diego


Unnamed: 0,Agency,City
703,"Ada County Highway District, dba: ACHD Commute...",Boise
778,Valley Regional Transit,Meridian
1440,"City of Pocatello, dba: Pocatello Regional Tra...",Pocatello
1482,Mountain Rides Transportation Authority,Ketchum
1598,Treasure Valley Transit,Nampa


### To Do:
* Keep a subset of columns and clean up column names (no spaces, newlines, etc):
    * columns related to identifying the agency
    * population, passenger trips
    * transit mode
    * at least 3 service metric variables, normalized and not normalized
* Deal with duplicates - what is the unit for each row? What is the unit for desired analysis? Should an agency appear multiple times, and if so, why?
* Aggregate at least 2 ways and show an interesting comparison, after dealing with duplicates somehow (either aggregation and/or defining what the unit of analysis is)
* Calculate weighted average after the aggregation for the service metrics
* Decide on one type of chart to visualize, and generalize it as a function
* Make charts using the function

In [52]:
#identified the columns i want to keep
subset_col2 = [
   'Agency',
    'NTD_ID',
    'State',
    'Organization_Type',
    'Reporter_Type',
    'Primary_UZA__Population',
    'Agency_VOMS',
    'Mode',
    'Fare_Revenues_Earned',
    'Total_Operating_Expenses',
    'Vehicle_Revenue_Miles'
    
]

#creating a new df called df_sub that takes the initial df but only keeps the columns started in `subset_col2` (see ex 2)
df_sub = df[subset_col2]

#peaked into the df and see that the fare revenue, operating expense are `objects`, but i want `integers`
df_sub.dtypes

Agency                      object
NTD_ID                      object
State                       object
Organization_Type           object
Reporter_Type               object
Primary_UZA__Population     object
Agency_VOMS                  int64
Mode                        object
Fare_Revenues_Earned        object
Total_Operating_Expenses    object
Vehicle_Revenue_Miles       object
dtype: object

In [56]:
#remove '$' and ',' from fare revenue and operating cost so i can change data type to int from some of the columns
# trying assign function

df_sub['Fare_Revenues_Earned'] = df_sub.Fare_Revenues_Earned.str.replace('$','').str.replace(',','')
df_sub['Total_Operating_Expenses'] = df_sub.Total_Operating_Expenses.replace('$','').str.replace(',','')
df_sub.head()

  df_sub['Fare_Revenues_Earned'] = df_sub.Fare_Revenues_Earned.str.replace('$','').str.replace(',','')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub['Fare_Revenues_Earned'] = df_sub.Fare_Revenues_Earned.str.replace('$','').str.replace(',','')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub['Total_Operating_Expenses'] = df_sub.Total_Operating_Expenses.replace('$','').str.replace(',','')


Unnamed: 0,Agency,NTD_ID,State,Organization_Type,Reporter_Type,Primary_UZA__Population,Agency_VOMS,Mode,Fare_Revenues_Earned,Total_Operating_Expenses,Vehicle_Revenue_Miles
0,MTA New York City Transit,20008,NY,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,HR,3643213720,$5206727193,354616371
1,MTA New York City Transit,20008,NY,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,CB,76398352,$242520835,9866807
2,MTA New York City Transit,20008,NY,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,MB,846111742,$2685918268,86233591
3,MTA New York City Transit,20008,NY,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,DR,9781667,$516470491,37759280
4,MTA New York City Transit,20008,NY,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,RB,32469300,$103071355,3382426


In [50]:
#remove '$' and ',' from fare revenue and operating cost so i can change data type to int from some of the columns
# trying assign function


df_sub.assign(
    fare_rev =  df_sub.Fare_Revenues_Earned.str.replace('$',''),
    tl_op_exp = df_sub.Total_Operating_Expenses.str.replace('$','')
)
list(df_sub.columns)

  fare_rev =  df_sub.Fare_Revenues_Earned.str.replace('$',''),
  tl_op_exp = df_sub.Total_Operating_Expenses.str.replace('$','')


['Agency',
 'NTD_ID',
 'State',
 'Organization_Type',
 'Reporter_Type',
 'Primary_UZA__Population',
 'Agency_VOMS',
 'Mode',
 'Fare_Revenues_Earned',
 'Total_Operating_Expenses',
 'Vehicle_Revenue_Miles']

In [None]:
#change datatype of fare revenue earned, total operating exepenses, vehicles revenue miles to int. 
## is there a way do do that from the my list of columns to keep?

df_sub.astype({
    'Fare_Revenues_Earned': int,
    'Total_Operating_Expenses': int,
    'Vehicle_Revenue_Miles':int,
})


In [None]:
df_sub.drop_duplicates(subset=['NTD_ID']).shape

### Thoughts
In comparing the `df_sub` dataframe to the `df_sub.drop_duplicates...` dataframe, there is a difference in the number of rows. I dropped duplicates based on the `NTD_ID` of each row and was returned with a list of unique NTD Agencies. 

However, the reason there were duplicates in the initial dataframe was do to the categorical nature of the `Mode` col. some Agencies have different Modes of transportation, and each mode has its own distinct revenue, operating expenses and vehicle revenue miles.

Personally I would like to keep df_sub with the duplicates, and consolidate the duplicate agencies using aggregation

In [31]:
#aggregations
#what is the total revenue earned per state
df_sub.groupby('State').Fare_Revenues_Earned.agg(['sum', 'max', 'min'])


Unnamed: 0_level_0,sum,max,min
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AK,"$1,175,411 $3,104,663 $685,839 $233,684 $26,09...","$79,477",$0
AL,"$1,827,042 $178,505 $451,446 $0 $205,127 $625,...","$93,772",$0
AR,"$663,662 $217,238 $1,678,435 $7,371 $9,912 $24...","$99,645","$1,678,435"
AS,"$21,291 $65,738","$65,738","$21,291"
AZ,"$3,372,077 $1,617,171 $9,104,573 $27,492,333 $...","$9,104,573",$0
CA,"$31,426,577 $4,997,045 $42,986,478 $182,029,21...","$99,070",$0
CO,"$53,739,008 $4,529,322 $32,983,228 $24,775,846...","$98,959",$0
CT,"$1,069,990 $14,215,583 $1,357,670 $7,223,078 $...","$783,788",$0
DC,"$3,720,334 $533,518,013 $8,058,636 $722,503 $1...","$8,058,636",$0
DE,"$616,312 $104,336 $4,498,664 $7,326,183","$7,326,183","$104,336"


In [33]:
#What is the total revenue earned per agency
df_sub.groupby('Agency').Fare_Revenues_Earned.agg(['sum','max','min'])


Unnamed: 0_level_0,sum,max,min
Agency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
City of Gadsden,"$46,036 $23,790","$46,036","$23,790"
Sistersville Ferry,$405,$405,$405
10-15 Regional Transit Agency,"$151,459","$151,459","$151,459"
A&C Bus Corporation & Montgomery & Westside Owners Association,"$5,459,416","$5,459,416","$5,459,416"
ALTRAN Transit Authority,"$252,932","$252,932","$252,932"
...,...,...,...
Yuba-Sutter Transit Authority,"$506,790 $569,046 $170,498","$569,046","$170,498"
Yuma County Intergovernmental Public Transportation Authority,"$2,754 $335,101 $415,121","$415,121","$2,754"
Yurok Tribe,$0 $0,$0,$0
"Zia Therapy Center, Inc.","$66,892 $15,543","$66,892","$15,543"


### Helpful Hints for Functions
* Opportunities are from components that are generalizable in making a chart
* Maybe these components need the same lines of code to clean them
* You can always further define variables within a function
* You can always use f-strings within functions to make slight modifications to the parameters you pass

In [None]:
# Sample function
import altair as alt

def make_bar_chart(df, x_col, y_col):
    x_title = f"{x_col.title()}"
    
    chart = (alt.Chart(df)
             .mark_bar()
             .encode(
                 x=alt.X(x_col, title=x_title),
                 y=alt.Y(y_col, title=""),
             )
            )
    return chart


When removing files:
* `git rm folder/file.ipynb` if the file is in GitHub (checked in in the past)
* if it's not, you can use `rm folder/file.ipynb`
* if it's a folder that's been checked in, you can use `git rm folder/ -rf`, followed by `rm folder/ -rf`. r = recursive, f = force.