## Exercise 2: Merging and Deriving New Columns

Skills: 
* Merge 2 dataframes
* F-strings!
* Markdown cells
* Build on groupby/agg knowledge, derive new columns, exporting
* Practice committing on GitHub

References: 
* https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html
* https://docs.calitp.org/data-infra/analytics_tools/saving_code.html

In [48]:
import pandas as pd

Use of f-strings. [Read more on this](https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python).

Also, click on this Markdown cell and see how to do different formatting syntax within Markdown. [Reference this](https://www.datacamp.com/community/tutorials/markdown-in-jupyter-notebook).

If you don't have access to Google Cloud Storage, change the path to pull from our truncated sample parquets stored in the repo.

We use [relative paths](https://towardsthecloud.com/get-relative-path-python) rather than absolute paths. Since we are in the `starter_kit` directory, we just need to go one more level in to the `data` subfolder. To get one level outside of `starter_kit`, use `../` and you'll end up in `data-analyses`. 

```
FOLDER = "./data/"
FILE_NAME = "exercise_2_3_ntd_metrics_2019.parquet"
df = pd.read_parquet(f"{FOLDER}{FILE_NAME}")

```

In [49]:
FOLDER = "./data/"
FILE_NAME = "exercise_2_3_ntd_metrics_2019.parquet"
df = pd.read_parquet(f"{FOLDER}{FILE_NAME}")

In [50]:
FOLDER = "./data/"
FILE_NAME = "exercise_2_ntd_vehicles_2019.parquet"
df2 = pd.read_parquet(f"{FOLDER}{FILE_NAME}")

In [10]:
df.head()

Unnamed: 0,Agency,City,State,Legacy NTD ID,NTD ID,Organization Type,Reporter Type,Primary UZA\n Population,Agency VOMS,Mode,...,Passenger Miles Questionable,Vehicle Revenue Miles,Vehicle Revenue Miles Questionable,Any data questionable?,Unnamed: 39,Unnamed: 40,Unnamed: 41,1,Unnamed: 43,Unnamed: 44
0,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,HR,...,,354616371,,No,,,,Hide questionable data tags,,
1,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,CB,...,,9866807,,No,,,,Show questionable data tags,,
2,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,MB,...,,86233591,,No,,2.0,,1,,2.0
3,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,DR,...,,37759280,,No,,,,,,
4,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,RB,...,,3382426,,No,,,,,,


In [11]:
df2.head()

Unnamed: 0,Agency,City,State,Legacy NTD ID,NTD ID,Organization Type,Reporter Type,Primary UZA Population,Agency VOMS,Bus,...,Trucks And Other Rubber Tire Vehicles,Trucks And Other Rubber Tire Vehicles >= ULB,Steel Wheel Vehicles,Steel Wheel Vehicles >= ULB,Total Service Vehicles,Total Service Vehicles >= ULB,Unnamed: 95,Unnamed: 96,Unnamed: 97,1
0,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,3104,...,1489,408,373,262,2297,818,,,,Hide Vehicles >= ULB
1,New Jersey Transit Corporation,Newark,NJ,2080,20080,Other Publicly-Owned or Privately Chartered Co...,Full Reporter,18351295,3645,1234,...,787,374,89,21,975,477,,,,Show Vehicles >= ULB
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,1962,...,934,384,9,4,1420,518,,,,1
3,Washington Metropolitan Area Transit Authority,Washington,DC,3030,30030,Independent Public Agency or Authority of Tran...,Full Reporter,4586770,3391,1607,...,1522,472,199,24,1938,571,,,,
4,"King County Department of Metro Transit, dba: ...",Seattle,WA,1,1,"City, County or Local Government Unit or Depar...",Full Reporter,3059393,3233,551,...,396,106,0,0,522,142,,,,


In [None]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/bus_service_increase/"
FILE_NAME = "ntd_metrics_2019.csv"

df = pd.read_csv(f"{GCS_FILE_PATH}{FILE_NAME}")
df.head(2)

In [None]:
FILE_NAME = "ntd_vehicles_2019.csv"
df2 = pd.read_csv(f"{GCS_FILE_PATH}{FILE_NAME}")

df2.head(2)

In [5]:
len(df)

3685

In [6]:
df["NTD ID"].nunique()

2183

In [7]:
len(df2)
df2["NTD ID"].nunique()

2775

### To do:

* Start with the `ntd_metrics_2019.csv` dataset.
* Merge in the `ntd_vehicles_2019.csv` dataset from the same location within the GCS bucket, but only keep a couple of columns.
* Print out what states there are using `value_counts`
* Subset and only keep the following states: NY, CA, TX, ID, MS
* Calculate some aggregate statistics grouping by states (the point of the exercise is to aggregate, less so on whether the stats make sense):
    * Include: sum, mean, count (of operators), nunique (of city)
    * Challenge: give a per capita measure, such as total service vehicles per 100,000 residents
* Plot the per capita measure across the 5 states (some states are very populous and some are not...per capita hopefully normalizes pop differences)

In [12]:
list(df.columns)

['Agency',
 'City',
 'State',
 'Legacy NTD ID',
 'NTD ID',
 'Organization Type',
 'Reporter Type',
 'Primary UZA\n Population',
 'Agency VOMS',
 'Mode',
 'TOS',
 'Mode VOMS',
 'Ratios:',
 'Fare Revenues per Unlinked Passenger Trip ',
 'Fare Revenues per Unlinked Passenger Trip Questionable',
 'Fare Revenues per Total Operating Expense (Recovery Ratio)',
 'Fare Revenues per Total Operating Expense (Recovery Ratio) Questionable',
 'Cost per\n Hour',
 'Cost per Hour Questionable',
 'Passengers per Hour',
 'Passengers per Hour Questionable',
 'Cost per Passenger',
 'Cost per Passenger Questionable',
 'Cost per Passenger Mile',
 'Cost per Passenger Mile Questionable',
 'Source Data:',
 'Fare Revenues Earned',
 'Fare Revenues Earned Questionable',
 'Total Operating Expenses',
 'Total Operating Expenses Questionable',
 'Unlinked Passenger Trips',
 'Unlinked Passenger Trips Questionable',
 'Vehicle Revenue Hours',
 'Vehicle Revenue Hours Questionable',
 'Passenger Miles',
 'Passenger Miles Que

In [8]:
df.columns

Index(['Agency', 'City', 'State', 'Legacy NTD ID', 'NTD ID',
       'Organization Type', 'Reporter Type', 'Primary UZA\n Population',
       'Agency VOMS', 'Mode', 'TOS', 'Mode VOMS', 'Ratios:',
       'Fare Revenues per Unlinked Passenger Trip ',
       'Fare Revenues per Unlinked Passenger Trip Questionable',
       'Fare Revenues per Total Operating Expense (Recovery Ratio)',
       'Fare Revenues per Total Operating Expense (Recovery Ratio) Questionable',
       'Cost per\n Hour', 'Cost per Hour Questionable', 'Passengers per Hour',
       'Passengers per Hour Questionable', 'Cost per Passenger',
       'Cost per Passenger Questionable', 'Cost per Passenger Mile',
       'Cost per Passenger Mile Questionable', 'Source Data:',
       'Fare Revenues Earned', 'Fare Revenues Earned Questionable',
       'Total Operating Expenses', 'Total Operating Expenses Questionable',
       'Unlinked Passenger Trips', 'Unlinked Passenger Trips Questionable',
       'Vehicle Revenue Hours', 'Vehicle

In [13]:
df.Agency.value_counts()

Massachusetts Bay Transportation Authority                           9
New Jersey Transit Corporation                                       8
King County Department of Metro Transit, dba: King County Metro      8
Metropolitan Transit Authority of Harris County, Texas               8
Maryland Transit Administration                                      8
                                                                    ..
City of Shafter                                                      1
Class LTD                                                            1
City of Onalaska, dba: Onalaska Shared Ride Taxi City of Onalaska    1
City of Monrovia                                                     1
City of Needles, dba: Needles Area Transit                           1
Name: Agency, Length: 2169, dtype: int64

### Helpful Hints and Best Practices

* Start with comprehensive approach in writing down all the lines of code to clean data. 
* Once the data cleaning process is done, work on refining the code and tidying it to see what steps can be chained together, what steps are done repeatedly (use a function!), etc.

#### Chaining
Similar to **piping** in R, where you can pipe multiple operations in 1 line of code with `>>`, you can do a similar method of chaining in Python. There is also a `df.pipe` function, but that's slightly different.

Make use of parentheses to do this. Also, use `df.assign` (see below) so you don't run into the `SettingWithCopyWarning`, which may pop up if you decide to subset your data. 

#### Assign 
You can create new columns in place, and the warning that comes up is mostly harmless. But, `assign` also lets you chain more operations after. [More clarification.](https://pythonguides.com/add-a-column-to-a-dataframe-in-python-pandas/)
```
states_clean = (states_clean
    # Assign is similar to R dplyr's mutate
    .assign(
        # Strip leading or trailing blanks (slightly different than replace)
        # Decide if you want to replace all blanks or just leading/trailing
        Agency = (states_clean.Agency.str.strip()
                .str.replace('(', '').str.replace(')', '')
        ),
        # Do something similar for City as above
        City = states_clean.City.str.strip(),
        # Replace blanks with nothing
        State = states_clean.State.str.replace(' ', '')
    ).astype({
        "Population": int, 
        "Fare_Revenues": int,
    })
)
```

Alternatively, try it with a loop:

```
for c in ["Agency", "City"]:
    df[c] = (df[c].str.strip()
            .str.replace('(', '')
            .str.replace(')', '')
            .astype(int)
            )
```

#### Using `str.contains` with some special characters
Use backslash `\` to "escape". [StackOverflow explanation](https://stackoverflow.com/questions/48699907/error-unbalanced-parenthesis-while-checking-if-an-item-presents-in-a-pandas-d)
`states_clean[states_clean.Fare_Revenues.str.contains("\(")]`


#### Merging
If your merge results produces a `col_x` and `col_y`, add more columns to your list of merge columns, with `on = ["col1", "col2"]`.

#### Use `isin` to filter by multiple conditions

```
keep_me = ["CA", "NY", "TX"]
df2 = df[df.State.isin(keep_me)]
```

In [51]:
merge1 = pd.merge(df, df2, on = ['Agency','City', 'State', 'Legacy NTD ID', 'NTD ID'],
    how = 'inner', validate = 'm:1')
# think about which df has the duplicates (left or right), and
# m:1, 1:m, 1:1, m:m
# the one with duplicates should have the m, the other should have the 1

In [52]:
keep_col=['Agency', 'City', 'State', 'Legacy NTD ID', 'NTD ID', 'Primary UZA\n Population', 'Mode', 'TOS','Total Service Vehicles','Cost per\n Hour','Passenger Miles','Total Operating Expenses']

In [53]:
merge2=merge1[keep_col]

In [54]:
df.State.value_counts()

CA    436
TX    167
NC    160
WA    148
FL    144
MI    130
NY    128
GA    120
WI    114
OH    106
OR    106
KS    104
PA     96
IL     94
IN     90
CO     87
MN     87
VA     73
PR     70
NE     68
LA     60
IA     58
MT     57
MA     55
AZ     55
OK     51
KY     51
MD     50
AL     49
TN     47
NJ     46
MO     45
NM     42
SC     39
CT     38
WV     35
AK     34
ND     34
NV     34
ME     32
WY     32
MS     29
ID     28
SD     26
AR     26
VT     25
NH     20
UT     19
HI     12
DC      8
RI      5
DE      4
VI      2
GU      2
AS      2
Name: State, dtype: int64

In [55]:
keep_me = ['NY', 'CA', 'TX', 'ID', 'MS']

In [56]:
merge3 = merge2[merge2.State.isin(keep_me)]

In [57]:
merge3.dtypes

Agency                      object
City                        object
State                       object
Legacy NTD ID               object
NTD ID                      object
Primary UZA\n Population    object
Mode                        object
TOS                         object
Total Service Vehicles      object
Cost per\n Hour             object
Passenger Miles             object
Total Operating Expenses    object
dtype: object

In [58]:
merge3['Primary UZA\n Population'].value_counts()

0              222
12,150,996     134
18,351,295      35
3,281,212       32
5,121,892       25
              ... 
67,983           2
583,681          1
99,437           1
61,900           1
65,088           1
Name: Primary UZA\n Population, Length: 99, dtype: int64

In [22]:
merge3['Total Service Vehicles'].value_counts()

0      311
1       70
2       42
3       30
4       29
      ... 
183      1
825      1
156      1
54       1
25       1
Name: Total Service Vehicles, Length: 63, dtype: int64

In [46]:
#ignore. practice only
merge3 = (merge3
    # Assign is similar to R dplyr's mutate
    .assign(
        # Strip leading or trailing blanks (slightly different than replace)
        # Decide if you want to replace all blanks or just leading/trailing
        population = (merge3['Primary UZA\n Population'].str.strip()
                .str.replace('(', '').str.replace(')', '')
        ),
        # Do something similar for City as above
       vehicles = (merge3['Total Service Vehicles'].str.strip()
                .str.replace('(', '').str.replace(')', '')
        ),
        # Replace blanks with nothing
        State = merge3.State.str.replace(' ', '')
    )
)

AttributeError: 'tuple' object has no attribute 'assign'

In [63]:
merge3 = (merge3
    .assign(
        population = (merge3['Primary UZA\n Population'].str.strip()
                .str.replace(',', ''))
        ).astype({
        "population": int
        })
    )

#remove comma in between numbers. Change data type to int

In [64]:
merge3.population.value_counts()

0           222
12150996    134
18351295     35
3281212      32
5121892      25
           ... 
67983         2
583681        1
99437         1
61900         1
65088         1
Name: population, Length: 99, dtype: int64

In [12]:
agg1=merge3.groupby(['State']).agg({'City':'count', 'Mode':'nunique'}).reset_index()

In [13]:
agg1

Unnamed: 0,State,City,Mode
0,CA,436,15
1,ID,28,3
2,MS,29,3
3,NY,128,10
4,TX,167,9


In [15]:
agg2=merge3.groupby(['State']).agg({'Total Operating Expenses':'sum'}).reset_index()

In [16]:
agg2

Unnamed: 0,State,Total Operating Expenses
0,CA,"$168,453,369 $25,666,876 $446,368,668 $1,209,7..."
1,ID,"$976,859 $10,915,400 $4,314,550 $1,581,946 $89..."
2,MS,"$458,218 $4,619,813 $1,465,804 $1,699,690 $2,0..."
3,NY,"$5,206,727,193 $242,520,835 $2,685,918,268 $51..."
4,TX,"$53,373,690 $11,785,438 $305,824,660 $5,891,47..."


In [65]:
agg3=merge3.groupby(['State']).agg({'population':'mean'}).reset_index()

In [66]:
agg3

Unnamed: 0,State,population
0,CA,4294451.0
1,ID,80011.93
2,MS,51397.1
3,NY,5141550.0
4,TX,1403881.0
