## Exercise 2: Merging and Deriving New Columns

Skills: 
* Merge 2 dataframes
* F-strings!
* Markdown cells
* Build on groupby/agg knowledge, derive new columns, exporting
* Practice committing on GitHub

References: 
* https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html
* https://docs.calitp.org/data-infra/analytics_tools/saving_code.html

In [59]:
import pandas as pd


Use of f-strings. [Read more on this](https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python).

Also, click on this Markdown cell and see how to do different formatting syntax within Markdown. [Reference this](https://www.datacamp.com/community/tutorials/markdown-in-jupyter-notebook).

If you don't have access to Google Cloud Storage, change the path to pull from our truncated sample parquets stored in the repo.

We use [relative paths](https://towardsthecloud.com/get-relative-path-python) rather than absolute paths. Since we are in the `starter_kit` directory, we just need to go one more level in to the `data` subfolder. To get one level outside of `starter_kit`, use `../` and you'll end up in `data-analyses`. 

```
FOLDER = "./data/"
FILE_NAME = "exercise_2_3_ntd_metrics_2019.parquet"
df = pd.read_parquet(f"{FOLDER}{FILE_NAME}")

```

In [60]:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/bus_service_increase/"
FILE_NAME = "ntd_metrics_2019.csv"

df = pd.read_csv(f"{GCS_FILE_PATH}{FILE_NAME}")
df

Unnamed: 0,Agency,City,State,Legacy NTD ID,NTD ID,Organization Type,Reporter Type,Primary UZA\n Population,Agency VOMS,Mode,...,Passenger Miles Questionable,Vehicle Revenue Miles,Vehicle Revenue Miles Questionable,Any data questionable?,Unnamed: 39,Unnamed: 40,Unnamed: 41,1,Unnamed: 43,Unnamed: 44
0,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,HR,...,,354616371,,No,,,,Hide questionable data tags,,
1,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,CB,...,,9866807,,No,,,,Show questionable data tags,,
2,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,MB,...,,86233591,,No,,2.0,,1,,2.0
3,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,DR,...,,37759280,,No,,,,,,
4,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,RB,...,,3382426,,No,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3680,,,,,,,,,,,...,,,,No,,,,,,
3681,,,,,,,,,,,...,,,,No,,,,,,
3682,,,,,,,,,,,...,,,,No,,,,,,
3683,,,,,,,,,,,...,,,,No,,,,,,


In [61]:
FILE_NAME = "ntd_vehicles_2019.csv"
df2 = pd.read_csv(f"{GCS_FILE_PATH}{FILE_NAME}")

df2

Unnamed: 0,Agency,City,State,Legacy NTD ID,NTD ID,Organization Type,Reporter Type,Primary UZA Population,Agency VOMS,Bus,...,Trucks And Other Rubber Tire Vehicles,Trucks And Other Rubber Tire Vehicles >= ULB,Steel Wheel Vehicles,Steel Wheel Vehicles >= ULB,Total Service Vehicles,Total Service Vehicles >= ULB,Unnamed: 95,Unnamed: 96,Unnamed: 97,1
0,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,3104,...,1489,408,373,262,2297,818,,,,Hide Vehicles >= ULB
1,New Jersey Transit Corporation,Newark,NJ,2080,20080,Other Publicly-Owned or Privately Chartered Co...,Full Reporter,18351295,3645,1234,...,787,374,89,21,975,477,,,,Show Vehicles >= ULB
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,Independent Public Agency or Authority of Tran...,Full Reporter,12150996,3469,1962,...,934,384,9,4,1420,518,,,,1
3,Washington Metropolitan Area Transit Authority,Washington,DC,3030,30030,Independent Public Agency or Authority of Tran...,Full Reporter,4586770,3391,1607,...,1522,472,199,24,1938,571,,,,
4,"King County Department of Metro Transit, dba: ...",Seattle,WA,0001,00001,"City, County or Local Government Unit or Depar...",Full Reporter,3059393,3233,551,...,396,106,0,0,522,142,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2770,Partners in Prime,Hamilton,OH,,A0020-55641,Private-Non-Profit Corporation,Reduced Asset Reporter,0,0,0,...,3,1,0,0,16,3,,,,
2771,"Oxford Senior Citizens, Inc.",Oxford,OH,,A0020-55642,Private-Non-Profit Corporation,Reduced Asset Reporter,0,0,0,...,0,0,0,0,0,0,,,,
2772,"Marielders, Inc.",Cincinnati,OH,,A0020-55643,Private-Non-Profit Corporation,Reduced Asset Reporter,0,0,0,...,0,0,0,0,0,0,,,,
2773,Maple Knoll Communities,Cincinnati,OH,,A0020-55646,Private-Non-Profit Corporation,Reduced Asset Reporter,0,0,0,...,0,0,0,0,0,0,,,,


In [62]:
len(df)

3685

In [63]:
df["NTD ID"].nunique()

2183

In [64]:
len(df2)
df2["NTD ID"].nunique()

2775

### To do:

* Start with the `ntd_metrics_2019.csv` dataset.
* Merge in the `ntd_vehicles_2019.csv` dataset from the same location within the GCS bucket, but only keep a couple of columns.
* Print out what states there are using `value_counts`
* Subset and only keep the following states: NY, CA, TX, ID, MS
* Calculate some aggregate statistics grouping by states (the point of the exercise is to aggregate, less so on whether the stats make sense):
    * Include: sum, mean, count (of operators), nunique (of city)
    * Challenge: give a per capita measure, such as total service vehicles per 100,000 residents
* Plot the per capita measure across the 5 states (some states are very populous and some are not...per capita hopefully normalizes pop differences)

In [65]:
list(df.columns)

['Agency',
 'City',
 'State',
 'Legacy NTD ID',
 'NTD ID',
 'Organization Type',
 'Reporter Type',
 'Primary UZA\n Population',
 'Agency VOMS',
 'Mode',
 'TOS',
 'Mode VOMS',
 'Ratios:',
 'Fare Revenues per Unlinked Passenger Trip ',
 'Fare Revenues per Unlinked Passenger Trip Questionable',
 'Fare Revenues per Total Operating Expense (Recovery Ratio)',
 'Fare Revenues per Total Operating Expense (Recovery Ratio) Questionable',
 'Cost per\n Hour',
 'Cost per Hour Questionable',
 'Passengers per Hour',
 'Passengers per Hour Questionable',
 'Cost per Passenger',
 'Cost per Passenger Questionable',
 'Cost per Passenger Mile',
 'Cost per Passenger Mile Questionable',
 'Source Data:',
 'Fare Revenues Earned',
 'Fare Revenues Earned Questionable',
 'Total Operating Expenses',
 'Total Operating Expenses Questionable',
 'Unlinked Passenger Trips',
 'Unlinked Passenger Trips Questionable',
 'Vehicle Revenue Hours',
 'Vehicle Revenue Hours Questionable',
 'Passenger Miles',
 'Passenger Miles Que

In [66]:
df.Agency.value_counts()

Massachusetts Bay Transportation Authority                           9
New Jersey Transit Corporation                                       8
King County Department of Metro Transit, dba: King County Metro      8
Metropolitan Transit Authority of Harris County, Texas               8
Maryland Transit Administration                                      8
                                                                    ..
City of Shafter                                                      1
Class LTD                                                            1
City of Onalaska, dba: Onalaska Shared Ride Taxi City of Onalaska    1
City of Monrovia                                                     1
City of Needles, dba: Needles Area Transit                           1
Name: Agency, Length: 2169, dtype: int64

### Step by Step

Keep only the columns you need.

* `df` is `ntd_metrics` and `df2` is `ntd_vehicles`
* For both dfs, keep `Agency`, `City`, `State`, `Legacy NTD ID`, `NTD ID`
* For `ntd_metrics`, also keep `Primary UZA\n Population`, `Mode`, `TOS`
* For `ntd_vehicles`, also keep `Total Service Vehicles`

In [67]:
ntd_metrics_select = df.filter(['Agency','City','State','Legacy NTD ID','NTD ID','Primary UZA\n Population','Mode','TOS'], axis=1)
ntd_vehicles_select = df2.filter(['Agency','City','State','Legacy NTD ID','NTD ID','Total Service Vehicles'], axis=1)



Rename columns for both dataframes.
* replace spaces with underscores
* lowercase letters

`df.columns = df.columns.str.strip().str.replace(' ', '_').str.lower()` 

`df = df.rename(columns = {'old_name': 'new_name'})`

In [68]:
ntd_metrics_select.columns = ntd_metrics_select.columns.str.strip().str.replace(' ', '_').str.lower()
ntd_vehicles_select.columns = ntd_vehicles_select.columns.str.strip().str.replace(' ', '_').str.lower()

Basic checks for any given dataframe.

* Check data types for columns: `df.dtypes`
* Get df's info: `df.info()`
* Get df's dimensions: `df.shape`
* Get df's length (number of rows): `len(df)`
* Summary stats for columns: `df.describe()`

In [69]:

ntd_vehicles_select.dtypes
ntd_metrics_select.dtypes

ntd_vehicles_select.info()
ntd_metrics_select.info()

ntd_metrics_select.shape
ntd_vehicles_select.shape

len(ntd_metrics_select)
len(ntd_vehicles_select)


ntd_metrics_select.describe()
ntd_vehicles_select.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2775 entries, 0 to 2774
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   agency                  2775 non-null   object
 1   city                    2775 non-null   object
 2   state                   2775 non-null   object
 3   legacy_ntd_id           1987 non-null   object
 4   ntd_id                  2775 non-null   object
 5   total_service_vehicles  2775 non-null   object
dtypes: object(6)
memory usage: 130.2+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3685 entries, 0 to 3684
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   agency                   3680 non-null   object
 1   city                     3680 non-null   object
 2   state                    3680 non-null   object
 3   legacy_ntd_id            3394 non-null   object
 4   ntd_id 

Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,total_service_vehicles
count,2775,2775,2775,1987,2775,2775
unique,2761,2007,55,1985,2775,118
top,Union County Transit,Alhambra,CA,8R05-010,20008,0
freq,2,12,224,2,1,1842


Make a plan to clean columns.

* Add a Markdown cell
* Jot down which columns should be numeric, but are not.
* If the data type is `object`, it's string. If it's `float64` or `int64`, it's numeric.
* For the columns that should be numeric, do so. Use `assign` to create new columns and overwrite the existing columns.

## Primary_uza\n_population should be numeric but are objects i.e. string variables

In [70]:
ntd_metrics_select['primary_uza\n_population'] = ntd_metrics_select['primary_uza\n_population'].str.replace(',', '')

In [83]:
ntd_metrics_select=ntd_metrics_select.assign(n_population=lambda x: x[['primary_uza\n_population']].apply(pd.to_numeric))
ntd_metrics_select

Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,primary_uza\n_population,mode,tos,n_population
0,MTA New York City Transit,New York,NY,2008,20008,18351295,HR,DO,18351295.0
1,MTA New York City Transit,New York,NY,2008,20008,18351295,CB,DO,18351295.0
2,MTA New York City Transit,New York,NY,2008,20008,18351295,MB,DO,18351295.0
3,MTA New York City Transit,New York,NY,2008,20008,18351295,DR,PT,18351295.0
4,MTA New York City Transit,New York,NY,2008,20008,18351295,RB,DO,18351295.0
...,...,...,...,...,...,...,...,...,...
3680,,,,,,,,,
3681,,,,,,,,,
3682,,,,,,,,,
3683,,,,,,,,,


Merge the 2 dataframes.
* set the validate parameter: `validate = "m:1"`. Choose from "m:1", "1:1", or "1:m"
* Put `ntd_metrics` on the left and `ntd_vehicles` on the right. What is the validate parameter?
* Put `ntd_vehicles` on the left and `ntd_metrics` on the right. What is the validate parameter?


In [85]:
merge1 = pd.merge(
    ntd_metrics_select,
    ntd_vehicles_select,
    on = 'ntd_id',
    how = 'left',
    validate = 'm:1'
)

merge2 = pd.merge(
    ntd_vehicles_select,
    ntd_metrics_select,
    on = 'ntd_id',
    how = 'left',
    validate = '1:m'
)

merge1

Unnamed: 0,agency_x,city_x,state_x,legacy_ntd_id_x,ntd_id,primary_uza\n_population,mode,tos,n_population,agency_y,city_y,state_y,legacy_ntd_id_y,total_service_vehicles
0,MTA New York City Transit,New York,NY,2008,20008,18351295,HR,DO,18351295.0,MTA New York City Transit,New York,NY,2008,2297
1,MTA New York City Transit,New York,NY,2008,20008,18351295,CB,DO,18351295.0,MTA New York City Transit,New York,NY,2008,2297
2,MTA New York City Transit,New York,NY,2008,20008,18351295,MB,DO,18351295.0,MTA New York City Transit,New York,NY,2008,2297
3,MTA New York City Transit,New York,NY,2008,20008,18351295,DR,PT,18351295.0,MTA New York City Transit,New York,NY,2008,2297
4,MTA New York City Transit,New York,NY,2008,20008,18351295,RB,DO,18351295.0,MTA New York City Transit,New York,NY,2008,2297
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3680,,,,,,,,,,,,,,
3681,,,,,,,,,,,,,,
3682,,,,,,,,,,,,,,
3683,,,,,,,,,,,,,,


Play with merges:
* set `indicator=True`
* adjust the `how` parameter: how = 'inner', 'left', 'right', 'outer'
* look at the merge results: `df._merge.value_counts()`
* what's changing? What merge results appear when it's an `inner` join vs `left` join vs `right` join vs `outer` join?

In [54]:
merge3 = pd.merge(
    ntd_metrics_select,
    ntd_vehicles_select,
    on = 'ntd_id',
    how = 'left',
    sort=True,
    indicator = True,
    validate = 'm:1'
)

merge4 = pd.merge(
    ntd_metrics_select,
    ntd_vehicles_select,
    on = 'ntd_id',
    how = 'outer',
    sort=True,
    indicator = True,
    validate = 'm:1'
)


merge3._merge.value_counts()
merge4._merge.value_counts()

both          3680
right_only     592
left_only        5
Name: _merge, dtype: int64

### Helpful Hints and Best Practices

* Start with comprehensive approach in writing down all the lines of code to clean data. 
* Once the data cleaning process is done, work on refining the code and tidying it to see what steps can be chained together, what steps are done repeatedly (use a function!), etc.

#### Chaining
Similar to **piping** in R, where you can pipe multiple operations in 1 line of code with `>>`, you can do a similar method of chaining in Python. There is also a `df.pipe` function, but that's slightly different.

Make use of parentheses to do this. Also, use `df.assign` (see below) so you don't run into the `SettingWithCopyWarning`, which may pop up if you decide to subset your data. 

#### Assign 
You can create new columns in place, and the warning that comes up is mostly harmless. But, `assign` also lets you chain more operations after. [More clarification.](https://pythonguides.com/add-a-column-to-a-dataframe-in-python-pandas/)
```
states_clean = (states_clean
    # Assign is similar to R dplyr's mutate
    .assign(
        # Strip leading or trailing blanks (slightly different than replace)
        # Decide if you want to replace all blanks or just leading/trailing
        Agency = (states_clean.Agency.str.strip()
                .str.replace('(', '').str.replace(')', '')
        ),
        # Do something similar for City as above
        City = states_clean.City.str.strip(),
        # Replace blanks with nothing
        State = states_clean.State.str.replace(' ', '')
    ).astype({
        "Population": int, 
        "Fare_Revenues": int,
    })
)
```

Alternatively, try it with a loop:

```
for c in ["Agency", "City"]:
    df[c] = (df[c].str.strip()
            .str.replace('(', '')
            .str.replace(')', '')
            .astype(int)
            )
```

#### Using `str.contains` with some special characters
Use backslash `\` to "escape". [StackOverflow explanation](https://stackoverflow.com/questions/48699907/error-unbalanced-parenthesis-while-checking-if-an-item-presents-in-a-pandas-d)
`states_clean[states_clean.Fare_Revenues.str.contains("\(")]`


#### Merging
If your merge results produces a `col_x` and `col_y`, add more columns to your list of merge columns, with `on = ["col1", "col2"]`.

#### Use `isin` to filter by multiple conditions

```
keep_me = ["CA", "NY", "TX"]
df2 = df[df.State.isin(keep_me)]
```

Subset columns to just the 5 states listed above.

In [89]:
keep_me = ["CA", "NY", "TX"]
merge1 = merge1[merge1.state_x.isin(keep_me)]
merge1

Unnamed: 0,agency_x,city_x,state_x,legacy_ntd_id_x,ntd_id,primary_uza\n_population,mode,tos,n_population,agency_y,city_y,state_y,legacy_ntd_id_y,total_service_vehicles
0,MTA New York City Transit,New York,NY,2008,20008,18351295,HR,DO,18351295.0,MTA New York City Transit,New York,NY,2008,2297
1,MTA New York City Transit,New York,NY,2008,20008,18351295,CB,DO,18351295.0,MTA New York City Transit,New York,NY,2008,2297
2,MTA New York City Transit,New York,NY,2008,20008,18351295,MB,DO,18351295.0,MTA New York City Transit,New York,NY,2008,2297
3,MTA New York City Transit,New York,NY,2008,20008,18351295,DR,PT,18351295.0,MTA New York City Transit,New York,NY,2008,2297
4,MTA New York City Transit,New York,NY,2008,20008,18351295,RB,DO,18351295.0,MTA New York City Transit,New York,NY,2008,2297
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3649,Sullivan County Transportation,Monticello,NY,2R02-042,2R02-20937,0,MB,DO,0.0,Sullivan County Transportation,Monticello,NY,2R02-042,0
3664,City of Mechanicville,Mechanicville,NY,2213,20213,594962,MB,DO,594962.0,City of Mechanicville,Mechanicville,NY,2213,0
3669,Los Angeles County Dept. of Public Works - Ath...,Alhambra,CA,,90269,12150996,MB,PT,12150996.0,Los Angeles County Dept. of Public Works - Ath...,Alhambra,CA,,0
3678,"City of Needles, dba: Needles Area Transit",Needles,CA,9R02-063,9R02-91020,0,MB,PT,0.0,"City of Needles, dba: Needles Area Transit",Needles,CA,9R02-063,0


Add a new column with this metric: `service_vehicles_per_capita` (service vehicles divided by population)

* [Read more on rate metrics](https://oag.ca.gov/sites/all/files/agweb/pdfs/cjsc/stats/computational_formulas.pdf)
* `df[new_column] = df[numerator_col]/df[denominator_col]`
* `df[new_column] = df[numerator_col].divide(df[denominator_col])`

In [104]:
merge1.info()

In [106]:
merge1['total_service_vehicles'] = merge1['total_service_vehicles'].apply(pd.to_numeric, errors = 'coerce').fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merge1['total_service_vehicles'] = merge1['total_service_vehicles'].apply(pd.to_numeric, errors = 'coerce').fillna(0)


In [108]:
merge1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 731 entries, 0 to 3679
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   agency_x                 731 non-null    object 
 1   city_x                   731 non-null    object 
 2   state_x                  731 non-null    object 
 3   legacy_ntd_id_x          619 non-null    object 
 4   ntd_id                   731 non-null    object 
 5   primary_uza
_population  731 non-null    object 
 6   mode                     731 non-null    object 
 7   tos                      731 non-null    object 
 8   n_population             731 non-null    float64
 9   agency_y                 731 non-null    object 
 10  city_y                   731 non-null    object 
 11  state_y                  731 non-null    object 
 12  legacy_ntd_id_y          619 non-null    object 
 13  total_service_vehicles   731 non-null    float64
dtypes: float64(2), object(12)

In [110]:
merge1['service_vehicles_per_capita'] = merge1['total_service_vehicles']/merge1['n_population']
merge1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merge1['service_vehicles_per_capita'] = merge1['total_service_vehicles']/merge1['n_population']


Unnamed: 0,agency_x,city_x,state_x,legacy_ntd_id_x,ntd_id,primary_uza\n_population,mode,tos,n_population,agency_y,city_y,state_y,legacy_ntd_id_y,total_service_vehicles,service_vehicles_per_capita
0,MTA New York City Transit,New York,NY,2008,20008,18351295,HR,DO,18351295.0,MTA New York City Transit,New York,NY,2008,0.0,0.0
1,MTA New York City Transit,New York,NY,2008,20008,18351295,CB,DO,18351295.0,MTA New York City Transit,New York,NY,2008,0.0,0.0
2,MTA New York City Transit,New York,NY,2008,20008,18351295,MB,DO,18351295.0,MTA New York City Transit,New York,NY,2008,0.0,0.0
3,MTA New York City Transit,New York,NY,2008,20008,18351295,DR,PT,18351295.0,MTA New York City Transit,New York,NY,2008,0.0,0.0
4,MTA New York City Transit,New York,NY,2008,20008,18351295,RB,DO,18351295.0,MTA New York City Transit,New York,NY,2008,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3649,Sullivan County Transportation,Monticello,NY,2R02-042,2R02-20937,0,MB,DO,0.0,Sullivan County Transportation,Monticello,NY,2R02-042,0.0,
3664,City of Mechanicville,Mechanicville,NY,2213,20213,594962,MB,DO,594962.0,City of Mechanicville,Mechanicville,NY,2213,0.0,0.0
3669,Los Angeles County Dept. of Public Works - Ath...,Alhambra,CA,,90269,12150996,MB,PT,12150996.0,Los Angeles County Dept. of Public Works - Ath...,Alhambra,CA,,0.0,0.0
3678,"City of Needles, dba: Needles Area Transit",Needles,CA,9R02-063,9R02-91020,0,MB,PT,0.0,"City of Needles, dba: Needles Area Transit",Needles,CA,9R02-063,0.0,


Add a new column that is `service_vehicles_per_100k`

In [112]:
merge1['service_vehicles_per_100K'] = merge1['total_service_vehicles']/100000
merge1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merge1['service_vehicles_per_100K'] = merge1['total_service_vehicles']/100000


Unnamed: 0,agency_x,city_x,state_x,legacy_ntd_id_x,ntd_id,primary_uza\n_population,mode,tos,n_population,agency_y,city_y,state_y,legacy_ntd_id_y,total_service_vehicles,service_vehicles_per_capita,service_vehicles_per_100K
0,MTA New York City Transit,New York,NY,2008,20008,18351295,HR,DO,18351295.0,MTA New York City Transit,New York,NY,2008,0.0,0.0,0.0
1,MTA New York City Transit,New York,NY,2008,20008,18351295,CB,DO,18351295.0,MTA New York City Transit,New York,NY,2008,0.0,0.0,0.0
2,MTA New York City Transit,New York,NY,2008,20008,18351295,MB,DO,18351295.0,MTA New York City Transit,New York,NY,2008,0.0,0.0,0.0
3,MTA New York City Transit,New York,NY,2008,20008,18351295,DR,PT,18351295.0,MTA New York City Transit,New York,NY,2008,0.0,0.0,0.0
4,MTA New York City Transit,New York,NY,2008,20008,18351295,RB,DO,18351295.0,MTA New York City Transit,New York,NY,2008,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3649,Sullivan County Transportation,Monticello,NY,2R02-042,2R02-20937,0,MB,DO,0.0,Sullivan County Transportation,Monticello,NY,2R02-042,0.0,,0.0
3664,City of Mechanicville,Mechanicville,NY,2213,20213,594962,MB,DO,594962.0,City of Mechanicville,Mechanicville,NY,2213,0.0,0.0,0.0
3669,Los Angeles County Dept. of Public Works - Ath...,Alhambra,CA,,90269,12150996,MB,PT,12150996.0,Los Angeles County Dept. of Public Works - Ath...,Alhambra,CA,,0.0,0.0,0.0
3678,"City of Needles, dba: Needles Area Transit",Needles,CA,9R02-063,9R02-91020,0,MB,PT,0.0,"City of Needles, dba: Needles Area Transit",Needles,CA,9R02-063,0.0,,0.0


Do a group by and aggregate.

* Group by state, count the number of agencies and find average of total service vehicles.
* Write a sentence to explain the result

In [114]:
pivot = (merge1.groupby(['state_x'])
         .agg({'agency_x':'count',
               'total_service_vehicles':'mean'}
             ).reset_index()
        )
pivot

Unnamed: 0,state_x,agency_x,total_service_vehicles
0,CA,436,29.220183
1,NY,128,40.609375
2,TX,167,57.245509


#### Within the state of California, there are 436 agencies, which operate 30 service vehicles on an average. Similary, New York and Texas have 128 and 167 agencies respectively that operate 40 and 58 service vehicles on an average.
