## Exercise 2: Merging and Deriving New Columns

Skills: 
* Merge 2 dataframes
* F-strings!
* Markdown cells
* Build on groupby/agg knowledge, derive new columns, exporting
* Practice committing on GitHub

References: 
* https://docs.calitp.org/data-infra/analytics_new_analysts/01-data-analysis-intro.html
* https://docs.calitp.org/data-infra/analytics_tools/saving_code.html

In [1]:
import pandas as pd

Use of f-strings. [Read more on this](https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python).

Also, click on this Markdown cell and see how to do different formatting syntax within Markdown. [Reference this](https://www.datacamp.com/community/tutorials/markdown-in-jupyter-notebook).

If you don't have access to Google Cloud Storage, change the path to pull from our truncated sample parquets stored in the repo.

We use [relative paths](https://towardsthecloud.com/get-relative-path-python) rather than absolute paths. Since we are in the `starter_kit` directory, we just need to go one more level in to the `data` subfolder. To get one level outside of `starter_kit`, use `../` and you'll end up in `data-analyses`. 

replace below with :
```
FOLDER = "./data/"
FILE_NAME = "exercise_2_3_ntd_metrics_2019.parquet"
df = pd.read_parquet(f"{FOLDER}{FILE_NAME}")

```

In [2]:
GCS_FILE_PATH = "../data/"
FILE_NAME = "exercise_2_3_ntd_metrics_2019.parquet"

df = pd.read_parquet(f"{GCS_FILE_PATH}{FILE_NAME}")
df.head(2)

Unnamed: 0,Agency,City,State,Legacy NTD ID,NTD ID,Organization Type,Reporter Type,Primary UZA\n Population,Agency VOMS,Mode,...,Passenger Miles Questionable,Vehicle Revenue Miles,Vehicle Revenue Miles Questionable,Any data questionable?,Unnamed: 39,Unnamed: 40,Unnamed: 41,1,Unnamed: 43,Unnamed: 44
0,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,HR,...,,354616371,,No,,,,Hide questionable data tags,,
1,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,CB,...,,9866807,,No,,,,Show questionable data tags,,


In [3]:
FILE_NAME = "exercise_2_ntd_vehicles_2019.parquet"
df2 = pd.read_parquet(f"{GCS_FILE_PATH}{FILE_NAME}")

df2.head(2)

Unnamed: 0,Agency,City,State,Legacy NTD ID,NTD ID,Organization Type,Reporter Type,Primary UZA Population,Agency VOMS,Bus,...,Trucks And Other Rubber Tire Vehicles,Trucks And Other Rubber Tire Vehicles >= ULB,Steel Wheel Vehicles,Steel Wheel Vehicles >= ULB,Total Service Vehicles,Total Service Vehicles >= ULB,Unnamed: 95,Unnamed: 96,Unnamed: 97,1
0,MTA New York City Transit,New York,NY,2008,20008,"Subsidiary Unit of a Transit Agency, Reporting...",Full Reporter,18351295,10885,3104,...,1489,408,373,262,2297,818,,,,Hide Vehicles >= ULB
1,New Jersey Transit Corporation,Newark,NJ,2080,20080,Other Publicly-Owned or Privately Chartered Co...,Full Reporter,18351295,3645,1234,...,787,374,89,21,975,477,,,,Show Vehicles >= ULB


In [4]:
len(df)

3685

In [5]:
df["NTD ID"].nunique()

2183

In [6]:
len(df2)
df2["NTD ID"].nunique()

2775

### To do:

* Start with the `ntd_metrics_2019.csv` dataset.
* Merge in the `ntd_vehicles_2019.csv` dataset from the same location within the GCS bucket, but only keep a couple of columns.
* Print out what states there are using `value_counts`
* Subset and only keep the following states: NY, CA, TX, ID, MS
* Calculate some aggregate statistics grouping by states (the point of the exercise is to aggregate, less so on whether the stats make sense):
    * Include: sum, mean, count (of operators), nunique (of city)
    * Challenge: give a per capita measure, such as total service vehicles per 100,000 residents
* Plot the per capita measure across the 5 states (some states are very populous and some are not...per capita hopefully normalizes pop differences)

In [7]:
list(df.columns)

['Agency',
 'City',
 'State',
 'Legacy NTD ID',
 'NTD ID',
 'Organization Type',
 'Reporter Type',
 'Primary UZA\n Population',
 'Agency VOMS',
 'Mode',
 'TOS',
 'Mode VOMS',
 'Ratios:',
 'Fare Revenues per Unlinked Passenger Trip ',
 'Fare Revenues per Unlinked Passenger Trip Questionable',
 'Fare Revenues per Total Operating Expense (Recovery Ratio)',
 'Fare Revenues per Total Operating Expense (Recovery Ratio) Questionable',
 'Cost per\n Hour',
 'Cost per Hour Questionable',
 'Passengers per Hour',
 'Passengers per Hour Questionable',
 'Cost per Passenger',
 'Cost per Passenger Questionable',
 'Cost per Passenger Mile',
 'Cost per Passenger Mile Questionable',
 'Source Data:',
 'Fare Revenues Earned',
 'Fare Revenues Earned Questionable',
 'Total Operating Expenses',
 'Total Operating Expenses Questionable',
 'Unlinked Passenger Trips',
 'Unlinked Passenger Trips Questionable',
 'Vehicle Revenue Hours',
 'Vehicle Revenue Hours Questionable',
 'Passenger Miles',
 'Passenger Miles Que

In [8]:
df.Agency.value_counts()

Massachusetts Bay Transportation Authority                           9
New Jersey Transit Corporation                                       8
King County Department of Metro Transit, dba: King County Metro      8
Metropolitan Transit Authority of Harris County, Texas               8
Maryland Transit Administration                                      8
                                                                    ..
City of Shafter                                                      1
Class LTD                                                            1
City of Onalaska, dba: Onalaska Shared Ride Taxi City of Onalaska    1
City of Monrovia                                                     1
City of Needles, dba: Needles Area Transit                           1
Name: Agency, Length: 2169, dtype: int64

### Step by Step

Keep only the columns you need.

* `df` is `ntd_metrics` and `df2` is `ntd_vehicles`
* For both dfs, keep `Agency`, `City`, `State`, `Legacy NTD ID`, `NTD ID`
* For `ntd_metrics`, also keep `Primary UZA\n Population`, `Mode`, `TOS`
* For `ntd_vehicles`, also keep `Total Service Vehicles`

In [9]:
ntd_metrics_tbl = df[['Agency', 'City', 'State', 'Legacy NTD ID', 'NTD ID','Primary UZA\n Population','Mode', 'TOS']]
ntd_metrics_tbl

Unnamed: 0,Agency,City,State,Legacy NTD ID,NTD ID,Primary UZA\n Population,Mode,TOS
0,MTA New York City Transit,New York,NY,2008,20008,18351295,HR,DO
1,MTA New York City Transit,New York,NY,2008,20008,18351295,CB,DO
2,MTA New York City Transit,New York,NY,2008,20008,18351295,MB,DO
3,MTA New York City Transit,New York,NY,2008,20008,18351295,DR,PT
4,MTA New York City Transit,New York,NY,2008,20008,18351295,RB,DO
...,...,...,...,...,...,...,...,...
3680,,,,,,,,
3681,,,,,,,,
3682,,,,,,,,
3683,,,,,,,,


In [10]:
ntd_vehicles_tbl = df2[['Agency', 'City', 'State', 'Legacy NTD ID', 'NTD ID','Total Service Vehicles']]
ntd_vehicles_tbl

Unnamed: 0,Agency,City,State,Legacy NTD ID,NTD ID,Total Service Vehicles
0,MTA New York City Transit,New York,NY,2008,20008,2297
1,New Jersey Transit Corporation,Newark,NJ,2080,20080,975
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,1420
3,Washington Metropolitan Area Transit Authority,Washington,DC,3030,30030,1938
4,"King County Department of Metro Transit, dba: ...",Seattle,WA,0001,00001,522
...,...,...,...,...,...,...
2770,Partners in Prime,Hamilton,OH,,A0020-55641,16
2771,"Oxford Senior Citizens, Inc.",Oxford,OH,,A0020-55642,0
2772,"Marielders, Inc.",Cincinnati,OH,,A0020-55643,0
2773,Maple Knoll Communities,Cincinnati,OH,,A0020-55646,0


Rename columns for both dataframes.
* replace spaces with underscores
* lowercase letters

`df.columns = df.columns.str.strip().str.replace(' ', '_').str.lower()` 

`df = df.rename(columns = {'old_name': 'new_name'})`

In [11]:
ntd_metrics_tbl.columns = ntd_metrics_tbl.columns.str.strip().str.replace(' ', '_').str.lower()

#df = df.rename(columns = {'old_name': 'new_name'}) -- I'm not sure what I am supposed to rename these columns as?
ntd_metrics_tbl = ntd_metrics_tbl.rename(columns = {'primary_uza\n_population': 'primary_uza_pop'})
ntd_metrics_tbl

Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,primary_uza_pop,mode,tos
0,MTA New York City Transit,New York,NY,2008,20008,18351295,HR,DO
1,MTA New York City Transit,New York,NY,2008,20008,18351295,CB,DO
2,MTA New York City Transit,New York,NY,2008,20008,18351295,MB,DO
3,MTA New York City Transit,New York,NY,2008,20008,18351295,DR,PT
4,MTA New York City Transit,New York,NY,2008,20008,18351295,RB,DO
...,...,...,...,...,...,...,...,...
3680,,,,,,,,
3681,,,,,,,,
3682,,,,,,,,
3683,,,,,,,,


In [12]:
ntd_vehicles_tbl.columns = ntd_vehicles_tbl.columns.str.strip().str.replace(' ', '_').str.lower()
ntd_vehicles_tbl
#df = df.rename(columns = {'old_name': 'new_name'}) -- I'm not sure what I am supposed to rename these columns as?

Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,total_service_vehicles
0,MTA New York City Transit,New York,NY,2008,20008,2297
1,New Jersey Transit Corporation,Newark,NJ,2080,20080,975
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,1420
3,Washington Metropolitan Area Transit Authority,Washington,DC,3030,30030,1938
4,"King County Department of Metro Transit, dba: ...",Seattle,WA,0001,00001,522
...,...,...,...,...,...,...
2770,Partners in Prime,Hamilton,OH,,A0020-55641,16
2771,"Oxford Senior Citizens, Inc.",Oxford,OH,,A0020-55642,0
2772,"Marielders, Inc.",Cincinnati,OH,,A0020-55643,0
2773,Maple Knoll Communities,Cincinnati,OH,,A0020-55646,0


Basic checks for any given dataframe.

* Check data types for columns: `df.dtypes`
* Get df's info: `df.info()`
* Get df's dimensions: `df.shape`
* Get df's length (number of rows): `len(df)`
* Summary stats for columns: `df.describe()`

In [13]:
ntd_metrics_tbl.dtypes

agency             object
city               object
state              object
legacy_ntd_id      object
ntd_id             object
primary_uza_pop    object
mode               object
tos                object
dtype: object

In [14]:
ntd_metrics_tbl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3685 entries, 0 to 3684
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   agency           3680 non-null   object
 1   city             3680 non-null   object
 2   state            3680 non-null   object
 3   legacy_ntd_id    3394 non-null   object
 4   ntd_id           3680 non-null   object
 5   primary_uza_pop  3680 non-null   object
 6   mode             3680 non-null   object
 7   tos              3680 non-null   object
dtypes: object(8)
memory usage: 230.4+ KB


In [15]:
ntd_metrics_tbl.shape

(3685, 8)

In [16]:
len(ntd_metrics_tbl)

3685

In [17]:
ntd_metrics_tbl.describe()

Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,primary_uza_pop,mode,tos
count,3680,3680,3680,3394,3680,3680,3680,3680
unique,2169,1666,55,1967,2183,444,19,2
top,Massachusetts Bay Transportation Authority,Portland,CA,1003,10003,0,DR,DO
freq,9,16,436,9,9,1686,1879,2553


In [18]:
ntd_metrics_tbl['primary_uza_pop'].value_counts()

0              1686
12,150,996      134
18,351,295       64
2,148,346        40
5,502,379        37
               ... 
81,176            1
61,900            1
70,436            1
50,440            1
59,036            1
Name: primary_uza_pop, Length: 444, dtype: int64

In [19]:
ntd_vehicles_tbl.dtypes
ntd_vehicles_tbl.info()
ntd_vehicles_tbl.shape
len(ntd_vehicles_tbl)
ntd_vehicles_tbl.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2775 entries, 0 to 2774
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   agency                  2775 non-null   object
 1   city                    2775 non-null   object
 2   state                   2775 non-null   object
 3   legacy_ntd_id           1987 non-null   object
 4   ntd_id                  2775 non-null   object
 5   total_service_vehicles  2775 non-null   object
dtypes: object(6)
memory usage: 130.2+ KB


Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,total_service_vehicles
count,2775,2775,2775,1987,2775,2775
unique,2761,2007,55,1985,2775,118
top,Union County Transit,Alhambra,CA,8R05-010,20008,0
freq,2,12,224,2,1,1842


Make a plan to clean columns.

* Add a Markdown cell
* Jot down which columns should be numeric, but are not.
* If the data type is `object`, it's string. If it's `float64` or `int64`, it's numeric.
* For the columns that should be numeric, do so. Use `assign` to create new columns and overwrite the existing columns.

In [20]:
#Using "Int64" below instead of int because there are nulls in the table which creates a problem

In [21]:
ntd_metrics_tbl = (ntd_metrics_tbl
    .assign(
        primary_uza_pop = (ntd_metrics_tbl['primary_uza_pop'].str.strip()
                .str.replace(',', ''))
        ).astype({
        "primary_uza_pop": "Int64"
        })
    )
ntd_metrics_tbl

Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,primary_uza_pop,mode,tos
0,MTA New York City Transit,New York,NY,2008,20008,18351295,HR,DO
1,MTA New York City Transit,New York,NY,2008,20008,18351295,CB,DO
2,MTA New York City Transit,New York,NY,2008,20008,18351295,MB,DO
3,MTA New York City Transit,New York,NY,2008,20008,18351295,DR,PT
4,MTA New York City Transit,New York,NY,2008,20008,18351295,RB,DO
...,...,...,...,...,...,...,...,...
3680,,,,,,,,
3681,,,,,,,,
3682,,,,,,,,
3683,,,,,,,,


In [22]:
ntd_metrics_tbl.dtypes
#ntd_metrics_tbl.astype({'primary_uza_pop':'float'})

agency             object
city               object
state              object
legacy_ntd_id      object
ntd_id             object
primary_uza_pop     Int64
mode               object
tos                object
dtype: object

In [23]:
ntd_vehicles_tbl = (ntd_vehicles_tbl
    .assign(
        total_service_vehicles = (ntd_vehicles_tbl['total_service_vehicles'].str.strip()
                .str.replace(',', ''))
        ).astype({
        "total_service_vehicles": int
        })
    )
#why did integer not work here? ;-;
#Does Cathy have different data?
ntd_vehicles_tbl

Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,total_service_vehicles
0,MTA New York City Transit,New York,NY,2008,20008,2297
1,New Jersey Transit Corporation,Newark,NJ,2080,20080,975
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,9154,90154,1420
3,Washington Metropolitan Area Transit Authority,Washington,DC,3030,30030,1938
4,"King County Department of Metro Transit, dba: ...",Seattle,WA,0001,00001,522
...,...,...,...,...,...,...
2770,Partners in Prime,Hamilton,OH,,A0020-55641,16
2771,"Oxford Senior Citizens, Inc.",Oxford,OH,,A0020-55642,0
2772,"Marielders, Inc.",Cincinnati,OH,,A0020-55643,0
2773,Maple Knoll Communities,Cincinnati,OH,,A0020-55646,0


Merge the 2 dataframes.
* set the validate parameter: `validate = "m:1"`. Choose from "m:1", "1:1", or "1:m"
* Put `ntd_metrics` on the left and `ntd_vehicles` on the right. What is the validate parameter?
* Put `ntd_vehicles` on the left and `ntd_metrics` on the right. What is the validate parameter?


In [30]:
ntd_metrics_tbl.ntd_id.value_counts()
#ids appear multiple times
ntd_metrics_tbl[ntd_metrics_tbl.ntd_id=='10003']

Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,primary_uza_pop,mode,tos
34,Massachusetts Bay Transportation Authority,Boston,MA,1003,10003,4181019,FB,PT
35,Massachusetts Bay Transportation Authority,Boston,MA,1003,10003,4181019,MB,DO
36,Massachusetts Bay Transportation Authority,Boston,MA,1003,10003,4181019,MB,PT
37,Massachusetts Bay Transportation Authority,Boston,MA,1003,10003,4181019,DR,PT
38,Massachusetts Bay Transportation Authority,Boston,MA,1003,10003,4181019,CR,PT
39,Massachusetts Bay Transportation Authority,Boston,MA,1003,10003,4181019,RB,DO
40,Massachusetts Bay Transportation Authority,Boston,MA,1003,10003,4181019,HR,DO
41,Massachusetts Bay Transportation Authority,Boston,MA,1003,10003,4181019,TB,DO
42,Massachusetts Bay Transportation Authority,Boston,MA,1003,10003,4181019,LR,DO


In [31]:
ntd_vehicles_tbl[ntd_vehicles_tbl.ntd_id=='10003']

Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,total_service_vehicles
6,Massachusetts Bay Transportation Authority,Boston,MA,1003,10003,1602


In [38]:
#if you get x and y below it's because they show uo mutible times
#so when you create a new table, make sure to include those fields in the merge with brakets

In [37]:
merge1 = ntd_metrics_tbl.merge(ntd_vehicles_tbl, on = ['ntd_id', 'agency', 'city', 'state','legacy_ntd_id'], validate = 'm:1')
merge1

Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,primary_uza_pop,mode,tos,total_service_vehicles
0,MTA New York City Transit,New York,NY,2008,20008,18351295,HR,DO,2297
1,MTA New York City Transit,New York,NY,2008,20008,18351295,CB,DO,2297
2,MTA New York City Transit,New York,NY,2008,20008,18351295,MB,DO,2297
3,MTA New York City Transit,New York,NY,2008,20008,18351295,DR,PT,2297
4,MTA New York City Transit,New York,NY,2008,20008,18351295,RB,DO,2297
...,...,...,...,...,...,...,...,...,...
3673,Lane County Transportation,Dighton,KS,7R02-102,7R02-70197,0,DR,DO,0
3674,Quileute Tribe Community Shuttle,La Push,WA,,00417,0,MB,DO,0
3675,Samish Indian Nation,Anacortes,WA,,00455,0,DR,DO,0
3676,"City of Needles, dba: Needles Area Transit",Needles,CA,9R02-063,9R02-91020,0,MB,PT,0


In [46]:
merge2 = ntd_vehicles_tbl.merge(ntd_metrics_tbl, on = ['ntd_id', 'agency', 'city', 'state','legacy_ntd_id'], validate = '1:m', indicator=True, how = 'outer')
#the how function will tell us whether the left, right or outer database has data for the fields
merge2

Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,total_service_vehicles,primary_uza_pop,mode,tos,_merge
0,MTA New York City Transit,New York,NY,2008,20008,2297.0,18351295,HR,DO,both
1,MTA New York City Transit,New York,NY,2008,20008,2297.0,18351295,CB,DO,both
2,MTA New York City Transit,New York,NY,2008,20008,2297.0,18351295,MB,DO,both
3,MTA New York City Transit,New York,NY,2008,20008,2297.0,18351295,DR,PT,both
4,MTA New York City Transit,New York,NY,2008,20008,2297.0,18351295,RB,DO,both
...,...,...,...,...,...,...,...,...,...,...
4274,,,,,,,,,,right_only
4275,,,,,,,,,,right_only
4276,,,,,,,,,,right_only
4277,,,,,,,,,,right_only


Play with merges:
* set `indicator=True`
* adjust the `how` parameter: how = 'inner', 'left', 'right', 'outer'
* look at the merge results: `df._merge.value_counts()`
* what's changing? What merge results appear when it's an `inner` join vs `left` join vs `right` join vs `outer` join?

### Helpful Hints and Best Practices

* Start with comprehensive approach in writing down all the lines of code to clean data. 
* Once the data cleaning process is done, work on refining the code and tidying it to see what steps can be chained together, what steps are done repeatedly (use a function!), etc.

#### Chaining
Similar to **piping** in R, where you can pipe multiple operations in 1 line of code with `>>`, you can do a similar method of chaining in Python. There is also a `df.pipe` function, but that's slightly different.

Make use of parentheses to do this. Also, use `df.assign` (see below) so you don't run into the `SettingWithCopyWarning`, which may pop up if you decide to subset your data. 

#### Assign 
You can create new columns in place, and the warning that comes up is mostly harmless. But, `assign` also lets you chain more operations after. [More clarification.](https://pythonguides.com/add-a-column-to-a-dataframe-in-python-pandas/)
```
states_clean = (states_clean
    # Assign is similar to R dplyr's mutate
    .assign(
        # Strip leading or trailing blanks (slightly different than replace)
        # Decide if you want to replace all blanks or just leading/trailing
        Agency = (states_clean.Agency.str.strip()
                .str.replace('(', '').str.replace(')', '')
        ),
        # Do something similar for City as above
        City = states_clean.City.str.strip(),
        # Replace blanks with nothing
        State = states_clean.State.str.replace(' ', '')
    ).astype({
        "Population": int, 
        "Fare_Revenues": int,
    })
)
```

Alternatively, try it with a loop:

```
for c in ["Agency", "City"]:
    df[c] = (df[c].str.strip()
            .str.replace('(', '')
            .str.replace(')', '')
            .astype(int)
            )
```

#### Using `str.contains` with some special characters
Use backslash `\` to "escape". [StackOverflow explanation](https://stackoverflow.com/questions/48699907/error-unbalanced-parenthesis-while-checking-if-an-item-presents-in-a-pandas-d)
`states_clean[states_clean.Fare_Revenues.str.contains("\(")]`


#### Merging
If your merge results produces a `col_x` and `col_y`, add more columns to your list of merge columns, with `on = ["col1", "col2"]`.

#### Use `isin` to filter by multiple conditions

```
keep_me = ["CA", "NY", "TX"]
df2 = df[df.State.isin(keep_me)]
```

Subset columns to just the 5 states listed above.

In [49]:
keep_me = ["CA", "NY", "TX"]
ntd_tbl = merge1[merge1.state.isin(keep_me)]
ntd_tbl

Unnamed: 0,agency,city,state,legacy_ntd_id,ntd_id,primary_uza_pop,mode,tos,total_service_vehicles
0,MTA New York City Transit,New York,NY,2008,20008,18351295,HR,DO,2297
1,MTA New York City Transit,New York,NY,2008,20008,18351295,CB,DO,2297
2,MTA New York City Transit,New York,NY,2008,20008,18351295,MB,DO,2297
3,MTA New York City Transit,New York,NY,2008,20008,18351295,DR,PT,2297
4,MTA New York City Transit,New York,NY,2008,20008,18351295,RB,DO,2297
...,...,...,...,...,...,...,...,...,...
3647,Sullivan County Transportation,Monticello,NY,2R02-042,2R02-20937,0,MB,DO,0
3662,City of Mechanicville,Mechanicville,NY,2213,20213,594962,MB,DO,0
3667,Los Angeles County Dept. of Public Works - Ath...,Alhambra,CA,,90269,12150996,MB,PT,0
3676,"City of Needles, dba: Needles Area Transit",Needles,CA,9R02-063,9R02-91020,0,MB,PT,0


Add a new column with this metric: `service_vehicles_per_capita` (service vehicles divided by population)

* [Read more on rate metrics](https://oag.ca.gov/sites/all/files/agweb/pdfs/cjsc/stats/computational_formulas.pdf)
* `df[new_column] = df[numerator_col]/df[denominator_col]`
* `df[new_column] = df[numerator_col].divide(df[denominator_col])`

In [100]:
def div_zero(row):
    if row.total_service_vehicles == '0':
        return '0'
    else:
        return ntd_tbl.total_service_vehicles / ntd_tbl.primary_uza_pop

ntd_tbl['service_vehicles_per_capita'] = ntd_tbl.apply(lambda row : div_zero(row),axis=1)

#numerator = ntd_tbl.loc[total_service_vehicles]
#denominator = ntd_tbl.loc[primary_uza_pop]
#ntd_tbl["service_vehicles_per_capita"] = numerator/ntd_tbl.denominator
##KeyError: '[2297, 390, 151, 941, 156, 535, 825, 33, 170, 100, 310, 30, 58, 70, 69, 27, 31, 5, 38, 47, 19, 183, 64, 220, 45, 36, 22, 39, 7, 23, 11, 29, 8, 12, 9, 6, 10, 66, 25, <NA>] not in index'

#try:
#    ntd_tbl["service_vehicles_per_capita"] = ntd_tbl.total_service_vehicles / ntd_tbl.primary_uza_pop
#    except ZeroDivisionError:
#        ntd_tbl["service_vehicles_per_capita"] = 0
##SyntaxError: invalid syntax    

#ntd_tbl[service_vehicles_per_capita] = ntd_tbl.apply((total_service_vehicles/ntd_tbl.primary_uza_pop),axis=1)
##ZeroDivisionError: division by zero

#ntd_tbl[service_vehicles_per_capita] = ntd_tbl[total_service_vehicles].divide(ntd_tbl[primary_uza_pop])
##KeyError: "None of [Index([2297, 2297, 2297, 2297, 2297, 1420, 1420, 1420, 1420, 1420,\n       ...\n          0,    0,    0,    0,    0,    0,    0,    0,    0, <NA>],\n      dtype='object', length=732)] are in the [columns]"

ntd_tbl

ZeroDivisionError: division by zero

Add a new column that is `service_vehicles_per_100k`

In [None]:
#pop100k = ntd_tbl.primary_uza_pop/100000
#ntd_tbl['service_vehicles_per_100k'] = ntd_tbl.total_service_vehicles / ntd_tbl.pop100k

Do a group by and aggregate.

* Group by state, count the number of agencies and find average of total service vehicles.
* Write a sentence to explain the result

In [101]:
ntd_tbl_state = ntd_tbl.groupby('state').agg('agency').mean('total_service_vehicles')

  ntd_tbl_state = ntd_tbl.groupby('state').agg('agency').mean('total_service_vehicles')


NotImplementedError: SeriesGroupBy.mean does not implement numeric_only