### Clean 2021 A-10 data  
This is a 1-time cleaning process since all other data will come from Black Cat (This is from NTD tables). Only cleaning enough to get started making the prototype.  
It is in a weird format of a) identifying columns, then b) cols for facility type, then c) cols for the ownership type. BUT there might be > 1 ownership type in which case its unclear unless you look at it row by row, which facilities are under which type of ownership.  
  
* We add an `ownerships` column for the ownership type. 
* Additional 2 columns for transport type.
* There will be 1 row per agency/Mode-TOS/ownership type.
* 1 column per facility type.
  
We make sample data only, purely for prototyping until the BC API is complete. Ignore those with > 1 ownership type.   

In [1]:
import pandas as pd
import numpy as np 

In [9]:
# Data found from NTD tables by Slalom team.
df = pd.read_csv("../data/a10_2021.csv") # A-10
df = df.drop(['Directly Operated Ownership Types:', 'Purchased Transportation Ownership Types:'], axis=1)
df.head(3)

Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,Mode,TOS,Under 200 Vehicles,200 to 300 Vehicles,Over 300 Vehicles,Heavy Maintenance Facilities,Total Facilities,Owned,Leased from a Public Entity,Leased from a Private Entity,Owned by PT Provider,Owned by Public Agency,Leased by PT Provider,Leased by Public Agency
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,DO,4.0,24.52,0.97,1.94,31.43,31.43,0.0,0.0,0.0,0.0,0.0,0.0
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,VP,PT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,LR,DO,17.0,0.0,0.0,0.0,17.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
df.shape

(430, 19)

Make new column of ownership types


In [11]:
df.insert(7, 'ownerships', "")
df.head(3)

Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,Mode,TOS,ownerships,Under 200 Vehicles,200 to 300 Vehicles,Over 300 Vehicles,Heavy Maintenance Facilities,Total Facilities,Owned,Leased from a Public Entity,Leased from a Private Entity,Owned by PT Provider,Owned by Public Agency,Leased by PT Provider,Leased by Public Agency
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,DO,,4.0,24.52,0.97,1.94,31.43,31.43,0.0,0.0,0.0,0.0,0.0,0.0
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,VP,PT,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,LR,DO,,17.0,0.0,0.0,0.0,17.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0


Count the non-zero values in the columns on ownership type.  
https://stackoverflow.com/questions/69605383/pandas-determine-column-labels-that-contribute-to-non-zero-values-in-each-row

In [12]:
df['own_type_count'] = np.count_nonzero(df.iloc[:, 13:19],axis=1)
df.head()


Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,Mode,TOS,ownerships,Under 200 Vehicles,200 to 300 Vehicles,...,Heavy Maintenance Facilities,Total Facilities,Owned,Leased from a Public Entity,Leased from a Private Entity,Owned by PT Provider,Owned by Public Agency,Leased by PT Provider,Leased by Public Agency,own_type_count
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,DO,,4.0,24.52,...,1.94,31.43,31.43,0.0,0.0,0.0,0.0,0.0,0.0,1
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,VP,PT,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,LR,DO,,17.0,0.0,...,0.0,17.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,PT,,3.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,1
4,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,HR,DO,,2.0,0.0,...,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [13]:
df['own_type_count'].value_counts()

1    325
0     99
2      6
Name: own_type_count, dtype: int64

In [14]:
df['own_type_label'] = df.iloc[:, 13:19].apply(lambda r: r.index[r.ne(0)].to_list(), axis=1)
df.head()

Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,Mode,TOS,ownerships,Under 200 Vehicles,200 to 300 Vehicles,...,Total Facilities,Owned,Leased from a Public Entity,Leased from a Private Entity,Owned by PT Provider,Owned by Public Agency,Leased by PT Provider,Leased by Public Agency,own_type_count,own_type_label
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,DO,,4.0,24.52,...,31.43,31.43,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,VP,PT,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,[]
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,LR,DO,,17.0,0.0,...,17.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]
3,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,PT,,3.0,0.0,...,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,1,[Leased by PT Provider]
4,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,HR,DO,,2.0,0.0,...,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]


In [15]:
pd.set_option('display.max_columns', None)

# Fill in the newly created `ownerships` column with a string of the ownership type.
df['ownerships'] = [','.join(map(str, l)) for l in df['own_type_label']] 
df.head()

Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,Mode,TOS,ownerships,Under 200 Vehicles,200 to 300 Vehicles,Over 300 Vehicles,Heavy Maintenance Facilities,Total Facilities,Owned,Leased from a Public Entity,Leased from a Private Entity,Owned by PT Provider,Owned by Public Agency,Leased by PT Provider,Leased by Public Agency,own_type_count,own_type_label
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,DO,Owned,4.0,24.52,0.97,1.94,31.43,31.43,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,VP,PT,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,[]
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,LR,DO,Owned,17.0,0.0,0.0,0.0,17.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]
3,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,PT,Leased by PT Provider,3.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,1,[Leased by PT Provider]
4,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,HR,DO,Owned,2.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]


In [16]:
# Reduce down to only those with 1 ownership type - again for prototyping purposes
one_ownership = df[df['own_type_count']==1].reset_index(drop=True)
print(one_ownership.shape)
one_ownership.head()

(325, 22)


Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,Mode,TOS,ownerships,Under 200 Vehicles,200 to 300 Vehicles,Over 300 Vehicles,Heavy Maintenance Facilities,Total Facilities,Owned,Leased from a Public Entity,Leased from a Private Entity,Owned by PT Provider,Owned by Public Agency,Leased by PT Provider,Leased by Public Agency,own_type_count,own_type_label
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,DO,Owned,4.0,24.52,0.97,1.94,31.43,31.43,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,LR,DO,Owned,17.0,0.0,0.0,0.0,17.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,PT,Leased by PT Provider,3.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,1,[Leased by PT Provider]
3,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,HR,DO,Owned,2.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]
4,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,RB,DO,Owned,0.0,0.48,0.03,0.06,0.57,0.57,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]


In [18]:
# Since this dataset is now only those that have 1 ownership type, we don't need any cols after "total Facilities"
# ...since those just indicated the ownership type.
# We just cut off the columns to the right of "Total Facilities" and it's ready to be sample data 
#--------

sample = one_ownership.drop(one_ownership.columns[13:22], axis=1)
sample.head()

Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,Mode,TOS,ownerships,Under 200 Vehicles,200 to 300 Vehicles,Over 300 Vehicles,Heavy Maintenance Facilities,Total Facilities
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,DO,Owned,4.0,24.52,0.97,1.94,31.43
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,LR,DO,Owned,17.0,0.0,0.0,0.0,17.0
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,PT,Leased by PT Provider,3.0,0.0,0.0,0.0,3.0
3,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,HR,DO,Owned,2.0,0.0,0.0,0.0,2.0
4,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,RB,DO,Owned,0.0,0.48,0.03,0.06,0.57


In [19]:
sample.insert(5, 'year', 2021)
sample.head(3)

Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,year,Mode,TOS,ownerships,Under 200 Vehicles,200 to 300 Vehicles,Over 300 Vehicles,Heavy Maintenance Facilities,Total Facilities
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,MB,DO,Owned,4.0,24.52,0.97,1.94,31.43
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,LR,DO,Owned,17.0,0.0,0.0,0.0,17.0
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2021,MB,PT,Leased by PT Provider,3.0,0.0,0.0,0.0,3.0


In [20]:
# sanity check on an agency who's data I know:
sample[sample['Agency']=='Mountain Area Regional Transit Authority, dba: Mountain Transit']

Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,year,Mode,TOS,ownerships,Under 200 Vehicles,200 to 300 Vehicles,Over 300 Vehicles,Heavy Maintenance Facilities,Total Facilities
190,"Mountain Area Regional Transit Authority, dba:...",Big Bear Lake,CA,Independent Public Agency or Authority of Tran...,Rural Reporter,2021,MB,DO,Owned,0.0,0.0,0.0,0.0,1.23
197,"Mountain Area Regional Transit Authority, dba:...",Big Bear Lake,CA,Independent Public Agency or Authority of Tran...,Rural Reporter,2021,CB,DO,Owned,0.0,0.0,0.0,0.0,0.36
198,"Mountain Area Regional Transit Authority, dba:...",Big Bear Lake,CA,Independent Public Agency or Authority of Tran...,Rural Reporter,2021,DR,DO,Owned,0.0,0.0,0.0,0.0,0.41


In [21]:
# write to csv. Will be imported into another notebook for validation tool prototyping
sample.to_csv('../data/2021_a10_submitted_partialdata.csv')

### Now doing same thing for 2020

In [2]:
df2020 = pd.read_csv("../data/a10_2020.csv") # A-10
df2020 = df2020.drop(['DO Ownership Types:', 'PT Ownership Types:'], axis=1)
df2020.head(3)

Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,Mode,TOS,Under 200 Vehicles,200 to 300 Vehicles,Over 300 Vehicles,Heavy Maintenance Facilities,Total Facilities,Owned,Leased from a Public Entity,Leased from a Private Entity,Owned by PT Provider,Owned by Public Agency,Leased by PT Provider,Leased by Public Agency
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,DO,4.0,24.6,0.98,1.96,31.54,31.54,0.0,0.0,0.0,0.0,0.0,0.0
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,VP,PT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,LR,DO,14.0,0.0,0.0,0.0,14.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
df2020.insert(7, 'ownerships', "")

df2020['own_type_count'] = np.count_nonzero(df2020.iloc[:, 13:19],axis=1)
df2020.head(3)

Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,Mode,TOS,ownerships,Under 200 Vehicles,200 to 300 Vehicles,...,Heavy Maintenance Facilities,Total Facilities,Owned,Leased from a Public Entity,Leased from a Private Entity,Owned by PT Provider,Owned by Public Agency,Leased by PT Provider,Leased by Public Agency,own_type_count
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,DO,,4.0,24.6,...,1.96,31.54,31.54,0.0,0.0,0.0,0.0,0.0,0.0,1
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,VP,PT,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,LR,DO,,14.0,0.0,...,0.0,14.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [4]:
df2020['own_type_count'].value_counts()

1    332
0    107
2      8
Name: own_type_count, dtype: int64

In [5]:
df2020['own_type_label'] = df2020.iloc[:, 13:19].apply(lambda r: r.index[r.ne(0)].to_list(), axis=1)
df2020['ownerships'] = [','.join(map(str, l)) for l in df2020['own_type_label']] 
df2020.head()

Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,Mode,TOS,ownerships,Under 200 Vehicles,200 to 300 Vehicles,...,Total Facilities,Owned,Leased from a Public Entity,Leased from a Private Entity,Owned by PT Provider,Owned by Public Agency,Leased by PT Provider,Leased by Public Agency,own_type_count,own_type_label
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,DO,Owned,4.0,24.6,...,31.54,31.54,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,VP,PT,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,[]
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,LR,DO,Owned,14.0,0.0,...,14.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]
3,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,PT,Leased by PT Provider,3.0,0.0,...,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,1,[Leased by PT Provider]
4,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,HR,DO,Owned,2.0,0.0,...,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]


In [6]:
# Reduce down to only those with 1 ownership type - again for prototyping purposes
one_ownership = df2020[df2020['own_type_count']<=1].reset_index(drop=True)
print(one_ownership.shape)
one_ownership.head()

(439, 22)


Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,Mode,TOS,ownerships,Under 200 Vehicles,200 to 300 Vehicles,...,Total Facilities,Owned,Leased from a Public Entity,Leased from a Private Entity,Owned by PT Provider,Owned by Public Agency,Leased by PT Provider,Leased by Public Agency,own_type_count,own_type_label
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,DO,Owned,4.0,24.6,...,31.54,31.54,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,VP,PT,,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,[]
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,LR,DO,Owned,14.0,0.0,...,14.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]
3,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,MB,PT,Leased by PT Provider,3.0,0.0,...,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,1,[Leased by PT Provider]
4,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,HR,DO,Owned,2.0,0.0,...,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1,[Owned]


In [7]:
sample = one_ownership.drop(one_ownership.columns[13:22], axis=1)
# sample.head(3)
sample.insert(5, 'year', 2020)
sample.head(3)

Unnamed: 0,Agency,City,State,Organization Type,Reporter Type,year,Mode,TOS,ownerships,Under 200 Vehicles,200 to 300 Vehicles,Over 300 Vehicles,Heavy Maintenance Facilities,Total Facilities
0,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2020,MB,DO,Owned,4.0,24.6,0.98,1.96,31.54
1,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2020,VP,PT,,0.0,0.0,0.0,0.0,0.0
2,Los Angeles County Metropolitan Transportation...,Los Angeles,CA,Independent Public Agency or Authority of Tran...,Full Reporter,2020,LR,DO,Owned,14.0,0.0,0.0,0.0,14.0


In [8]:
# write to csv. Will be imported into another notebook for validation tool prototyping
sample.to_csv('../data/2020_a10_submitted_partialdata.csv')