# Taking a first look and cleaning the data.
* [Data Source](https://airtable.com/appeVUdmRBi3K9hTS/tblLywLvMA2OTesQP/viwRRKOaZvvkSNfmU?blocks=hide)
* [Term Explanations](https://docs.calitp.org/data-infra/datasets_and_tables/transitdatabase.html)
* Customers want real time info and ease of payment on their phones/etc. Look at all the technology gathered in Transit Stacks to see what products/vendors/etc are out there. It's not too efficient for agencies to use so many different vendors/products, how can Caltrans help create some consistency?

In [1]:
import numpy as np
import pandas as pd

pd.options.display.max_columns = 50
pd.options.display.max_rows = 250
pd.set_option("display.max_colwidth", None)
pd.options.display.float_format = "{:.2f}".format

from itertools import chain

import altair as alt
from calitp import *
from siuba import *

GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/transit_stacks/"

## Products Data

In [2]:
# drop columns with tons of NAs
products = (
    to_snakecase(pd.read_csv(f"{GCS_FILE_PATH}products-Grid view (1).csv"))
    .drop(
        columns=[
            "business_model_features",
            "attachments",
            "status",
            "certifications",
            "connectivity",
            "accepted_input_components",
            "output_components",
            "input",
            "output",
        ]
    )
    .rename(columns={"name": "product_name"})
)

In [3]:
products.columns

Index(['product_name', 'components', 'vendor', 'url', 'requirements',
       'product_features', 'notes', 'organization_stack_components'],
      dtype='object')

In [5]:
### Count number of strings in organization_stack_components column to see how many orgs are using this vendor.
# https://stackoverflow.com/questions/51502263/pandas-dataframe-object-has-no-attribute-str
products["count_of_orgs_using_product"] = (
    products["organization_stack_components"]
    .str.split(",+")
    .str.len()
    .groupby(products.product_name)
    .transform("sum")
)

In [14]:
products = products.fillna("N/A")

### What % of vendors with scheduling software also provide GTFS data out of the box?
* Go back and tag companies for GTFS.

In [9]:
# https://stackoverflow.com/questions/26577516/how-to-test-if-a-string-contains-one-of-the-substrings-in-a-list-in-pandas
searchfor = ["GTFS", "schedule", "Scheduling", "Schedule", "scheduling"]

In [17]:
gtfs_schedule_overlap = products[
    products["components"].str.contains(
        "&".join(searchfor),
        case=False,
    )
]
gtfs_schedule_overlap

Unnamed: 0,product_name,components,vendor,url,requirements,product_features,notes,organization_stack_components,count_of_orgs_using_product


### Vendor with the most products
* Uber has 26 separate rows for all the different products it offers. 

In [61]:
no_vendor_nulls = products.loc[products["vendor"] != "N/A"]

In [62]:
no_vendor_nulls.vendor.value_counts().head(10)

Uber Inc.                     26
Luminator Technology Group    24
Genfare                       13
GMV Syncromatics Inc          10
Connexionz Inc.                9
Trapeze Group                  6
Conduent Inc                   6
UTA                            6
Avail Technologies Inc.        4
Cubic                          4
Name: vendor, dtype: int64

### Most popular products in general
* Metric: count_of_orgs_using_product counts the number of strings after each comma in the organization stacks component col for each product.
* Assume that each value is a separate organization.
* Genfare FareBox has 94 values, making it the most popular product

In [64]:
products[
    ["product_name", "vendor", "components", "count_of_orgs_using_product", "notes"]
].sort_values("count_of_orgs_using_product", ascending=False).head(10)

Unnamed: 0,product_name,vendor,components,count_of_orgs_using_product,notes
60,Genfare Farebox (Unspecified),Genfare,Cash Farebox,94.0,
212,Cubic NextBus Suite,Cubic,Real-time info,92.0,Link now leads to Umo Mobility Platform.\n
231,Trapeze Fixed Route Scheduling,Trapeze Group,"Run cutting,Driver Sign-up",74.0,
0,Avail - Unspecified,Avail Technologies Inc.,,64.0,"myAvail–the Enterprise Transit Management Software (ETMS) that empowers agencies to drastically improve efficiency, tracking, and compliance."
228,GMV/Syncromatics Sync,GMV Syncromatics Inc,"Real-time info,Mobile trip planning app",60.0,
111,Trillium GTFS Manager,Trillium Inc.,"GTFS generation,GTFS Schedule Publishing",58.0,
98,Excel,Microsoft,General Purpose Software,50.0,
52,Clever Devices - Unspecified,Clever Devices Ltd.,AVL Software,50.0,
6,In house activity,,,48.0,
223,Swiftly Transitime,Swiftly Inc.,"Real-time info,Arrival predictions,Alerts Content Management System,Social Alerts,Alerts Subscription Service,GTFS Alerts Publication",44.0,"Swiftly Transitime gives riders the very best in vehicle arrival predictions, and our APIs make it easy to connect them with whichever apps, websites, signage, or ADA-supportive media your riders use."


### Most popular products by component type and # of organizations 

In [107]:
popular_products = popular_products.loc[
    popular_products.groupby("components")["count_of_orgs_using_product"].idxmax()
]

In [110]:
popular_products = popular_products.astype(
    {
        "components": "string",
        "product_name": "string",
        "count_of_orgs_using_product": "int64",
    }
).dtypes

## Components

In [18]:
components = to_snakecase(pd.read_csv(f"{GCS_FILE_PATH}components-Grid view.csv"))

In [19]:
components.isna().sum()

name                               0
aliases                           95
system                            73
location                           1
function_group                     3
description                       88
products                          37
organization_stack_components     82
example_stacks                    94
example_stacks_copy              107
properties_+_features             96
dtype: int64

In [20]:
components.shape

(107, 11)

### Count number of products in each category 

In [21]:
# https://stackoverflow.com/questions/51502263/pandas-dataframe-object-has-no-attribute-str
components["count_of_products_in_categories"] = (
    components["products"]
    .str.split(",+")
    .str.len()
    .groupby(components.name)
    .transform("sum")
)

In [22]:
components.name.nunique()

107

In [66]:
components.loc[components["name"] == "Scheduling (Demand-Responsive)"]

Unnamed: 0,name,aliases,system,location,function_group,description,products,organization_stack_components,example_stacks,example_stacks_copy,properties_+_features,count_of_products_in_categories
73,Scheduling (Demand-Responsive),,Demand-Responsive Scheduling,Backoffice,Scheduling,,"Ecolane (Unspecified Model),TripShot - Unspecified",,,,,2.0


### 70 unique categories & top 10 "crowded" product categories
* Count number of strings in the "products" column and group by "name" column.
* Real-time info is the most "crowded" category with 32 different products.
* Most categories only have one product.
* About 4 unique products in each category, when filtering out any categories with 0 products recorded.

In [27]:
f"{product_categories.name.nunique()} unique categories"

'70 unique categories'

In [29]:
f"Median number of different products in a category is {product_categories.count_of_products_in_categories.median()}"

'Median number of different products in a category is 4.0'

In [24]:
def bar_chart(df, x_col, y_col):
    chart = (
        alt.Chart(df)
        .mark_bar()
        .encode(
            x=x_col,
            y=y_col,
            color=alt.Color(x_col, scale=alt.Scale(scheme="tealblues")),
        )
    )
    return chart

In [25]:
product_categories = (
    components[["name", "count_of_products_in_categories"]]
    .sort_values("count_of_products_in_categories", ascending=False)
    .rename(columns={"name": "category"})
)

In [26]:
# filter out any categories with 0 products
product_categories = product_categories[
    product_categories["count_of_products_in_categories"] > 0
]

In [69]:
most_saturated_category = product_categories.head(10)

In [31]:
bar_chart(most_saturated_category, "count_of_products_in_categories", "name")

### Function Groups
* Most of the products are under the "operations" group.

In [32]:
components.function_group.value_counts()

Operations         46
Rider info         20
Fare collection    10
Scheduling          7
Backoffice          6
Maintenance         6
IT                  4
Traffic             3
Reporting           1
Rider Info          1
Name: function_group, dtype: int64

## Contracts

In [33]:
contracts = (
    to_snakecase(pd.read_csv(f"{GCS_FILE_PATH}contracts-Grid view.csv"))
    .drop(columns=["attachments", "organization_stack_components", "name"])
    .rename(
        columns={
            "type_of_contract:_functional_category": "functional_category",
            "type_of_contract:_functions": "contract_type",
        }
    )
)

In [71]:
contracts.shape

(128, 11)

In [70]:
contracts.isna().sum()

contract_holder                0
contract_vendor                0
contract_name                 19
functional_category            0
contract_type                  0
start_date                     5
end_date                      90
renewal_option                 0
value                        119
notes                        110
duration_of_contract_year     90
dtype: int64

In [34]:
contracts.sample(2)

Unnamed: 0,contract_holder,contract_vendor,contract_name,functional_category,contract_type,start_date,end_date,renewal_option,value,notes
39,City of Turlock,Token Transit,Label used for the procurement.,"Onboard fares,Offboard fares","Alt fare validator,Mobile ticketing",2019-03-12,2020-03-11,,,
88,San Joaquin Regional Transit District,VenTek,Label used for the procurement.,Offboard fares,Ticket vending machines,2013-06-17,,,,


In [35]:
f"{ contracts.contract_holder.nunique()} organizations in contracts data set"

'51 organizations in contracts data set'

In [36]:
f"{ contracts.contract_vendor.nunique()} vendors in contracts data set"

'37 vendors in contracts data set'

### 125 contracts have none/no record for renewal options, 3 autorenews

In [38]:
contracts.renewal_option.value_counts()

None           125
Auto-renews      3
Name: renewal_option, dtype: int64

### For contracts with an end date, average duration is 3 years.
* Only 30 rows have end date values populated.

In [39]:
# Editing date time cols to the right data type
contracts = contracts.assign(
    start_date=pd.to_datetime(contracts.start_date, errors="coerce"),
    end_date=pd.to_datetime(contracts.end_date, errors="coerce"),
)

In [40]:
# new column for duration of contract year.
contracts["duration_of_contract_year"] = (
    (contracts["end_date"] - contracts["start_date"]).dt.days
) / 365

In [41]:
# Average contract length in years
filtered_for_end_date = contracts[contracts["end_date"].notnull()]
filtered_for_end_date["duration_of_contract_year"].median()

3.0027397260273974

### Separate contract type.
* There are 67 different types, separating them out by commas might make it easier to to analyze?

In [37]:
f"{ contracts.contract_type.nunique()} unique contract types...various combos of stuff like GTFS, mobile ticketing, etc"

'67 unique contract types...various combos of stuff like GTFS, mobile ticketing, etc'

In [42]:
# https://stackoverflow.com/questions/52575290/how-to-separate-string-into-multiple-rows-in-pandas
contract_type = contracts["contract_type"].str.split(",")
cols = contracts.columns.difference(["contract_type"])

In [43]:
contracts_delinated = contracts.loc[
    contracts.index.repeat(contract_type.str.len()), cols
].assign(contract_type_use=list(chain.from_iterable(contract_type.tolist())))

In [73]:
contracts_delinated.loc[
    contracts_delinated["contract_holder"] == "Eastern Contra Costa Transit Authority"
]

Unnamed: 0,contract_holder,contract_name,contract_vendor,duration_of_contract_year,end_date,functional_category,notes,renewal_option,start_date,value,contract_type_use
44,Eastern Contra Costa Transit Authority,**Tri Delta Transit,AmericanEagle,,NaT,Offboard fares,,,2017-11-01,,Mobile ticketing
45,Eastern Contra Costa Transit Authority,**Tri Delta Transit,Connexionz Inc.,,NaT,"Onboard rider information,CAD/AVL,Scheduling",,,2009-05-29,,GTFS Generation
45,Eastern Contra Costa Transit Authority,**Tri Delta Transit,Connexionz Inc.,,NaT,"Onboard rider information,CAD/AVL,Scheduling",,,2009-05-29,,Vehicle Locations
45,Eastern Contra Costa Transit Authority,**Tri Delta Transit,Connexionz Inc.,,NaT,"Onboard rider information,CAD/AVL,Scheduling",,,2009-05-29,,Arrival predictions
45,Eastern Contra Costa Transit Authority,**Tri Delta Transit,Connexionz Inc.,,NaT,"Onboard rider information,CAD/AVL,Scheduling",,,2009-05-29,,Annunciator
45,Eastern Contra Costa Transit Authority,**Tri Delta Transit,Connexionz Inc.,,NaT,"Onboard rider information,CAD/AVL,Scheduling",,,2009-05-29,,Interior signage
46,Eastern Contra Costa Transit Authority,**Tri Delta Transit,TransTrack Solutions Group,,NaT,Reporting,,,2019-03-04,,Reporting software
47,Eastern Contra Costa Transit Authority,**Tri Delta Transit/Remix,Via Inc.,,NaT,Scheduling,,,2016-06-09,,Scheduling Software



### Most common contract element
* Most contracts have an element of GTFS Generation, followed by Vehicle Locations, and Arrival Predictions.

In [None]:
most_common_contract_product = (
    contracts_delinated.contract_type_use.value_counts()
    .to_frame()
    .reset_index()
    .rename(
        columns={"index": "product_type", "contract_type_use": "number_of_contracts"}
    )
    .head(10)
)

In [46]:
bar_chart(most_common_contract_product, "number_of_contracts", "product_type")

### Most popular vendors by contract awarded

In [47]:
vendors2 = (
    contracts.contract_vendor.value_counts()
    .to_frame()
    .reset_index()
    .head(10)
    .rename(columns={"index": "vendor", "contract_vendor": "number_of_contracts"})
)

In [48]:
bar_chart(vendors2, "number_of_contracts", "vendor")

In [None]:
### Organizations with the most contracts

In [49]:
contract_holders = (
    contracts.contract_holder.value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "holders", "contract_holder": "# contracts"})
    .head(10)
)

In [50]:
bar_chart(contract_holders, "# contracts", "holders")