# OMNY Analysis

This notebook is created with the goal of analysing transit data. By passing in a CSV dataset exported by NYC MTA's OMNY fare system, information such as time of day, trips per week, most-visited stations, and more can be determined

## Getting the CSV Data

- Go to https://omny.info/ and log in
- Go to Trips https://omny.info/account/trips and update the filter to show the past 12 months (the max time range possible)
- Scroll to the bottom of the page, click "Download trip history" and pick CSV and then download
- (Optional) manually merge this data with previously-downloaded data using Excel or Sublime
  - Alternatively, Pandas can read multiple CSV files and pick unique rows
- Place the CSV files in the same directory as the Jupyter notebook, aka this Git repo

In [None]:
# Load libraries
import pandas as pd # pd is the standard alias https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html

## Loading the CSV Data

- Read the CSV files into Pandas as separate dataframes
- Merge them together using concat
- Get the unique rows based on Reference ID, in case the OMNY exports had overlapping data
- Resources
  - https://medium.com/@harryfry/combining-multiple-csv-files-into-one-with-pandas-97f631d67960
  - https://www.geeksforgeeks.org/how-to-merge-multiple-csv-files-into-a-single-pandas-dataframe/#
  - Use lower_case_with_underscore https://peps.python.org/pep-0008/#function-and-variable-names
  - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
  - https://pandas.pydata.org/docs/reference/api/pandas.concat.html

In [None]:
# Edit this to update the CSV data files to be considered
files_to_read = ['trip-history.csv','trip-history2.csv']

# Read from CSV and concatenate in one line
# Index is not important and can be ignored 
df = pd.concat(map(pd.read_csv, files_to_read), ignore_index=True)

print("Raw CSV Data has row count of", len(df))

# Reference is a unique ID per fare payment, and can be used to get unique rows
df.drop_duplicates(subset=['Reference'],inplace=True,ignore_index=True)

print("Unique CSV Data has row count of", len(df))


## Massage and Format the DataFrame

- The DataFrame, upon initial read from CSV, is mostly Objects
```
Data columns (total 7 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   Reference          int64 
 1   Transit Account #  int64 
 2   Trip Time          object
 3   Mode               object
 4   Location           object
 5   Product Type       object
 6   Fare Amount ($)    object
dtypes: int64(2), object(5)
```
- However, many of the fields are categorical, meaning there is a finite set of possibile values. This allows Pandas to process it more efficiently
- Some fields are also date-based or numeric, and can be interpreted as such
- The column names can also be hard to reference, given their whitespace and special characters. They can be renamed
- Resources
  - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html
  - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info
  - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
  - https://lifewithdata.com/2022/02/28/how-to-convert-a-string-column-to-float-in-pandas/
  - https://stackoverflow.com/questions/32464280/converting-currency-with-to-numbers-in-python-pandas
  - https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

In [None]:
# Rename columns to be easier to reference
df = df.rename(columns={'Reference':'trip_id',
                   'Transit Account #':'rider_id',
                   'Trip Time':'start_time',
                   'Mode':'transit_mode',
                   'Location':'start_location',
                   'Product Type':'product_type',
                   'Fare Amount ($)':'fare_cost'})

# Convert categorical fields from object to category
df = df.astype({'rider_id': 'category','transit_mode': 'category','start_location': 'category','product_type': 'category'})

# Convert object column with "$2.75" to a float, stripping out the $ to prepare it for interpretation
df['fare_cost'] = df['fare_cost'].astype(str).str.replace('$', '').astype(float)


# Convert Trip Time object column with value such as 2022-07-19 20:50:29 into a datetime type
df['start_time'] = pd.to_datetime(df['start_time'])



df.info(show_counts=False)
df