### Exploratory Data Analysis of Ridesharing Dataset

#### Overview:
This Jupyter notebook contains code and analysis for exploring the New York Ridesharing Dataset. The dataset comprises Ride Sharing Data, including time, longitute, latitude, and other relevant details.

#### Dataset Used:
- Dataset Name: New York City Ridesharing Dataset
- Source: https://www.kaggle.com/datasets/fivethirtyeight/uber-pickups-in-new-york-city

#### Objective:
The primary goal of this notebook is to perform exploratory data analysis (EDA) on the Uber Ridesharing Dataset. This includes:
- Data cleaning and preprocessing
- Merging data for training datasets
- Extracting insights and patterns from the data
- Rudementary visualizations for quick analysis

### Author:
- Name: Aden Letchworth
- Date: 12/17/2023




In [135]:
# Import Libraries

import pandas as pd
import plotly.express as px

import sys

# Custom Helper Functions

sys.path.append('../src')
import data_utils as ds 

In [136]:
# Load Dataset(s)

df = pd.read_csv('../data/raw/uber-raw-data-sep14.csv')

In [137]:
df.head

<bound method NDFrame.head of                   Date/Time      Lat      Lon    Base
0          9/1/2014 0:01:00  40.2201 -74.0021  B02512
1          9/1/2014 0:01:00  40.7500 -74.0027  B02512
2          9/1/2014 0:03:00  40.7559 -73.9864  B02512
3          9/1/2014 0:06:00  40.7450 -73.9889  B02512
4          9/1/2014 0:11:00  40.8145 -73.9444  B02512
...                     ...      ...      ...     ...
1028131  9/30/2014 22:57:00  40.7668 -73.9845  B02764
1028132  9/30/2014 22:57:00  40.6911 -74.1773  B02764
1028133  9/30/2014 22:58:00  40.8519 -73.9319  B02764
1028134  9/30/2014 22:58:00  40.7081 -74.0066  B02764
1028135  9/30/2014 22:58:00  40.7140 -73.9496  B02764

[1028136 rows x 4 columns]>

In [138]:
print(df.columns)
print(df.dtypes)

Index(['Date/Time', 'Lat', 'Lon', 'Base'], dtype='object')
Date/Time     object
Lat          float64
Lon          float64
Base          object
dtype: object


In [139]:
print(f'Null Values: {df.isnull().any().any()}, NA Values: {df.isna().any().any()}')

Null Values: False, NA Values: False


In [140]:
df.describe

<bound method NDFrame.describe of                   Date/Time      Lat      Lon    Base
0          9/1/2014 0:01:00  40.2201 -74.0021  B02512
1          9/1/2014 0:01:00  40.7500 -74.0027  B02512
2          9/1/2014 0:03:00  40.7559 -73.9864  B02512
3          9/1/2014 0:06:00  40.7450 -73.9889  B02512
4          9/1/2014 0:11:00  40.8145 -73.9444  B02512
...                     ...      ...      ...     ...
1028131  9/30/2014 22:57:00  40.7668 -73.9845  B02764
1028132  9/30/2014 22:57:00  40.6911 -74.1773  B02764
1028133  9/30/2014 22:58:00  40.8519 -73.9319  B02764
1028134  9/30/2014 22:58:00  40.7081 -74.0066  B02764
1028135  9/30/2014 22:58:00  40.7140 -73.9496  B02764

[1028136 rows x 4 columns]>

In [141]:
df['Date/Time'].value_counts

<bound method IndexOpsMixin.value_counts of 0            9/1/2014 0:01:00
1            9/1/2014 0:01:00
2            9/1/2014 0:03:00
3            9/1/2014 0:06:00
4            9/1/2014 0:11:00
                  ...        
1028131    9/30/2014 22:57:00
1028132    9/30/2014 22:57:00
1028133    9/30/2014 22:58:00
1028134    9/30/2014 22:58:00
1028135    9/30/2014 22:58:00
Name: Date/Time, Length: 1028136, dtype: object>

#### Create Regex function for Isolating Date 

Notice: Taking Substrings wouldn't work correctly since the date is formatted '9/30/2014...' in some cases so we would have to take it in accordance to string length, however the suffix can be different sizes such as '0:01:00' and '22:57:00' making regex the simplest way of extracting the date.

In [142]:
"""
HERE FOR DOCCUMENTATION PURPOSES MOVED TO DATA UTILITY FILE

def find_date(text):
    date_pattern = r'^(\d{1,2}/\d{1,2}/\d{4})'  
    match = re.search(date_pattern, text)
    if match:
        return match.group(1)  
    else:
        return None 

def standardize_date(text):
    date_pattern_month = r'^(\d{1}/\d{2}/\d{4})'  
    date_pattern_day = r'^(\d{2}/\d{1}/\d{4})'    
    date_pattern_all = r'^(\d{1}/\d{1}/\d{4})'   

    if re.match(date_pattern_month, text):
        return re.sub(r'^(\d{1}/)(\d{2}/\d{4})', r'0\1\2', text)
    
    if re.match(date_pattern_day, text):
        return re.sub(r'^(\d{2}/)(\d{1}/\d{4})', r'\10\2', text)

    if re.match(date_pattern_all, text):
        return re.sub(r'^(\d{1}/)(\d{1}/\d{4})', r'0\1\2', text)

    return text  

"""

"\nHERE FOR DOCCUMENTATION PURPOSES MOVED TO DATA UTILITY FILE\n\ndef find_date(text):\n    date_pattern = r'^(\\d{1,2}/\\d{1,2}/\\d{4})'  \n    match = re.search(date_pattern, text)\n    if match:\n        return match.group(1)  \n    else:\n        return None \n\ndef standardize_date(text):\n    date_pattern_month = r'^(\\d{1}/\\d{2}/\\d{4})'  \n    date_pattern_day = r'^(\\d{2}/\\d{1}/\\d{4})'    \n    date_pattern_all = r'^(\\d{1}/\\d{1}/\\d{4})'   \n\n    if re.match(date_pattern_month, text):\n        return re.sub(r'^(\\d{1}/)(\\d{2}/\\d{4})', r'0\x01\x02', text)\n    \n    if re.match(date_pattern_day, text):\n        return re.sub(r'^(\\d{2}/)(\\d{1}/\\d{4})', r'\x08\x02', text)\n\n    if re.match(date_pattern_all, text):\n        return re.sub(r'^(\\d{1}/)(\\d{1}/\\d{4})', r'0\x01\x02', text)\n\n    return text  \n\n"

In [143]:
df['Date'] = df['Date/Time'].apply(lambda x: ds.find_date(x))

df['Date'] = df['Date'].apply(lambda x: ds.standardize_date(x))

In [144]:
# Perform frequency count of dates
date_counts = df['Date'].value_counts().reset_index()
date_counts.columns = ['Date', 'Frequency']

# Sort the date_counts DataFrame by 'Date' column in ascending order
date_counts = date_counts.sort_values('Date')

# Create Plotly bar chart for date frequency distribution
fig = px.bar(date_counts, x='Date', y='Frequency', labels={'Date': 'Date', 'Frequency': 'Frequency Count'})
fig.update_xaxes(type='category')  
fig.update_layout(title='Date Frequency Distribution (Sorted by Date)')
fig.show()

In [145]:
df['Day'] = df['Date'].apply(lambda x: ds.get_day_of_week_from_string(x))

In [146]:
# Perform frequency count of days
days_counts = df['Day'].value_counts().reset_index()
days_counts.columns = ['Day', 'Frequency']

# Create Plotly bar chart for day frequency distribution
fig = px.bar(days_counts, x='Day', y='Frequency', labels={'Day': 'Day', 'Frequency': 'Frequency Count'})
fig.update_xaxes(type='category')  
fig.update_layout(title='Day Frequency Distribution (Sorted by Frequency)')
fig.show()

In [150]:
files = ds.get_files_by_regex('../data/raw/uber-raw-data*.csv')

data_frames = []

for file in files:
    data_frames.append(pd.read_csv(file))

for data_frame in data_frames:
    print(data_frame.columns)

Index(['Date/Time', 'Lat', 'Lon', 'Base'], dtype='object')
Index(['Date/Time', 'Lat', 'Lon', 'Base'], dtype='object')
Index(['Dispatching_base_num', 'Pickup_date', 'Affiliated_base_num',
       'locationID'],
      dtype='object')
Index(['Date/Time', 'Lat', 'Lon', 'Base'], dtype='object')
Index(['Date/Time', 'Lat', 'Lon', 'Base'], dtype='object')
Index(['Date/Time', 'Lat', 'Lon', 'Base'], dtype='object')
Index(['Date/Time', 'Lat', 'Lon', 'Base'], dtype='object')


In [152]:
data_frames[2]

Unnamed: 0,Dispatching_base_num,Pickup_date,Affiliated_base_num,locationID
0,B02617,2015-05-17 09:47:00,B02617,141
1,B02617,2015-05-17 09:47:00,B02617,65
2,B02617,2015-05-17 09:47:00,B02617,100
3,B02617,2015-05-17 09:47:00,B02774,80
4,B02617,2015-05-17 09:47:00,B02617,90
...,...,...,...,...
14270474,B02765,2015-05-08 15:43:00,B02765,186
14270475,B02765,2015-05-08 15:43:00,B02765,263
14270476,B02765,2015-05-08 15:43:00,B02765,90
14270477,B02765,2015-05-08 15:44:00,B01899,45


In [None]:
data_frames.remove(index=2)
