### Author: Allan R. Jeeboo
### Preferred Name: Vyncent S. A. van der Wolvenhuizen
### Affiliation: Data Science Student at Triple Ten
### Email: vanderwolvenhuizen.vyncent@proton.me
### Date started: 2025-03-06
### Last updated: 2025-03-06 09:27 EST

# 1.0 Introduction
In this project, we'll be assuming the fictional role of being an analyst for Zuber, a new ride-sharing company that's launching in Chicago. Our task is to find patterns in the available information, understand passenger preferences, and the impact of external factors on rides. 

### 1.1 Import Data
Let's import the necessary libraries and files for this project.

In [4]:
import pandas as pd 
import seaborn as sns 

company_trips_amount = pd.read_csv('company_name_trips_amount.csv')
dropoff_trips_avg = pd.read_csv('dropoff_trips_avg.csv')
pickup_weather_ride_duration = pd.read_csv('pickup_weather_ride_duration.csv')

Let's examine the first 10 rows of each dataset, along with the number of rows and columns in each.

In [8]:
print(company_trips_amount.head(10))
print(company_trips_amount.shape)

                        company_name  trips_amount
0                          Flash Cab         19558
1          Taxi Affiliation Services         11422
2                   Medallion Leasin         10367
3                         Yellow Cab          9888
4    Taxi Affiliation Service Yellow          9299
5          Chicago Carriage Cab Corp          9181
6                       City Service          8448
7                           Sun Taxi          7701
8          Star North Management LLC          7455
9  Blue Ribbon Taxi Association Inc.          5953
(64, 2)


In [9]:
print(dropoff_trips_avg.head(10))
print(dropoff_trips_avg.shape)

  dropoff_location_name  average_trips
0                  Loop   10727.466667
1           River North    9523.666667
2         Streeterville    6664.666667
3             West Loop    5163.666667
4                O'Hare    2546.900000
5             Lake View    2420.966667
6            Grant Park    2068.533333
7         Museum Campus    1510.000000
8            Gold Coast    1364.233333
9    Sheffield & DePaul    1259.766667
(94, 2)


In [10]:
print(pickup_weather_ride_duration.head(10))
print(pickup_weather_ride_duration.shape)

              start_ts weather_conditions  duration_seconds
0  2017-11-25 16:00:00               Good            2410.0
1  2017-11-25 14:00:00               Good            1920.0
2  2017-11-25 12:00:00               Good            1543.0
3  2017-11-04 10:00:00               Good            2512.0
4  2017-11-11 07:00:00               Good            1440.0
5  2017-11-11 04:00:00               Good            1320.0
6  2017-11-04 16:00:00                Bad            2969.0
7  2017-11-18 11:00:00               Good            2280.0
8  2017-11-11 14:00:00               Good            2460.0
9  2017-11-11 12:00:00               Good            2040.0
(1068, 3)


A few initial observations:
1. There are 64 listed taxi/ride-sharing companies, and 'Flash Cab' seems to have a noticeably higher usage rate compared to its competitors. 
2. There are 94 listed neighborhoods; 'Loop' and 'River North' appear to be the most popular drop off location. 
3. We have data for 1 068 rides.

### 1.2 Data Description 
An explanation for what each column in each dataset represents.

company_trips_amount description:

- **company_name**: Taxi company name.
- **trips_amount**: The number of rides for each taxi company on November 15-16, 2017.


dropoff_trips_avg information: 

- **dropoff_location_name**: Chicago neighborhoods where rides ended
- **average_trips**: the average number of rides that ended in each neighborhood in November 2017

pickup_weather_ride_duration

- **start_ts**: pickup date and time
- **weather_conditions**: weather conditions at the moment the ride started
- **duration_seconds**: ride duration in seconds

# 2.0 Data Preprocessing

###  2.1 Checking company_trips_amount 
Check for NaNs and ensuring dtypes are correct.

In [None]:
company_trips_amount.isna().sum()

company_name    0
trips_amount    0
dtype: int64

In [13]:
company_trips_amount.dtypes

company_name    object
trips_amount     int64
dtype: object

### 2.2 Checking dropoff_trips_avg
Check for NaNs and ensuring dtypes are correct.

In [23]:
dropoff_trips_avg.isna().sum()

dropoff_location_name    0
average_trips            0
dtype: int64

In [14]:
dropoff_trips_avg.dtypes

dropoff_location_name     object
average_trips            float64
dtype: object

Since you can't have a fraction of a trip, let's round the values to the nearest integer and then convert the column 'average_trips' to int (note that it's currently a float).

In [21]:
dropoff_trips_avg['average_trips'] = dropoff_trips_avg['average_trips'].round().astype(int)
dropoff_trips_avg.dtypes

dropoff_location_name    object
average_trips             int64
dtype: object

### 2.3 Checking pickup_weather_ride_duration
Check for NaNs and ensuring dtypes are correct.

In [24]:
pickup_weather_ride_duration.isna().sum()

start_ts              0
weather_conditions    0
duration_seconds      0
dtype: int64

In [19]:
pickup_weather_ride_duration.dtypes

start_ts               object
weather_conditions     object
duration_seconds      float64
dtype: object

The 'duration_seconds' column is a float currently, which is redundant because the values in the column are integers. Let's change the dtype to int.

In [22]:
pickup_weather_ride_duration['duration_seconds'] = pickup_weather_ride_duration['duration_seconds'].astype(int)
pickup_weather_ride_duration.dtypes

start_ts              object
weather_conditions    object
duration_seconds       int64
dtype: object

There isn't much to comment on this section. We have no NaN values and the only minor changes made were converting the columns dropoff_trips_avg['average_trips'] and pickup_weather_ride_duration['duration_seconds'] from floats to ints.

# 3.0 Exploratory Data Analysis
In this section we'll be performing some EDA. As we have been doing, we'll first start with the company_name_trips_amount dataset, then dropoff_trips_avg, and finally pickup_weather_ride_duration. At the end we'll discuss what we've inferred and begin to formulate hypotheses.

### 3.1 PH
First let's identify the top 10 neighborhoods for drop-offs. drop_off_trips_avg should already be sorted off of average_trips in descending order, but for good measure let's ensure that's the case.

In [34]:
dropoff_trips_avg.average_trips = dropoff_trips_avg.average_trips.sort_values(ascending= False)
dropoff_trips_avg.head(10)

Unnamed: 0,dropoff_location_name,average_trips
0,Loop,10727
1,River North,9524
2,Streeterville,6665
3,West Loop,5164
4,O'Hare,2547
5,Lake View,2421
6,Grant Park,2069
7,Museum Campus,1510
8,Gold Coast,1364
9,Sheffield & DePaul,1260
