# Title: Analyzing Taxi Ride Patterns and Weather Impact in Chicago


### Introduction
This project aims to analyze taxi ride data from Chicago to identify patterns and trends that can assist the ride-sharing company Zuber in planning strategies for its market entry. The study focuses on understanding passenger preferences, neighborhood activity, and the impact of weather conditions on ride durations. The findings will guide Zuber in tailoring marketing campaigns and operational strategies.

The dataset includes:
1. Taxi companies and the number of rides (Dataset: `taxi_df`).
2. Average drop-offs by neighborhood (Dataset: `neighborhood_df`).
3. Ride details such as start time, duration, and weather conditions (Dataset: `rides_df`).

## Project Outline

---

### Step 1: Data Loading and Initial Inspection
- Load the datasets: `taxi_df`, `neighborhood_df`, and `rides_df`.
- Display the structure of the data and inspect column names, data types, and missing values.

---

### Step 2: Data Cleaning and Preparation
- Rename columns to make them consistent and easier to use.
- Convert columns to appropriate data types (e.g., `datetime` for timestamps).
- Handle missing values appropriately.
- For `rides_df`, categorize weather conditions as "Bad" or "Good."

---

### Step 3: Exploratory Data Analysis (EDA)
- **Taxi Companies (`taxi_df`)**: Analyze the number of rides by each company.
- **Neighborhoods (`neighborhood_df`)**: Identify the top 10 neighborhoods with the highest average drop-offs.
- **Visualization**:
  - Plot the number of rides by taxi company.
  - Plot the top 10 neighborhoods by average drop-offs.

---

### Step 4: Hypothesis Testing
- **Hypothesis**: Test whether the average duration of rides from the Loop to O'Hare changes on rainy Saturdays.
- Prepare the data by filtering rides in `rides_df` for Saturdays and categorizing them by weather condition.
- Use a two-sample t-test to determine if there's a significant difference in ride durations between "Good" and "Bad" weather conditions.

---

### Step 5: Conclusion
- Summarize key findings from the EDA and hypothesis testing.
- Provide actionable insights to guide Zuber’s strategy for entering the Chicago market.

---

### Dataset Overview:
1. **`taxi_df`**:
   - **Columns**: `company_name` (object), `trips_amount` (int).
   - Contains 64 rows with no missing values.

2. **`neighborhood_df`**:
   - **Columns**: `dropoff_location_name` (object), `average_trips` (float).
   - Contains 94 rows with no missing values.

3. **`rides_df`**:
   - **Columns**: `start_ts` (object), `weather_conditions` (object), `duration_seconds` (float).
   - Contains 1068 rows with no missing values.
   - `start_ts` will need conversion to `datetime`.

---

### Next Steps:
1. Clean and preprocess the data:
   - Convert `start_ts` to `datetime`.
   - Ensure consistency in column naming.
   - Address any potential data type issues or anomalies.

## Step 1 
### Load the data

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
from scipy import stats as st

In [None]:
try:
    taxi_df = pd.read_csv('moved_project_sql_result_01.csv')
    neighborhood_df = pd.read_csv('moved_project_sql_result_04.csv')
    rides_df = pd.read_csv('moved_project_sql_result_07.csv')
except FileNotFoundError:
    try:
        taxi_df = pd.read_csv('/mnt/data/moved_project_sql_result_01.csv')
        neighborhood_df = pd.read_csv('/mnt/data/moved_project_sql_result_04.csv')
        rides_df = pd.read_csv('/mnt/data/moved_project_sql_result_07.csv')
    except FileNotFoundError:
        print("Files not found. Please ensure the file paths are correct or upload the files.")


In [None]:
# Load the datasets 
taxi_df = pd.read_csv('moved_project_sql_result_01.csv')
neighborhood_df = pd.read_csv('moved_project_sql_result_04.csv')
rides_df = pd.read_csv('moved_project_sql_result_07.csv')

# Convert 'start_ts' in rides_df to datetime and clean column names
rides_df['start_ts'] = pd.to_datetime(rides_df['start_ts'])
taxi_df.columns = taxi_df.columns.str.lower()
neighborhood_df.columns = neighborhood_df.columns.str.lower()
rides_df.columns = rides_df.columns.str.lower()

print(rides_df.head())
print(taxi_df.head())
print(neighborhood_df.head())


In [None]:
rides_df.info()

taxi_df.info()

neighborhood_df.info()

## Step 2
### Data Cleaning

In [None]:
# Check for missing values in each dataset
rides_missing = rides_df.isnull().sum()
neighborhood_missing = neighborhood_df.isnull().sum()
taxi_missing = taxi_df.isnull().sum()

# Combine results into a summary DataFrame for better readability
missing_values_summary = {
    "Dataset": ["rides_df", "neighborhood_df", "taxi_df"],
    "Missing Values": [rides_missing.sum(), neighborhood_missing.sum(), taxi_missing.sum()]
}

missing_summary_df = pd.DataFrame(missing_values_summary)
missing_summary_df

In these datasets we have no missing values so far to clean.

## Step 3
### EDA

In [None]:
# Sort taxi_df by trips_amount in descending order
sorted_taxi_df = taxi_df.sort_values(by="trips_amount", ascending=False)

# Plot the top 10 taxi companies by trips
plt.figure(figsize=(10, 6))
plt.bar(sorted_taxi_df["company_name"], sorted_taxi_df["trips_amount"], color="skyblue")
plt.xticks(rotation=90)
plt.title("Number of Rides by Taxi Company (November 15-16, 2017)")
plt.xlabel("Taxi Company")
plt.ylabel("Number of Rides")
plt.tight_layout()
plt.show()


The bar chart above displays the number of rides for each taxi company from November 15–16, 2017. Flash Cab and Taxi Affiliation Services are among the top performers, indicating their dominance in the market.

In [None]:
# Sort neighborhood_df by average_trips in descending order
top_neighborhoods = neighborhood_df.sort_values(by="average_trips", ascending=False).head(10)

# Plot the top 10 neighborhoods by average drop-offs
plt.figure(figsize=(10, 6))
plt.bar(top_neighborhoods["dropoff_location_name"], top_neighborhoods["average_trips"], color="orange")
plt.xticks(rotation=45, ha="right")
plt.title("Top 10 Neighborhoods by Average Drop-offs (November 2017)")
plt.xlabel("Neighborhood")
plt.ylabel("Average Trips")
plt.tight_layout()
plt.show()


The bar chart highlights the top 10 neighborhoods with the highest average drop-offs during November 2017. The Loop and River North stand out as the most popular drop-off locations.

### **Summary of EDA Findings**

#### **1. Taxi Companies (`taxi_df`)**
- Flash Cab is the leading company, with **19,558 rides**, followed by Taxi Affiliation Services with **11,422 rides**.
- A small number of companies dominate the market, indicating an opportunity for Zuber to analyze competitive strategies of these key players.

#### **2. Neighborhoods (`neighborhood_df`)**
- The top neighborhoods by average drop-offs include:
  1. **Loop**: 10,727.47 average drop-offs.
  2. **River North**: 9,523.67 average drop-offs.
  3. **Streeterville**: 6,664.67 average drop-offs.
- Central business districts like the Loop and River North dominate drop-offs, suggesting they are key areas of activity.

#### **Visual Insights**
- The taxi company distribution shows a clear disparity in the number of rides, with a few companies significantly outperforming others.
- Neighborhood analysis confirms that commercial and downtown areas are the primary hubs for drop-offs.

---

### **Key Takeaways**
1. **For Zuber**:
   - Focus advertising and operational strategies around neighborhoods with high drop-off averages, such as the Loop and River North.
   - Study the practices of dominant taxi companies like Flash Cab to design competitive offerings.

2. **General Trends**:
   - Central neighborhoods drive most of the demand.
   - A few companies control the majority of the market, presenting both challenges and opportunities for market entry.

---

### Next Steps:
We'll proceed to hypothesis testing to examine the impact of weather conditions on ride durations.

#### **Hypothesis**
- **Null Hypothesis (\(H_0\))**: The average duration of rides from the Loop to O'Hare International Airport does not change on rainy Saturdays.
- **Alternative Hypothesis (\(H_1\))**: The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays.




In [None]:
# Convert 'start_ts' to datetime
rides_df['start_ts'] = pd.to_datetime(rides_df['start_ts'])

# Filter rides for Saturdays in November
rides_saturdays = rides_df[
    (rides_df['start_ts'].dt.dayofweek == 5) &  # Saturday
    (rides_df['start_ts'] >= "2017-11-01") &
    (rides_df['start_ts'] <= "2017-11-30")
]

# Compare "Good" and "Bad" weather conditions for ride durations
good_weather_durations = rides_saturdays[rides_saturdays['weather_conditions'] == 'Good']['duration_seconds']
bad_weather_durations = rides_saturdays[rides_saturdays['weather_conditions'] == 'Bad']['duration_seconds']

# Perform t-test using scipy.stats
t_stat, p_value = st.ttest_ind(good_weather_durations, bad_weather_durations, equal_var=False)

t_stat, p_value


### **Hypothesis Testing Summary**

---

#### **Objective**
To test whether weather conditions impact the average duration of taxi rides from the Loop to O'Hare International Airport on Saturdays in November 2017.

---

#### **Hypothesis**
- **Null Hypothesis (\(H_0\))**: The average duration of rides from the Loop to O'Hare does not differ between "Good" and "Bad" weather conditions.
- **Alternative Hypothesis (\(H_1\))**: The average duration of rides from the Loop to O'Hare differs between "Good" and "Bad" weather conditions.

---

#### **Statistical Test**
- **Test Used**: Two-sample t-test (independent samples, unequal variance).
- **Significance Level (\(\alpha\))**: 0.05.

---

#### **Data Preparation**
1. Filtered the dataset for rides:
   - That started on Saturdays in November 2017.
   - With valid weather conditions categorized as "Good" or "Bad."

2. Grouped ride durations by weather condition:
   - "Good" Weather Group.
   - "Bad" Weather Group.

---

#### **Results**
- **t-statistic**: \(-7.19\)
- **p-value**: \(6.74 \times 10^{-12}\)

---

#### **Conclusion**
- There is a statistically significant difference in average ride durations between "Good" and "Bad" weather conditions on Saturdays.
- **Bad weather** is associated with **longer ride durations**, likely due to slower traffic or cautious driving during adverse weather conditions.

---

#### **Actionable Insights for Zuber**
1. **Operational Planning**:
   - Adjust estimated travel times during bad weather conditions to provide accurate expectations to passengers.
   - Allocate additional resources (drivers, vehicles) to account for delays caused by bad weather.

2. **Dynamic Pricing**:
   - Consider implementing surge pricing during bad weather conditions to account for longer durations and operational challenges.

3. **Marketing Strategy**:
   - Promote Zuber as a reliable service during bad weather conditions to encourage customer trust and loyalty.


### **Overall Conclusion**

The analysis of Chicago taxi ride data highlights key trends and actionable insights for Zuber:

1. **Market Leaders**:
   - Flash Cab and Taxi Affiliation Services dominate the market with the highest ride counts. Zuber should analyze their strategies for effective competition.

2. **Key Neighborhoods**:
   - The Loop and River North are the most popular drop-off locations, indicating they are critical areas for targeted marketing and operations.

3. **Weather Impact**:
   - Bad weather significantly increases ride durations, likely due to traffic delays. Zuber should account for this in pricing, resource allocation, and customer communications.

These findings provide a solid foundation for Zuber to design data-driven strategies for entering the Chicago market.