# Exploratoy Data Analysis

In this project, we embark on a comprehensive exploration of taxi ride data in Chicago, focusing on November 15-16, 2017. We are equipped with multiple datasets that provide insights into various aspects of taxi services in the city. Our journey begins with exploratory data analysis, where we delve into the characteristics of taxi companies and neighborhoods, identifying patterns and trends. Subsequently, we transition into hypothesis testing, aiming to uncover whether weather conditions, particularly rainy Saturdays, influence the average duration of rides from the Loop to O'Hare International Airport. 

Through meticulous data analysis and hypothesis testing, we aim to draw meaningful conclusions that shed light on the dynamics of taxi services in Chicago. Our approach is grounded in Python programming, leveraging data visualization and statistical methods to derive actionable insights. Let's embark on this analytical expedition to unveil the nuances of Chicago's taxi landscape.

# Importing Libraries and Loading Files

we'll begin by importing the necessary libraries and loading the provided CSV files. Then, we'll explore the data, ensuring correct data types and conducting preliminary analysis to identify the top 10 neighborhoods by drop-offs. Subsequently, we'll create visualizations to depict the distribution of taxi companies and the number of rides, as well as the top 10 neighborhoods by drop-offs.

## Importing

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind

## Loading Data

In [9]:
# Importing the CSV files
try:
    taxi_df = pd.read_csv("/datasets/project_sql_result_01.csv")
    neighborhood_df = pd.read_csv("/datasets/project_sql_result_04.csv")
    data = pd.read_csv("/datasets/project_sql_result_07.csv")
    
except FileNotFoundError:
    print('input file not found')    

input file not found


In [10]:
# Displaying the first few rows of each dataset
print("Taxi Company Data:")
taxi_df.head()

Taxi Company Data:


NameError: name 'taxi_df' is not defined

In [None]:
print("Neighborhood Data:")
neighborhood_df.head()

# Studying and Exploring Data

Performing an exploratory data analysis aids in comprehending the diversity of data types present, identifying and managing duplicates effectively, as well as illuminating the extent and implications of missing values within the dataset.

## Exploring Taxi Dataframe

In [None]:
taxi_df.describe()

In [None]:
taxi_df.info()

### Checking for Null and Duplicate Values on the Taxi Dataframe

In [None]:
# Check for duplicates
print('Duplicate Values:', taxi_df.duplicated().sum())

In [None]:
# Check for null values
print('Null Values:')
taxi_df.isnull().sum()

### Exploring Neighborhood Dataframe

In [None]:
neighborhood_df.describe()

In [None]:
neighborhood_df.info()

## Checking for Null and Duplicate Values on the Neighborhood Dataframe

In [None]:
# Check for duplicates
print('Duplicate Values:', neighborhood_df.duplicated().sum())

In [None]:
# Check for null values
print('Null Values:')
neighborhood_df.isnull().sum()

### Correcting Data types

We will be rouding the 'average_tips' column. Rounding the 'average_trips' column to two decimals can be beneficial for several reasons:

Improved Readability: Rounding to two decimal places simplifies the values, making them easier to read and understand.

Enhanced Interpretation: Rounding helps in emphasizing the significant digits and removes excessive precision that may not be relevant or meaningful for interpretation.

Consistency: Rounding the 'average_trips' column to a consistent number of decimal places ensures uniformity across the dataset. Consistency in presentation facilitates comparison between different values within the same column and across different datasets or visualizations.

Overall, rounding the 'average_trips' column to two decimals promotes clarity, simplicity, and consistency in data presentation, leading to better readability, interpretation, and communication of insights derived from the dataset.

In [None]:
# Rounding 'average_trips' by two decimal values
#neighborhood_df['average_trips'] = neighborhood_df['average_trips'].astype('int64')
neighborhood_df['average_trips'] = neighborhood_df['average_trips'].round(2)

In [None]:
# Displaying changes
neighborhood_df.head()

## Identifying Top 10 Neighborhoods

In [None]:
top_10_neighborhoods = neighborhood_df.sort_values(by='average_trips', ascending=False).head(10)
top_10_neighborhoods

The data reveals that popular tourist destinations such as the Loop and River North experience exceptionally high average trip volumes, suggesting a high demand for transportation services in these areas. Conversely, neighborhoods like Sheffield & DePaul and Gold Coast, while still significant, exhibit comparatively lower average trip numbers, indicating potentially lower demand or different usage patterns. This insight could inform transportation companies in allocating resources and tailoring services to meet varying demand levels across different areas of the city.

## Visualizing Taxi Activity

In [None]:
# Identifying Top 10 taxi companies by number of rides
top_taxi = taxi_df.sort_values(by='trips_amount', ascending=False).head(10)
top_taxi

In [None]:
colors = ['#FF5733', '#FFC300']
plt.figure(figsize=(10, 6))
plt.barh(top_taxi['company_name'], top_taxi['trips_amount'], color=colors)  # Specify the colors
plt.xticks(rotation=90)
plt.ylabel('Taxi Company')
plt.xlabel('Number of Rides')
plt.title('Top 10 Taxi Companies by Number of Rides')
plt.tight_layout()
plt.show()

This graph shows the distribution of rides among the top 10 taxi companies.

Market Dominance of Established Players: Flash Cab emerges as the clear leader in terms of trip volume, significantly outpacing other companies. This suggests a strong market presence and customer preference for this particular taxi service.

Variety in Market Share: While Flash Cab leads in trip numbers, other companies like Taxi Affiliation Services, Medallion Leasing, and Yellow Cab also command substantial shares of the market. This diversity indicates a competitive landscape with multiple players catering to the transportation needs of the city.

Potential Areas for Growth: Companies with lower trip counts, such as Star North Management LLC and Blue Ribbon Taxi Association Inc., may have opportunities for growth and expansion. Understanding the factors contributing to their lower trip volumes could inform strategic initiatives aimed at increasing market share and improving competitiveness.

Additional Observations:

Importance of Brand Recognition: Recognizable brands like Yellow Cab and Chicago Carriage Cab Corp maintain solid positions within the market, likely due in part to their established brand reputation and customer trust. This underscores the importance of brand recognition and customer loyalty in the taxi industry.

Customer Preferences and Service Quality: Variations in trip amounts among different companies may also reflect differences in service quality, pricing strategies, coverage areas, and customer satisfaction levels. Analyzing customer feedback and preferences could provide insights into areas for improvement and competitive advantages for taxi companies.

In [None]:
# Graph 2: Top 10 neighborhoods by number of drop-offs
plt.figure(figsize=(10, 6))
plt.bar(top_10_neighborhoods['dropoff_location_name'], top_10_neighborhoods['average_trips'])
plt.xticks(rotation=90)
plt.xlabel('Neighborhood')
plt.ylabel('Average Number of Drop-offs')
plt.title('Top 10 Neighborhoods by Drop-offs')
plt.tight_layout()
plt.show()

This graph highlights the top 10 neighborhoods in terms of drop-offs.

Urban Center Dominance: The top dropoff locations, such as Loop, River North, and Streeterville, reflect the dominance of urban centers in attracting taxi trips. These areas likely host a concentration of commercial, entertainment, and residential activities, resulting in high transportation demand.

Tourist and Business Hubs: Locations like Loop and River North, known for their bustling commercial districts and tourist attractions, exhibit the highest average trip numbers. This suggests a significant portion of taxi usage is driven by tourism, business travel, and commuting to work or entertainment venues.

Transportation Nodes: O'Hare's presence among the top dropoff locations indicates its importance as a major transportation hub. The high average trip count to O'Hare suggests a significant volume of taxi trips associated with air travel, including passenger dropoffs and pickups.

Residential and Recreational Areas: Dropoff locations like Lake View and Grant Park suggest that residential areas and recreational spaces also contribute to taxi usage. These areas may represent destinations for leisure activities, residential dropoffs, or transportation hubs connecting to other parts of the city.

Business Insights: Analyzing average trip numbers for different locations can provide valuable insights for taxi companies in resource allocation, pricing strategies, and service coverage. Understanding the demand patterns across various dropoff locations enables companies to optimize their operations and meet customer needs effectively.

Seasonal Variations: While not explicitly shown in the provided data, analyzing average trip numbers over time could reveal seasonal variations in transportation demand. For example, tourist-heavy areas might experience peaks during holiday seasons or summer months, while business districts might exhibit more consistent demand throughout the year.

PS. Some of these insights were obtained through research conducted on the web outside from the data provided to gain a deeper understanding of the areas under investigation.

## Testing hypotheses 

## Testing hypotheses 

To test the hypothesis "The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays," we need to follow these steps:

Formulate Null and Alternative Hypotheses:
Null Hypothesis (H0): The average duration of rides from the Loop to O'Hare International Airport is the same on rainy Saturdays as on non-rainy Saturdays.

Alternative Hypothesis (H1): The average duration of rides from the Loop to O'Hare International Airport differs on rainy Saturdays compared to non-rainy Saturdays.

Select Significance Level (alpha):
The significance level (alpha) represents the threshold for the p-value, which is the probability of observing the data given that the null hypothesis is true. A common choice for alpha is 0.05, indicating a 5% chance of rejecting the null hypothesis when it is actually true.

Criterion for Testing Hypotheses:
We'll use a two-sample t-test to compare the means of ride durations on rainy Saturdays and non-rainy Saturdays. This test is appropriate when comparing the means of two independent samples and assumes that the data are normally distributed.

## Exploratory Data Analysis on the Data from last query

In [None]:
# Displaying dataset
data.sample(5)

In [None]:
data['weather_conditions'].unique()

We are examining the `weather_condition` column to identify all distinct values.

In [None]:
data.describe()

In [None]:
print('Total duplicates:',data.duplicated().sum())

We have identified a total of 197 duplicates in our dataset. We will investigate these duplicates to ascertain whether they are necessary for our dataset.

In [None]:
duplicates = data[data.duplicated()]
duplicates.head(5)

We have determined that there are duplicates in the ‘duration_seconds’ column. Although these duplicate values occur in the ‘duration_seconds’ column, they correspond to different dates in the ‘start_ts’ column. Consequently, we will retain these duplicates as they are essential for our dataset.

In [None]:
data.isnull().sum()

We have identify no null values in our dataset. 

In [None]:
data.info()

We have identified that the `start_ts` data type is currently designated as `object`. It is imperative to convert this data type to 'datetime' format.

## Converting data to correct to correct data types

In [None]:
data['start_ts'] = pd.to_datetime(data['start_ts'])

Now that we have successfully converted our `start_ts` to the appropriate data type, we will proceed to filter our data for both rainy and non-rainy Saturdays. This step is essential for further evaluating our hypothesis.

## Filtering data and Performing hypotheses

We load the dataset and now we will filter it to separate rides on rainy Saturdays and non-rainy Saturdays. 

We then will use the `ttest_ind` function from `scipy.stats` module to perform a two-sample t-test. Finally, we will compare the obtained `p-value` with the chosen significance level (alpha) to decide whether to reject the null hypothesis or not. If the `p-value` is less than `alpha`, we reject the null hypothesis, indicating that there is a significant difference in ride durations between rainy and non-rainy Saturdays. Otherwise, we fail to reject the null hypothesis.

In [None]:
# Filter data for rainy Saturdays and non-rainy Saturdays
rainy_saturdays = data[(data['weather_conditions'] == 'Bad') & (data['start_ts'].dt.dayofweek == 5)]
non_rainy_saturdays = data[(data['weather_conditions'] != 'Bad') & (data['start_ts'].dt.dayofweek == 5)]

In [None]:
#added by reviewer

display(len(rainy_saturdays), len(non_rainy_saturdays))

In [None]:
# Perform t-test
t_stat, p_value = ttest_ind(rainy_saturdays['duration_seconds'], non_rainy_saturdays['duration_seconds'])

In [None]:
print('p-value:', p_value)

The calculated p-value is 6.52. Notably, this value is negligibly small, significantly less than the conventional significance level of 0.05. Consequently, we can confidently reject the null hypothesis.

In [None]:
# Compare p-value to significance level
alpha = 0.05
if p_value < alpha:
    print("Reject null hypothesis: The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays.")
else:
    print("Fail to reject null hypothesis: There is no significant difference in the average duration of rides on rainy Saturdays.")

Based on our statistical analysis, we find evidence to suggest that the average duration of rides from the Loop to O’Hare International Airport significantly varies on rainy Saturdays

# Project Conclusion:

In this project, we delved into an exploratory data analysis (EDA) journey to uncover valuable insights from taxi ride data in Chicago during November 15-16, 2017. By analyzing two datasets containing information on taxi companies, neighborhood drop-offs, and ride durations, we gained a deeper understanding of the taxi service landscape.
Exploratory Data Analysis (EDA)

Data Import and Cleaning:
- We imported the provided CSV files and ensured correct data types.
- Our focus was on understanding taxi companies, drop-off locations, and ride durations.

Top Neighborhoods by Drop-offs:
 - Through visualization, we identified the top 10 neighborhoods with the highest number of drop-offs.
 - These neighborhoods play a crucial role in the taxi service ecosystem.

Distribution of Rides Among Taxi Companies:
 - We visualized the distribution of rides among different taxi companies.
 - The following companies stood out:
    - Flash Cab: 19,558 trips
    - Taxi Affiliation Services: 11,422 trips
    - Medallion Leasing: 10,367 trips
    - Yellow Cab: 9,888 trips

Hypothesis Testing

We explored the impact of weather conditions, specifically rainy Saturdays, on ride durations. Our hypothesis test led us to reject the null hypothesis, suggesting that there is statistically significant difference in ride durations between rainy Saturdays and other days. Weather conditions significantly influence ride durations.
Implications and Future Research

Our study has implications for ride-sharing companies:

Resource Allocation: Knowing that rainy Saturdays significantly affect ride durations, companies can allocate resources efficiently.

Customer Satisfaction: By leveraging data-driven insights, companies can enhance service quality and improve customer satisfaction.

Further Investigation: Future research could explore additional variables (e.g., traffic patterns, special events) to refine operational efficiency.

In summary, our data-driven approach contributes to a more reliable transportation system in urban areas like Chicago.