# Zuber - Exploratory Data Analysis

Two datasets about taxi trips in Chicago will be analysed. The first contains the trip numbers provided by different taxi companies.

**First dataset** <code>project_sql_result_01.csv.</code> contains:

> <code>company_name</code>: taxi company name \
> <code>trips_amount</code>: the number of rides for each taxi company on November 15-16, 2017.

The second provides average number of rides that ended in certain drop locations.

**Second dataset** <code>project_sql_result_04.csv.</code> contains:

> <code>dropoff_location_name</code>: Chicago neighborhoods where rides ended \
> <code>average_trips</code>: the average number of rides that ended in each neighborhood in November 2017.

Both will be explored to reveal the competitor landscape and top neighborhood locations.

## 4.1 Read Data

### 4.1.1 Import libraries

In [1]:
# Import Libraries
import pandas as pd # data processing
import plotly.express as px # interactive plotting library
import scipy.stats as stats # statistical library for hypothesis testing

### 4.1.2 Save CSVs

In [2]:
# Read company data as company 
company = pd.read_csv('data/moved_project_sql_result_01.csv')

# Read location and trip average data as location_avg
location_avg = pd.read_csv('data/moved_project_sql_result_04.csv')

## 4.2 Study data

Each dataset will be studied to determine the:
- columns provided
- amount of data provided
- validity of values
- given data types
- presence of missing values
- presence of duplicate values

### 4.2.1 Company Data

In [3]:
# Display head of company to view provided columns
company.head()

Unnamed: 0,company_name,trips_amount
0,Flash Cab,19558
1,Taxi Affiliation Services,11422
2,Medallion Leasin,10367
3,Yellow Cab,9888
4,Taxi Affiliation Service Yellow,9299


In [4]:
# Display information about company to determine data types and missing values
company.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  64 non-null     object
 1   trips_amount  64 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB


In [5]:
# Display the minimum value to determine the validity of values
print(company.min())

company_name    0118 - 42111 Godfrey S.Awir
trips_amount                              2
dtype: object


In [6]:
# Determine if duplicates exist
company.duplicated(subset='company_name').sum()

0

**Analysis:** Company data contains 64 companies with their accompanying trip amounts. 

Data types are correct. Company names are given as strings and aptly saved as objects. Trip amounts are correctly given as integers.

As the minimum trips given is not negative, all values are valid.

No missing values exist. All values make sense as the trip amount minimum is above 0.

As sum of duplicates is zero, none exist.

### 4.2.2 Location and Duration Data

In [7]:
# Display head of location_time
print(location_avg.head())

  dropoff_location_name  average_trips
0                  Loop   10727.466667
1           River North    9523.666667
2         Streeterville    6664.666667
3             West Loop    5163.666667
4                O'Hare    2546.900000


In [8]:
# Display info about location_time
location_avg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   dropoff_location_name  94 non-null     object 
 1   average_trips          94 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.6+ KB


In [9]:
# Display the minimum value to determine the validity of values
print(location_avg.min())

dropoff_location_name    Albany Park
average_trips                    1.8
dtype: object


In [10]:
# Determine if duplicates exist
location_avg.duplicated(subset='dropoff_location_name').sum()

0

**Analysis:** 

Location drop off and average trip duration contains data for 94 trips. 

Data types are correct. Drop off locations are given as strings and saved as objects. Average trips is correctly identified as a float. 

As the minimum of average dropoffs given is not negative, all values are valid.

No missing values exist. All values make sense as the  minimum is above 0.

As sum of duplicates is zero, none exist.

All data is in its correct form and ready for analysis.

## 4.3 Analysis

Both datframes will be analysed with a histogram, boxplot and barchart. Top taxi companies and neighborhood drop off locations will also be identified. Conclusions will be made.

### 4.3.1 Taxi Companies and Rides



In [11]:
# Create histogram of trips by company
fig = px.histogram(company, 
                   x="trips_amount", 
                   nbins=100)

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title={'text': 'Distribution of Trip Number',
                            'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Trip Count',
                    yaxis_title='Number of Companies')

# Show plot
fig.show()

In [12]:
# Create a boxplot with flipped axes and a title
fig = px.box(company, 
             x="trips_amount", 
             orientation="h",
             hover_data=['company_name'])

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title={'text': 'Boxplot of Company Rides',
                            'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Number of Trips')

# Show plot
fig.show()


In [13]:
# Display summary statistics of company
print(company.describe())

       trips_amount
count     64.000000
mean    2145.484375
std     3812.310186
min        2.000000
25%       20.750000
50%      178.500000
75%     2106.500000
max    19558.000000


In [14]:
# Create percentage of total column
company['percent_of_total'] = (company['trips_amount'] * 100/company['trips_amount'].sum()).round(2)

# Select top 10 taxi companies
top_10 = company[0:10]

# Show barplot of taxi companies
fig = px.bar(top_10, x='company_name', y='percent_of_total')

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title={'text': 'Marketshare of Rides of Top Taxi Companies by Number of Rides',
                            'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Taxi Company',
                    yaxis_title='Marketshare (%)')

# Show barplot
fig.show()

In [15]:
# Sum the marketshare of the top 10 companies
top_10['percent_of_total'].sum().round(2)

72.3

**Conclusion:**

Out of the 64 taxi companies, exactly half made less than 190 trips. 

The median and mean number of trips made by taxi companies is 178.5 and 2145.8 respectively.

The top company, Flash Cab, makes up 14.2% of all rides. This is followed by Taxi Affiliation Services and Medallion Leasin which hold 8.3% and 7.6% each.

The top 10 taxi companies by rides given provided 72.3% of all rides in Chicago on the 15th and 16th of November, 2017.

### 4.3.2 Top 10 Neighborhoods

In [16]:
# Create histogram of trips by location dropoff
fig = px.histogram(location_avg, 
                   x="average_trips", 
                   nbins=100)

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title={'text': 'Histogram of Dropoff Location Averages within Neighborhoods',
                            'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Average Dropoffs',
                    yaxis_title='Number of Neighborhoods')

# Show plot
fig.show()

In [17]:
# Create a boxplot with flipped axes and a title for dropoff locations
fig = px.box(location_avg, 
             x="average_trips", 
             orientation="h",
             hover_data=['dropoff_location_name'])

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title={'text': 'Boxplot of Dropoff Location Averages within Neighborhoods',
                            'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Average Dropoffs')

# Show plot
fig.show()

In [18]:
# Display summary statistics of location_avg
location_avg.describe()

Unnamed: 0,average_trips
count,94.0
mean,599.953728
std,1714.591098
min,1.8
25%,14.266667
50%,52.016667
75%,298.858333
max,10727.466667


In [19]:
# Create percentage of total column
location_avg['percent_of_total'] = (location_avg['average_trips'] * 100/location_avg['average_trips'].sum()).round(2)

# Select top 10 neighborhoods
top_10_neighborhoods = location_avg[0:10]

# Show barplot of taxi companies
fig = px.bar(top_10_neighborhoods, x='dropoff_location_name', y='percent_of_total')

# Update figure, center title, change xaxis title font size, change yaxis title font size
fig.update_layout(title={'text': 'Marketshare of Dropoff Locations of Top Neighborhoods by Average Dropoffs',
                            'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'},
                    xaxis_title='Neighborhood',
                    yaxis_title='Marketshare (%)')

# Show barplot
fig.show()

In [20]:
# Sum the marketshare of the top 10 neighborhoods
top_10_neighborhoods['percent_of_total'].sum().round(2)

76.7

**Conclusion:**

Out of the 94 neighborhoods given, 60 had less 100 dropoffs on average. 

The median and mean average dropoffs per neighborhood is 52 and 600 respectively.

The top neighborhood, Loop, makes up  for 19.02% of dropoffs of average. This is followed by River North and Streeterville which hold 16.80% and 11.82% each.

The top 10 neighborhood dropoff locations on average provided 76.7% of all dropoffs in Chicago in November, 2017.

# Zuber - Hypothesis Testing

## 5.1 The Hypothesis

<br>

**<center>The average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays.**</center>

<br>

## 5.2 Metrics

**Population**: Taxi riders in the Chicago Area at all times.

**Sample:** Taxi riders in the Chicago Area on Saturdays in November, 2017.

**Null Hypothesis**: The average duration is the same on rainy and dry Saturdays.
- The null hypothesis proposes a theory that no statistically significant differences between data characteristics exist. 
- Taking the hypothesis and altering it to ensure that these conditions are met gives this null hypothesis.

**Alternative Hypothesis**: The average duration is not the same on rainy and dry Saturdays.
- The alternative hypothesis is the opposite of the null hypothesis.

**Statistical Test**: Two-sided hypothesis test on the equality of two population means. 


**Alpha**: Reject null hypothesis if p-value < **0.05**.

## 5.3 Method
 The method will follow these steps:
- Create dataframe that contains trip data of pickup times, weather conditions and the duration of the ride will be read in.
- Split dataframe based on weather conditions.
- Perform two-sided hypothese test on the equality of the two population means.

### 5.3.1 Create Dataframe

Read in dataframe. Ensure that column values are correct. Ensure that data is sufficent by reference to null and duplicate values.


In [23]:
# Read dataframe, parse dates on 'start_ts'
df= pd.read_csv('data/moved_project_sql_result_07.csv', parse_dates=['start_ts'])

# Show head of dataframe
df.head()

Unnamed: 0,start_ts,weather_conditions,duration_seconds
0,2017-11-25 16:00:00,Good,2410.0
1,2017-11-25 14:00:00,Good,1920.0
2,2017-11-25 12:00:00,Good,1543.0
3,2017-11-04 10:00:00,Good,2512.0
4,2017-11-11 07:00:00,Good,1440.0


In [24]:
# Show info about weather to check for null values and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1068 entries, 0 to 1067
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   start_ts            1068 non-null   datetime64[ns]
 1   weather_conditions  1068 non-null   object        
 2   duration_seconds    1068 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 25.2+ KB


In [25]:
# Check that column values are correct
# Ensure all values in start_ts column occur on a Saturday
df[df['start_ts'].dt.dayofweek == 5]['start_ts'].count()

1068

In [26]:
# Check categorical values only include 'Good' and 'Bad'
df['weather_conditions'].value_counts()

weather_conditions
Good    888
Bad     180
Name: count, dtype: int64

In [27]:
# Check duration values counts
df['duration_seconds'].value_counts()

duration_seconds
1380.0    41
1260.0    35
1440.0    29
1320.0    23
1200.0    23
          ..
4140.0     1
1547.0     1
2271.0     1
2141.0     1
2834.0     1
Name: count, Length: 495, dtype: int64

In [28]:
# Note that duration is given in whole seconds as dtype=int64, save as int
df['duration_seconds'] = df['duration_seconds'].astype(int)

In [29]:
# Save duplicates on all columns as weather
weather = df.duplicated(subset=['start_ts','weather_conditions','duration_seconds'])

# Check for duplicate values
weather.sum()

197

In [30]:
# Show duplicates
df[weather]

Unnamed: 0,start_ts,weather_conditions,duration_seconds
62,2017-11-11 06:00:00,Good,1260
74,2017-11-11 08:00:00,Good,1380
76,2017-11-04 09:00:00,Good,1380
117,2017-11-11 07:00:00,Good,1380
119,2017-11-04 14:00:00,Good,3300
...,...,...,...
1054,2017-11-11 07:00:00,Good,1380
1058,2017-11-25 12:00:00,Good,1440
1062,2017-11-11 06:00:00,Good,1500
1065,2017-11-11 13:00:00,Good,2100


In [31]:
# Check sumamry statistics of duration_seconds
df['duration_seconds'].describe()

count    1068.000000
mean     2071.731273
std       769.461125
min         0.000000
25%      1438.250000
50%      1980.000000
75%      2580.000000
max      7440.000000
Name: duration_seconds, dtype: float64

**Analysis:**

Start time data appears to be given to the nearest hour. This data only includes Saturdays. Weather conditions only contains values 'Good' and 'Bad' values as it should. Duration is given in the nearest seconds so it has been changed to the integer type. These values are also positive, confirming their validity.

No null data exists. Duplicate data does. However, It may be entirely possible that this data is valid. This may be the case as the start time is rounded up or down an hour. As weather conditions can only be good or bad at a certain time, riders that start within an hour period will show the same data for these fields.

Duration could change this, but duration is also rounded to the nearest second. Many of these values also conveniently match up with an exact minute. For instance, 1380 seconds happens to fall on exactly 23 minutes, suggesting that these values aren't to the exact second. If a typical route also takes 23 minutes, say from the airport to downtown, duplicate data is not unreasonable to expect. This data may still be valid and will be included.

### 5.3.2 Split dataframe

Split dataframe into good and bad weather values

In [32]:
# Create dataframe of good weather
good_weather = df[df['weather_conditions'] == 'Good']

# Create dataframe of bad weather
bad_weather = df[df['weather_conditions'] == 'Bad']

### 5.3.3 Perform hypothesis

Save alpha value, use two-sided ttest on duration_seconds column from both dataframes. Print outcome dependent of p-value in relation to alpha.

In [33]:
# Save alpha value
alpha = 0.05

# Use ttest_ind to compare two dataset samples
results = stats.ttest_ind(good_weather['duration_seconds'], bad_weather['duration_seconds'])

# Print p-value
print('p-value:', results.pvalue)

# Print condition depending on pvalue compared to alpha
if results.pvalue < alpha:
    print("The null hypothesis is rejected")
else:
    print("The null hypothesis is not rejected")

p-value: 6.517970327099473e-12
The null hypothesis is rejected


**Results**

As the p-value is lower than our level for alpha, the null hypothesis is rejected. Instead, the hypothesis that the average duration is not the same on rainy and dry Saturdays is accepted. This makes sense, as we can expect heavier traffic during rainy conditions.

## Conclusion

### Exploratory Data Analysis

Two dataframes were analyzed. The first showed that over half of all taxi companies made less than 190 trips. The mean average outperformed the median as top companies dominated the marketshare. Flash Cab, Taxi Affiliation Services and Medallion Leasin all captured 14.2%, 8.3%, and 7.6% of all rides on the 15th and 16th of November 2017. The top 10 taxi companies together held 72.3% of marketshare.

The second dataframe showed that most neighbourhoods had less than 100 dropoffs on average in November 2017. The top 10 neighbourhoods had a much higher dropoff average, with all providing for 76.7% of all dropoffs. The top neighborhood, the Loop itself held 19.02%.

### Hypothesis Test

The null hypothesis that the average duration is the same on rainy and dry Saturdays is rejected. This was done using a two sided t-test. Instead, the test found that rainy and dry Saturdays have a different ride duration time.
