### Step 1. Parsing and Opening Data

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as st

URL='https://code.s3.yandex.net/data-analyst-eng/chicago_weather_2017.html'
req = requests.get(URL)
soup = BeautifulSoup(req.text, 'lxml')

table = soup.find(attrs={"id": "weather_records"})

heading = []
for row in table.find_all('th'):
    heading.append(row.text) # Add the content from the <th> tag to the heading_table list using append() 

content=[] 
for row in table.find_all('tr'):
    if not row.find_all('th'): 
            content.append([element.text for element in row.find_all('td')])

df_weather = pd.DataFrame(content, columns=heading) 
df_weather.sample(5)

Unnamed: 0,Date and time,Temperature,Description
112,2017-11-05 16:00:00,283.51,proximity thunderstorm
385,2017-11-17 01:00:00,275.96,overcast clouds
609,2017-11-26 09:00:00,271.53,sky is clear
618,2017-11-26 18:00:00,278.5,scattered clouds
548,2017-11-23 20:00:00,280.59,sky is clear


In [6]:
df_weather['Description'].value_counts()

sky is clear                        178
overcast clouds                     146
mist                                 97
broken clouds                        63
scattered clouds                     47
few clouds                           37
light rain                           31
fog                                  24
haze                                 18
light intensity drizzle              13
moderate rain                        12
light snow                           11
drizzle                               9
proximity thunderstorm                5
proximity thunderstorm with rain      2
thunderstorm with drizzle             1
thunderstorm with light rain          1
heavy intensity rain                  1
thunderstorm with rain                1
Name: Description, dtype: int64

Next we will import csv files from previous database queries with data on cab rides for each taxi company on November 15-16, 2017.

In [2]:
try:
    df_trips = pd.read_csv('../datasets/project_sql_result_01.csv')
except:
    df_trips = pd.read_csv('/datasets/project_sql_result_01.csv','\t')

df_trips.sample(5)

Unnamed: 0,company_name,trips_amount
1,Taxi Affiliation Services,11422
54,2192 - 73487 Zeymane Corp,14
30,Setare Inc,230
40,6574 - Babylon Express Inc.,31
23,KOAM Taxi Association,1259


In [3]:
df_trips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  64 non-null     object
 1   trips_amount  64 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB


In [4]:
try:
    df_average = pd.read_csv('../datasets/project_sql_result_04.csv')
except:
    df_average = pd.read_csv('/datasets/project_sql_result_04.csv','\t')

df_average.sample(5)

Unnamed: 0,dropoff_location_name,average_trips
55,Dunning,30.166667
90,Hegewisch,3.117647
65,Ashburn,16.133333
60,New City,22.933333
41,North Park,67.833333


In [5]:
df_average.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   dropoff_location_name  94 non-null     object 
 1   average_trips          94 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.6+ KB


### Step 2. Data Preprocessing

In [8]:
# fix column consistency of weather columns


In [7]:
good_weather = ['sky is clear','overcast clouds','mist','broken clouds','scattered clouds','few clouds','proximity thunderstorm']

def weather_id(row):
    if row['Description'] in good_weather:
        return 1
    else:
        return 0
        
df_weather['weather_condition_id'] = df_weather.apply(weather_id,axis=1)
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 697 entries, 0 to 696
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Date and time         697 non-null    object
 1   Temperature           697 non-null    object
 2   Description           697 non-null    object
 3   weather_condition_id  697 non-null    int64 
dtypes: int64(1), object(3)
memory usage: 21.9+ KB


In all data frames we see undercase columns and correct data types.

In [None]:
print("df_trips nulls: \n",df_trips.isna().sum(),"\n")
print("df_average nulls: \n",df_average.isna().sum(),'\n')
print("df_weather nulls: \n",df_weather.isna().sum())
df_trips.drop_duplicates(inplace=True)
df_average.drop_duplicates(inplace=True)
df_weather.drop_duplicates(inplace=True)

No nulls found in either dataframe and possible duplicates dropped.

In [None]:
df_trips.head(10).plot(x='company_name',y='trips_amount',kind="bar",figsize=(8,6))
plt.suptitle("")
plt.title("Total Trips of Top 10 Taxi Companies (November 15-16, 2017)")
plt.show()

Ordinary graph of total trips by company, clear leader in Flash Cab. With a look on their website they have existed since 1945 so they seem to be the established local company. Other companies have similar trip totals.

In [None]:
df_average.head(10).plot(x='dropoff_location_name',y='average_trips',kind="bar",figsize=(8,6))
plt.suptitle("")
plt.title("Average Trips of 10 Most Popular Dropoff Areas (November 2017)")
plt.show()

By looking at a map of Chicago, all the top 4 areas by average trips are bordering. This is the downtown core.

In [None]:
(
    df_weather
    .groupby('weather_condition_id')['duration_seconds']
    .plot(kind='hist',alpha=0.4,figsize=(8,6))
)

plt.suptitle("")
plt.title("Trip Durations in Good and Bad Weather Conditions")
plt.legend(['Bad Weather','Good Weather'])
plt.show()

There is far more data available for good weather but it looks less varied and a sharp peaked distribution. The graph indicates some outliers in trip_duration that I will slice out to get more accurate look.

In [None]:
min_durations = df_weather['duration_seconds'].quantile(0.05)
max_durations = df_weather['duration_seconds'].quantile(0.95)

df_weather = df_weather.query("@min_durations < duration_seconds < @max_durations")
(
    df_weather
    .groupby('weather_condition_id')['duration_seconds']
    .plot(kind='hist',alpha=0.4,figsize=(8,6))
)

plt.suptitle("")
plt.title("Trip Durations in Good and Bad Weather Conditions")
plt.legend(['Bad Weather','Good Weather'])
plt.show()

We can now see more clearly that the good weather trip durations have a poisson distribution while the bad weather durations have a flat platykurtic distribution.

All relevant data has been imported to dataframes with no nulls or duplicate and we can see the leading taxi cab companies and popular drop-off areas (indicating the downtown core of Chicago).

### Step 3. Testing hypotheses 

Test hypothesis on "the average duration of rides from the Loop to O'Hare International Airport changes on rainy Saturdays". 

Without the test I would assume that bad weather taxi rides take longer because people drive more carefully or slower and it causes more traffic. It was also apparent in the previous steps graph that their distributions have marked differences. The null hypothesis is that there is no statistically significant difference between travel duration on good and bad weather days. The alternative hypothesis is that there is a statistically significant difference between travel duration on good and bad weather days.

In [None]:
df_bad = df_weather.query('weather_condition_id == 0')
df_good = df_weather.query('weather_condition_id == 1')
                
print("The variance of trip durations in bad weather = {:.2f}".format(np.var(df_bad['duration_seconds'])))
print("The variance of trip durations on good weather = {:.2f}".format(np.var(df_good['duration_seconds'])))

Bad weather trip durations have more variance. Using alpha of 1% because it will indicate a significant difference.

In [None]:
alpha = .01 # critical statistical significance level, if the p-value is less than alpha, we reject the hypothesis

# Calculate the p-value with equal_val as True
# The variances of the statistical populations from which the samples are different
results = st.ttest_ind(
        df_bad['duration_seconds'], 
        df_good['duration_seconds'],
        equal_var = False)

print('p-value: ', results.pvalue)

if results.pvalue < alpha:
        print("We reject the null hypothesis")
else:
        print("We can't reject the null hypothesis")

This indicates that there is a statistically significant difference between trip durations in good and bad weather conditions. Choosing an alpha of 5% would not have made a difference in the result of the hypothesis test.

### Conclusion

We tested the hypotheses on the trip durations between weather conditions and as a result rejected the null hypothesis, indicating a statistically significant difference between trip durations in good and bad weather conditions. This makes sense logically that weather effects a taxi ride duration due to traffic and could be seen in the stark difference between the distributions on their respective graphs.