<a href="https://colab.research.google.com/github/booorayan/BluecarsAuto/blob/master/Autolib_projecte.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothesis Testing Project

## Defining the Question 

Determine the total number of bluecars taken from stations in Paris between Monday-Friday

### Hypothesis

**Claim**: The same total number of bluecars are taken from stations in Paris between Monday and Friday

**Null Hypothesis**: There is no difference in the total number of bluecars taken from stations between Monday and Friday.

**Alternate Hypothesis:** There is a difference in the total number of bluecars taken from stations between Monday and Friday.

### Context

In recent times, car users have shown increased interest in electric car sharing services. Knowledge of the use of car sharing services and the trend during workdays can help companies in planning.
This study seeks to determine if the total number of bluecars taken/shared is relatively constant or differs during the weekdays/workdays.

### Metrics for Success

*   Get a sample(s) of the data 
*   Determine the p-value
*   Reject or accept/fail to reject the null hypothesis



  

### Experimental Design

The experimental design of this project adhered to the CRISP-DM methodology.
CRISP-DM entails the following steps:


1.   Business/problem Understanding
2.   Data Understanding
3.   Data Preparation
4.   Modelling
5.   Evaluation
  





### Appropriateness of Data Provided

*   The data provided contains variables that can be considered relevant to the question and hypothesis.
*   Columns like dayofweek and bluecars_taken_sum can help in answering the question and testing the hypothesis



## Importing libraries to be used

In [0]:
# pandas allows us to organize data in table form
import pandas as pd

# nummpy will enable us to work with multidimensional arrays
import numpy as np

# matplotlib will help in visualizing the data
import matplotlib.pyplot as plt
%matplotlib inline  

# seaborn will also help in data visualization
import seaborn as sns
sns.set()  #(Re)set the seaborn default

# pandas profiling provides a summary report, including descriptive statistics of the dataset 
import pandas_profiling as pp

from sklearn.linear_model import LinearRegression

from sklearn import model_selection

from sklearn.model_selection import train_test_split

from sklearn.metrics import r2_score

import statsmodels.api as sm

## Loading and previewing the dataset

In [2]:
# loading the dataset and previewing the first 5 observations 
url = 'http://bit.ly/DSCoreAutolibDataset'

autoe = pd.read_csv(url)
autoe.head()

Unnamed: 0,Postal code,date,n_daily_data_points,dayOfWeek,day_type,BlueCars_taken_sum,BlueCars_returned_sum,Utilib_taken_sum,Utilib_returned_sum,Utilib_14_taken_sum,Utilib_14_returned_sum,Slots_freed_sum,Slots_taken_sum
0,75001,1/1/2018,1440,0,weekday,110,103,3,2,10,9,22,20
1,75001,1/2/2018,1438,1,weekday,98,94,1,1,8,8,23,22
2,75001,1/3/2018,1439,2,weekday,138,139,0,0,2,2,27,27
3,75001,1/4/2018,1320,3,weekday,104,104,2,2,9,8,25,21
4,75001,1/5/2018,1440,4,weekday,114,117,3,3,6,6,18,20


In [3]:
# reading the columns of the dataframe
autoe.columns

Index(['Postal code', 'date', 'n_daily_data_points', 'dayOfWeek', 'day_type',
       'BlueCars_taken_sum', 'BlueCars_returned_sum', 'Utilib_taken_sum',
       'Utilib_returned_sum', 'Utilib_14_taken_sum', 'Utilib_14_returned_sum',
       'Slots_freed_sum', 'Slots_taken_sum'],
      dtype='object')

In [4]:
len(autoe.columns)

13

dataframe has 13 columns

In [5]:
# loading the dictionary and reading the description of columns in the dataset
dlink = 'http://bit.ly/DSCoreAutolibDatasetGlossary'

dic = pd.read_excel(dlink,)
dic


Unnamed: 0,Column name,explanation
0,Postal code,postal code of the area (in Paris)
1,date,date of the row aggregation
2,n_daily_data_points,number of daily data poinst that were availabl...
3,dayOfWeek,identifier of weekday (0: Monday -> 6: Sunday)
4,day_type,weekday or weekend
5,BlueCars_taken_sum,Number of bluecars taken that date in that area
6,BlueCars_returned_sum,Number of bluecars returned that date in that ...
7,Utilib_taken_sum,Number of Utilib taken that date in that area
8,Utilib_returned_sum,Number of Utilib returned that date in that area
9,Utilib_14_taken_sum,Number of Utilib 1.4 taken that date in that area


In [6]:
# checking the number of rows and columns in the dataframe
print('No. of rows: {} \nNo. of columns: {}'.format(autoe.shape[0], autoe.shape[1]))

# checking the no. of total observations
print('Total observations: {}'.format(autoe.size))

# autoe dataframe has 16,085 rows and 13 columns
# dataframe has a total of 209,105 observations


No. of rows: 16085 
No. of columns: 13
Total observations: 209105


In [7]:
# checking the datatype of the columns and no. of non-null columns
autoe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16085 entries, 0 to 16084
Data columns (total 13 columns):
Postal code               16085 non-null int64
date                      16085 non-null object
n_daily_data_points       16085 non-null int64
dayOfWeek                 16085 non-null int64
day_type                  16085 non-null object
BlueCars_taken_sum        16085 non-null int64
BlueCars_returned_sum     16085 non-null int64
Utilib_taken_sum          16085 non-null int64
Utilib_returned_sum       16085 non-null int64
Utilib_14_taken_sum       16085 non-null int64
Utilib_14_returned_sum    16085 non-null int64
Slots_freed_sum           16085 non-null int64
Slots_taken_sum           16085 non-null int64
dtypes: int64(11), object(2)
memory usage: 1.6+ MB


## Data Cleaning

In [0]:
# creating a copy of the dataframe to work on
autoel = autoe.copy()

In [9]:
# replacing whitespaces in the columns with underscores and converting column names to lowercase to ensure uniformity
autoel.columns = autoel.columns.str.replace(' ', '_').str.lower()

# confirming 
autoel.columns

Index(['postal_code', 'date', 'n_daily_data_points', 'dayofweek', 'day_type',
       'bluecars_taken_sum', 'bluecars_returned_sum', 'utilib_taken_sum',
       'utilib_returned_sum', 'utilib_14_taken_sum', 'utilib_14_returned_sum',
       'slots_freed_sum', 'slots_taken_sum'],
      dtype='object')

In [10]:
# checking for sum of duplicate values
autoel.duplicated().sum()

# dataframe/dataset has no duplicate values

0

In [11]:
# checking for the sum of missing values in each column

autoel.isnull().sum()

# output reveals that dataframe has no missing values 

postal_code               0
date                      0
n_daily_data_points       0
dayofweek                 0
day_type                  0
bluecars_taken_sum        0
bluecars_returned_sum     0
utilib_taken_sum          0
utilib_returned_sum       0
utilib_14_taken_sum       0
utilib_14_returned_sum    0
slots_freed_sum           0
slots_taken_sum           0
dtype: int64

In [12]:
# checking the datatypes of the columns 

autoel.dtypes

# all but two columns in the dataframe are numerical variables

postal_code                int64
date                      object
n_daily_data_points        int64
dayofweek                  int64
day_type                  object
bluecars_taken_sum         int64
bluecars_returned_sum      int64
utilib_taken_sum           int64
utilib_returned_sum        int64
utilib_14_taken_sum        int64
utilib_14_returned_sum     int64
slots_freed_sum            int64
slots_taken_sum            int64
dtype: object

date column has an inappropriate datatype

In [13]:
# converting date column to datetime
autoel.date = pd.to_datetime(autoel.date)

autoel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16085 entries, 0 to 16084
Data columns (total 13 columns):
postal_code               16085 non-null int64
date                      16085 non-null datetime64[ns]
n_daily_data_points       16085 non-null int64
dayofweek                 16085 non-null int64
day_type                  16085 non-null object
bluecars_taken_sum        16085 non-null int64
bluecars_returned_sum     16085 non-null int64
utilib_taken_sum          16085 non-null int64
utilib_returned_sum       16085 non-null int64
utilib_14_taken_sum       16085 non-null int64
utilib_14_returned_sum    16085 non-null int64
slots_freed_sum           16085 non-null int64
slots_taken_sum           16085 non-null int64
dtypes: datetime64[ns](1), int64(11), object(1)
memory usage: 1.6+ MB


In [14]:
autoel.head()

Unnamed: 0,postal_code,date,n_daily_data_points,dayofweek,day_type,bluecars_taken_sum,bluecars_returned_sum,utilib_taken_sum,utilib_returned_sum,utilib_14_taken_sum,utilib_14_returned_sum,slots_freed_sum,slots_taken_sum
0,75001,2018-01-01,1440,0,weekday,110,103,3,2,10,9,22,20
1,75001,2018-01-02,1438,1,weekday,98,94,1,1,8,8,23,22
2,75001,2018-01-03,1439,2,weekday,138,139,0,0,2,2,27,27
3,75001,2018-01-04,1320,3,weekday,104,104,2,2,9,8,25,21
4,75001,2018-01-05,1440,4,weekday,114,117,3,3,6,6,18,20


In [15]:
# selecting columns where day_type = weekday because we will be working with weekdays only

autob = autoel[autoel['day_type'] == 'weekday']
# previewing the first ten observations in the resulting dataframe
autob.head(10)

Unnamed: 0,postal_code,date,n_daily_data_points,dayofweek,day_type,bluecars_taken_sum,bluecars_returned_sum,utilib_taken_sum,utilib_returned_sum,utilib_14_taken_sum,utilib_14_returned_sum,slots_freed_sum,slots_taken_sum
0,75001,2018-01-01,1440,0,weekday,110,103,3,2,10,9,22,20
1,75001,2018-01-02,1438,1,weekday,98,94,1,1,8,8,23,22
2,75001,2018-01-03,1439,2,weekday,138,139,0,0,2,2,27,27
3,75001,2018-01-04,1320,3,weekday,104,104,2,2,9,8,25,21
4,75001,2018-01-05,1440,4,weekday,114,117,3,3,6,6,18,20
7,75001,2018-01-08,1438,0,weekday,84,83,3,3,10,10,14,15
8,75001,2018-01-09,1439,1,weekday,81,84,1,1,4,4,15,15
9,75001,2018-01-10,1440,2,weekday,88,85,5,5,11,11,23,22
10,75001,2018-01-11,1440,3,weekday,125,125,3,4,13,13,22,22
11,75001,2018-01-12,1439,4,weekday,126,127,3,2,12,12,11,13


In [16]:
# dropping day_type column because it is constant (i.e., as weekday)
# dropping n_daily_data_points column because it is not relevant to the problem

autob = autob.drop(['day_type', 'n_daily_data_points'], 1)
autob.columns

Index(['postal_code', 'date', 'dayofweek', 'bluecars_taken_sum',
       'bluecars_returned_sum', 'utilib_taken_sum', 'utilib_returned_sum',
       'utilib_14_taken_sum', 'utilib_14_returned_sum', 'slots_freed_sum',
       'slots_taken_sum'],
      dtype='object')

In [0]:
# checking for outliers in columns[3:]

box, axx = plt.subplots(2,4, figsize=(15,13))
box.suptitle('Box plots for Electric Cars Usage', fontsize=14, y=0.9)

for ax, column in zip(axx.flatten(), autob.columns[3:]):   
  sns.boxplot(autob[column], ax=ax)
  
# boxplots indicate presence of numerous outliers in plotted columns
# however, we will not drop outliers since they are reasonable/realistic
# autolib had about 3900 registered eletric cars
    

In [0]:
from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters(np.datetime64)

In [19]:
# using pandas profiling to get a summarized report of the dataset
pp.ProfileReport(autob)

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


0,1
Number of variables,12
Number of observations,11544
Total Missing (%),0.0%
Total size in memory,1.1 MiB
Average record size in memory,96.0 B

0,1
Numeric,5
Categorical,0
Boolean,0
Date,1
Text (Unique),0
Rejected,6
Unsupported,0

0,1
Distinct count,11544
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,8040
Minimum,0
Maximum,16084
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,803.15
Q1,4018.8
Median,8040.5
Q3,12062.0
95-th percentile,15279.0
Maximum,16084.0
Range,16084.0
Interquartile range,8043.0

0,1
Standard deviation,4643.5
Coef of variation,0.57755
Kurtosis,-1.1997
Mean,8040
MAD,4020.8
Skewness,0.00059134
Sum,92813613
Variance,21562000
Memory size,90.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4094,1,0.0%,
11631,1,0.0%,
3435,1,0.0%,
7529,1,0.0%,
5480,1,0.0%,
9574,1,0.0%,
13668,1,0.0%,
3427,1,0.0%,
1378,1,0.0%,
7521,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.0%,
1,1,0.0%,
2,1,0.0%,
3,1,0.0%,
4,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
16078,1,0.0%,
16079,1,0.0%,
16080,1,0.0%,
16083,1,0.0%,
16084,1,0.0%,

0,1
Distinct count,104
Unique (%),0.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,88790
Minimum,75001
Maximum,95880
Zeros (%),0.0%

0,1
Minimum,75001
5-th percentile,75006
Q1,91330
Median,92340
Q3,93400
95-th percentile,94500
Maximum,95880
Range,20879
Interquartile range,2070

0,1
Standard deviation,7648
Coef of variation,0.086136
Kurtosis,-0.54301
Mean,88790
MAD,6498.4
Skewness,-1.1684
Sum,1024991290
Variance,58492000
Memory size,90.3 KiB

Value,Count,Frequency (%),Unnamed: 3
94130,112,1.0%,
94450,112,1.0%,
94340,112,1.0%,
94500,112,1.0%,
78140,112,1.0%,
94700,112,1.0%,
95100,112,1.0%,
75006,112,1.0%,
75014,112,1.0%,
92150,112,1.0%,

Value,Count,Frequency (%),Unnamed: 3
75001,112,1.0%,
75002,112,1.0%,
75003,112,1.0%,
75004,112,1.0%,
75005,112,1.0%,

Value,Count,Frequency (%),Unnamed: 3
94700,112,1.0%,
94800,112,1.0%,
95100,112,1.0%,
95870,112,1.0%,
95880,112,1.0%,

0,1
Distinct count,112
Unique (%),1.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,2018-01-01 00:00:00
Maximum,2018-06-19 00:00:00

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.9739
Minimum,0
Maximum,4
Zeros (%),20.6%

0,1
Minimum,0
5-th percentile,0
Q1,1
Median,2
Q3,3
95-th percentile,4
Maximum,4
Range,4
Interquartile range,2

0,1
Standard deviation,1.4178
Coef of variation,0.71826
Kurtosis,-1.3072
Mean,1.9739
MAD,1.2106
Skewness,0.027247
Sum,22787
Variance,2.0101
Memory size,90.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,2374,20.6%,
1,2363,20.5%,
4,2271,19.7%,
3,2268,19.6%,
2,2268,19.6%,

Value,Count,Frequency (%),Unnamed: 3
0,2374,20.6%,
1,2363,20.5%,
2,2268,19.6%,
3,2268,19.6%,
4,2271,19.7%,

Value,Count,Frequency (%),Unnamed: 3
0,2374,20.6%,
1,2363,20.5%,
2,2268,19.6%,
3,2268,19.6%,
4,2271,19.7%,

0,1
Distinct count,789
Unique (%),6.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,116.03
Minimum,0
Maximum,1093
Zeros (%),0.4%

0,1
Minimum,0
5-th percentile,5
Q1,18
Median,42
Q3,126
95-th percentile,477
Maximum,1093
Range,1093
Interquartile range,108

0,1
Standard deviation,169.63
Coef of variation,1.4619
Kurtosis,5.5538
Mean,116.03
MAD,119.08
Skewness,2.3282
Sum,1339435
Variance,28773
Memory size,90.3 KiB

Value,Count,Frequency (%),Unnamed: 3
12,209,1.8%,
9,202,1.7%,
11,201,1.7%,
10,197,1.7%,
14,193,1.7%,
13,188,1.6%,
7,169,1.5%,
16,159,1.4%,
19,159,1.4%,
17,159,1.4%,

Value,Count,Frequency (%),Unnamed: 3
0,44,0.4%,
1,98,0.8%,
2,110,1.0%,
3,146,1.3%,
4,130,1.1%,

Value,Count,Frequency (%),Unnamed: 3
1032,1,0.0%,
1043,1,0.0%,
1087,1,0.0%,
1089,1,0.0%,
1093,1,0.0%,

0,1
Correlation,0.99878

0,1
Distinct count,42
Unique (%),0.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.425
Minimum,0
Maximum,47
Zeros (%),35.5%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,1
Q3,4
95-th percentile,16
Maximum,47
Range,47
Interquartile range,4

0,1
Standard deviation,5.38
Coef of variation,1.5708
Kurtosis,7.3687
Mean,3.425
MAD,3.6945
Skewness,2.5098
Sum,39538
Variance,28.944
Memory size,90.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,4095,35.5%,
1,2129,18.4%,
2,1250,10.8%,
3,836,7.2%,
4,563,4.9%,
5,393,3.4%,
6,334,2.9%,
7,231,2.0%,
8,221,1.9%,
9,176,1.5%,

Value,Count,Frequency (%),Unnamed: 3
0,4095,35.5%,
1,2129,18.4%,
2,1250,10.8%,
3,836,7.2%,
4,563,4.9%,

Value,Count,Frequency (%),Unnamed: 3
37,2,0.0%,
38,2,0.0%,
40,1,0.0%,
46,1,0.0%,
47,1,0.0%,

0,1
Correlation,0.97947

0,1
Correlation,0.93908

0,1
Correlation,0.99096

0,1
Correlation,0.94576

0,1
Correlation,0.99915

Unnamed: 0,postal_code,date,dayofweek,bluecars_taken_sum,bluecars_returned_sum,utilib_taken_sum,utilib_returned_sum,utilib_14_taken_sum,utilib_14_returned_sum,slots_freed_sum,slots_taken_sum
0,75001,2018-01-01,0,110,103,3,2,10,9,22,20
1,75001,2018-01-02,1,98,94,1,1,8,8,23,22
2,75001,2018-01-03,2,138,139,0,0,2,2,27,27
3,75001,2018-01-04,3,104,104,2,2,9,8,25,21
4,75001,2018-01-05,4,114,117,3,3,6,6,18,20


## Exploratory Data Analysis

In [20]:
# ckecking for unique values in the postal_code column

print(autob.postal_code.unique())
print('\nNumber of unique values in postal_code column: %d' % len(autob.postal_code.unique()))

# there are 104 distinct postal codes in the postal_code column

[75001 75002 75003 75004 75005 75006 75007 75008 75009 75010 75011 75012
 75013 75014 75015 75016 75017 75018 75019 75020 75112 75116 78000 78140
 78150 91330 91370 91400 92000 92100 92110 92120 92130 92140 92150 92160
 92170 92190 92200 92210 92220 92230 92240 92250 92260 92270 92290 92300
 92310 92320 92330 92340 92350 92360 92370 92380 92390 92400 92410 92420
 92500 92600 92700 92800 93100 93110 93130 93150 93170 93200 93230 93260
 93300 93310 93350 93360 93370 93390 93400 93440 93500 93600 93700 93800
 94000 94100 94110 94120 94130 94140 94150 94160 94220 94230 94300 94340
 94410 94450 94500 94700 94800 95100 95870 95880]

Number of unique values in postal_code column: 104


In [21]:
# checking for unique values in the dayofweek column
autob.dayofweek.unique()


array([0, 1, 2, 3, 4])

In [22]:
# checking for the total sum of bluecars taken in each day of the week excluding the weekend

autob.groupby('dayofweek')[['bluecars_taken_sum']].sum().sort_values('bluecars_taken_sum', ascending=False)

# most bluecars (i.e., 288,546) are taken on Friday
# Monday and Thursday follow in second and third respectively

Unnamed: 0_level_0,bluecars_taken_sum
dayofweek,Unnamed: 1_level_1
4,288546
0,263893
3,263207
1,261940
2,261849


In [23]:
autob.groupby('dayofweek')[['bluecars_returned_sum']].sum().sort_values('bluecars_returned_sum', ascending=False)

# most bluecars (i.e., 285,029) are returned on Friday
# Monday and Thursday follow in second nd third respectively

Unnamed: 0_level_0,bluecars_returned_sum
dayofweek,Unnamed: 1_level_1
4,286029
0,264808
3,262961
2,260673
1,260470


### Univariate Analysis

#### Frequency Tables

In [24]:
# frequency table of bluecars_returned_sum column

autob['dayofweek'].value_counts()

0    2374
1    2363
4    2271
3    2268
2    2268
Name: dayofweek, dtype: int64

#### Measures of Central Tendency

In [25]:
# descriptive statistics of the columns in the dataframe

autob.describe()

Unnamed: 0,postal_code,dayofweek,bluecars_taken_sum,bluecars_returned_sum,utilib_taken_sum,utilib_returned_sum,utilib_14_taken_sum,utilib_14_returned_sum,slots_freed_sum,slots_taken_sum
count,11544.0,11544.0,11544.0,11544.0,11544.0,11544.0,11544.0,11544.0,11544.0,11544.0
mean,88789.959286,1.973926,116.028673,115.63938,3.424983,3.41762,7.999047,7.975485,20.945166,20.921431
std,7647.995374,1.417797,169.626905,168.344751,5.37995,5.349742,11.963164,11.88266,47.900208,47.84858
min,75001.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,91330.0,1.0,18.0,19.0,0.0,0.0,1.0,1.0,0.0,0.0
50%,92340.0,2.0,42.0,42.0,1.0,1.0,3.0,3.0,0.0,0.0
75%,93400.0,3.0,126.0,126.0,4.0,4.0,9.0,9.0,4.0,5.0
max,95880.0,4.0,1093.0,1056.0,47.0,49.0,94.0,94.0,288.0,294.0


In [26]:

# mean of bluecars_returned_sum, bluecars_taken_sum, slots_freed_sum & slots_taken_sum 

num_col = ['bluecars_returned_sum', 'bluecars_taken_sum', 'slots_freed_sum', 'slots_taken_sum']

for column in num_col:
  print('mean of {}: {:.2f}' .format(column, autob[column].mean()))

mean of bluecars_returned_sum: 115.64
mean of bluecars_taken_sum: 116.03
mean of slots_freed_sum: 20.95
mean of slots_taken_sum: 20.92


In [27]:
# median of bluecars_returned_sum, bluecars_taken_sum, slots_freed_sum, slots_taken_sum 

num_col = ['bluecars_returned_sum', 'bluecars_taken_sum', 'slots_freed_sum', 'slots_taken_sum']

for column in num_col:
  print('median of %s: %d' % (column, autob[column].median()))

median of bluecars_returned_sum: 42
median of bluecars_taken_sum: 42
median of slots_freed_sum: 0
median of slots_taken_sum: 0


In [28]:
# mode of bluecars_returned_sum, bluecars_taken_sum, slots_freed_sum, slots_taken_sum & of dayofweek

num_co = ['bluecars_returned_sum', 'bluecars_taken_sum', 'slots_freed_sum', 'slots_taken_sum', 'dayofweek']

for column in num_co:
  print('mode of %s: %d' % (column, autob[column].mode()))
  
# day 0/Monday features the most in the dataset
# in most cases, the sum of either returned bluecars or taken bluecars was 12


mode of bluecars_returned_sum: 12
mode of bluecars_taken_sum: 12
mode of slots_freed_sum: 0
mode of slots_taken_sum: 0
mode of dayofweek: 0


#### Measures of Spread/Dispersion

In [29]:
# standard deviation of bluecars_returned_sum, bluecars_taken_sum, 
# slots_freed_sum & slots_taken_sum columns

num_col = ['bluecars_returned_sum', 'bluecars_taken_sum', 'slots_freed_sum', 'slots_taken_sum']

for column in num_col:
  print('Standard deviation of {}: {:.2f}' .format(column, autob[column].std()))
  
# there is great standard deviation in the distribution of the sum of bluecars taken 

Standard deviation of bluecars_returned_sum: 168.34
Standard deviation of bluecars_taken_sum: 169.63
Standard deviation of slots_freed_sum: 47.90
Standard deviation of slots_taken_sum: 47.85


In [30]:
# variance of bluecars_returned_sum, bluecars_taken_sum, 
# slots_freed_sum & slots_taken_sum columns

num_col = ['bluecars_returned_sum', 'bluecars_taken_sum', 'slots_freed_sum', 'slots_taken_sum']

for column in num_col:
  print('Variance of {}: {:.2f}' .format(column, autob[column].var()))

# there is a significantly high variance in the distribution of bluecars taken 

Variance of bluecars_returned_sum: 28339.96
Variance of bluecars_taken_sum: 28773.29
Variance of slots_freed_sum: 2294.43
Variance of slots_taken_sum: 2289.49


In [31]:
# skewness of bluecars_returned_sum, bluecars_taken_sum, 
# slots_freed_sum & slots_taken_sum columns

num_col = ['bluecars_returned_sum', 'bluecars_taken_sum', 'slots_freed_sum', 'slots_taken_sum']

for column in num_col:
  print('Skewness of {}: {:.2f}' .format(column, autob[column].skew()))

  
# the distribution of bluecars taken and slots taken exhibit positive skewness (i.e., are skewed to the right)
# the modes of bluecars taken & slotes taken are less than the mean of bluecars taken and slots taken respectively.

Skewness of bluecars_returned_sum: 2.33
Skewness of bluecars_taken_sum: 2.33
Skewness of slots_freed_sum: 2.54
Skewness of slots_taken_sum: 2.54


In [32]:
# kurtosis of bluecars_returned_sum, bluecars_taken_sum, 
# slots_freed_sum & slots_taken_sum columns

for column in num_col:
  print('Kurtosis of {}: {:.2f}' .format(column, autob[column].kurt()))
  
# distribution of the sum of bluecars taken and sum of slots taken 
# has positive kurtosis indicating the presence/profusion of outliers
# distribution can be described as heavy-tailed/leptokurtic

Kurtosis of bluecars_returned_sum: 5.54
Kurtosis of bluecars_taken_sum: 5.55
Kurtosis of slots_freed_sum: 6.01
Kurtosis of slots_taken_sum: 6.00


In [33]:
# range of bluecars_returned_sum, bluecars_taken_sum, 
# slots_freed_sum & slots_taken_sum columns

for column in num_col:
  
  maxi = autob[column].max()
  mini = autob[column].min()
  
  range_col = maxi - mini
  
  print('Range of {}: {:.2f}' .format(column, range_col))

# range in the sum of bluecars taken is 1093
  

Range of bluecars_returned_sum: 1056.00
Range of bluecars_taken_sum: 1093.00
Range of slots_freed_sum: 288.00
Range of slots_taken_sum: 294.00


In [34]:
# first, second and third quantiles of the sum of bluecars and slots taken and returned

for column in num_col:
  quant = autob[column].quantile([0.25,0.5,0.75])
  
  print('\nfirst, second and third quantiles for {}: \n{}'.format(column, quant))


first, second and third quantiles for bluecars_returned_sum: 
0.25     19.0
0.50     42.0
0.75    126.0
Name: bluecars_returned_sum, dtype: float64

first, second and third quantiles for bluecars_taken_sum: 
0.25     18.0
0.50     42.0
0.75    126.0
Name: bluecars_taken_sum, dtype: float64

first, second and third quantiles for slots_freed_sum: 
0.25    0.0
0.50    0.0
0.75    4.0
Name: slots_freed_sum, dtype: float64

first, second and third quantiles for slots_taken_sum: 
0.25    0.0
0.50    0.0
0.75    5.0
Name: slots_taken_sum, dtype: float64


In [0]:
heatmap = autob.corr()
plt.figure(figsize=(10,6))
sns.heatmap(heatmap, xticklabels=heatmap.columns, yticklabels=heatmap.columns, annot=True)
plt.title('Heatmap Showing Correlation Between Different Numerical Variables', fontsize=15, pad=2)
plt.show()

In [0]:
# histogram showing the distribution of total number of cars taken

plt.figure(figsize=(6,4))
dis = sns.distplot(autob.bluecars_taken_sum, bins=10, color='olive')
plt.xlabel('Total Number of Cars Taken')
plt.yticks(dis.get_yticks(), dis.get_yticks() * 100)
plt.ylabel('Distribution [%]', fontsize=12)
plt.title('Distribution of Total Number of Bluecars Taken')
plt.show()

# most common total number of bluecars taken was between 0-100

In [37]:
autob.head()

Unnamed: 0,postal_code,date,dayofweek,bluecars_taken_sum,bluecars_returned_sum,utilib_taken_sum,utilib_returned_sum,utilib_14_taken_sum,utilib_14_returned_sum,slots_freed_sum,slots_taken_sum
0,75001,2018-01-01,0,110,103,3,2,10,9,22,20
1,75001,2018-01-02,1,98,94,1,1,8,8,23,22
2,75001,2018-01-03,2,138,139,0,0,2,2,27,27
3,75001,2018-01-04,3,104,104,2,2,9,8,25,21
4,75001,2018-01-05,4,114,117,3,3,6,6,18,20


In [38]:
oto = autob.copy()

oto['dayofweek'] = oto.dayofweek.map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday'})
oto.head()

Unnamed: 0,postal_code,date,dayofweek,bluecars_taken_sum,bluecars_returned_sum,utilib_taken_sum,utilib_returned_sum,utilib_14_taken_sum,utilib_14_returned_sum,slots_freed_sum,slots_taken_sum
0,75001,2018-01-01,Monday,110,103,3,2,10,9,22,20
1,75001,2018-01-02,Tuesday,98,94,1,1,8,8,23,22
2,75001,2018-01-03,Wednesday,138,139,0,0,2,2,27,27
3,75001,2018-01-04,Thursday,104,104,2,2,9,8,25,21
4,75001,2018-01-05,Friday,114,117,3,3,6,6,18,20


In [0]:
# barplot showing the distribution of entries/observations between mon-fri


oto.dayofweek.value_counts().plot(kind='bar', figsize=(6,4), color='chocolate')
plt.xticks(rotation=360)
plt.xlabel('Day of Week')
plt.ylabel('No. of Entries')
plt.title('Distribution of Entries Between Days of the Week', fontsize=15)
plt.show()

# no. of entries for each day of week are somewhat evenly distributed ranging between 2200-2400
# monday and tuesday had slightly more entries

In [0]:
# days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']

au = oto.groupby('dayofweek')[['bluecars_taken_sum']].sum().reset_index()
au

plt.figure(figsize=(7,5))
plt.pie(au.bluecars_taken_sum, labels=au.dayofweek, autopct='%1.1f%%', startangle=90)
plt.axis('equal')
plt.title('Pie Chart of Total Number of Bluecars Taken by Day Between Mon-Fri')
plt.show()

# according to the pie chart, total number of cars was highest on Friday, contributing to 21.5%
# of the total number of bluecars taken during the weekdays

In [0]:
br = oto.groupby('dayofweek')['bluecars_returned_sum'].sum().reset_index()
plt.figure(figsize=(7,5))
plt.pie(br.bluecars_returned_sum, labels=br.dayofweek, autopct='%1.1f%%', startangle=90)
plt.axis('equal')
plt.title('Pie Chart of Total Number of Bluecars Returned by Day Between Mon-Fri')
plt.show()


### Bivariate Analysis

In [0]:
autm = autob.iloc[:61,:]   # selecting the first 60 observations from the autob dataframe.

# plotting a line graph of sum of bluecars taken and sum of recharging slots freed
plt.figure(figsize=(15,4))
plt.plot(autm.date, autm.bluecars_taken_sum)
plt.xticks(autm.date, rotation=90)

plt.xlabel('Date')
plt.ylabel('Sum of Bluecars Taken')
plt.title('Line Graph Showing Trend in Taking Bluecars', fontsize=15)
plt.show()

# output reveals a constant fluctuation in the trend of taking bluecars
# for the three month period between 2018/01/01 - 2018/03/28
# lowest number of sum of bluecars taken was on 09/01/2018
# the other two days when the sum of bluecars taken was lowest were 29/01/2018 and 19/02/2018 (Trend in low sum of bluecars taken in afer every twenty days) 

In [0]:
plt.figure(figsize=(6,4))
plt.scatter(autob.bluecars_taken_sum, autob.bluecars_returned_sum)
plt.xlabel('Sum of Bluecars Taken')
plt.ylabel('Sum of Recharging Slots Freed')
plt.title('Scatter Plot for Bluecars Taken vs Slots Freed', fontsize=15)
plt.show()

# scatter plot reveals a perfect linear relationship (i.e positive correlation) between the sum of bluecars taken and sum of bluecars returned.
# the two columns can be considered to provide the same information

In [0]:
plt.figure(figsize=(6,4))
plt.scatter(autob.bluecars_taken_sum, autob.slots_freed_sum, color='coral')
plt.xlabel('Sum of Bluecars Taken')
plt.ylabel('Sum of Recharging Slots Freed')
plt.title('Scatter Plot for Bluecars Taken vs Slots Freed', fontsize=15)
plt.show()

# scatter plot reveals sum of bluecars taken and sum of slots freed are highly correlated (i.e positive correlation)

In [0]:
plt.figure(figsize=(6,4))
plt.scatter(autob.bluecars_taken_sum, autob.utilib_14_taken_sum, color='cadetblue')
plt.xlabel('Sum of Bluecars Taken')
plt.ylabel('Sum of Utilib_14 Taken')
plt.title('Scatter Plot for Bluecars Taken vs Utilib_14 Taken', fontsize=15)
plt.show()

In [0]:
plt.figure(figsize=(6,4))
plt.scatter(autob.bluecars_taken_sum, autob.utilib_taken_sum)
plt.xlabel('Sum of Bluecars Taken')
plt.ylabel('Sum of Utilibs Taken')
plt.title('Scatter Plot for Bluecars Taken vs Taken Utilibs', fontsize=15)
plt.show()

# scatter plot shows a positive correlation between sum of returned utilibs and sum of bluecars taken

In [0]:
# color-coded scatter plots
plt.figure(figsize=(7,4))
sns.scatterplot(x=oto.bluecars_taken_sum, y=oto.slots_freed_sum, hue=oto['dayofweek'])
plt.xlabel('Sum of Bluecars Taken')
plt.ylabel('Sum of Free Slots')
plt.title('Color-coded Scatter Plot of Sum of Bluecars Taken and Free Slots', fontsize=14, color='blue')
plt.show()

# the color-coded scatter plot reveals that on Fridays (purple dots), the sum of
# bluecars taken is often higher abd this is accompanied with an increase in number of recharging slots that are available.

In [0]:

# plotting a bar graph of Sum of bluecars taken vs Day of Week
plt.figure(figsize=(6,4))
sns.barplot(x=au.index, y=au.bluecars_taken_sum)
plt.ylabel('Sum of Bluecars Taken')
plt.xlabel('Day of Week (mon-fri)')
plt.title('Sum of Bluecars Taken Grouped By Day of Week (Mon-Fri)', fontsize=15, color='blue')
plt.show()

# total sum of bluecars taken is relatively similar between Monday(0) and Thursday(3)
# Friday recorded the highest total number of bluecars taken

#### Correlation

In [49]:
# pearson/standard correlation coefficient between sum of bluecars taken and sum of utilibs returned

pearson = autob.bluecars_taken_sum.corr(autob.utilib_taken_sum, method='pearson')
print('pearson correlation coefficient: %.4f' % pearson)

# output reveals a strong positive correlation between the sum of bluecars taken and the sum of utilibs taken

pearson correlation coefficient: 0.8842


In [50]:
# pearson/standard correlation coefficient between sum of bluecars taken and sum of utilibs returned

pearson = autob.bluecars_taken_sum.corr(autob.utilib_14_taken_sum, method='pearson')
print('pearson correlation coefficient: %.4f' % pearson)

# output reveals a strong positive correlation between the sum of bluecars taken and the sum of utilib_14 taken

pearson correlation coefficient: 0.9387


In [51]:
# pearson/standard correlation coefficient between sum of bluecars taken and sum of utilibs returned

pearson = autob.bluecars_taken_sum.corr(autob.slots_freed_sum, method='pearson')
print('pearson correlation coefficient: %.4f' % pearson)

# output reveals a strong positive correlation between the sum of bluecars taken and the sum of recharging slots freed

pearson correlation coefficient: 0.9457


In [52]:
# pearson/standard correlation coefficient between sum of bluecars taken and sum of utilibs returned

pearson = autob.bluecars_taken_sum.corr(autob.bluecars_returned_sum, method='pearson')
print('pearson correlation coefficient: %.4f' % pearson)

# output reveals a perfect linear correlation between the sum of bluecars taken and the sum of bluecars returned

pearson correlation coefficient: 0.9988


### Multivariate Analysis

In [0]:
# pairplot of scatter plot and histogram of the columns in the dataframe

sns.pairplot(autob)
plt.show()

In [54]:
# getting the correlation of numerical variables in the dataframe

autob.corr()

Unnamed: 0,postal_code,dayofweek,bluecars_taken_sum,bluecars_returned_sum,utilib_taken_sum,utilib_returned_sum,utilib_14_taken_sum,utilib_14_returned_sum,slots_freed_sum,slots_taken_sum
postal_code,1.0,0.00013,-0.701709,-0.701877,-0.629116,-0.62919,-0.661567,-0.661256,-0.75473,-0.754422
dayofweek,0.00013,1.0,0.030798,0.028943,0.024498,0.02501,0.024469,0.024331,0.022517,0.022582
bluecars_taken_sum,-0.701709,0.030798,1.0,0.99878,0.884239,0.883856,0.93873,0.937613,0.9457,0.944855
bluecars_returned_sum,-0.701877,0.028943,0.99878,1.0,0.884533,0.884332,0.939084,0.938248,0.945758,0.946006
utilib_taken_sum,-0.629116,0.024498,0.884239,0.884533,1.0,0.979469,0.836193,0.835359,0.84548,0.844701
utilib_returned_sum,-0.62919,0.02501,0.883856,0.884332,0.979469,1.0,0.835772,0.835549,0.845527,0.845376
utilib_14_taken_sum,-0.661567,0.024469,0.93873,0.939084,0.836193,0.835772,1.0,0.99096,0.895978,0.895574
utilib_14_returned_sum,-0.661256,0.024331,0.937613,0.938248,0.835359,0.835549,0.99096,1.0,0.895244,0.895647
slots_freed_sum,-0.75473,0.022517,0.9457,0.945758,0.84548,0.845527,0.895978,0.895244,1.0,0.999154
slots_taken_sum,-0.754422,0.022582,0.944855,0.946006,0.844701,0.845376,0.895574,0.895647,0.999154,1.0


In [55]:
# dropping columns not required for this analysis
# postal_code irrelevant as analysis is based on total number of bluecars taken

autoli = autob.drop(['postal_code', 'bluecars_returned_sum', 'utilib_returned_sum', 'utilib_14_returned_sum'], 1)
autoli.columns


Index(['date', 'dayofweek', 'bluecars_taken_sum', 'utilib_taken_sum',
       'utilib_14_taken_sum', 'slots_freed_sum', 'slots_taken_sum'],
      dtype='object')

In [0]:
# heatmap showing correlation between variables in the autoli dataframe
sns.set()
sns.heatmap(autoli.corr(), annot=True)
plt.title('Heatmap of Autoli Variables')
plt.show()

In [57]:
# specifiying out features and target variables 

features = autoli.drop(['bluecars_taken_sum'], 1)
target = autoli['bluecars_taken_sum']

# the sum of bluecars taken is the dependent/target variable
# the remaining columns will be the indenpendent variables

features.head()

Unnamed: 0,date,dayofweek,utilib_taken_sum,utilib_14_taken_sum,slots_freed_sum,slots_taken_sum
0,2018-01-01,0,3,10,22,20
1,2018-01-02,1,1,8,23,22
2,2018-01-03,2,0,2,27,27
3,2018-01-04,3,2,9,25,21
4,2018-01-05,4,3,6,18,20


In [58]:
# creating three new columns (year, month and day) that result from the date column

features[['year', 'month', 'day']] = features.date.apply(lambda x: pd.Series(x.strftime("%Y,%m,%d").split(',')))
features.head()

Unnamed: 0,date,dayofweek,utilib_taken_sum,utilib_14_taken_sum,slots_freed_sum,slots_taken_sum,year,month,day
0,2018-01-01,0,3,10,22,20,2018,1,1
1,2018-01-02,1,1,8,23,22,2018,1,2
2,2018-01-03,2,0,2,27,27,2018,1,3
3,2018-01-04,3,2,9,25,21,2018,1,4
4,2018-01-05,4,3,6,18,20,2018,1,5


In [59]:
# checking the datatypes of the resulting columns
features.dtypes


date                   datetime64[ns]
dayofweek                       int64
utilib_taken_sum                int64
utilib_14_taken_sum             int64
slots_freed_sum                 int64
slots_taken_sum                 int64
year                           object
month                          object
day                            object
dtype: object

In [60]:
# converting the datatypes of month, year and day columns to integers

for column in features.columns[6:]:
  features[column] = features[column].astype(int)
  
# confirming the change
features.dtypes

date                   datetime64[ns]
dayofweek                       int64
utilib_taken_sum                int64
utilib_14_taken_sum             int64
slots_freed_sum                 int64
slots_taken_sum                 int64
year                            int64
month                           int64
day                             int64
dtype: object

In [61]:
features.year.unique()

array([2018])

In [62]:
# dropping the year and date column because they not relevant to the problem
# year column is constant (2018) and date column is redundant

features.drop(['date', 'year'], 1, inplace=True)
features.columns

Index(['dayofweek', 'utilib_taken_sum', 'utilib_14_taken_sum',
       'slots_freed_sum', 'slots_taken_sum', 'month', 'day'],
      dtype='object')

In [0]:
# splitting data to training and test sets using train_test_split()
# 30% of the data wil make up the testing set and the rest will be the training set

features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.3, random_state=45)


In [0]:
# creating and training the model by fitting the linear regression model on the training set
linre = LinearRegression()
res = linre.fit(features_train, target_train)

In [0]:
# grabbing predictions off/from the test set 
pred = linre.predict(features_test)

In [0]:
# visualizing the prediction using a scatter plot
plt.figure(figsize=(7,5))
plt.scatter(target_test, pred)
plt.title('Scatter plot for Test Set and Predicted Set', fontsize=14, color='blue')
plt.xlabel('Target Test')
plt.ylabel('Predicted Sum')
plt.show()

# output reveals correlation between predicted sum and test set

In [67]:
# calculating the coefficient of determinantion, R2
r2_score(target_test,pred)

# output indicates that a linear model explains approx. 94.63% of response data variability
# model is a very good fit

0.9462799710579674

In [68]:
# Previewing the coefficients of the equation and the y intercept of the linear model

(linre.coef_, linre.intercept_)


(array([ 0.90803644,  6.15038465,  5.48051754,  2.16568999, -0.636727  ,
        -0.72586692, -0.11484885]), 21.097293141148825)

In [69]:
fet2 = sm.add_constant(features.values)
model = sm.OLS(target, fet2).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:     bluecars_taken_sum   R-squared:                       0.946
Model:                            OLS   Adj. R-squared:                  0.946
Method:                 Least Squares   F-statistic:                 2.887e+04
Date:                Sun, 17 Nov 2019   Prob (F-statistic):               0.00
Time:                        16:02:59   Log-Likelihood:                -58794.
No. Observations:               11544   AIC:                         1.176e+05
Df Residuals:                   11536   BIC:                         1.177e+05
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         21.4748      1.211     17.731      0.0

In [0]:
# a larger (insignificant) p-value suggests that changes in the predictor are not associated with changes in the response.
# since x5 (i.e., slots_taken_sum) has p-value > 0.05, it is statistically insignificant and it can be dropped to reduce the number of features
# we will have therefore reduced our features from seven to six


## Hypothesis Testing

### Simple Sample Random Sampling

In [70]:
# using simple random sampling to create a sample (sample 1)
# the size of the sample is 30% of the data

samp1 = autob.sample(frac=0.3, random_state=101)
samp1.head()


Unnamed: 0,postal_code,date,dayofweek,bluecars_taken_sum,bluecars_returned_sum,utilib_taken_sum,utilib_returned_sum,utilib_14_taken_sum,utilib_14_returned_sum,slots_freed_sum,slots_taken_sum
5677,92170,2018-03-09,4,83,84,0,0,5,4,0,0
3738,78140,2018-06-15,4,31,31,1,0,4,4,0,0
7153,92270,2018-05-31,3,30,24,0,0,3,5,0,0
14707,94340,2018-01-26,4,48,51,0,0,4,3,0,0
7921,92330,2018-05-08,1,44,40,1,2,5,9,4,2


In [71]:
# checking shape of sample 1
samp1.shape

# our sample (created from the autob datframe) has 3,463 rows and 11 columns 

(3463, 11)

In [72]:
# grouping sample 1 by dayofweek and displaying sum of bluecars taken

sawan = samp1.groupby('dayofweek')[['bluecars_taken_sum']].sum().reset_index()
sawan['dayofweek'] = sawan['dayofweek'].map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday'})
sawan

Unnamed: 0,dayofweek,bluecars_taken_sum
0,Monday,80151
1,Tuesday,81031
2,Wednesday,79700
3,Thursday,80698
4,Friday,89593


In [90]:
# we'll be comparing numerical and categorical variables. Therfore, we will use a chi square test to test our hypothesis
# specifically, we will use  a chi square goodness of fit


observed = list(sawan['bluecars_taken_sum'])
total = sum(observed)
expected = [round(total/5)]

# magnitude of repetition declaration
mag = 5

# list comprehension
expected = [item for item in expected for i in range(mag)]



[80151, 81031, 79700, 80698, 89593]

In [104]:

import scipy, scipy.stats
from scipy.stats import chisquare
ob_val = scipy.array(observed)
exp_val = scipy.array(expected)

chi = chisquare(f_obs=ob_val, f_exp=exp_val)
print('Chi square test statistic: {} \np value: {}'.format(chi[0],chi[1]))
print('p value rounded off to four dp:', round(chi[1],4))

Chi square test statistic: 835.6713078372956 
p value: 1.4398750459546759e-179
p value rounded off to four dp: 0.0


Since the p value (0.0) is less than the assumed level of significance of 0.05, we reject the null hypothesis that there is no difference in the number of cars taken from stations in different days of the week.

Thus, the test results reveal that there is a difference in the number of cars taken from stations in different days of the week.

In [111]:
# creating a second random sample from the autob dataframe
# size of second sample is also 30% of original data

samp2 = autob.sample(frac=0.3, random_state=30)
samp2.head()

Unnamed: 0,postal_code,date,dayofweek,bluecars_taken_sum,bluecars_returned_sum,utilib_taken_sum,utilib_returned_sum,utilib_14_taken_sum,utilib_14_returned_sum,slots_freed_sum,slots_taken_sum
12992,94000,2018-01-26,4,105,91,2,1,11,13,0,0
12419,93500,2018-03-21,2,39,36,0,1,4,7,0,0
7502,92310,2018-01-18,3,46,52,1,1,5,4,0,0
5479,92160,2018-01-23,1,23,24,1,1,7,4,0,0
889,75006,2018-04-23,0,164,169,4,5,11,9,34,34


In [112]:
# checking the shape of sample 2
samp2.shape

# second sample also has 3,463 rows and 11 columns (shape similar to sample 1)

(3463, 11)

In [115]:
# grouping sample 2 by dayofweek and displaying the sum of bluecars taken
satu = samp2.groupby('dayofweek')[['bluecars_taken_sum']].sum().reset_index()
satu

Unnamed: 0,dayofweek,bluecars_taken_sum
0,0,80222
1,1,74933
2,2,82067
3,3,77426
4,4,91088


In [116]:
obsaved = list(satu['bluecars_taken_sum'])
totol = sum(obsaved)
expekted = [round(totol/5)]

# magnitude of repetition declaration
mag = 5

# list comprehension
expekted = [item for item in expekted for i in range(mag)]

expekted

[81147, 81147, 81147, 81147, 81147]

In [117]:
obs_val = scipy.array(obsaved)
expt_val = scipy.array(expekted)

chis = chisquare(f_obs=obs_val, f_exp=expt_val)
print('Chi square test statistic: {} \np value: {}'.format(chis[0],chis[1]))
print('p value rounded off to four dp:', round(chis[1],4))

Chi square test statistic: 1885.284027752104 
p value: 0.0
p value rounded off to four dp: 0.0


For the second sample, the p value is 0. We reject the null hypothesis and accept the alternate hypothesis. 

We conclude that there is a difference in the number of cars from stations in different days of the week. This conclusion is consistent with the test results from the first sample. 

In [118]:
# creating a third random sample from the autob dataframe
# size of third sample is also 30% of original data

samp3 = autob.sample(frac=0.3, random_state=3098)
samp3.head()

Unnamed: 0,postal_code,date,dayofweek,bluecars_taken_sum,bluecars_returned_sum,utilib_taken_sum,utilib_returned_sum,utilib_14_taken_sum,utilib_14_returned_sum,slots_freed_sum,slots_taken_sum
14937,94410,2018-04-13,4,30,29,1,1,1,2,0,0
2527,75017,2018-02-01,3,690,694,17,18,34,29,204,205
12060,93400,2018-01-25,3,114,98,4,4,9,8,0,0
14001,94150,2018-04-13,4,23,25,1,1,0,1,3,3
15896,95870,2018-05-07,0,45,42,0,0,1,2,0,0


In [119]:
# grouping sample 2 by dayofweek and displaying the sum of bluecars taken
sat3 = samp3.groupby('dayofweek')[['bluecars_taken_sum']].sum().reset_index()
sat3

Unnamed: 0,dayofweek,bluecars_taken_sum
0,0,81332
1,1,79955
2,2,80672
3,3,74757
4,4,83683


In [120]:
obsvd = list(sat3['bluecars_taken_sum'])
ttal = sum(obsvd)
expd = [round(ttal/5)]

# magnitude of repetition declaration
mag = 5

# list comprehension
expd = [item for item in expd for i in range(mag)]

expd

[80080, 80080, 80080, 80080, 80080]

In [121]:
obsd_val = scipy.array(obsvd)
exptd_val = scipy.array(expd)

chisq = chisquare(f_obs=obsd_val, f_exp=exptd_val)
print('Chi square test statistic: {} \np value: {}'.format(chisq[0],chisq[1]))
print('p value rounded off to four dp:', round(chisq[1],4))

Chi square test statistic: 540.0790584415585 
p value: 1.4333597270200174e-115
p value rounded off to four dp: 0.0


The p value for the third sample set is also less than the level of significance (0.05). We reject the null and accept the alternate hypothesis.

We conclude that there is a dofference in the number of cars from stations in different days of the week.


#### Challenge Solution

In [0]:
# grouping the population by day of week and displaying the sum of bluecars taken each day

group_cars = autob.groupby('dayofweek')[['bluecars_taken_sum']].sum()
group_cars.head()

Unnamed: 0_level_0,bluecars_taken_sum
dayofweek,Unnamed: 1_level_1
0,263893
1,261940
2,261849
3,263207
4,288546


In [0]:
# we can check the distribution of the total number of bluecars taken by day between Monday-Friday

group_cars.sort_values('bluecars_taken_sum', ascending=False)

# output reveals a difference in the total number of bluecars taken each day between Mon-Fri
# Friday recorded the highest total number of bluecars taken from stations, followed by Monday and Thursday Respectively

Unnamed: 0_level_0,bluecars_taken_sum
dayofweek,Unnamed: 1_level_1
4,288546
0,263893
3,263207
1,261940
2,261849


## Conclusion


*   Since the determined p-value is below the level of significance of 0.05 we reject the null hypothesis and accept the alternate hypothesis.

*   Thus, we conclude that there is a difference in the total number of cars taken from stations in different days of the week.

*   The test results are backed by the results of the groupby() function which reveal a difference in the number of cars between monday - friday. 

*   An assumption taken is that highly correlated features/variables represent the same information and one used inplace of the other. As a result, the columns bluecars_returned_sum', 'utilib_returned_sum', 'utilib_14_returned_sum were not used/included in the analysis









