# Chapter 1 Statistics and Probability

Case study for this chapter is on Cycle Sharing Scheme - determining Brand Persona

The aim is to "_lay out a strong marketing plan for reaching out to potential customers_"

Infrastructure expansion (more bikes and docking stations) by Jason, the Owner, did not result in increase rate of customer sign-ups nor customer retention

Questions by marketer Nancy, and Data Analyst Eric:
1. which attributes correlates the best with trip duration and number of trips?
1. which age generation adapts the most to the service?


In [None]:
# Listing 1-1

%matplotlib inline

import random
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import statistics
import numpy as np
import scipy
from scipy import stats
import seaborn


In [None]:
import sys
import matplotlib
print(f"python {sys.version}")
print(f"numpy {np.__version__}")
print(f"pandas {pd.__version__}")
print(f"matplotlib {matplotlib.__version__}")
print(f"scipy {scipy.__version__}")
print(f"seaborn {seaborn.__version__}")


### Some information about the service

- The service has 500 bikes at 50 stations across Seattle. 
- Each of the stations has a dock locking system (where all bikes are parked); 
    - kiosks (so customers can get a membership key or pay for a trip); and 
    - a helmet rental service. 
- A person can choose between purchasing 
    - a membership key or 
    - short-term pass. 
- A membership key entitles an annual membership, and the key can be obtained from a kiosk. 
- Advantages for members include 
    - quick retrieval of bikes and 
    - unlimited 45-minute rentals. 
- Short-term passes offer access to bikes for 
    - a 24-hour or 
    - 3-day time interval. 
- Riders can avail and return the bikes at any of the 50 stations citywide.

Table 1-1. Data Dictionary for the Trips Data from Cycles Share Dataset

|Feature name | Description|
|:----|:----|
|trip_id|Unique ID assigned to each trip|
|Starttime|Day and time when the trip started, in PST|
|Stoptime|Day and time when the trip ended, in PST|
|Bikeid |ID attached to each bike|
|Tripduration |Time of trip in seconds|
|from_station_name | Name of station where the trip originated|
| to_station_name | Name of station where the trip terminated|
| from_station_id | ID of station where trip originated
|to_station_id |ID of station where trip terminated |
|Usertype |Value can include either of the following: short-term pass holder or member|
|Gender |Gender of the rider |
|Birthyear|Birth year of the rider |

Dependent vairable: `tripduration`

In [None]:
# Listing 1-2

data = pd.read_csv('datasets/Chapter 1/trip.csv')

In [None]:
data.shape

Dataset has 236065 records (rows) and 12 features (columns)

In [None]:
# Listing 1-3

print(len(data))
data.head()

In [None]:
data.describe()

The above describe() analysis picks out the numerical features in the dataset. `trip_id` and `birthyear` distribution is nonsensical, so these 2 features need to be converted. And it seems, `birthyear` has some missing values.

`tripduration` is in seconds and does not tally against description or metrics discussed, so better to convert to minutes for analysis.

In [None]:
data.info()

Observations on the data:
- `tripduration` is in seconds
- `trip_id` is an int
- `starttime` and `stoptime` are strings. In order to perform datetime analysis, date features needs to be converted to DateTime format.
- `gender` and `birthyear` have missing values
- dataset does not provide userid, or any other identifiers for users. So there's no way to tell if same trips were taken repeatly by the same customers or different ones or even, a new customer. Therefore, we have no way of of telling if the trips are by new customers or not. Also does not contain length of membership for members of the service.
- however, there's `bikeid` that can be analysed if repeat trips from or to same kiosks

In [None]:
print(f"% of missing 'gender': {(data['gender'].isnull().sum())/len(data)*100:.2f}")
print(f"% of missing 'birthyear': {(data['birthyear'].isnull().sum())/len(data)*100:.2f}")

In [None]:
data.tail()

### Converting date columns to DateTime format

In [None]:
for date_col in ['starttime', 'stoptime']:
    data[date_col] = pd.to_datetime(data[date_col], format = "%m/%d/%Y %H:%M")

### Sort by `starttime`

In [None]:
data = data.sort_values(by='starttime', ascending=True)
data.reset_index()

### Determine earliest and last date/time of records

Insight from below analysis:
- data is collected for 3 years, from October 2014 - September 2016
- service is operational beyond traditional 9-5 office operational hours

In [None]:
print(f"first entry in starttime series: {data['starttime'][0]}")
print(f"last entry in stoptime series: {data['stoptime'][len((data))-1]}")

## Univariate Analysis

In [None]:
# Listing 1-4

# data = data.sort_values(by='starttime', ascending=True)
# data.reset_index()
# print('Date range of dataset: %s - %s'%(data.ix[1, 'starttime'], data.ix[len(data)-1, 'stoptime']))


In [None]:
print(f"Date range of dataset : {data['starttime'][0]} - {data['stoptime'][len((data))-1]}")

### Plotting Distribution of User Types

Analysing who uses the service more: members or short-term pass holders?

Seems that `usertype` **Member** is the majority of cyclers in Seattle who uses this bike-sharing service. 

In [None]:
# Listing 1-5

groupby_user = data.groupby('usertype').size()
groupby_user.plot.bar(title = 'Distribution of user types')

### Plotting Distribution by Gender

Similarly, wondered whether Males or Females are the main users?

Seems that `gender` **Male** is the majority of cyclers in Seattle who uses this bike-sharing service. 

In [None]:
# Listing 1-6

groupby_gender = data.groupby('gender').size()
groupby_gender.plot.bar(title = 'Distribution of genders')

### Plotting Distribution by BirthYear

Analysing what age group of users utilised the service more?

Seems that users born between early 1980s and late 1990s formed the majority of cyclers in Seattle who uses this bike-sharing service. 

In [None]:
# Listing 1-7

data = data.sort_values(by='birthyear')
groupby_birthyear = data.groupby('birthyear').size()
groupby_birthyear.plot.bar(title = 'Distribution of birth years', figsize = (15,4))

### Plotting Distribution by Frequency of usage by Member Types for Millenials

Analysing among millenials who uses the service, are they members or pass-holders?

Seems that `usertype` Member is the type cyclers among Millenials, in Seattle who uses this bike-sharing service. 

In [None]:
# Listing 1-8

data_millenials = data[(data['birthyear'] >= 1977) & (data['birthyear']<=1994)]
groupby_millenials = data_millenials.groupby('usertype').size()
groupby_millenials.plot.bar(title = 'Distribution of user types by birthyear between 1977 and 1994')

## Multivariate Analysis

### Plotting the Distribution of Birth Years by Gender Type

Insight gained from graph is that indeed all cyclists who used the service where mostly men, execpt for those born in 1947 were all Women. Which makes sense as that is close to when the 2nd World War concluded, and thus explained the shortage of men everywhere.

In [None]:
# Listing 1-9

groupby_birthyear_gender = data.groupby(['birthyear', 'gender'])['birthyear'].count().unstack('gender').fillna(0)
groupby_birthyear_gender[['Male','Female','Other']].plot.bar(title = 'Distribution of birth years by Gender', stacked=True, figsize = (15,4))

### Plotting the Distribution of Birth Years by User Types

Only Members have their `birthyear` data recorded.

In [None]:
# Listing 1-10

groupby_birthyear_user = data.groupby(['birthyear', 'usertype'])['birthyear'].count().unstack('usertype').fillna(0)
groupby_birthyear_user.plot.bar(title = 'Distribution of birth years by Usertype', stacked=True, figsize = (15,4))

### Validation If We Don’t Have Birth Year Available for Short-Term Pass Holders

Conclusion:  loyalty of millenials can’t be validated from the data at hand

In [None]:
# Listing 1-11

data[data['usertype']=='Short-Term Pass Holder']['birthyear'].isnull().values.all()

### Validation If We Don’t Have Gender Available for Short-Term Pass Holders

 Demographic variables for user type ‘Short-Term Pass holders’ is not recorded either.

In [None]:
# Listing 1-12

data[data['usertype']=='Short-Term Pass Holder']['gender'].isnull().values.all()

### Converting String to datetime, and Deriving New Features

In [None]:
# Listing 1-13

List_ = list(data['starttime'])

# List_ = [datetime.datetime.strptime(x, "%m/%d/%Y %H:%M") for x in List_]
data['starttime_mod'] = pd.Series(List_,index=data.index)
data['starttime_date'] = pd.Series([x.date() for x in List_],index=data.index)
data['starttime_year'] = pd.Series([x.year for x in List_],index=data.index)
data['starttime_month'] = pd.Series([x.month for x in List_],index=data.index)
data['starttime_day'] = pd.Series([x.day for x in List_],index=data.index)
data['starttime_hour'] = pd.Series([x.hour for x in List_],index=data.index)

In [None]:
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
data['starttime_dayofweek'] = pd.Series([days[x.weekday()] for x in List_],index=data.index)

### Plotting the Distribution of Trip Duration over Daily Time

In [None]:
# Listing 1-14

data.groupby('starttime_date')['tripduration'].mean().plot.bar(title = 'Distribution of Trip duration by date', figsize = (15,4))

### Exercises

1. determine the distribution of number of trips by *year*. do you see a specific pattern?
    - there's an increasing trend over the years of increased usage. can we determine by membership or pass-holders?
2. determine the distribution of number of trips by *month*. do you see a specific pattern?
    - more trips made in the Summer months
3. determine the distribution of number of trips by *day*. do you see a specific pattern?
    - no discernable pattern by date of Month, however there is a slight bump every 7 days; perhaps need to plot days of the week?
4. determine the distribution of number of trips by ~~day~~ *hour*. do you see a specific pattern?
    - more trips during working hours of 10am - 4pm, and surprising at wee hours between 1am - 3am.
5. plot a frequency distribution of trips on a daily basis.
    - isn't this 'Distribution of Trip duration by date' above?


In [None]:
data.groupby('starttime_year')['tripduration'].mean().plot.bar(title = 'Distribution of Trip duration by Year', figsize = (15,4))

In [None]:
data.groupby('starttime_month')['tripduration'].mean().plot.bar(title = 'Distribution of Trip duration by Month', figsize = (15,4))

In [None]:
data.groupby('starttime_day')['tripduration'].mean().plot.bar(title = 'Distribution of Trip duration by Date of the Month', figsize = (15,4))

In [None]:
data.groupby('starttime_hour')['tripduration'].mean().plot.bar(title = 'Distribution of Trip duration by Hour', figsize = (15,4))

In [None]:
data.groupby('starttime_dayofweek')['tripduration'].mean().plot.bar(title = 'Distribution of Trip duration by Day of Week', figsize = (15,4))

### Determining the Measures of Center Using Statistics Package

 note: *determining the measures of centers using the statistics package will require us to transform the input data structure to a list type.*

- most trips originatef from Pier 69 / Alaskan Way & Clay St
    - makes this the ideal location for running promotional campaigns targeted to existing customers.
- shows mean (average) to be greater than that of the median (central value)

In [None]:
# Listing 1-15

trip_duration = list(data['tripduration'])
station_from = list(data['from_station_name'])
print(f"Mean of trip duration: {statistics.mean(trip_duration)/60:.2f} mins")
print(f"Median of trip duration: {statistics.median(trip_duration)/60:.2f} mins")
print(f"Mode of station originating from: {statistics.mode(station_from)}")

### Plotting Histogram of Trip Duration

excerpt from book:
> The distribution in Figure 1-11 has only one peak (i.e., mode). The distribution is not symmetric and has majority of values toward the right-hand side of the mode. These extreme values toward the right are negligible in quantity, but their extreme nature tends to pull the mean toward themselves. Thus the reason why the mean is greater than the median.
> The distribution in Figure 1-11 is referred to as a normal distribution.

> Normal distribution, or in other words Gaussian distribution, is a continuous probability distribution that is bell shaped. The important characteristic of this distribution is that the mean lies at the center of this distribution with a spread (i.e., standard deviation) around it.

The `tripduration` is *Positively skewed*. This might be due to the presence of outliers.

In [None]:
# Listing 1-16

data['tripduration'].plot.hist(bins=100, title='Frequency distribution of Trip duration')
plt.show()

In [None]:
# Listing 1-17

# [Q1 - 1.5 (IQR) ,  Q3 + 1.5 (IQR) ] (i.e. IQR = Q3 - Q1)

### Plotting a Box plot of Trip Duration

All of that black bar are dots of outliers!

In [None]:
# Listing 1-18

box = data.boxplot(column=['tripduration'], vert=False, figsize = (15,4)) #add vert=False
plt.show()

### Determining Ratio of Values in Observations of tripduration Which Are Outliers


In [None]:
# Listing 1-19

q75, q25 = np.percentile(trip_duration, [75 ,25])
iqr = q75 - q25
print(f"Proportion of values as outlier: {((len(data) - len([x for x in trip_duration if q75+(1.5*iqr) >=x>= q25-(1.5*iqr)]))*100/float(len(data))):.4f} percent")

In [None]:
# Listing 1-20

# Number of outliers values = Length of all values - Length of all non outliers values

In [None]:
# Listing 1-21

# Ratio of outliers = ( Number of outliers values / Length of all values ) * 100

### Calculating z scores for Observations Lying Within tripduration

excerpt from book:
> Nancy was relieved to see only 9.5% of the values within the dataset to be outliers. Considering the time series nature of the dataset she knew that removing these outliers wouldn’t be an option. Hence she knew that the only option she could rely on was to apply transformation to these outliers to negate their extreme nature. However, she was interested in observing the mean of the non-outlier values of trip duration. This she then wanted to compare with the mean of all values calculated earlier in Listing 1-15.

In [None]:
# Listing 1-22

mean_trip_duration = np.mean([x for x in trip_duration if q75+(1.5*iqr) >=x>= q25-(1.5*iqr)])
upper_whisker = q75+(1.5*iqr)
print('Mean of trip duration: %f'%mean_trip_duration)

### Calculating Mean Scores for Observations Lying Within tripduration

In [None]:
# modified for repeated use - 2nd time for males_mean_trip_duration
def transform_tripduration(x, upper, mean):
    
    if x > upper:
        return mean
    return x

In [None]:
# Listing 1-23

# def transform_tripduration(x):
    
#     if x > upper_whisker:
#         return mean_trip_duration
#     return x

data['tripduration_mean'] = data['tripduration'].apply(lambda x: transform_tripduration(x, upper_whisker, mean_trip_duration))

data['tripduration_mean'].plot.hist(bins=100, title='Frequency distribution of mean transformed Trip duration')
plt.show()

### Determining the Measures of Center in Absence of Outliers

In [None]:
# Listing 1-24

print('Mean of trip duration: %f'%data['tripduration_mean'].mean())
print('Standard deviation of trip duration: %f'%data['tripduration_mean'].std())
print('Median of trip duration: %f'%data['tripduration_mean'].median())

##  EXERCISE

1. Find the mean, median, and mode of the trip duration of gender type male.
2. by looking at the numbers obtained earlier, in your opinion is the distribution symmetric or skewed? if skewed, then is is it positively skewed or negatively skewed?
    - positively skewed
3. plot a frequency distribution of trip duration for trips availed by gender type male. does it validate your inference as you did so in the previous question?
4. plot a box plot of the trip duration of trips taken by males. do you think any outliers exist?
5. apply the formula in listing ~~1-6~~ 1-19 to determine the percentage of observations for which outliers exists.
6. perform the treatment of outliers by incorporating one of the methods we discussed earlier for the treatment of outliers.

### Multivariate Measures of Center for `tripduration` and `gender` == 'Male'

In [None]:
# Exercise 1
mask_males = data['gender'] == 'Male'
trip_durations_of_males = list(data[mask_males]['tripduration'])
trip_durations_of_males
print(f"Mean of trip_durations taken by males: {statistics.mean(trip_durations_of_males)/60:.2f} mins")
print(f"Median of trip_durations taken by males: {statistics.median(trip_durations_of_males)/60:.2f} mins")
print(f"Mode of trip_durations taken by males: {statistics.mode(trip_durations_of_males)/60:.2f} mins")

In [None]:
# Exercise 3
data[mask_males]['tripduration'].plot.hist(bins=100, title='Frequency distribution of Trip duration by Males')
plt.show()

In [None]:
# Exercise 4
# Yes, there are outliers
df = data[mask_males]
box = df.boxplot(column=['tripduration'], vert=False, figsize = (15,4)) #add vert=False
plt.show()

In [None]:
# Exercise 5
# Listing 1-19 modified for trip_durations_of_males
q75, q25 = np.percentile(trip_durations_of_males, [75 ,25])
iqr = q75 - q25
print(f"Proportion of values as outlier: {((len(data) - len([x for x in trip_durations_of_males if q75+(1.5*iqr) >=x>= q25-(1.5*iqr)]))*100/float(len(data))):.4f} percent")

In [None]:
# Exercise 6-1
# Listing 1-21 modified for trip_durations_of_males
males_mean_trip_duration = np.mean([x for x in trip_durations_of_males if q75+(1.5*iqr) >=x>= q25-(1.5*iqr)])
males_upper_whisker = q75+(1.5*iqr)
print('Mean of trip duration: %f'%males_mean_trip_duration)

In [None]:
# Exercise 6-2
# Listing 1-23 modified for trip_durations_of_males
trips = df['tripduration'].apply(lambda x: transform_tripduration(x, males_upper_whisker, males_mean_trip_duration)) # chaining this triggered the SettingWithCopyWarning
df.loc['trip_durations_of_males_mean'] = trips

trips

In [None]:
df['trip_durations_of_males_mean'].plot.hist(bins=100, title='Frequency distribution of mean transformed Trip durations availed by Males')
plt.show()

### Correlation directions

In [None]:
data.birthyear.describe()

### Before imputation of `birthyear`

In [None]:
birth_year = list(data['birthyear'])
print(f"Mean of birth year: {statistics.mean(birth_year)}")
print(f"Median of birth year: {statistics.median(birth_year)}")
print(f"Mode of birth year: {statistics.mode(birth_year)}")

In [None]:
birthbox = data.boxplot(column=['birthyear'], vert=False, figsize = (15,4)) #add vert=False
plt.show()

In [None]:
backup = data.copy()
mask_birthyear_nan = data['birthyear'].isnull()

data.loc[mask_birthyear_nan, 'birthyear'] = statistics.median(birth_year)
data


### After imputation of `birthyear`

In [None]:
birth_year = list(data['birthyear'])
print(f"Mean of birth year: {statistics.mean(birth_year)}")
print(f"Median of birth year: {statistics.median(birth_year)}")
print(f"Mode of birth year: {statistics.mode(birth_year)}")

In [None]:
print(data['birthyear'].isnull().sum())
print(data['starttime_year'].isna().sum())

In [None]:
# Listing 1-26

# pd.set_option('display.width', 100)
# pd.set_option('precision', 3)  # invalid, changed in newer version?

# data['age'] = data['starttime_year'] - data['birthyear']

# correlations = data[['tripduration','age']].corr(method='pearson')
# print(correlations)

### Default values before setting new values

In [None]:
print(f"display.width: {str(pd.options.display.width)}")
print(f"display.precision: {str(pd.options.display.precision)}")


### Correlation Coefficient Between trip duration and age

In [None]:
pd.set_option('display.width', 100)
pd.set_option('display.precision', 3)

In [None]:
data['age'] = data['starttime_year'] - data['birthyear'] # coeffs changed from book after birthyear imputed vs dropped

correlations = data[['tripduration','age']].corr(method='pearson')
print(correlations)

### Pairplot of trip duration and age

In [None]:
# Listing 1-25
# rearranged this to be after 1.26 because it is using 'age' feature which is engineered in 1-26
# data = data.dropna()
seaborn.pairplot(data, vars=['age', 'tripduration'], kind='reg')
plt.show()

# plot shows older people take longer bike trips, and all age groups take single trips lasting between 45-83 mins
# longest use of service is for about 3/4 of a day? 25000 s/60 min/24 hr = 17 hrs?? Is calculation correct?

### Computing Two-Tail t-test of Categories of gender and user types

In [None]:
# Listing 1-27

for cat in ['gender','usertype']:

    print(f"Category: \n{cat}")
    groupby_category = data.groupby(['starttime_date', cat])['starttime_date'].count().unstack(cat)
    # print(groupby_category)
    # groupby_category = groupby_category.dropna() # 'usertype' isna() dropped
    category_names = list(groupby_category.columns)
    print(category_names)

    for comb in [(category_names[i],category_names[j]) for i in range(len(category_names)) for j in range(i+1, len(category_names))]:

        print(f"{comb[0]}, {comb[1]}")
        t_statistics = stats.ttest_ind(list(groupby_category[comb[0]]), list(groupby_category[comb[1]]))
        print(f"Statistic: {t_statistics.statistic:.6f}, P value: {t_statistics.pvalue:.6f}")
        print('\n')

# since usertype for short-term users has no gender values and is dropped, there is no other usertype to compare to, how did book do it?
# book says 'gender' an 'age', though

### Script to Validate Central Limit Theorem on Trips Dataset

In [None]:
# Listing 1-28

daily_tickets = list(data.groupby('starttime_date').size())
sample_tickets = []
checkpoints = [1, 10, 100, 300, 500, 700]
plot_count = 1

random.shuffle(daily_tickets)

plt.figure(figsize=(15,7))
binrange=np.array(np.linspace(0,500,101))

for i in range(1000):
    if daily_tickets:
        sample_tickets.append(daily_tickets.pop())

    if i+1 in checkpoints or not daily_tickets:
        plt.subplot(2,3,plot_count)
        plt.hist(sample_tickets, binrange)
        plt.title('n=%d' % (i+1),fontsize=15)        
        plot_count+=1

    if not daily_tickets:
	    break
        
plt.show()

In [None]:
data.shape

In [None]:
print(f"No of records dropped due to 'gender' isna() {len(backup) - len(data)}") # dropped gender == NaN
print(f"Percentage of total:  {(len(backup) - len(data))/len(backup)*100}")