#### Note:

In this file we have estimated the rate parameter for the exponential distribution using bootstrapping. This file contains the results of the second dataset of the first module(i.e. When can we expect the next call to come in?).

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import statistics 
import datetime as dt
import PCATR as pcatr
import numpy as np
import datetime
import math
from sklearn.utils import resample

In [None]:
def load(file):
    df = pd.read_csv(file)
    df.drop(['Unnamed: 0'],inplace=True,axis=1)
    df['Calldate'] = pd.to_datetime(df['Calldate'])
    df['Calltime'] = pd.to_datetime(df['Calltime'])
    return df
dataFrame = load('HU_File_Q1_S2.csv')

In [None]:
dataFrame

In [None]:
def dayOfWeek(df,day):
    newData = df[df['DayOfWeek'] == day]
    uniqueDates = newData.Calldate.unique()
    return newData,uniqueDates

In [None]:
dataFrame['ArrivalDiff'] = dataFrame.groupby(['Calldate'])['ArrivalTime'].diff().fillna(dataFrame['ArrivalTime'])

#### Standard Error:

The standard error(SE) is very similar to standard deviation. Both are measures of spread. The higher the number, the more spread out your data is. To put it simply, the two terms are essentially equal — but there is one important difference. While the standard error uses statistics (sample data) standard deviations use parameters (population data). 

In [None]:
def standardError(df):
    days = list(df.DayOfWeek.unique())
    mean = []
    se = []
    for i in days:
        data,date = dayOfWeek(df,i)
        val = data.ArrivalDiff/60
        mu1 = sum(val)/val.shape[0]
        std = statistics.stdev(val)
        se1 = std/math.sqrt(val.shape[0])
        print(i,"Sample size: ",val.shape[0],"Square root of sample size: ",math.sqrt(val.shape[0]))
        mean.append(mu1)
        se.append(se1)
    return mean,se

Here we have calculated standard errors for each day of the week. The function returns two lists. The first list contains mean of arrival differences for each day and the second list contains standard error for each day of the week. We have calculated these because we want to plot standard error bars.

In [None]:
mean,se = standardError(dataFrame)

#### Error Bar:

An error bar is a line through a point on a graph, parallel to one of the axes, which represents the uncertainty or variation of the corresponding coordinate of the point.

Error bars can communicate the following information about your data:
How spread the data are around the mean value (small SD bar = low spread, data are clumped around the mean; larger SD bar = larger spread, data are more variable from the mean).

The reliability of the mean value as a representative number for the data set.  In other words, how accurately the mean value represents the data (small SD bar = more reliable, larger SD bar = less reliable).  It's important to note that just because you have a larger SD, it does not indicate your data is not valid.  

The likelihood of there being a significant difference between between data sets.

#### What do Error Bars Indicate about Statistical Significance?

A "significant difference" means that the results that are seen are most likely not due to chance or sampling error.  In any experiment or observation that involves sampling from a population, there is always the possibility that an observed effect would have occurred due to sampling error alone.  But if result is "significant,"  then the investigator may conclude that the observed effect actually reflects the characteristics of the population rather than just sampling error or chance.  

The standard deviation error bars on a graph can be used to get a sense for whether or not a difference is significant.  Look for overlap between the standard deviation bars:

When standard deviation errors bars overlap quite a bit, it's a clue that the difference is not statistically significant.  You must actually perform a statistical test to draw a conclusion. 

Similarly, when standard deviation errors bars overlap even less, it's a clue that the difference is probably not statistically significant.You must actually perform a statistical test to draw a conclusion.

Moreover, when standard deviation error bars do not overlap, it's a clue that the difference may be significant, but you cannot be sure.  You must actually perform a statistical test to draw a conclusion.

We can see from our plots that there is no overlap.

In [None]:
# Build the plot
#%matplotlib notebook
def plotErrorBars(mean,se,df):
    days = list(df.DayOfWeek.unique())
    fig, ax = plt.subplots()
    x_pos = np.arange(len(days))
    ax.bar(x_pos, mean, yerr=se, align='center', alpha=0.7, ecolor='black', capsize=10)
    ax.set_ylabel('Mean Arrival Difference')
    ax.set_xticks(x_pos)
    ax.set_xticklabels(days)
    ax.set_title('Mean of Arrival Differences for each day of week')
    ax.yaxis.grid(True)

    # Save the figure and show
    plt.tight_layout()
    #plt.savefig('bar_plot_with_error_bars.png')
    plt.show()
    
    fig, ax = plt.subplots()
    x_pos = np.arange(len(days))
    ax.errorbar(x_pos, mean, yerr=se, ecolor='black', fmt='o', capsize=20)
    ax.set_ylabel('Mean Arrival Difference')
    ax.set_xticks(x_pos)
    ax.set_xticklabels(days)
    ax.set_title('Mean of Arrival Differences for each day of week')
    ax.yaxis.grid(True)

    # Save the figure and show
    #plt.savefig('bar_plot_with_error_bars.png')
    plt.tight_layout()
    plt.show()
plotErrorBars(mean,se,dataFrame)

Made plots of call count per hour for each day of the week to see if there is a peek hour. From the plots it is evident that the calls are randomly distributed within the day and there is no subtle pattern.

In [None]:
def plotCallCountPerHour(df):
    days = list(df.DayOfWeek.unique())
    hours = df.hour.unique()
    l = [i for i in hours]
    for i in days:
        d, date = dayOfWeek(df,i)
        c = d.groupby(['Calldate','hour']).count()
        c = c.reset_index()
        plt.figure(figsize=(10,7))
        sns.swarmplot('hour','DialStart',data=c)
        plt.title(i)
        plt.show()
plotCallCountPerHour(dataFrame)

#### Bootstrapping:

Simulation methods in which the distribution to be sampled from is determined from data are called bootstrap methods. Bootstrap is a powerful, computer-based method for statistical inference without relying on too many assumption.
The basic idea of bootstrap is make inference about a estimate(such as sample mean) for a population parameter θ (such as population mean) on sample data. It is a resampling method by independently sampling with replacement from an existing sample data with same sample size n, and performing inference among these resampled data.

Our bootstrapping function takes in the dataFrame as input along with day of week (otherwise it runs on the whole data), number of times it needs to resample. In our bootstrapping approach we make a list which has a length of 90% of data that is being fed in. Then we sample with replacement and calculate the mean. We repeat this process 10000 times. As per the central limit theorem the sample means would form a normal distribution. Our function outputs the list of sample means, the mean of the data being fed in, a list of seed values that were put into the resample function, the lenght of samples and the data that was inputted in the function.

Furthermore, we calculate the mean of sample means which we call the bootstrap mean. Then we calculate the difference between the mean of arrival differences of the data and the bootstrap mean to see how much the values differ.

In [None]:
def bootstrap(df,day=None,loop=10000):
    if day!=None:
        first, _ = dayOfWeek(df,day)
    else:
        first = df
    bootstrapMeans = []
    seeds = []
    data = np.array(first.ArrivalDiff)
    totMean = statistics.mean(data)
    samples = round(len(first)*0.9)
    seed = 0
    for i in range(loop):
        bootstrapMeans.append(statistics.mean(resample(data,replace = True, n_samples=samples, random_state=seed )))
        seeds.append(seed)
        seed+=1
        if seed%1000==0:
            print("epoch", seed)
    return (bootstrapMeans,totMean,seeds,samples,first)


For each day of the week we calculate the mean of sample means which we call bootstrap mean. Then we calculate the difference between the simple mean (of arrival differences) and the bootstrap mean.

### Monday

In [None]:
sampleMeans, meanOfData, seedList, numberofSamples, dayWiseData = bootstrap(dataFrame,'Monday')
#print('Simple Mean of data: ',meanOfData,'\n','Bootstrap Means: ', sampleMeans)

In [None]:
bootstrapMean = statistics.mean(sampleMeans)
print('Simple Mean of data: ',meanOfData,' Bootstrap Mean ', bootstrapMean, ' Difference ', meanOfData-bootstrapMean)

### Tuesday

In [None]:
sampleMeans1, meanOfData1, seedList1, numberofSamples1, dayWiseData1 = bootstrap(dataFrame,'Tuesday')
#print('Simple Mean of data: ',meanOfData1,'\n','Bootstrap Means: ', sampleMeans1)

In [None]:
bootstrapMean1 = statistics.mean(sampleMeans1)
print('Simple Mean of data: ',meanOfData1,' Bootstrap Mean ', bootstrapMean1, ' Difference ', meanOfData1-bootstrapMean1)

### Wednesday

In [None]:
sampleMeans2, meanOfData2, seedList2, numberofSamples2, dayWiseData2 = bootstrap(dataFrame,'Wednesday')
#print('Simple Mean of data: ',meanOfData2,'\n','Bootstrap Means: ', sampleMeans2)

In [None]:
bootstrapMean2 = statistics.mean(sampleMeans2)
print('Simple Mean of data: ',meanOfData2,' Bootstrap Mean ', bootstrapMean2, ' Difference ', meanOfData2-bootstrapMean2)

### Thursday

In [None]:
sampleMeans3, meanOfData3, seedList3, numberofSamples3, dayWiseData3 = bootstrap(dataFrame,'Thursday')
#print('Simple Mean of data: ',meanOfData3,'\n','Bootstrap Means: ', sampleMeans3)

In [None]:
bootstrapMean3 = statistics.mean(sampleMeans3)
print('Simple Mean of data: ',meanOfData3,' Bootstrap Mean ', bootstrapMean3, ' Difference ', meanOfData3-bootstrapMean3)

### Friday

In [None]:
sampleMeans4, meanOfData4, seedList4, numberofSamples4, dayWiseData4 = bootstrap(dataFrame,'Friday')
#print('Simple Mean of data: ',meanOfData4,'\n','Bootstrap Means: ', sampleMeans4)

In [None]:
bootstrapMean4 = statistics.mean(sampleMeans4)
print('Simple Mean of data: ',meanOfData4,' Bootstrap Mean ', bootstrapMean4, ' Difference ', meanOfData4-bootstrapMean4)

### Saturday

In [None]:
sampleMeans5, meanOfData5, seedList5, numberofSamples5, dayWiseData5 = bootstrap(dataFrame,'Saturday')
#print('Simple Mean of data: ',meanOfData5,'\n','Bootstrap Means: ', sampleMeans5)

In [None]:
bootstrapMean5 = statistics.mean(sampleMeans5)
print('Simple Mean of data: ',meanOfData5,' Bootstrap Mean ', bootstrapMean5, ' Difference ', meanOfData5-bootstrapMean5)

### Sunday

In [None]:
sampleMeans6, meanOfData6, seedList6, numberofSamples6, dayWiseData6 = bootstrap(dataFrame,'Sunday')
#print('Simple Mean of data: ',meanOfData6,'\n','Bootstrap Means: ', sampleMeans6)

In [None]:
bootstrapMean6 = statistics.mean(sampleMeans6)
print('Simple Mean of data: ',meanOfData6,' Bootstrap Mean ', bootstrapMean6, ' Difference ', meanOfData6-bootstrapMean6)

In [None]:
#We plot the sample means of Monday and we can see that the central limit theorem is working. 
plt.hist(sampleMeans,edgecolor='black')
plt.show()

We calculate the sample means of the whole dataset i.e. containing all days of week. We then calculate the bootstrap mean and then finally we calculate the difference between the simple mean and bootstrap mean. 

The purpose of calculating the mean of the whole dataset was to see if there is a difference between the means of individual days and the mean of the whole dataset. 

Furthermore, by calculating the mean of the whole dataset we could use it to model the days which have same means as the mean of this whole dataset. So instead of finding seven parameters for the exponential random variables we can use lesser number of parameters when there is an overlap. 

In [None]:
bootstrapMeans, totMean, seeds, samples, first = bootstrap(dataFrame,None,10000)    

In [None]:
print('Mean of whole dataset: ',totMean,'\n','Bootstrap Means: ', bootstrapMeans)

In [None]:
bootstrapMean7 = statistics.mean(bootstrapMeans)
print('Mean of whole dataset ',totMean,' Bootstrap Mean ', bootstrapMean7, ' Difference ', totMean-bootstrapMean7)

We calculate the confidence intervals for our bootstrapping. We use the approach mentioned in the chapter 5 of the book. Statistics for Engineers and Scientists by William Nivadi. 
We sort the data and then calculate the mean of values from 250-260 for the lower confidence interval and calculate the mean of 9750-9760 value for the upper confidence interval.

In [None]:
def confidenceIntervals(sampMean,tFifth,tSixth,nFifth,nSixth):
    sampMean.sort()
    lConfInterval = statistics.mean(sampMean[tFifth:tSixth])
    uConfInterval = statistics.mean(sampMean[nFifth:nSixth])
    return lConfInterval,uConfInterval
interval, confInterval = confidenceIntervals(sampMeans,250,261,9750,9761)
interval1, confInterval1 = confidenceIntervals(sampMeans1,250,261,9750,9761)
interval2, confInterval2 = confidenceIntervals(sampMeans2,250,261,9750,9761)
interval3, confInterval3 = confidenceIntervals(sampMeans3,250,261,9750,9761)
interval4, confInterval4 = confidenceIntervals(sampMeans4,250,261,9750,9761)
interval5, confInterval5 = confidenceIntervals(sampMeans5,250,261,9750,9761)
interval6, confInterval6 = confidenceIntervals(sampMeans6,250,261,9750,9761)
interval7, confInterval7 = confidenceIntervals(sampMeans7,250,261,9750,9761)

To save and load data.

In [None]:
#import pickle
#with open('bootstrapmeans.pkl', 'wb') as f:
#    pickle.dump(bootstrapMeans, f)
#with open('totalmean.pkl', 'wb') as f:
#    pickle.dump(totMean, f)
    

#with open('Exponential/mondaysecondwisedatamean6.pkl', 'rb') as f:
#    meanOfData6 = pickle.load(f)



Plotting the bootstrap Means of each day and the bootstrap Mean of the whole dataset along with confidence intervals. If there is an overlap between confidence intervals we can say that the difference between those two particular means is not statistically significant.

From our plots we can see the mean of the week overlaps with Friday hence the difference is not statistically significant. However, since there is only one overlap so we still need seven parameters or seven expected values to make an exponential distribution for each day of the week. Hence, we do not perform any hypothesis testing.

In [None]:
from matplotlib.patches import Rectangle
plt.figure(figsize=(13,7))
plt.title('Bootstrap Means with confidence Intervals')
plt.axvline(bootstrapMean7, color='red', linestyle='-', linewidth=1)
plt.axvline(interval7, color='red', linestyle='dashed', linewidth=1)
plt.axvline(confInterval7, color='red', linestyle='dashed', linewidth=1)
plt.axvline(bootstrapMean, color='green', linestyle='-', linewidth=1)
plt.axvline(interval, color='green', linestyle='dashed', linewidth=1)
plt.axvline(confInterval, color='green', linestyle='dashed', linewidth=1)
plt.axvline(bootstrapMean1, color='yellow', linestyle='-', linewidth=1)
plt.axvline(interval1, color='yellow', linestyle='dashed', linewidth=1)
plt.axvline(confInterval1, color='yellow', linestyle='dashed', linewidth=1)
plt.axvline(bootstrapMean2, color='orange', linestyle='-', linewidth=1)
plt.axvline(interval2, color='orange', linestyle='dashed', linewidth=1)
plt.axvline(confInterval2, color='orange', linestyle='dashed', linewidth=1)
plt.axvline(bootstrapMean3, color='blue', linestyle='-', linewidth=1)
plt.axvline(interval3, color='blue', linestyle='dashed', linewidth=1)
plt.axvline(confInterval3, color='blue', linestyle='dashed', linewidth=1)
plt.axvline(bootstrapMean4, color='purple', linestyle='-', linewidth=1)
plt.axvline(interval4, color='purple', linestyle='dashed', linewidth=1)
plt.axvline(confInterval4, color='purple', linestyle='dashed', linewidth=1)
plt.axvline(bootstrapMean5, color='brown', linestyle='-', linewidth=1)
plt.axvline(interval5, color='brown', linestyle='dashed', linewidth=1)
plt.axvline(confInterval5, color='brown', linestyle='dashed', linewidth=1)
plt.axvline(bootstrapMean6, color='black', linestyle='-', linewidth=1)
plt.axvline(interval6, color='black', linestyle='dashed', linewidth=1)
plt.axvline(confInterval6, color='black', linestyle='dashed', linewidth=1)
handles = [Rectangle((0,0),1,1,color=c,ec="k") for c in ['red','green','yellow','orange','blue','purple','brown','black']]
labels= ["Week","Mon", "Tue","Wed","Thu","Fri","Sat","Sun"]
plt.legend(handles, labels,bbox_to_anchor=(1, 0.5),shadow=True)
plt.xlabel("Arrival Difference between consecutive calls")
plt.show()

In [None]:
from matplotlib.patches import Rectangle
plt.figure(figsize=(13,7))
plt.title("Bootstrap Mean of the whole Week and Friday")
plt.hist(bootstrapMeans,edgecolor='black')
plt.axvline(bootstrapMean7, color='red', linestyle='-', linewidth=1)
plt.axvline(interval7, color='red', linestyle='dashed', linewidth=1)
plt.axvline(confInterval7, color='red', linestyle='dashed', linewidth=1)
plt.axvline(bootstrapMean4, color='purple', linestyle='-', linewidth=1)
plt.axvline(interval4, color='purple', linestyle='dashed', linewidth=1)
plt.axvline(confInterval4, color='purple', linestyle='dashed', linewidth=1)
handles = [Rectangle((0,0),1,1,color=c,ec="k") for c in ['red','purple']]
labels= ["Week","Fri"]
plt.legend(handles, labels,bbox_to_anchor=(1, 0.5),shadow=True)
plt.xlabel("Arrival Difference between consecutive calls")
plt.show()

Since the bootstrap mean of thursday and friday are overlapping we perform the z-test to see if the two means are different or not.

We keep our null hypothesis as the two means are not different.

We perform the z-test and the p-value for the z-score -0.954 turns out to be 0.34 therefore we accept the null hypothesis that the two means are same.


In [None]:
thur = dataFrame[dataFrame.DayOfWeek=='Thursday']
fri = dataFrame[dataFrame.DayOfWeek=='Friday']
def z_test(t1,t2):
    val = t1.ArrivalDiff
    mu1 = sum(val)/val.shape[0]
    print(val.shape[0]/sum(val))
    val2 = t2
    val2 = val2.ArrivalDiff
    mu2 = sum(val2)/val2.shape[0]
    std1 = statistics.stdev(val)
    std2 = statistics.stdev(val2)
    var1 = statistics.variance(val)
    var2 = statistics.variance(val2)

    print(' train mean:',mu1,'\n test mean:', mu2, 
          '\n train st-dev:',std1, 
          '\n test st-dev:',std2, 
          '\n train variance:',var1, 
          '\n test variance:',var2) 

    n1 = val.shape[0]
    n2 = val2.shape[0]
    dof = n1+n2-2
    rootNum = (n1-1)*var1+(n2-1)*var2
    num = mu1 - mu2
    zscore = num/math.sqrt( ((var1)/n1) + ((var2)/n2) )
    print('Difference ',num,'Zscore ',zscore)
    print('Variance1 ',var1,'Variance2 ',var2)
z_test(thur,fri)

The seven bootstrap means or our expected values are as follows: 

Expected value for Monday =  18.400408229325677

Expected value for Tuesday = 19.64763040527629 

Expected value for Wednesday = 21.424941410588588 

Expected value for Thursday =  23.73590126455369 

Expected value for Friday = 23.897650829770267

Expected value for Saturday = 41.78701247135073

Expected value for Sunday = 52.66931993895529

Expected value for the whole week =  25.219208887955816 


Since thursday and friday are not different we can use the same expected value for both.
We can calculate the rate parameter by the following formula lambda = 1/E[x].


In [None]:
#calls per minute or arrival rate per minute 
lambdas = {}
mon = meanOfData/60
mon = 1/mon
lambdas['mon'] = mon
tue = meanOfData1/60
tue = 1/tue
lambdas['tue'] = tue
wed = meanOfData2/60
wed = 1/wed
lambdas['wed'] = wed
thu = meanOfData3/60
thu = 1/thu
lambdas['thu'] = thu
fri = meanOfData4/60
fri = 1/fri
lambdas['fri'] = fri
sat = meanOfData5/60
sat = 1/sat
lambdas['sat'] = sat
sun = meanOfData6/60
sun = 1/sun
lambdas['sun'] = sun
week = totMean/60
week = 1/week
lambdas['week'] = week


In [None]:
lambdas

References:


https://towardsdatascience.com/an-introduction-to-the-bootstrap-method-58bcb51b4d60

https://www.statisticshowto.datasciencecentral.com/what-is-the-standard-error-of-a-sample/

https://www.biologyforlife.com/interpreting-error-bars.html

https://www.academia.edu/35637181/_William_Navidi_Statistics_for_Engineers_and_Scie_BookFi_ 
