In [1]:
import pandas as pd
import numpy as np
import sys
import re
df_ordered = pd.read_csv("../code/ordered.csv")
df_severeWeather = pd.read_csv("../code/severeWeather.csv")

# Capstone Project I Report: Inferential Statistics

From the previous report we observe the following:

In general:

* In extreme weather days, the sales record for item 45 dropped significantly. 
* In extreme weather days, the sales record for item 9 dropped slightly. 
* In extreme weather days, the sales record for item 44 dropped slightly.
* In extreme weather days, people buy item 5 more and they often do so before extreme weather comes.
* In extreme weather days, people buy more staff apart from item 44, item 45 and item 9, for last 6 items in top 10 increased by roughly 2 item per day
* People buy item 93 more on the day of extreme weather. They also buy this item before an extreme weather.
* when facing a long time extreme weather event, people do less shopping on item 5, 45, 44
* Even when it is a sunny day, the sales record close to bad weather still differ from normal case, with item 5 being the best seller and item 45 at the third place.

Specifically for item 5:

* Year: Sales record steady goes down given the year.
* Month: The month record is more even and diverse. It is hard to find a clear pattern.
* Weekday: People tend to buy more on weekends. On weekdays, Monday and Friday see more selling than others.
* Rainfall/Snowfall: People tend to buy item 5 on a sunny day. But when facing major weather events people will go and buy them as well. 
* Temperature: It can be observed that during normal days people tend to buy less item 5 when the temperature is between -16 to 38 $^{\circ}$F, and between 64 $^{\circ}$F to 76 $^{\circ}$F. When there is a major weather event, however, the confidence interval becomes large enough to affect this conclusion from a statistical point of view

These observations are going to be tested in this report.

First, I get tools ready for the task: 

In [2]:
# Define some useful functions first

def perm_diff(data1,data2,targetfun):
    conc_data = np.concatenate((data1,data2))
    value = targetfun(data1) - targetfun(data2)
    value_diff = np.empty(10000)

    for i in range(10000):
        perm_data = np.random.permutation(conc_data)
        perm_data1 = perm_data[:len(data1)]
        perm_data2 = perm_data[len(data1):]
        
        perm_value1 = targetfun(perm_data1)
        perm_value2 = targetfun(perm_data2)
        value_diff[i] = perm_value1 - perm_value2
        
        if value>0:
            p = np.sum(value_diff > value)/float(len(value_diff))
        else:
            p = np.sum(value_diff < value)/float(len(value_diff))
    print "p value:",  p
    print "value", value
    print "99% null hypothesis interval:",  np.percentile(value_diff, [0.5, 99.5])
    
def bsfromfunc(observes,targetfunc):
    value = targetfunc(observes)
    bs_target = np.empty(10000)
    for i in range(10000):
        bs_sample = np.random.choice(observes,size=len(observes))
        bs_target[i] = targetfunc(bs_sample)
        
    print 'value, ', value
    print '99% interval, ', np.percentile(bs_target, [0.5, 99.5])
    


#### In extreme weather days, the sales record for item 45 dropped significantly.

I perform a null hypotheses test for this question. The hypothesis is: there is no difference for sales record of item 45 between extreme weather days and normal days.

In [3]:
def item_sales_inferential(item):

    mask_event = ((df_ordered['WEvent'] == 1) & (df_ordered['item_nbr'] == item))
    mask_no_event = ((df_ordered['WEvent'] == 0) & (df_ordered['item_nbr'] == item))

    data1 = list(df_ordered.loc[mask_event, 'units'])
    data2 = list(df_ordered.loc[mask_no_event, 'units'])

    perm_diff(data1,data2,np.mean)

item_sales_inferential(45)

p value: 0.0
value -8.5177407805
99% null hypothesis interval: [-1.80311776  1.90913812]


The p-value is 0, meaning the null hypothesis does not stand. Therefore, in extreme weather days, the sales record for item 45 did drop.

#### In extreme weather days, the sales record for item 9 dropped slightly.

I perform a null hypotheses test for this question. The hypothesis is: there is no difference for sales record of item 9 between extreme weather days and normal days.

In [4]:
item_sales_inferential(9)

p value: 0.0181
value -1.78536037982
99% null hypothesis interval: [-2.2010421   2.26200586]


There is actually close to 2% chance for null hypotheses to stand. Under 5% acceptance, we can still reject this null hypothesis. Therefore, we can still say the sales record for item 9 dropped during extreme weather days, although slightly.

#### In extreme weather days, the sales record for item 44 dropped slightly

I perform a null hypothesis test for this question. The hypothesis is: there is no difference for sales record of item 44 between extreme weather days and normal days.

In [5]:
item_sales_inferential(44)

p value: 0.24
value -0.66953605725
99% null hypothesis interval: [-2.36833832  2.49661429]


The null hypothesis stands this time. Therefore, we cannot say the sales record for item 44 actually dropped.

#### In extreme weather days, people buy more staff apart from item 44, item 45 and item 9, for last 6 items in top 10 increased by roughly 2 item per day

Similar to above statements, the null hypotheses are tested.

In [6]:
print "item 5"
item_sales_inferential(5)
print "\n"

print "item 68"
item_sales_inferential(68)
print "\n"

print "item 16"
item_sales_inferential(16)
print "\n"

print "item 25"
item_sales_inferential(25)
print "\n"

print "item 48"
item_sales_inferential(48)
print "\n"

print "item 36"
item_sales_inferential(36)
print "\n"

item 5
p value: 0.0047
value 1.78574791356
99% null hypothesis interval: [-1.72427105  1.75471021]


item 68
p value: 0.0
value 4.97743431142
99% null hypothesis interval: [-0.72214722  0.72498934]


item 16
p value: 0.0
value 2.04496793016
99% null hypothesis interval: [-0.73142313  0.77779933]


item 25
p value: 0.0
value 2.35675080833
99% null hypothesis interval: [-1.1457389   1.30350309]


item 48
p value: 0.0
value 2.5980431391
99% null hypothesis interval: [-0.86165149  0.97645239]


item 36
p value: 0.0
value 1.51021767097
99% null hypothesis interval: [-0.6277559   0.71786791]




All null hypotheses do not stand, meaning that the statements are true: The sales records of these 6 items indeed increased.

#### In extreme weather days, people buy item 5 more and they often do so before extreme weather comes.

In [7]:
mask = (df_ordered['WEvent'] == 1)

df_ordered['Before_Event'] = (pd.rolling_mean(df_ordered['Condition'], window=3).shift(-3) > 0)
df_ordered['After_Event'] = (pd.rolling_mean(df_ordered['Condition'], window=3).shift(1) > 0)

	Series.rolling(window=3,center=False).mean()
  app.launch_new_instance()
	Series.rolling(window=3,center=False).mean()


In [8]:
mask_event = ((df_ordered['After_Event'] == 0) & (df_ordered['WEvent'] == 1) & (df_ordered['item_nbr'] == 5))
mask_no_event = ((df_ordered['After_Event'] == 1) & (df_ordered['item_nbr'] == 5))

data1 = list(df_ordered.loc[mask_event, 'units'])
data2 = list(df_ordered.loc[mask_no_event, 'units'])

perm_diff(data1,data2,np.mean)

p value: 0.0708
value 2.06832645124
99% null hypothesis interval: [-3.71796514  3.75806004]


This null hypothesis stands, although with a p-value of only 0.07.

#### People buy item 93 more on the day of extreme weather. They also buy this item before an extreme weather.

Two null hypotheses are tested here.

In [9]:
mask_event = ((df_ordered['Condition'] == 1) & (df_ordered['item_nbr'] == 93))
mask_no_event = ((df_ordered['Condition'] == 0) & (df_ordered['item_nbr'] == 93))

data1 = list(df_ordered.loc[mask_event, 'units'])
data2 = list(df_ordered.loc[mask_no_event, 'units'])

perm_diff(data1,data2,np.mean)

p value: 0.0
value 4.59785472258
99% null hypothesis interval: [-0.33650113  0.55326266]


The null hypothesis does not stand. Item 93 did sell better on the day of extreme weather than the rest of days.

In [10]:
mask_event = ((df_ordered['Condition'] == 0) & (df_ordered['item_nbr'] == 93) & (df_ordered['Before_Event'] == 1))
mask_no_event = ((df_ordered['Condition'] == 0) & (df_ordered['item_nbr'] == 93) & (df_ordered['Before_Event'] == 0))

data1 = list(df_ordered.loc[mask_event, 'units'])
data2 = list(df_ordered.loc[mask_no_event, 'units'])

perm_diff(data1,data2,np.mean)

p value: 0.0
value 2.46105218936
99% null hypothesis interval: [-0.22196152  0.30149219]


The null hypothesis does not stand. Item 93 did sell better shortly before the day of extreme weather than the rest of days excluding the extreme weather days.

#### When facing a long time extreme weather event, people do less shopping on item 5, 45, 44

In [11]:
def examinie_long_term(item):

    mask_event = ((df_ordered['Before_Event'] ^ df_ordered['After_Event'] == 1) & (df_ordered['item_nbr'] == item) & (df_ordered['WEvent'] == 1))
    mask_no_event = ((df_ordered['Before_Event'] ^ df_ordered['After_Event'] == 0) & (df_ordered['item_nbr'] == item) & (df_ordered['WEvent'] == 1))

    data1 = list(df_ordered.loc[mask_event, 'units'])
    data2 = list(df_ordered.loc[mask_no_event, 'units'])

    perm_diff(data1,data2,np.mean)
    
print "item 5: "
examinie_long_term(5)
print "item 45: "
examinie_long_term(45)
print "item 44: "
examinie_long_term(44)

item 5: 
p value: 0.4257
value -0.377943906968
99% null hypothesis interval: [-5.33110912  5.13037168]
item 45: 
p value: 0.1446
value 1.94136738982
99% null hypothesis interval: [-4.74552816  4.51252979]
item 44: 
p value: 0.2166
value 1.9207331672
99% null hypothesis interval: [-6.63007756  5.91249994]


All null hypotheses do not stand, meaning the observation on this issue could be coincident.

#### Even when it is a sunny day, the sales record close to bad weather still differ from normal case, with item 5 being the best seller and item 45 at the third place.

In [12]:
mask_event = ((df_ordered['preciptotal'] == 0) & (df_ordered['item_nbr'] == 5) & (df_ordered['WEvent'] == 1))
mask_no_event = ((df_ordered['preciptotal'] == 0) & (df_ordered['item_nbr'] == 5) & (df_ordered['WEvent'] == 0))

data1 = list(df_ordered.loc[mask_event, 'units'])
data2 = list(df_ordered.loc[mask_no_event, 'units'])

perm_diff(data1,data2,np.mean)

p value: 0.0285
value 1.80844419214
99% null hypothesis interval: [-2.25521036  2.46452891]


The null hypothesis rejected on 5% confidence level, meaning the statement stands: Even for sunny days, item 5 sells is dependent on if the day is close to an extreme weather day or not.

In [13]:
mask_event = ((df_ordered['preciptotal'] == 0) & (df_ordered['item_nbr'] == 45) & (df_ordered['WEvent'] == 1))
mask_no_event = ((df_ordered['preciptotal'] == 0) & (df_ordered['item_nbr'] == 45) & (df_ordered['WEvent'] == 0))

data1 = list(df_ordered.loc[mask_event, 'units'])
data2 = list(df_ordered.loc[mask_no_event, 'units'])

perm_diff(data1,data2,np.mean)

p value: 0.0
value -7.18063549636
99% null hypothesis interval: [-2.59934386  2.71724752]


The null hypothesis does not stand, meaning the statement stands: Even for sunny days, item 45 sells is dependent on if the day is close to an extreme weather day or not.

#### For item 5: Year: Sales record steady goes down given the year.

In [14]:
df_ordered['year'] = pd.to_datetime(df_ordered['date'], infer_datetime_format=True).dt.year

mask2012 = ((df_ordered['year'] == 2012) & (df_ordered['item_nbr'] == 5))
mask2013 = ((df_ordered['year'] == 2013) & (df_ordered['item_nbr'] == 5))
mask2014 = ((df_ordered['year'] == 2014) & (df_ordered['item_nbr'] == 5))

data2012 = list(df_ordered.loc[mask2012,'units'])
data2013 = list(df_ordered.loc[mask2013,'units'])
data2014 = list(df_ordered.loc[mask2014,'units'])

print "2012 item 5 sales per day: "
print bsfromfunc(data2012,np.mean)

print "2013 item 5 sales per day: "
print bsfromfunc(data2013,np.mean)

print "2014 item 5 sales per day: "
print bsfromfunc(data2014,np.mean)

2012 item 5 sales per day: 
value,  23.7681435507
99% interval,  [ 23.00569809  24.53114453]
None
2013 item 5 sales per day: 
value,  18.8006975585
99% interval,  [ 18.13004484  19.46658303]
None
2014 item 5 sales per day: 
value,  16.5693385352
99% interval,  [ 15.92374401  17.24055728]
None


We can see that 99% confidence intervals do not overlap between one another. It means the sales record is changing over the years. 

#### For item 5: Weekday: People tend to buy more on weekends. On weekdays, Monday and Friday see more selling than others.

In [15]:
df_ordered['weekday'] = pd.to_datetime(df_ordered['date'], infer_datetime_format=True).dt.weekday

maskMon = ((df_ordered['weekday'] == 0) & (df_ordered['item_nbr'] == 5))
maskTue = ((df_ordered['weekday'] == 1) & (df_ordered['item_nbr'] == 5))
maskWed = ((df_ordered['weekday'] == 2) & (df_ordered['item_nbr'] == 5))
maskThu = ((df_ordered['weekday'] == 3) & (df_ordered['item_nbr'] == 5))
maskFri = ((df_ordered['weekday'] == 4) & (df_ordered['item_nbr'] == 5))
maskSat = ((df_ordered['weekday'] == 5) & (df_ordered['item_nbr'] == 5))
maskSun = ((df_ordered['weekday'] == 6) & (df_ordered['item_nbr'] == 5))

list_para = [maskMon,maskTue,maskWed,maskThu,maskFri,maskSat,maskSun]
weekdays = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

for ind in range(len(list_para)):
    data = list(df_ordered.loc[list_para[ind],'units'])
    print 'The average sales for ' + weekdays[ind] + ':'
    print bsfromfunc(data,np.mean)

The average sales for Mon:
value,  20.9906276151
99% interval,  [ 19.91932636  22.09708954]
None
The average sales for Tue:
value,  18.80074551
99% interval,  [ 17.76243138  19.82431379]
None
The average sales for Wed:
value,  16.9823766365
99% interval,  [ 16.05638973  17.93992699]
None
The average sales for Thu:
value,  16.6404380792
99% interval,  [ 15.76763943  17.58113732]
None
The average sales for Fri:
value,  18.3176391458
99% interval,  [ 17.33327056  19.33951656]
None
The average sales for Sat:
value,  22.8378972279
99% interval,  [ 21.60175203  24.09754817]
None
The average sales for Sun:
value,  26.3804256745
99% interval,  [ 25.03401961  27.76337942]
None


The statement largely holds, for there is only a tiny bit confidence interval overlapping between weekday sales and weekend sales. However, we cannot say Friday sales better than other weekdays.

#### For item 5: Rainfall/Snowfall: People tend to buy item 5 on a sunny day. But when facing major weather events people will go and buy them as well. 

In [3]:
mask_normal_sunny = ((df_ordered['preciptotal'] == 0) & (df_ordered['WEvent'] == 0))
mask_noraml_not_summy = ((df_ordered['preciptotal'] > 0) & (df_ordered['WEvent'] == 0))
mask_weather = (df_ordered['WEvent'] == 1)

data_noraml_sunny = list(df_ordered.loc[mask_normal_sunny,'units'])
data_noraml_not_summy = list(df_ordered.loc[mask_noraml_not_summy,'units'])
data_weather = list(df_ordered.loc[mask_weather,'units'])

print "sunny vs non-sunny days in normal days"
perm_diff(data_noraml_sunny,data_noraml_not_summy,np.mean)

print "non-sunny normal days vs weather event days"
perm_diff(data_weather,data_noraml_not_summy,np.mean)

sunny vs non-sunny days in normal days
p value: 0.0
value 0.0539291156784
99% null hypothesis interval: [-0.02658073  0.0260989 ]
non-sunny normal days vs weather event days
p value: 0.0
value 0.0929169131328
99% null hypothesis interval: [-0.05249095  0.05089615]


Two hypotheses do not stand. Meaning the statement is true: People cancel their shopping plans if there is light rain or snow, but they go shopping and stock up during weather events.

#### For item 5: Temperature: It can be observed that during normal days people tend to buy less item 5 when the temperature is between -16 to 38 $^{\circ}$F, and between 64 $^{\circ}$F to 76 $^{\circ}$F. When there is a major weather event, however, the confidence interval becomes large enough to affect this conclusion from a statistical point of view

I have already performed inferential statistics on this issue. Please see this part in data storytelling part2. 

## Conclusion

In conclusion: Apart from the point "when facing a long time extreme weather event, people do less shopping on item 5, 45, 44", the other point stands. Therefore the following conclusions in data storytelling stand: 

* The sales pattern of normal days is different from the sales pattern of extreme weather period.
* Item 5 sells best during extreme weather period.
* Features that indicating whether the day is a normal day, a day before a bad weather or a day after a bad weather could be useful since observation suggests that even when it is a sunny day, the sales record close to bad weather still differ from normal case.
* Year: Year does affect the sales record.
* Weekday: Weekday does affect the sales record.
* Rainfall/Snowfall: Since we can indicate major events by event marker (see the part 1 report), whether rain/snow presents is useful in predicting item 5 sales, but the amount of rainfall/snowfall does not matter that much.
* Temperature: During normal days, The temperature and selling record surely has a correlation in general case. However, this correlation is not linear. Also, the correlation between temperature and selling record during major weather events is less stronger than the one during normal days.


While this conclusion does not stand:

* There might be logical correlations between features that indicate whether the day is a normal day, a day before a bad weather or a day after a bad weather. Because of this, using neuro network on this project might be promising.