## 3.0 Statistical Test for Correlation

### Creating composite function for Markdown success. 
We will define a function which returns different states for markdown success. High markdown success is denoted: "1" and "0" for low Markdown success.  The mean of the Markdown sales = 4719.567115. Markdown success is take to be any sale above the mean while failure is sale less than the mean. We will then use df.apply to apply the function to each row. 

In [1]:
# import the python required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# load the master_dataset.xls
df_master = pd.read_excel('master_dataset.xlsx', sheetname='Sheet1')
print (df_master.head())

   Store                 Date  Temperature  Fuel_Price  MarkDown1  MarkDown2  \
0      1  2010-05-02 00:00:00        42.31       2.572    10382.9    6115.67   
1      1  2010-12-02 00:00:00        38.51       2.548    10382.9    6115.67   
2      1  2010-02-19 00:00:00        39.93       2.514    10382.9    6115.67   
3      1  2010-02-26 00:00:00        46.63       2.561    10382.9    6115.67   
4      1  2010-05-03 00:00:00        46.50       2.625    10382.9    6115.67   

   MarkDown3  MarkDown4  MarkDown5         CPI     ...          Books  \
0     215.07    2406.62    6551.42  211.096358     ...       73315.81   
1     215.07    2406.62    6551.42  211.242170     ...       77280.42   
2     215.07    2406.62    6551.42  211.289143     ...       78602.71   
3     215.07    2406.62    6551.42  211.319643     ...       76091.36   
4     215.07    2406.62    6551.42  211.350143     ...       71718.48   

  Musical_Instruments  Star_Wars  Movies_TV  Video_Games  Portable_Audios  \
0  

Assign a new column Ave_MK to the dataframe which constitutes the mean of the markdown sales

In [3]:
# create new column containing the average of the markdown sales
df_master['Ave_MK'] = (df_master['MarkDown1'] + df_master['MarkDown2']+ df_master['MarkDown3']+ df_master['MarkDown4']+ df_master['MarkDown5'])/5

In [4]:
import pandas as pd

def Markdown_Success(c):
    if c['Ave_MK'] <= 4719.567115:
        return 0
    else:
        return 1


In [5]:
df_master['Markdown_Success'] = df_master.apply(Markdown_Success, axis=1)
print (df_master['Markdown_Success'].head()) 

0    1
1    1
2    1
3    1
4    1
Name: Markdown_Success, dtype: int64


### 3.0.1 One-Sample T-Test (Measuring the impact of size on Markdown Success)

A one-sample t-test checks whether a sample mean differs from the population mean. Since store size has the highest correlation with our dependent variable - Markdown Success, let's test to see whether the average size of the stores that have high Markdown success differs from the those that have low success.

Hypothesis Testing: Is there significant difference in the means of size of the stores who had high Markdown success and those who had low markdown success?

Null Hypothesis: (H0: pSS = pMS) The null hypothesis would be that there is no difference in store size between stores who had high Markdown success and those who had low markdown success.

Alternate Hypothesis: (HA: pSS != pMS) The alternative hypothesis would be that there is a difference in store size between stores who had high Markdown success and those who had low markdown success.


In [6]:
# Let's compare the means of our store sizes for high markdown success against the store markdown population
Markdown_population = df_master['Size'].mean()
Markdown_success_sizes = df_master[df_master['Markdown_Success']==1]['Size'].mean()

print( 'The mean store size for the population is: ' + str(Markdown_population))
print( 'The mean store size for stores that had high sales success during the markdowns is: ' + str(Markdown_success_sizes))

The mean store size for the population is: 130287.6
The mean store size for stores that had high sales success during the markdowns is: 167629.0945351759


### Conducting the T-Test

Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample comes from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [7]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master[df_master['Markdown_Success']==1]['Size'], # Sample of store size for stores that had high sales success during the markdowns
                  popmean = Markdown_population)  # Mean of store size for stores that had low sales success during the markdowns

Ttest_1sampResult(statistic=45.538955586982418, pvalue=0.0)

### T-Test Result

The test result shows the test statistic "t" is equal to 45.54. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies outside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we reject the null hypothesis. We can check the quantiles with stats.t.ppf():

### T-Test Quantile

If the t-statistic value we calculated above (45.54) is outside the quantiles, then we can reject the null hypothesis

In [8]:
degree_freedom = len(df_master[df_master['Markdown_Success']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96070932309
The t-distribution right quartile range is: 1.96070932309


### One-Sample T-Test Summary

T-Test = 45.54 | P-Value = 0.0 | Reject Null Hypothesis

Reject the null hypothesis because:

* T-Test score is outside the quantiles
* P-value is lower than confidence level of 5%

Based on the statistical analysis of a one sample t-test, there seems to be some significant difference between the mean mean store size for stores that had high sales success during the markdowns and the entire store population. The super low P-value of 0 at a 5% confidence level is a good indicator to reject the null hypothesis.

But this does not neccessarily mean that there is practical significance. We would have to conduct more experiments or maybe collect more data about the stores in order to come up with a more accurate finding.

### 3.0.2 Another Hypothesis test (Impact of the missing values)
Hypothesis Testing: Is there significant difference in the means of the filled missing values and those of the general population?
* Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the result by filling the missing values with numbers using the back fill method.
* Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference in the result by filling the missing values with numbers using the back fill method.

In [9]:
# Let us load the dataframe with missing values
df_master_MV = pd.read_excel('output_final.xlsx', sheetname='Sheet1')
print (df_master_MV.head())

   Store        Date  Temperature  Fuel_Price  MarkDown1  MarkDown2  \
0      1  05/02/2010        42.31       2.572        NaN        NaN   
1      1  12/02/2010        38.51       2.548        NaN        NaN   
2      1  19/02/2010        39.93       2.514        NaN        NaN   
3      1  26/02/2010        46.63       2.561        NaN        NaN   
4      1  05/03/2010        46.50       2.625        NaN        NaN   

   MarkDown3  MarkDown4  MarkDown5         CPI  ...         90        91  \
0        NaN        NaN        NaN  211.096358  ...   73315.81  57022.45   
1        NaN        NaN        NaN  211.242170  ...   77280.42  57845.36   
2        NaN        NaN        NaN  211.289143  ...   78602.71  59462.22   
3        NaN        NaN        NaN  211.319643  ...   76091.36  63011.44   
4        NaN        NaN        NaN  211.350143  ...   71718.48  57335.17   

          92        93        94         95        96        97        98  \
0  118966.90  58034.24  56157.83  11300

Assign a new column Ave_MK_MV to the dataframe which constitutes the mean of the markdown sales

In [10]:
# create new column containing the average of the markdown sales
df_master_MV['Ave_MK_MV'] = (df_master_MV['MarkDown1'] + df_master_MV['MarkDown2']+ df_master_MV['MarkDown3']+ 
                             df_master_MV['MarkDown4']+ df_master_MV['MarkDown5'])/5

In [11]:
# Let's compare the means of our markdown with missing values and markdown with missing values filled
Markdown_with_missing_values_filled = df_master['Ave_MK'].mean()
Markdown_with_missing_values = df_master_MV['Ave_MK_MV'].mean()

print( 'The mean markdown with missing values filled is: ' + str(Markdown_with_missing_values_filled))
print( 'The mean markdown with missing values is: ' + str(Markdown_with_missing_values))

The mean markdown with missing values filled is: 4719.567115262533
The mean markdown with missing values is: 4730.684346031739


### Conducting the T-Test
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [12]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Ave_MK'], # Sample of markdowns with missing values filled
                  popmean = Markdown_with_missing_values)  # Mean of markdowns with missing values

Ttest_1sampResult(statistic=-0.2261203126066092, pvalue=0.8211134898579352)

### T-Test Result
The test result shows the test statistic "t" is equal to 0.23. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

### T-Test Quantile
If the t-statistic value we calculated above (0.23) is inside the quantiles, then we can reject the null hypothesis

In [13]:
degree_freedom = len(df_master[df_master['Markdown_Success']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96070932309
The t-distribution right quartile range is: 1.96070932309


### One-Sample T-Test Summary
* T-Test =-0.23 | P-Value = 0.82 | Accept Null Hypothesis
* Accept the null hypothesis because:
* T-Test score is inside the quantiles
* P-value is higher than confidence level of 5%
* Based on the statistical analysis of a one sample t-test, there seems to be no significant difference between the mean of the markdowns when the missing values are left as nan and the markdowns when they are backfilled. The P-value of 0.82 at a 5% confidence level is a good indicator to accept the null hypothesis.

But as mentioned in the previous test this does not neccessarily mean that there is no practical significance. We would have to conduct more experiments or maybe collect more data about the stores in order to come up with a more accurate finding. For the purpose of this analysis we shall go ahead with the missing values filled since there is no suggestion that the final result will be affected by the missing values