## 3.0 Statistical Test for Correlation

We are going to test 6 hypotheses in this section. The hypotheses are as follows:

1. Store size has impact on Markdown Success
2. Missing values significantly affect result of tests
3. The difference between sales recorded on holidays and those recorded on non-holiday weeks is statitically significant
4. The difference between markdown sales on holidays and those recorded on non-holiday weeks is statistically significant
5. The difference in Markdowns for each Store Type is statistically significant
6. The difference in Sales of gift items for each store Type is statistically significant
7. The difference in Store Size for each Store Type is statistically significant
8. Store size has impact on Sales of Gift Items

### 3.0.1 Creating composite function for Markdown success. 
We will define a function which returns different states for markdown success. High markdown success is denoted: "1" and "0" for low Markdown success.  The mean of the Markdown sales = 4719.567115. Markdown success is take to be any sale above the mean while failure is sale less than the mean. We will then use df.apply to apply the function to each row. 

In [1]:
# import the python required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# load the master_dataset.xls
df_master = pd.read_excel('master_dataset.xlsx', sheetname='Sheet1')
print (df_master.head())

   Store  Temperature  Fuel_Price  MarkDown1  MarkDown2  MarkDown3  MarkDown4  \
0      1        42.31       2.572    10382.9    6115.67     215.07    2406.62   
1      1        38.51       2.548    10382.9    6115.67     215.07    2406.62   
2      1        39.93       2.514    10382.9    6115.67     215.07    2406.62   
3      1        46.63       2.561    10382.9    6115.67     215.07    2406.62   
4      1        46.50       2.625    10382.9    6115.67     215.07    2406.62   

   MarkDown5         CPI  Unemployment     ...          Books  \
0    6551.42  211.096358         8.106     ...       73315.81   
1    6551.42  211.242170         8.106     ...       77280.42   
2    6551.42  211.289143         8.106     ...       78602.71   
3    6551.42  211.319643         8.106     ...       76091.36   
4    6551.42  211.350143         8.106     ...       71718.48   

  Musical_Instruments  Star_Wars  Movies_TV  Video_Games  Portable_Audios  \
0            57022.45  118966.90   58034.24  

Assign a new column Ave_MK to the dataframe which constitutes the mean of the markdown sales

In [3]:
# create new column containing the average of the markdown sales
df_master['Ave_MK'] = (df_master['MarkDown1'] + df_master['MarkDown2']+ df_master['MarkDown3']+ df_master['MarkDown4']+ df_master['MarkDown5'])/5

In [4]:
import pandas as pd

def Markdown_Success(c):
    if c['Ave_MK'] <= 4719.567115:
        return 0
    else:
        return 1


In [5]:
df_master['Markdown_Success'] = df_master.apply(Markdown_Success, axis=1)
print (df_master['Markdown_Success'].head()) 

0    1
1    1
2    1
3    1
4    1
Name: Markdown_Success, dtype: int64


### 3.0.2 One-Sample T-Test (Measuring the impact of size on Markdown Success)

A one-sample t-test checks whether a sample mean differs from the population mean. Since store size has the highest correlation with our dependent variable - Markdown Success, let's test to see whether the average size of the stores that have high Markdown success differs from the those that have low success.

Hypothesis Testing: Is there significant difference in the means of size of the stores who had high Markdown success and those who had low markdown success?

Null Hypothesis: (H0: pSS = pMS) The null hypothesis would be that there is no difference in store size between stores who had high Markdown success and those who had low markdown success.

Alternate Hypothesis: (HA: pSS != pMS) The alternative hypothesis would be that there is a difference in store size between stores who had high Markdown success and those who had low markdown success.


In [6]:
# Let's compare the means of our store sizes for high markdown success against the store markdown population
Markdown_population = df_master['Size'].mean()
Markdown_success_sizes = df_master[df_master['Markdown_Success']==1]['Size'].mean()

print( 'The mean store size for the population is: ' + str(Markdown_population))
print( 'The mean store size for stores that had high sales success during the markdowns is: ' + str(Markdown_success_sizes))

The mean store size for the population is: 130287.6
The mean store size for stores that had high sales success during the markdowns is: 167629.0945351759


### 3.0.3 Conducting the T-Test

Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample comes from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [7]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master[df_master['Markdown_Success']==1]['Size'], # Sample of store size for stores that had high sales success during the markdowns
                  popmean = Markdown_population)  # Mean of store size for stores that had low sales success during the markdowns

Ttest_1sampResult(statistic=45.538955586982418, pvalue=0.0)

### 3.0.4 T-Test Result

The test result shows the test statistic "t" is equal to 45.54. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies outside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we reject the null hypothesis. We can check the quantiles with stats.t.ppf():

### 3.0.5 T-Test Quantile

If the t-statistic value we calculated above (45.54) is outside the quantiles, then we can reject the null hypothesis

In [8]:
degree_freedom = len(df_master[df_master['Markdown_Success']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96070932309
The t-distribution right quartile range is: 1.96070932309


### 3.0.6 One-Sample T-Test Summary

T-Test = 45.54 | P-Value = 0.0 | Reject Null Hypothesis

Reject the null hypothesis because:

* T-Test score is outside the quantiles
* P-value is lower than confidence level of 5%

Based on the statistical analysis of a one sample t-test, there seems to be some significant difference between the mean mean store size for stores that had high sales success during the markdowns and the entire store population. The super low P-value of 0 at a 5% confidence level is a good indicator to reject the null hypothesis.

But this does not neccessarily mean that there is practical significance. We would have to conduct more experiments or maybe collect more data about the stores in order to come up with a more accurate finding.

### 3.1 Another Hypothesis test (Impact of the missing values)
Hypothesis Testing: Is there significant difference in the means of the filled missing values and those of the general population?
* Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the result by filling the missing values with numbers using the back fill method.
* Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference in the result by filling the missing values with numbers using the back fill method.

In [9]:
# Let us load the dataframe with missing values
df_master_MV = pd.read_excel('output_final.xlsx', sheetname='Sheet1')
print (df_master_MV.head())

   Store        Date  Temperature  Fuel_Price  MarkDown1  MarkDown2  \
0      1  05/02/2010        42.31       2.572        NaN        NaN   
1      1  12/02/2010        38.51       2.548        NaN        NaN   
2      1  19/02/2010        39.93       2.514        NaN        NaN   
3      1  26/02/2010        46.63       2.561        NaN        NaN   
4      1  05/03/2010        46.50       2.625        NaN        NaN   

   MarkDown3  MarkDown4  MarkDown5         CPI  ...         90        91  \
0        NaN        NaN        NaN  211.096358  ...   73315.81  57022.45   
1        NaN        NaN        NaN  211.242170  ...   77280.42  57845.36   
2        NaN        NaN        NaN  211.289143  ...   78602.71  59462.22   
3        NaN        NaN        NaN  211.319643  ...   76091.36  63011.44   
4        NaN        NaN        NaN  211.350143  ...   71718.48  57335.17   

          92        93        94         95        96        97        98  \
0  118966.90  58034.24  56157.83  11300

Assign a new column Ave_MK_MV to the dataframe which constitutes the mean of the markdown sales

In [10]:
# create new column containing the average of the markdown sales
df_master_MV['Ave_MK_MV'] = (df_master_MV['MarkDown1'] + df_master_MV['MarkDown2']+ df_master_MV['MarkDown3']+ 
                             df_master_MV['MarkDown4']+ df_master_MV['MarkDown5'])/5

In [11]:
# Let's compare the means of our markdown with missing values and markdown with missing values filled
Markdown_with_missing_values_filled = df_master['Ave_MK'].mean()
Markdown_with_missing_values = df_master_MV['Ave_MK_MV'].mean()

print( 'The mean markdown with missing values filled is: ' + str(Markdown_with_missing_values_filled))
print( 'The mean markdown with missing values is: ' + str(Markdown_with_missing_values))

The mean markdown with missing values filled is: 4719.567115262533
The mean markdown with missing values is: 4730.684346031739


### 3.1.1 Conducting the T-Test
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [12]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Ave_MK'], # Sample of markdowns with missing values filled
                  popmean = Markdown_with_missing_values)  # Mean of markdowns with missing values

Ttest_1sampResult(statistic=-0.2261203126066092, pvalue=0.8211134898579352)

### 3.1.2 T-Test Result
The test result shows the test statistic "t" is equal to 0.23. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

### 3.1.3 T-Test Quantile
If the t-statistic value we calculated above (0.23) is inside the quantiles, then we can reject the null hypothesis

In [13]:
degree_freedom = len(df_master[df_master['Markdown_Success']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quantile range is: ' + str(LQ))
print ('The t-distribution right quantile range is: ' + str(RQ))

The t-distribution left quantile range is: -1.96070932309
The t-distribution right quantile range is: 1.96070932309


### 3.1.4 One-Sample T-Test Summary
* T-Test =-0.23 | P-Value = 0.82 | Accept Null Hypothesis
* Accept the null hypothesis because:
* T-Test score is inside the quantiles
* P-value is higher than confidence level of 5%
* Based on the statistical analysis of a one sample t-test, there seems to be no significant difference between the mean of the markdowns when the missing values are left as nan and the markdowns when they are backfilled. The P-value of 0.82 at a 5% confidence level is a good indicator to accept the null hypothesis.

But as mentioned in the previous test this does not neccessarily mean that there is no practical significance. We would have to conduct more experiments or maybe collect more data about the stores in order to come up with a more accurate finding. For the purpose of this analysis we shall go ahead with the missing values filled since there is no suggestion that the final result will be affected by the missing values

## 3.2 Test Hypothesis that the difference between sales recorded on holidays and those recorded on non-holiday weeks is statistically significant

### 3.2.1 Jewelry Sales

Hypothesis Testing: Is there significant difference in the means of the Jewelry sales on holiday period and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the jewelry sales recorded during holiday from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

#### 3.2.1.1 Mean Estimation - Jewelry Sales
Let us estimate the means of Jewelry Sales for holiday period and the means for the entire population

In [14]:
import pandas as pd

def Jewelry_Sales (c):
    if c['IsHoliday'] == 'FALSE':
        return 0
    else:
        return 1

In [15]:
# Let's compare the means of Jewelry Sales for holiday period and the means for the entire population
Jewelry_Sales_for_holiday_period = df_master[df_master['IsHoliday']==1]['Jewelry'].mean()
Jewelry_Sales_for_population = df_master['Jewelry'].mean()

print('The mean Jewelry Sales during holiday period is: ' + str(Jewelry_Sales_for_holiday_period))
print('The mean Jewelry Sales for the entire population is: ' + str(Jewelry_Sales_for_population))

The mean Jewelry Sales during holiday period is: 18020.531145299243
The mean Jewelry Sales for the entire population is: 18548.270426130683


#### 3.2.1.2 T-Test - Jewelry Sales
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [16]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Jewelry'], # Sample of jewelry sales 
                  popmean = Jewelry_Sales_for_holiday_period)  # means of Jewelry Sales for holiday period

Ttest_1sampResult(statistic=3.5516777740792578, pvalue=0.00038493451926068325)

#### 3.2.1.3 T-Test Result - Jewelry Sales
The test result shows the test statistic "t" is equal to 3.55. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

##### 3.2.1.4 T-Test Quantile - Jewelry Sales
If the t-statistic value we calculated above (3.55) is outside the quantiles, then we can reject the null hypothesis

In [17]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.2.1.5 One-Sample T-Test Summary - Jewelry Sales
T-Test = 3.55 | P-Value = 0.00038 | Reject Null Hypothesis
Reject the null hypothesis because:

T-Test score is outside the quantiles
P-value is less than confidence level of 5%

If the P-value is less than (or equal to) α, reject the null hypothesis in favor of the alternative hypothesis

Based on the statistical analysis of a one sample t-test, there seems to be significant difference between the mean of the IsHoliday Jewelry Sales. The P-value of 0.003 at a 5% confidence level is a good indicator to reject the null hypothesis.


### 3.2.2 Cell Phone Sales
Hypothesis Testing: Is there significant difference in the means of the Cell Phone sales on holiday period and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the Cell Phone sales recorded during holiday from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

### 3.2.2.1 Mean Estimation - Cell Phone Sales
Let us estimate the means of Cell Phone Sales for holiday period and the means for the entire population

In [18]:
# Let's compare the means of Cell phone Sales for holiday period and the means for the entire population
Cell_Phone_Sales_for_holiday_period = df_master[df_master['IsHoliday']==1]['Cell_Phones'].mean()
Cell_Phone_Sales_for_population = df_master['Cell_Phones'].mean()

print('The mean Cell Phone Sales during holiday period is: ' + str(Cell_Phone_Sales_for_holiday_period))
print('The mean Cell Phone Sales for the entire population is: ' + str(Cell_Phone_Sales_for_population))

The mean Cell Phone Sales during holiday period is: 25530.830683760512
The mean Cell Phone Sales for the entire population is: 25469.003758239905


#### 3.2.2.2 T-Test - Cell Phone Sales
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [19]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Cell_Phones'], # Sample of Cell Phone sales 
                  popmean = Cell_Phone_Sales_for_holiday_period)  # means of Cell Phone Sales for holiday period

Ttest_1sampResult(statistic=-0.47440054855753117, pvalue=0.63522695141238594)

#### 3.2.2.3 T-Test Result - Cell Phone Sales
The test result shows the test statistic "t" is equal to -0.47. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

#### 3.2.2.4 T-Test Quantile - Cell Phone Sales
If the t-statistic value we calculated above (-0.47) is outside the quantiles, then we can reject the null hypothesis

In [20]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.2.2.5 One-Sample T-Test Summary - Cell Phone Sales
T-Test = -0.47 | P-Value = 0.64| Accept Null Hypothesis; Accept the null hypothesis because:

T-Test score is inside the quantiles and P-value is more than confidence level of 5%

If the P-value is less than (or equal to) α, reject the null hypothesis in favor of the alternative hypothesis

Based on the statistical analysis of a one sample t-test, there seems to be no significant difference between the mean of the IsHoliday Cell Phone Sales and the population mean. The P-value of 0.6 at a 5% confidence level is a good indicator to accept the null hypothesis.

### 3.2.3 Floral Sales
Hypothesis Testing: Is there significant difference in the means of the Floral sales on holiday period and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the Floral sales recorded during holiday from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

### 3.2.3.1 Mean Estimation - Cell Phone Sales
Let us estimate the means of Floral Sales for holiday period and the means for the entire population

In [21]:
# Let's compare the means of Floral Sales for holiday period and the means for the entire population
Floral_Sales_for_holiday_period = df_master[df_master['IsHoliday']==1]['Floral'].mean()
Floral_Sales_for_population = df_master['Floral'].mean()

print('The mean Floral Sales during holiday period is: ' + str(Floral_Sales_for_holiday_period))
print('The mean Floral Sales for the entire population is: ' + str(Floral_Sales_for_population))

The mean Floral Sales during holiday period is: 10335.280273504366
The mean Floral Sales for the entire population is: 10297.530998779794


#### 3.2.3.2 T-Test - Floral Sales
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [22]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Floral'], # Sample of Floral sales 
                  popmean = Floral_Sales_for_holiday_period)  # means of Floral Sales for holiday period

Ttest_1sampResult(statistic=-0.51779191783310874, pvalue=0.60461744521598559)

#### 3.2.3.3 T-Test Result - Floral Sales
The test result shows the test statistic "t" is equal to -0.52. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

##### 3.2.3.4 T-Test Quantile - Floral Sales
If the t-statistic value we calculated above (-0.52) is outside the quantiles, then we can reject the null hypothesis

In [23]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.2.3.5 One-Sample T-Test Summary - Floral Sales
T-Test = -0.52 | P-Value = 0.60 | Accept Null Hypothesis; Accept the null hypothesis because:

T-Test score is inside the quantiles and P-value is higher than confidence level of 5%

If the P-value is higher than (or equal to) α, accept the null hypothesis. Based on the statistical analysis of a one sample t-test, there seems to be no significant difference between the mean of the IsHoliday Floral Sales and the population mean. The P-value of 0.60 at a 5% confidence level is a good indicator to accept the null hypothesis.

### 3.2.4 Clearance Clothing Sales
Hypothesis Testing: Is there significant difference in the means of the Clearance Clothing sales on holiday period and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the Clearance Clothing sales recorded during holiday from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

### 3.2.4.1 Mean Estimation - Clearance Clothing Sales
Let us estimate the means of Clearance Clothing Sales for holiday period and the means for the entire population

In [24]:
# Let's compare the means of Clearance Clothing Sales for holiday period and the means for the entire population
Clearance_Clothing_Sales_for_holiday_period = df_master[df_master['IsHoliday']==1]['Clearance_Clothings'].mean()
Clearance_Clothing_Sales_for_population = df_master['Clearance_Clothings'].mean()

print('The mean Clearance Clothing Sales during holiday period is: ' + str(Clearance_Clothing_Sales_for_holiday_period))
print('The mean Clearance Clothing Sales for the entire population is: ' + str(Clearance_Clothing_Sales_for_population))

The mean Clearance Clothing Sales during holiday period is: 7315.417675213643
The mean Clearance Clothing Sales for the entire population is: 7607.392772894041


#### 3.2.4.2 T-Test - Clearance Clothing Sales
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [25]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Clearance_Clothings'], # Sample of Clearance Clothings sales 
                  popmean = Clearance_Clothing_Sales_for_holiday_period)  # means of Clearance Clothings Sales for holiday period

Ttest_1sampResult(statistic=5.2428242532919205, pvalue=1.6208631849271012e-07)

#### 3.2.4.3 T-Test Result - Clearance Clothing Sales
The test result shows the test statistic "t" is equal to 5.24. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

##### 3.2.4.4 T-Test Quantile - Clearance Clothing Sales
If the t-statistic value we calculated above (5.24) is outside the quantiles, then we can reject the null hypothesis

In [26]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.2.4.5 One-Sample T-Test Summary - Clearance Clothing Sales
T-Test = 5.24 | P-Value = 1.6208631849271012e-07 | Reject Null Hypothesis; Reject the null hypothesis because:

T-Test score is outside the quantiles and P-value is lower than confidence level of 5%

If the P-value is higher than (or equal to) α, accept the null hypothesis. Based on the statistical analysis of a one sample t-test, there seems to be significant difference between the mean of the IsHoliday Clearance Clothing Sales and the population mean. The P-value of 1.6208631849271012e-07 at a 5% confidence level is a good indicator to reject the null hypothesis.

### 3.2.5 Video Games Sales
Hypothesis Testing: Is there significant difference in the means of the Video Games sales on holiday period and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the Video Games sales recorded during holiday from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

#### 3.2.5.1 Mean Estimation - Video Games Sales
Let us estimate the means of Video Games Sales for holiday period and the mean for the entire population

In [27]:
# Let's compare the means of Video Games Sales for holiday period and the means for the entire population
Video_Games_Sales_for_holiday_period = df_master[df_master['IsHoliday']==1]['Video_Games'].mean()
Video_Games_Sales_for_population = df_master['Video_Games'].mean()

print('The mean Video Games Sales during holiday period is: ' + str(Video_Games_Sales_for_holiday_period))
print('The mean Video Games Sales for the entire population is: ' + str(Video_Games_Sales_for_population))

The mean Video Games Sales during holiday period is: 24575.281760683654
The mean Video Games Sales for the entire population is: 24108.858239316476


#### 3.2.5.2 T-Test - Video Games Sales
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [28]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Video_Games'], # Sample of Video Games sales 
                  popmean = Video_Games_Sales_for_holiday_period)  # means of Video Games Sales for holiday period

Ttest_1sampResult(statistic=-1.6619010942794092, pvalue=0.096570882415821027)

#### 3.2.5.3 T-Test Result - Video Games Sales
The test result shows the test statistic "t" is equal to -1.66. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

##### 3.2.5.4 T-Test Quantile - Video Games Sales
If the t-statistic value we calculated above (-1.66) is outside the quantiles, then we can reject the null hypothesis

In [29]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.2.5.5 One-Sample T-Test Summary - Video Games Sales
T-Test = -1.66 | P-Value = 0.09 | Accept Null Hypothesis; Accept the null hypothesis because:

T-Test score is inside the quantiles. Based on the statistical analysis of a one sample t-test, there seems to be no significant difference between the mean of the IsHoliday Video Games Sales and the population mean. 

## 3.3 Test Hypothesis that the difference between MarkDowns recorded on holidays and those recorded on non-holiday weeks is statitically significant

### 3.3.1 MarkDown 1

Hypothesis Testing: Is there significant difference in the means of Markdown 1 on holiday period and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in Markdown 1 recorded during holiday from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

#### 3.3.1.1 Mean Estimation - Markdown 1
Let us estimate the means of Markdown 1 for holiday period and the means for the entire population

In [30]:
# Let's compare the means of Markdown 1 for holiday period and the means for the entire population
MarkDown1_for_holiday_period = df_master[df_master['IsHoliday']==1]['MarkDown1'].mean()
MarkDown1_for_population = df_master['MarkDown1'].mean()

print('The mean of MarkDown1 during holiday period is: ' + str(MarkDown1_for_holiday_period))
print('The mean of MarkDown1 for the entire population is: ' + str(MarkDown1_for_population))

The mean of MarkDown1 during holiday period is: 11338.814940170949
The mean of MarkDown1 for the entire population is: 8887.617797313967


#### 3.3.1.2 T-Test - MarkDown1
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [31]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown1'], # Sample of MarkDown1 
                  popmean = MarkDown1_for_holiday_period)  # means of MarkDown1 for holiday period

Ttest_1sampResult(statistic=-24.164318122881514, pvalue=1.1327427015673317e-124)

#### 3.3.1.3 T-Test Result - MarkDown1
The test result shows the test statistic "t" is equal to -24.16. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

#### 3.3.1.4 T-Test Quantile - MarkDown1
If the t-statistic value we calculated above (-24.16) is outside the quantiles, then we can reject the null hypothesis

In [32]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.3.1.5 One-Sample T-Test Summary - MarkDown1
T-Test = -24.16| P-Value = 1.1327427015673317e-124 | Reject Null Hypothesis; Reject the null hypothesis because:

T-Test score is outside the quantiles and P-value is lower than confidence level of 5%

If the P-value is higher than (or equal to) α, accept the null hypothesis. Based on the statistical analysis of a one sample t-test, there seems to be significant difference between the mean of the IsHoliday MarkDown1 and the population mean. The P-value of 1.1327427015673317e-124 at a 5% confidence level is a good indicator to reject the null hypothesis.

### 3.3.2 MarkDown 2
Hypothesis Testing: Is there significant difference in the means of Markdown 2 on holiday period and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in Markdown 2 recorded during holiday from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

#### 3.3.2.1 Mean Estimation - Markdown 2
Let us estimate the means of Markdown 2 for holiday period and the means for the entire population

In [33]:
# Let's compare the means of Markdown 1 for holiday period and the means for the entire population
MarkDown2_for_holiday_period = df_master[df_master['IsHoliday']==1]['MarkDown2'].mean()
MarkDown2_for_population = df_master['MarkDown2'].mean()

print('The mean of MarkDown2 during holiday period is: ' + str(MarkDown2_for_holiday_period))
print('The mean of MarkDown2 for the entire population is: ' + str(MarkDown2_for_population))

The mean of MarkDown2 during holiday period is: 9794.434051282047
The mean of MarkDown2 for the entire population is: 6107.224317460187


#### 3.3.2.2 T-Test - MarkDown2
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [34]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown2'], # Sample of MarkDown1 
                  popmean = MarkDown2_for_holiday_period)  # means of MarkDown1 for holiday period

Ttest_1sampResult(statistic=-37.240603802737347, pvalue=1.4303499102667072e-280)

#### 3.3.2.3 T-Test Result - MarkDown2
The test result shows the test statistic "t" is equal to -37.24. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

#### 3.3.2.4 T-Test Quantile - MarkDown2
If the t-statistic value we calculated above (-37.24) is outside the quantiles, then we can reject the null hypothesis

In [35]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.3.2.5 One-Sample T-Test Summary - MarkDown2
T-Test = -37.24| P-Value = 1.4303499102667072e-280 | Reject Null Hypothesis; Reject the null hypothesis because:

T-Test score is outside the quantiles and P-value is lower than confidence level of 5%

If the P-value is higher than (or equal to) α, accept the null hypothesis. Based on the statistical analysis of a one sample t-test, there seems to be significant difference between the mean of the IsHoliday MarkDown2 and the population mean. The P-value of 1.4303499102667072e-280 at a 5% confidence level is a good indicator to reject the null hypothesis.

### 3.3.3 MarkDown 3
Hypothesis Testing: Is there significant difference in the means of Markdown 3 on holiday period and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in Markdown 3 recorded during holiday from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

#### 3.3.3.1 Mean Estimation - Markdown 3
Let us estimate the means of Markdown 3 for holiday period and the means for the entire population

In [36]:
# Let's compare the means of Markdown 3 for holiday period and the means for the entire population
MarkDown3_for_holiday_period = df_master[df_master['IsHoliday']==1]['MarkDown3'].mean()
MarkDown3_for_population = df_master['MarkDown3'].mean()

print('The mean of MarkDown3 during holiday period is: ' + str(MarkDown3_for_holiday_period))
print('The mean of MarkDown3 for the entire population is: ' + str(MarkDown3_for_population))

The mean of MarkDown3 during holiday period is: 9578.540598290585
The mean of MarkDown3 for the entire population is: 928.7852197802101


#### 3.3.3.2 T-Test - MarkDown3
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [37]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown3'], # Sample of MarkDown3 
                  popmean = MarkDown3_for_holiday_period)  # means of MarkDown3 for holiday period

Ttest_1sampResult(statistic=-103.98200072690143, pvalue=0.0)

#### 3.3.3.3 T-Test Result - MarkDown3
The test result shows the test statistic "t" is equal to -103.98. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

#### 3.3.3.4 T-Test Quantile - MarkDown3
If the t-statistic value we calculated above (-103.98) is outside the quantiles, then we can reject the null hypothesis

In [38]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.3.3.5 One-Sample T-Test Summary - MarkDown3
T-Test = -103.98| P-Value = 0 | Reject Null Hypothesis; Reject the null hypothesis because:

T-Test score is outside the quantiles and P-value is lower than confidence level of 5%

If the P-value is higher than (or equal to) α, accept the null hypothesis. Based on the statistical analysis of a one sample t-test, there seems to be significant difference between the mean of the IsHoliday MarkDown3 and the population mean. The P-value of 0 at a 5% confidence level is a good indicator to reject the null hypothesis.

### 3.3.4 MarkDown 4
Hypothesis Testing: Is there significant difference in the means of Markdown 4 on holiday period and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in Markdown 4 recorded during holiday from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

#### 3.3.4.1 Mean Estimation - Markdown 4
Let us estimate the means of Markdown 4 for holiday period and the means for the entire population

In [39]:
# Let's compare the means of Markdown 4 for holiday period and the means for the entire population
MarkDown4_for_holiday_period = df_master[df_master['IsHoliday']==1]['MarkDown4'].mean()
MarkDown4_for_population = df_master['MarkDown4'].mean()

print('The mean of MarkDown4 during holiday period is: ' + str(MarkDown4_for_holiday_period))
print('The mean of MarkDown4 for the entire population is: ' + str(MarkDown4_for_population))

The mean of MarkDown4 during holiday period is: 4658.832564102572
The mean of MarkDown4 for the entire population is: 3130.1765555554607


#### 3.3.4.2 T-Test - MarkDown4
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [40]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown4'], # Sample of MarkDown4 
                  popmean = MarkDown4_for_holiday_period)  # means of MarkDown4 for holiday period

Ttest_1sampResult(statistic=-26.687306347499145, pvalue=1.5752096957048749e-150)

#### 3.3.4.3 T-Test Result - MarkDown4
The test result shows the test statistic "t" is equal to -26.69. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

#### 3.3.4.4 T-Test Quantile - MarkDown4
If the t-statistic value we calculated above (-26.68) is outside the quantiles, then we can reject the null hypothesis

In [41]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.3.4.5 One-Sample T-Test Summary - MarkDown4
T-Test = -26.68| P-Value = 1.5752096957048749e-150 | Reject Null Hypothesis; Reject the null hypothesis because:

T-Test score is outside the quantiles and P-value is lower than confidence level of 5%

If the P-value is higher than (or equal to) α, accept the null hypothesis. Based on the statistical analysis of a one sample t-test, there seems to be significant difference between the mean of the IsHoliday MarkDown4 and the population mean. The P-value of 0 at a 5% confidence level is a good indicator to reject the null hypothesis.

### 3.3.5 MarkDown 5
Hypothesis Testing: Is there significant difference in the means of Markdown 5 on holiday period and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in Markdown 5 recorded during holiday from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

#### 3.3.5.1 Mean Estimation - Markdown 5
Let us estimate the means of Markdown 5 for holiday period and the means for the entire population

In [42]:
# Let's compare the means of Markdown 5 for holiday period and the means for the entire population
MarkDown5_for_holiday_period = df_master[df_master['IsHoliday']==1]['MarkDown5'].mean()
MarkDown5_for_population = df_master['MarkDown5'].mean()

print('The mean of MarkDown5 during holiday period is: ' + str(MarkDown5_for_holiday_period))
print('The mean of MarkDown5 for the entire population is: ' + str(MarkDown5_for_population))

The mean of MarkDown5 during holiday period is: 3910.6670769230823
The mean of MarkDown5 for the entire population is: 4544.031686202577


#### 3.3.5.2 T-Test - MarkDown5
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [43]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown5'], # Sample of MarkDown5 
                  popmean = MarkDown5_for_holiday_period)  # means of MarkDown5 for holiday period

Ttest_1sampResult(statistic=5.9215134513376873, pvalue=3.3185785770162059e-09)

#### 3.3.5.3 T-Test Result - MarkDown5
The test result shows the test statistic "t" is equal to 5.92. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

#### 3.3.5.4 T-Test Quantile - MarkDown5
If the t-statistic value we calculated above (5.92) is outside the quantiles, then we can reject the null hypothesis

In [44]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.3.5.5 One-Sample T-Test Summary - MarkDown5
T-Test = 5.92| P-Value = 3.3185785770162059e-09 | Reject Null Hypothesis; Reject the null hypothesis because:

T-Test score is outside the quantiles and P-value is lower than confidence level of 5%

If the P-value is higher than (or equal to) α, accept the null hypothesis. Based on the statistical analysis of a one sample t-test, there seems to be significant difference between the mean of the IsHoliday MarkDown5 and the population mean. The P-value of 0 at a 5% confidence level is a good indicator to reject the null hypothesis.

## 3.4 Test Hypothesis that the difference in Markdowns for each Store Type is statistically significant

### 3.4.1 Markdowns for Store A
Assign 1 to store A and 0 to all others 

In [45]:
import pandas as pd

def Stores (c):
    if c['Store'] == 'A':
        return 1
    elif c['Store'] == 'B':
        return 2
    elif c['Store'] == 'C':
        return 3
    else:
        return 0

Hypothesis Testing: Is there significant difference in the means of the markdowns in Store A and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the Markdowns in Store A from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

### 3.4.1.1 Mean Estimation - Store A Markdowns
Let us estimate the means of the Markdowns for store A and the means for the entire population

In [46]:
# Let's compare the means of the Markdowns for Store A and the means for the entire population
MarkDown1_for_Store_A = df_master[df_master['Store']==1]['MarkDown1'].mean()
MarkDown1_for_population = df_master['MarkDown1'].mean()
MarkDown2_for_Store_A = df_master[df_master['Store']==1]['MarkDown2'].mean()
MarkDown2_for_population = df_master['MarkDown2'].mean()
MarkDown3_for_Store_A = df_master[df_master['Store']==1]['MarkDown3'].mean()
MarkDown3_for_population = df_master['MarkDown3'].mean()
MarkDown4_for_Store_A = df_master[df_master['Store']==1]['MarkDown4'].mean()
MarkDown4_for_population = df_master['MarkDown4'].mean()
MarkDown5_for_Store_A = df_master[df_master['Store']==1]['MarkDown5'].mean()
MarkDown5_for_population = df_master['MarkDown5'].mean()

print('The mean of MarkDown1 for Store A is: ' + str(MarkDown1_for_Store_A))
print('The mean of MarkDown1 for the entire population is: ' + str(MarkDown1_for_population))
print('The mean of MarkDown2 for Store A is: ' + str(MarkDown2_for_Store_A))
print('The mean of MarkDown2 for the entire population is: ' + str(MarkDown2_for_population))
print('The mean of MarkDown3 for Store A is: ' + str(MarkDown3_for_Store_A))
print('The mean of MarkDown3 for the entire population is: ' + str(MarkDown3_for_population))
print('The mean of MarkDown4 for Store A is: ' + str(MarkDown4_for_Store_A))
print('The mean of MarkDown4 for the entire population is: ' + str(MarkDown4_for_population))
print('The mean of MarkDown5 for Store A is: ' + str(MarkDown5_for_Store_A))
print('The mean of MarkDown5 for the entire population is: ' + str(MarkDown5_for_population))

The mean of MarkDown1 for Store A is: 9469.89093406594
The mean of MarkDown1 for the entire population is: 8887.617797313967
The mean of MarkDown2 for Store A is: 5361.7748351648315
The mean of MarkDown2 for the entire population is: 6107.224317460187
The mean of MarkDown3 for Store A is: 926.8147802197802
The mean of MarkDown3 for the entire population is: 928.7852197802101
The mean of MarkDown4 for Store A is: 3023.2740109890105
The mean of MarkDown4 for the entire population is: 3130.1765555554607
The mean of MarkDown5 for Store A is: 5501.529285714284
The mean of MarkDown5 for the entire population is: 4544.031686202577


#### 3.4.1.2 T-Test - Store A MarkDowns
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [47]:
# markdown 1
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown1'], # Sample of MarkDown1 
                  popmean = MarkDown1_for_Store_A)  # means of MarkDown1 Store A

Ttest_1sampResult(statistic=-5.7401475649920171, pvalue=9.7967258063166012e-09)

In [48]:
# markdown 2
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown2'], # Sample of MarkDown2 
                  popmean = MarkDown2_for_Store_A)  # means of MarkDown2 Store A

Ttest_1sampResult(statistic=7.5289964035617958, pvalue=5.6569836311209549e-14)

In [49]:
# markdown 3
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown3'], # Sample of MarkDown3 
                  popmean = MarkDown3_for_Store_A)  # means of MarkDown3 Store A

Ttest_1sampResult(statistic=0.023687403728779081, pvalue=0.9811025306972202)

In [50]:
# markdown 4
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown4'], # Sample of MarkDown4 
                  popmean = MarkDown4_for_Store_A)  # means of MarkDown4 Store A

Ttest_1sampResult(statistic=1.8663067035508349, pvalue=0.062034184503743747)

In [51]:
# markdown 5
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown5'], # Sample of MarkDown5 
                  popmean = MarkDown5_for_Store_A)  # means of MarkDown5 Store A

Ttest_1sampResult(statistic=-8.9519288448724676, pvalue=4.2650574456801189e-19)

#### 3.4.1.3 T-Test Result - Store A MarkDowns
The test result shows the test statistic "t" for each of the markdowns equals:
* MarkDown1 - 5.74
* MarkDown2 - 7.53
* MarkDown3 - 0.024
* MarkDown4 - 1.866
* MarkDown5 - -8.95

This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

#### 3.4.1.4 T-Test Quantile - Store A MarkDowns
If the t-statistic value we calculated above is outside the quantiles, then we can reject the null hypothesis

#### 3.4.1.5 One-Sample T-Test Summary - Store A MarkDowns
For MarkDown1, MarkDown2 and MarkDown5, Reject the null hypothesis because: T-Test score is outside the quartiles and P-value is lower than confidence level of 5%.

For MarkDown3 and MarkDown4, Accept the null hypothesis because the t-test score is inside the quartiles.

### 3.4.2 Markdowns for Store B
Assign 2 to store B

Hypothesis Testing: Is there significant difference in the means of the markdowns in Store B and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the Markdowns in Store B from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

### 3.4.2.1 Mean Estimation - Store B Markdowns
Let us estimate the means of the Markdowns for store B and the means for the entire population

In [52]:
# Let's compare the means of the Markdowns for Store B and the means for the entire population
MarkDown1_for_Store_B = df_master[df_master['Store']==2]['MarkDown1'].mean()
MarkDown1_for_population = df_master['MarkDown1'].mean()
MarkDown2_for_Store_B = df_master[df_master['Store']==2]['MarkDown2'].mean()
MarkDown2_for_population = df_master['MarkDown2'].mean()
MarkDown3_for_Store_B = df_master[df_master['Store']==2]['MarkDown3'].mean()
MarkDown3_for_population = df_master['MarkDown3'].mean()
MarkDown4_for_Store_B = df_master[df_master['Store']==2]['MarkDown4'].mean()
MarkDown4_for_population = df_master['MarkDown4'].mean()
MarkDown5_for_Store_B = df_master[df_master['Store']==2]['MarkDown5'].mean()
MarkDown5_for_population = df_master['MarkDown5'].mean()

print('The mean of MarkDown1 for Store B is: ' + str(MarkDown1_for_Store_B))
print('The mean of MarkDown1 for the entire population is: ' + str(MarkDown1_for_population))
print('The mean of MarkDown2 for Store B is: ' + str(MarkDown2_for_Store_B))
print('The mean of MarkDown2 for the entire population is: ' + str(MarkDown2_for_population))
print('The mean of MarkDown3 for Store B is: ' + str(MarkDown3_for_Store_B))
print('The mean of MarkDown3 for the entire population is: ' + str(MarkDown3_for_population))
print('The mean of MarkDown4 for Store B is: ' + str(MarkDown4_for_Store_B))
print('The mean of MarkDown4 for the entire population is: ' + str(MarkDown4_for_population))
print('The mean of MarkDown5 for Store B is: ' + str(MarkDown5_for_Store_B))
print('The mean of MarkDown5 for the entire population is: ' + str(MarkDown5_for_population))

The mean of MarkDown1 for Store B is: 10598.933736263738
The mean of MarkDown1 for the entire population is: 8887.617797313967
The mean of MarkDown2 for Store B is: 12464.842087912071
The mean of MarkDown2 for the entire population is: 6107.224317460187
The mean of MarkDown3 for Store B is: 1356.4093406593408
The mean of MarkDown3 for the entire population is: 928.7852197802101
The mean of MarkDown4 for Store B is: 4577.387802197806
The mean of MarkDown4 for the entire population is: 3130.1765555554607
The mean of MarkDown5 for Store B is: 5367.723681318676
The mean of MarkDown5 for the entire population is: 4544.031686202577


#### 3.4.2.2 T-Test - Store B MarkDowns
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [53]:
# markdown 1
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown1'], # Sample of MarkDown1 
                  popmean = MarkDown1_for_Store_B)  # means of MarkDown1 Store B

Ttest_1sampResult(statistic=-16.870443439462662, pvalue=8.4676092868986007e-63)

In [54]:
# markdown 2
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown2'], # Sample of MarkDown2 
                  popmean = MarkDown2_for_Store_B)  # means of MarkDown2 Store B

Ttest_1sampResult(statistic=-64.21156961777622, pvalue=0.0)

In [55]:
# markdown 3
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown3'], # Sample of MarkDown3 
                  popmean = MarkDown3_for_Store_B)  # means of MarkDown3 Store B

Ttest_1sampResult(statistic=-5.1406322725106941, pvalue=2.8014733958824994e-07)

In [56]:
# markdown 4
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown4'], # Sample of MarkDown4 
                  popmean = MarkDown4_for_Store_B)  # means of MarkDown4 Store B

Ttest_1sampResult(statistic=-25.265442109110051, pvalue=1.0913592773713057e-135)

In [57]:
# markdown 5
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown5'], # Sample of MarkDown5 
                  popmean = MarkDown5_for_Store_B)  # means of MarkDown5 Store B

Ttest_1sampResult(statistic=-7.7009405915278952, pvalue=1.5083947683751122e-14)

#### 3.4.2.3 T-Test Result - Store B MarkDowns
The test statistic tell us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

#### 3.4.2.4 T-Test Quantile - Store B MarkDowns
If the t-statistic value we calculated above is outside the quantiles, then we can reject the null hypothesis

In [58]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.4.2.5 One-Sample T-Test Summary - Store B MarkDowns
For all the markdowns reject the null hypothesis because the t-test scores are outside the quartiles and P-value is lower than confidence level of 5%

### 3.4.3 Markdowns for Store C
For this analysis, 3 has already been assigned to store C:

Hypothesis Testing: Is there significant difference in the means of the markdowns for Store C and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the Markdowns for Store C from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

### 3.4.3.1 Mean Estimation - Store C Markdowns
Let us estimate the means of the Markdowns for store C and the means for the entire population

In [59]:
# Let's compare the means of the Markdowns for Store C and the means for the entire population
MarkDown1_for_Store_C = df_master[df_master['Store']==3]['MarkDown1'].mean()
MarkDown1_for_population = df_master['MarkDown1'].mean()
MarkDown2_for_Store_C = df_master[df_master['Store']==3]['MarkDown2'].mean()
MarkDown2_for_population = df_master['MarkDown2'].mean()
MarkDown3_for_Store_C = df_master[df_master['Store']==3]['MarkDown3'].mean()
MarkDown3_for_population = df_master['MarkDown3'].mean()
MarkDown4_for_Store_C = df_master[df_master['Store']==3]['MarkDown4'].mean()
MarkDown4_for_population = df_master['MarkDown4'].mean()
MarkDown5_for_Store_C = df_master[df_master['Store']==3]['MarkDown5'].mean()
MarkDown5_for_population = df_master['MarkDown5'].mean()

print('The mean of MarkDown1 for Store C is: ' + str(MarkDown1_for_Store_C))
print('The mean of MarkDown1 for the entire population is: ' + str(MarkDown1_for_population))
print('The mean of MarkDown2 for Store C is: ' + str(MarkDown2_for_Store_C))
print('The mean of MarkDown2 for the entire population is: ' + str(MarkDown2_for_population))
print('The mean of MarkDown3 for Store C is: ' + str(MarkDown3_for_Store_C))
print('The mean of MarkDown3 for the entire population is: ' + str(MarkDown3_for_population))
print('The mean of MarkDown4 for Store C is: ' + str(MarkDown4_for_Store_C))
print('The mean of MarkDown4 for the entire population is: ' + str(MarkDown4_for_population))
print('The mean of MarkDown5 for Store C is: ' + str(MarkDown5_for_Store_C))
print('The mean of MarkDown5 for the entire population is: ' + str(MarkDown5_for_population))

The mean of MarkDown1 for Store C is: 6037.801373626369
The mean of MarkDown1 for the entire population is: 8887.617797313967
The mean of MarkDown2 for Store C is: 1762.144835164834
The mean of MarkDown2 for the entire population is: 6107.224317460187
The mean of MarkDown3 for Store C is: 646.6121428571429
The mean of MarkDown3 for the entire population is: 928.7852197802101
The mean of MarkDown4 for Store C is: 663.3780769230764
The mean of MarkDown4 for the entire population is: 3130.1765555554607
The mean of MarkDown5 for Store C is: 1078.5266483516493
The mean of MarkDown5 for the entire population is: 4544.031686202577


#### 3.4.3.2 T-Test - Store C MarkDowns
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [60]:
# markdown 1
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown1'], # Sample of MarkDown1 
                  popmean = MarkDown1_for_Store_C)  # means of MarkDown1 Store C

Ttest_1sampResult(statistic=28.093974756158012, pvalue=7.0617790617063055e-166)

In [61]:
# markdown 2
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown2'], # Sample of MarkDown2 
                  popmean = MarkDown2_for_Store_C)  # means of MarkDown2 Store C

Ttest_1sampResult(statistic=43.885049989779773, pvalue=0.0)

In [62]:
# markdown 3
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown3'], # Sample of MarkDown3 
                  popmean = MarkDown3_for_Store_C)  # means of MarkDown3 Store C

Ttest_1sampResult(statistic=3.3921099274810258, pvalue=0.00069684929907546408)

In [63]:
# markdown 4
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown4'], # Sample of MarkDown4 
                  popmean = MarkDown4_for_Store_C)  # means of MarkDown4 Store C

Ttest_1sampResult(statistic=43.065415848123415, pvalue=0.0)

In [64]:
# markdown 5
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['MarkDown5'], # Sample of MarkDown5 
                  popmean = MarkDown5_for_Store_C)  # means of MarkDown5 Store C

Ttest_1sampResult(statistic=32.400033719367876, pvalue=8.616019228573662e-217)

#### 3.4.3.3 T-Test Result - Store C MarkDowns
The test statistic tell us how much the sample mean deviates from the null hypothesis. If the t-statistic lies inside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we accept the null hypothesis. We can check the quantiles with stats.t.ppf():

#### 3.4.3.4 T-Test Quantile - Store C MarkDowns
If the t-statistic value we calculated above is outside the quantiles, then we can reject the null hypothesis

In [65]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.4.3.5 One-Sample T-Test Summary - Store C MarkDowns
For all the markdowns reject the null hypothesis because the t-test scores are outside the quartiles and P-value is lower than confidence level of 5%

## 3.5 Test Hypothesis that the difference in Sales of Selected Gift Items for each Store Type is statistically significant

### 3.5.1 Sales of Selected Gift Items for Store A
Hypothesis Testing: Is there significant difference in the means of sales of selected gift items for Store A and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the sales of selected gift items for Store A from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

### 3.5.1.1 Mean Estimation - Store A (Sales of Selected Gift Items)
Let us estimate the means of sales of selected gift items for store A and the means for the entire population

In [66]:
# Let's compare the means of sales of selected gift items for Store A and the means for the entire population
Jewelry_sale_for_Store_A = df_master[df_master['Store']==1]['Jewelry'].mean()
Jewelry_sale_for_population = df_master['Jewelry'].mean()
Cell_Phones_sale_for_Store_A = df_master[df_master['Store']==1]['Cell_Phones'].mean()
Cell_Phones_sale_for_population = df_master['Cell_Phones'].mean()
Floral_sale_for_Store_A = df_master[df_master['Store']==1]['Floral'].mean()
Floral_sale_for_population = df_master['Floral'].mean()
Clearance_Clothings_sale_for_Store_A = df_master[df_master['Store']==1]['Clearance_Clothings'].mean()
Clearance_Clothings_sale_for_population = df_master['Clearance_Clothings'].mean()
Video_Games_sale_for_Store_A = df_master[df_master['Store']==1]['Video_Games'].mean()
Video_Games_sale_for_population = df_master['Video_Games'].mean()

print('The mean of Jewelry sale for Store A is: ' + str(Jewelry_sale_for_Store_A))
print('The mean of Jewelry sale for the entire population is: ' + str(Jewelry_sale_for_population))
print('The mean of Cell Phones sale for Store A is: ' + str(Cell_Phones_sale_for_Store_A))
print('The mean of Cell Phones sale for the entire population is: ' + str(Cell_Phones_sale_for_population))
print('The mean of Floral sale for Store A is: ' + str(Floral_sale_for_Store_A))
print('The mean of Floral sale for the entire population is: ' + str(Floral_sale_for_population))
print('The mean of Clearance Clothings sale for Store A is: ' + str(Clearance_Clothings_sale_for_Store_A))
print('The mean of Clearance Clothings sale for the entire population is: ' + str(Clearance_Clothings_sale_for_population))
print('The mean of Video Games sale for Store A is: ' + str(Video_Games_sale_for_Store_A))
print('The mean of Video Games sale for the entire population is: ' + str(Video_Games_sale_for_population))

The mean of Jewelry sale for Store A is: 24342.963956043957
The mean of Jewelry sale for the entire population is: 18548.270426130683
The mean of Cell Phones sale for Store A is: 38867.23269230766
The mean of Cell Phones sale for the entire population is: 25469.003758239905
The mean of Floral sale for Store A is: 10055.947252747248
The mean of Floral sale for the entire population is: 10297.530998779794
The mean of Clearance Clothings sale for Store A is: 7485.268736263736
The mean of Clearance Clothings sale for the entire population is: 7607.392772894041
The mean of Video Games sale for Store A is: 64384.87785714291
The mean of Video Games sale for the entire population is: 24108.858239316476


#### 3.5.1.2 T-Test - Store A - Selected Gift Items
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [67]:
# Jewelry Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Jewelry'], # Sample of Jewelry Sale 
                  popmean = Jewelry_sale_for_Store_A)  # means of Jewelry Sale Store A

Ttest_1sampResult(statistic=-38.998204161385814, pvalue=2.6785830883097366e-305)

In [68]:
# Cell Phone Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Cell_Phones'], # Sample of Cell Phone Sale 
                  popmean = Cell_Phones_sale_for_Store_A)  # means of Cell Phone Sale Store A

Ttest_1sampResult(statistic=-102.80516300446031, pvalue=0.0)

In [69]:
# Floral Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Floral'], # Sample of Floral Sale 
                  popmean = Floral_sale_for_Store_A)  # means of Floral Sale Store A

Ttest_1sampResult(statistic=3.3137089940174946, pvalue=0.00092466839545339748)

In [70]:
# Clearance Clothing Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Clearance_Clothings'], # Sample of Clearance Clothing Sale 
                  popmean = Clearance_Clothings_sale_for_Store_A)  # means of Clearance Clothing Sale Store A

Ttest_1sampResult(statistic=2.1929091427355707, pvalue=0.028341953506761615)

In [71]:
# Video Games Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Video_Games'], # Sample of Video Games Sale 
                  popmean = Video_Games_sale_for_Store_A)  # means of Video Games Sale Store A

Ttest_1sampResult(statistic=-143.50640139213462, pvalue=0.0)

#### 3.5.1.3 T-Test Results - Store A - Selected Gift Items
| Items    | t-test statistic    | pvalue    |  
| ------   | :---------------:   | ------:   |
| Jewelry Sales  | -38.998   |2.68e-305     |  
| Cell Phone Sales  | -102.80   | 0.0  |    
| Floral Sales  | 3.31   | 0.0009  |    
| Clearance Clothings Sales  | 2.19   | 0.028   |    
| Video Games Sales  |   -143.50  |  0.0   |   

In [72]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.5.1.4 T-Test Summary - Store A - Selected Gift Items
For all the selected gift items we will reject the null hypothesis that there is no significant difference in their means with respect to Store A sales and those of the general population because the t-test scores are outside the quantiles and P-values are lower than confidence level of 5%

### 3.5.2 Sales of Selected Gift Items for Store B
Hypothesis Testing: Is there significant difference in the means of sales of selected gift items for Store B and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the sales of selected gift items for Store B from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

#### 3.5.2.1 Mean Estimation - Store B (Sales of Selected Gift Items)
Let us estimate the means of sales of selected gift items for store B and the means for the entire population

In [73]:
# Let's compare the means of sales of selected gift items for Store B and the means for the entire population
Jewelry_sale_for_Store_B = df_master[df_master['Store']==2]['Jewelry'].mean()
Jewelry_sale_for_population = df_master['Jewelry'].mean()
Cell_Phones_sale_for_Store_B = df_master[df_master['Store']==2]['Cell_Phones'].mean()
Cell_Phones_sale_for_population = df_master['Cell_Phones'].mean()
Floral_sale_for_Store_B = df_master[df_master['Store']==2]['Floral'].mean()
Floral_sale_for_population = df_master['Floral'].mean()
Clearance_Clothings_sale_for_Store_B = df_master[df_master['Store']==2]['Clearance_Clothings'].mean()
Clearance_Clothings_sale_for_population = df_master['Clearance_Clothings'].mean()
Video_Games_sale_for_Store_B = df_master[df_master['Store']==2]['Video_Games'].mean()
Video_Games_sale_for_population = df_master['Video_Games'].mean()

print('The mean of Jewelry sale for Store B is: ' + str(Jewelry_sale_for_Store_B))
print('The mean of Jewelry sale for the entire population is: ' + str(Jewelry_sale_for_population))
print('The mean of Cell Phones sale for Store B is: ' + str(Cell_Phones_sale_for_Store_B))
print('The mean of Cell Phones sale for the entire population is: ' + str(Cell_Phones_sale_for_population))
print('The mean of Floral sale for Store B is: ' + str(Floral_sale_for_Store_B))
print('The mean of Floral sale for the entire population is: ' + str(Floral_sale_for_population))
print('The mean of Clearance Clothings sale for Store B is: ' + str(Clearance_Clothings_sale_for_Store_B))
print('The mean of Clearance Clothings sale for the entire population is: ' + str(Clearance_Clothings_sale_for_population))
print('The mean of Video Games sale for Store B is: ' + str(Video_Games_sale_for_Store_B))
print('The mean of Video Games sale for the entire population is: ' + str(Video_Games_sale_for_population))

The mean of Jewelry sale for Store B is: 20354.11967032968
The mean of Jewelry sale for the entire population is: 18548.270426130683
The mean of Cell Phones sale for Store B is: 29440.757142857132
The mean of Cell Phones sale for the entire population is: 25469.003758239905
The mean of Floral sale for Store B is: 9130.248351648353
The mean of Floral sale for the entire population is: 10297.530998779794
The mean of Clearance Clothings sale for Store B is: 6417.711263736265
The mean of Clearance Clothings sale for the entire population is: 7607.392772894041
The mean of Video Games sale for Store B is: 40636.21818681316
The mean of Video Games sale for the entire population is: 24108.858239316476


#### 3.5.2.2 T-Test - Store B - Selected Gift Items
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [74]:
# Jewelry Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Jewelry'], # Sample of Jewelry Sale 
                  popmean = Jewelry_sale_for_Store_B)  # means of Jewelry Sale Store B 

Ttest_1sampResult(statistic=-12.153339455562994, pvalue=1.0729987820738569e-33)

In [75]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Cell_Phones'], # Sample of Cell Phone Sale 
                  popmean = Cell_Phones_sale_for_Store_B)  # means of Cell Phone Sale Store B

Ttest_1sampResult(statistic=-30.475427470919499, pvalue=2.5363389345099327e-193)

In [76]:
# Floral Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Floral'], # Sample of Floral Sale 
                  popmean = Floral_sale_for_Store_B)  # means of Floral Sale Store B

Ttest_1sampResult(statistic=16.011155840960438, pvalue=7.741865209723941e-57)

In [77]:
# Clearance Clothing Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Clearance_Clothings'], # Sample of Clearance Clothing Sale 
                  popmean = Clearance_Clothings_sale_for_Store_B)  # means of Clearance Clothing Sale Store B

Ttest_1sampResult(statistic=21.36240768292198, pvalue=1.4129529453060602e-98)

In [78]:
# Video Games Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Video_Games'], # Sample of Video Games Sale 
                  popmean = Video_Games_sale_for_Store_B)  # means of Video Games Sale Store B

Ttest_1sampResult(statistic=-58.88819136258445, pvalue=0.0)

#### 3.5.2.3 T-Test Results - Store B - Selected Gift Items
| Items    | t-test statistic    | pvalue    |  
| ------   | :---------------:   | ------:   |
| Jewelry Sales  | -12.15   |2.68e-305     |  
| Cell Phone Sales  | -30.48   | 2.54e-193  |    
| Floral Sales  | 16.01   | 7.74e-57  |    
| Clearance Clothings Sales  | 21.36   |1.41e-98   |    
| Video Games Sales  |   -58.89  |  0.0   |   

In [79]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.5.2.4 T-Test Summary - Store B - Selected Gift Items
For all the selected gift items we will reject the null hypothesis that there is no significant difference in their means with respect to Store B sales and those of the general population because the t-test scores are outside the quantiles and P-values are lower than confidence level of 5%

### 3.5.3 Sales of Selected Gift Items for Store C
Hypothesis Testing: Is there significant difference in the means of sales of selected gift items for Store C and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the sales of selected gift items for Store C from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

#### 3.5.3.1 Mean Estimation - Store C (Sales of Selected Gift Items)
Let us estimate the means of sales of selected gift items for store C and the means for the entire population

In [80]:
# Let's compare the means of sales of selected gift items for Store C and the means for the entire population
Jewelry_sale_for_Store_C = df_master[df_master['Store']==3]['Jewelry'].mean()
Jewelry_sale_for_population = df_master['Jewelry'].mean()
Cell_Phones_sale_for_Store_C = df_master[df_master['Store']==3]['Cell_Phones'].mean()
Cell_Phones_sale_for_population = df_master['Cell_Phones'].mean()
Floral_sale_for_Store_C = df_master[df_master['Store']==3]['Floral'].mean()
Floral_sale_for_population = df_master['Floral'].mean()
Clearance_Clothings_sale_for_Store_C = df_master[df_master['Store']==3]['Clearance_Clothings'].mean()
Clearance_Clothings_sale_for_population = df_master['Clearance_Clothings'].mean()
Video_Games_sale_for_Store_C = df_master[df_master['Store']==3]['Video_Games'].mean()
Video_Games_sale_for_population = df_master['Video_Games'].mean()

print('The mean of Jewelry sale for Store C is: ' + str(Jewelry_sale_for_Store_C))
print('The mean of Jewelry sale for the entire population is: ' + str(Jewelry_sale_for_population))
print('The mean of Cell Phones sale for Store C is: ' + str(Cell_Phones_sale_for_Store_C))
print('The mean of Cell Phones sale for the entire population is: ' + str(Cell_Phones_sale_for_population))
print('The mean of Floral sale for Store C is: ' + str(Floral_sale_for_Store_C))
print('The mean of Floral sale for the entire population is: ' + str(Floral_sale_for_population))
print('The mean of Clearance Clothings sale for Store C is: ' + str(Clearance_Clothings_sale_for_Store_C))
print('The mean of Clearance Clothings sale for the entire population is: ' + str(Clearance_Clothings_sale_for_population))
print('The mean of Video Games sale for Store C is: ' + str(Video_Games_sale_for_Store_C))
print('The mean of Video Games sale for the entire population is: ' + str(Video_Games_sale_for_population))

The mean of Jewelry sale for Store C is: 26851.18153846154
The mean of Jewelry sale for the entire population is: 18548.270426130683
The mean of Cell Phones sale for Store C is: 39208.43071428571
The mean of Cell Phones sale for the entire population is: 25469.003758239905
The mean of Floral sale for Store C is: 11191.472252747253
The mean of Floral sale for the entire population is: 10297.530998779794
The mean of Clearance Clothings sale for Store C is: 12887.638131868132
The mean of Clearance Clothings sale for the entire population is: 7607.392772894041
The mean of Video Games sale for Store C is: 42788.43835164836
The mean of Video Games sale for the entire population is: 24108.858239316476


#### 3.5.3.2 T-Test - Store C - Selected Gift Items
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [81]:
# Jewelry Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Jewelry'], # Sample of Jewelry Sale 
                  popmean = Jewelry_sale_for_Store_C)  # means of Jewelry Sale Store C

Ttest_1sampResult(statistic=-55.878472437064779, pvalue=0.0)

In [82]:
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Cell_Phones'], # Sample of Cell Phone Sale 
                  popmean = Cell_Phones_sale_for_Store_C)  # means of Cell Phone Sale Store C

Ttest_1sampResult(statistic=-105.42318949429487, pvalue=0.0)

In [83]:
# Floral Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Floral'], # Sample of Floral Sale 
                  popmean = Floral_sale_for_Store_C)  # means of Floral Sale Store C

Ttest_1sampResult(statistic=-12.261839722480998, pvalue=2.8967101621342553e-34)

In [84]:
# Clearance Clothing Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Clearance_Clothings'], # Sample of Clearance Clothing Sale 
                  popmean = Clearance_Clothings_sale_for_Store_C)  # means of Clearance Clothing Sale Store C

Ttest_1sampResult(statistic=-94.814244952119338, pvalue=0.0)

In [85]:
# Video Games Sale
import scipy.stats as stats
stats.ttest_1samp(a=  df_master['Video_Games'], # Sample of Video Games Sale 
                  popmean = Video_Games_sale_for_Store_C)  # means of Video Games Sale Store C

Ttest_1sampResult(statistic=-66.556709100677466, pvalue=0.0)

#### 3.5.1.3 T-Test Results - Store A - Selected Gift Items
| Items    | t-test statistic    | pvalue    |  
| ------   | :---------------:   | ------:   |
| Jewelry Sales  | -55.88   |0.0     |  
| Cell Phone Sales  | -105.42   | 0.0  |    
| Floral Sales  | -12.26   | 2.89e-34  |    
| Clearance Clothings Sales  | -94.81   | 0.0   |    
| Video Games Sales  |   -66.56  |  0.0   |   

In [86]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


#### 3.5.3.4 T-Test Summary - Store C - Selected Gift Items
For all the selected gift items we will reject the null hypothesis that there is no significant difference in their means with respect to Store C sales and those of the general population because the t-test scores are outside the quantiles and P-values are lower than confidence level of 5%

## 3.6 Test Hypothesis that the difference in Store Size for each Store Type is statistically significant¶

Hypothesis Testing: Is there significant difference in the means of the store zizes for each store type and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference in the Markdowns in Store A from the general population.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

### 3.6.1 Mean Estimation - Store Sizes w.r.t Store Type¶
Let us estimate the means of the store sizes with respect to store type and the means for the entire population

In [87]:
# Let's compare the means of store sizes for each store type and the means for the entire population
Size_for_Store_A = df_master[df_master['Store']==1]['Size'].mean()
Size_for_population = df_master['Size'].mean()
Size_for_Store_B = df_master[df_master['Store']==2]['Size'].mean()
Size_for_population = df_master['Size'].mean()
Size_for_Store_C = df_master[df_master['Store']==3]['Size'].mean()
Size_for_population = df_master['Size'].mean()

print('The mean of store size for Store A is: ' + str(Size_for_Store_A))
print('The mean of store size for the entire population is: ' + str(Size_for_population))
print('The mean of store size for Store B is: ' + str(Size_for_Store_B))
print('The mean of store size for the entire population is: ' + str(Size_for_population))
print('The mean of store size for Store C is: ' + str(Size_for_Store_C))
print('The mean of store size for the entire population is: ' + str(Size_for_population))

The mean of store size for Store A is: 151315.0
The mean of store size for the entire population is: 130287.6
The mean of store size for Store B is: 202307.0
The mean of store size for the entire population is: 130287.6
The mean of store size for Store C is: 37392.0
The mean of store size for the entire population is: 130287.6


### 3.6.2 T-Test - Store Sizes w.r.t Store Type
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [88]:
# Store Sizes w.r.t Store Type
import scipy.stats as stats
Statistic_storeA = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Size_for_Store_A)  # means of size for Store A
Statistic_storeB = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Size_for_Store_B)  # means of size for Store B
Statistic_storeC = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Size_for_Store_C)  # means of size for Store C
print('The t-test statistsic and p value for Store A are: ' + str(Statistic_storeA))
print('The t-test statistsic and p value for Store B are: ' + str(Statistic_storeB))
print('The t-test statistsic and p value for Store C are: ' + str(Statistic_storeC))

The t-test statistsic and p value for Store A are: Ttest_1sampResult(statistic=-30.150065242291078, pvalue=1.8177534113729897e-189)
The t-test statistsic and p value for Store B are: Ttest_1sampResult(statistic=-103.26476923969004, pvalue=0.0)
The t-test statistsic and p value for Store C are: Ttest_1sampResult(statistic=133.19803688148681, pvalue=0.0)


### 3.6.3 T-Test Results - Store Sizes w.r.t Store Type
| Items    | t-test statistic    | pvalue    |  
| ------   | :---------------:   | ------:   |
| Store A Size  | -30.15   | 1.82e-189     |  
| Store B Size  | -103.26   | 0.0  |    
| Store C Size | 133.198  | 0.0  |    


In [89]:
degree_freedom = len(df_master[df_master['IsHoliday']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The t-distribution left quartile range is: -1.96402740931
The t-distribution right quartile range is: 1.96402740931


### 3.6.4 T-Test Summary - Store Sizes w.r.t Store Type
We will reject the null hypothesis that there is no significant difference in the means of the store sizes with respect to Store Type and those of the general population because the t-test scores are outside the quantiles and P-values are lower than confidence level of 5%

## 3.7 Test Hypothesis that Store size has impact on Sales of Gift Items

Based on the means of sizes calculated in section 3.6.1 we can cartegorize the store sizes into 3:
* Small - <= 37392.0
* Medium - >37392.0 and <=130287.6
* Large - > 130287.6

We will now test the hypothesis that the store size does not have any impact on the sales of selected gift items.

### 3.7.1 Impact of size on sales of Selected Gift Items
Hypothesis Testing: Is there significant difference in the means of the sales of selected gift items wrt size and those of the general population?

Null Hypothesis: (H0: pMV = pGP) The null hypothesis would be that there is no difference.

Alternate Hypothesis: (HA: pMV != pGP) The alternative hypothesis would be that there is a difference.

### 3.7.2 Mean Estimation - Size (Sales of Selected Gift Items)
Let us estimate the means of sales of selected gift items with respect to size and the means for the entire population
    

In [90]:
import pandas as pd

def Size (c):
    if c['Size'] <= 37392.0:
        return 1
    elif c['Size'] > 37392.0 & c['Size'] <= 130287.6:
        return 2
    elif c['Size'] > 130287.6:
        return 3
    else:
        return 0

In [91]:
# Let's compare the means of sales of selected gift items for different size cartegories and the means for the entire population
Jewelry_sale_for_small_sized_store = df_master[df_master['Size']<= 37392.0]['Jewelry'].mean()
Jewelry_sale_for_population = df_master['Jewelry'].mean()
Cell_Phones_sale_for_small_sized_store = df_master[df_master['Size']<= 37392.0]['Cell_Phones'].mean()
Cell_Phones_sale_for_population = df_master['Cell_Phones'].mean()
Floral_sale_for_small_sized_store = df_master[df_master['Size']<= 37392.0]['Floral'].mean()
Floral_sale_for_population = df_master['Floral'].mean()
Clearance_Clothings_sale_for_small_sized_store = df_master[df_master['Size']<= 37392.0]['Clearance_Clothings'].mean()
Clearance_Clothings_sale_for_population = df_master['Clearance_Clothings'].mean()
Video_Games_sale_for_small_sized_store = df_master[df_master['Size']<= 37392.0]['Video_Games'].mean()
Video_Games_sale_for_population = df_master['Video_Games'].mean()


Jewelry_sale_for_medium_sized_store = df_master[(df_master['Size'] >37392.0)  & (df_master['Size']<=130287.6)]['Jewelry'].mean()
Jewelry_sale_for_population = df_master['Jewelry'].mean()
Cell_Phones_sale_for_medium_sized_store = df_master[(df_master['Size'] >37392.0)  & (df_master['Size']<=130287.6)]['Cell_Phones'].mean()
Cell_Phones_sale_for_population = df_master['Cell_Phones'].mean()
Floral_sale_for_medium_sized_store = df_master[(df_master['Size'] >37392.0)  & (df_master['Size']<=130287.6)]['Floral'].mean()
Floral_sale_for_population = df_master['Floral'].mean()
Clearance_Clothings_sale_for_medium_sized_store = df_master[(df_master['Size'] >37392.0)  & (df_master['Size']<=130287.6)]['Clearance_Clothings'].mean()
Clearance_Clothings_sale_for_population = df_master['Clearance_Clothings'].mean()
Video_Games_sale_for_medium_sized_store = df_master[(df_master['Size'] >37392.0)  & (df_master['Size']<=130287.6)]['Video_Games'].mean()
Video_Games_sale_for_population = df_master['Video_Games'].mean()

Jewelry_sale_for_large_sized_store = df_master[df_master['Size']>130287.6]['Jewelry'].mean()
Jewelry_sale_for_population = df_master['Jewelry'].mean()
Cell_Phones_sale_for_large_sized_store = df_master[df_master['Size']>130287.6]['Cell_Phones'].mean()
Cell_Phones_sale_for_population = df_master['Cell_Phones'].mean()
Floral_sale_for_large_sized_store = df_master[df_master['Size']>130287.6]['Floral'].mean()
Floral_sale_for_population = df_master['Floral'].mean()
Clearance_Clothings_sale_for_large_sized_store = df_master[df_master['Size']>130287.6]['Clearance_Clothings'].mean()
Clearance_Clothings_sale_for_population = df_master['Clearance_Clothings'].mean()
Video_Games_sale_for_large_sized_store = df_master[df_master['Size']>130287.6]['Video_Games'].mean()
Video_Games_sale_for_population = df_master['Video_Games'].mean()

print('SMALL SIZE STORE MEAN')
print('The mean of Jewelry sale for small sized store is: ' + str(Jewelry_sale_for_small_sized_store))
print('The mean of Jewelry sale for the entire population is: ' + str(Jewelry_sale_for_population))
print('The mean of Cell Phones sale for small sized storeB is: ' + str(Cell_Phones_sale_for_small_sized_store))
print('The mean of Cell Phones sale for the entire population is: ' + str(Cell_Phones_sale_for_population))
print('The mean of Floral sale for small sized store is: ' + str(Floral_sale_for_small_sized_store))
print('The mean of Floral sale for the entire population is: ' + str(Floral_sale_for_population))
print('The mean of Clearance Clothings sale for small sized store is: ' + str(Clearance_Clothings_sale_for_small_sized_store))
print('The mean of Clearance Clothings sale for the entire population is: ' + str(Clearance_Clothings_sale_for_population))
print('The mean of Video Games sale for small sized store is: ' + str(Video_Games_sale_for_small_sized_store))
print('The mean of Video Games sale for the entire population is: ' + str(Video_Games_sale_for_population))

print('MEDIUM SIZE STORE MEAN')
print('The mean of Jewelry sale for medium sized store is: ' + str(Jewelry_sale_for_medium_sized_store))
print('The mean of Jewelry sale for the entire population is: ' + str(Jewelry_sale_for_population))
print('The mean of Cell Phones sale for medium  sized storeB is: ' + str(Cell_Phones_sale_for_medium_sized_store))
print('The mean of Cell Phones sale for the entire population is: ' + str(Cell_Phones_sale_for_population))
print('The mean of Floral sale for medium  sized store is: ' + str(Floral_sale_for_medium_sized_store))
print('The mean of Floral sale for the entire population is: ' + str(Floral_sale_for_population))
print('The mean of Clearance Clothings sale for medium  sized store is: ' + str(Clearance_Clothings_sale_for_medium_sized_store))
print('The mean of Clearance Clothings sale for the entire population is: ' + str(Clearance_Clothings_sale_for_population))
print('The mean of Video Games sale for medium sized store is: ' + str(Video_Games_sale_for_medium_sized_store))
print('The mean of Video Games sale for the entire population is: ' + str(Video_Games_sale_for_population))

print('LARGE SIZE STORE MEAN')
print('The mean of Jewelry sale for large sized store is: ' + str(Jewelry_sale_for_large_sized_store))
print('The mean of Jewelry sale for the entire population is: ' + str(Jewelry_sale_for_population))
print('The mean of Cell Phones sale for large  sized storeB is: ' + str(Cell_Phones_sale_for_large_sized_store))
print('The mean of Cell Phones sale for the entire population is: ' + str(Cell_Phones_sale_for_population))
print('The mean of Floral sale for large sized store is: ' + str(Floral_sale_for_large_sized_store))
print('The mean of Floral sale for the entire population is: ' + str(Floral_sale_for_population))
print('The mean of Clearance Clothings sale for large sized store is: ' + str(Clearance_Clothings_sale_for_large_sized_store))
print('The mean of Clearance Clothings sale for the entire population is: ' + str(Clearance_Clothings_sale_for_population))
print('The mean of Video Games sale for large sized store is: ' + str(Video_Games_sale_for_large_sized_store))
print('The mean of Video Games sale for the entire population is: ' + str(Video_Games_sale_for_population))

SMALL SIZE STORE MEAN
The mean of Jewelry sale for small sized store is: 23217.55057692306
The mean of Jewelry sale for the entire population is: 18548.270426130683
The mean of Cell Phones sale for small sized storeB is: 33792.770219780235
The mean of Cell Phones sale for the entire population is: 25469.003758239905
The mean of Floral sale for small sized store is: 12242.874450549445
The mean of Floral sale for the entire population is: 10297.530998779794
The mean of Clearance Clothings sale for small sized store is: 9923.779368131867
The mean of Clearance Clothings sale for the entire population is: 7607.392772894041
The mean of Video Games sale for small sized store is: 36324.51060439562
The mean of Video Games sale for the entire population is: 24108.858239316476
MEDIUM SIZE STORE MEAN
The mean of Jewelry sale for medium sized store is: 18041.006688311838
The mean of Jewelry sale for the entire population is: 18548.270426130683
The mean of Cell Phones sale for medium  sized storeB i

### 3.7.3 T-Test - Size (Sales of Selected Gift Items)
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample does not come from the same distribution as the store population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [92]:
# Store Sizes w.r.t Sales of selected gift items
import scipy.stats as stats
Statistic_Jewelry_Small_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Jewelry_sale_for_small_sized_store)  # means of size for Jewelry Sales in small size Store 
Statistic_Cell_Phone_Small_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Cell_Phones_sale_for_small_sized_store)  # means of size for Cell Phone Sales in small size Store 
Statistic_Floral_Small_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Floral_sale_for_small_sized_store)  # means of size for Floral Sales in small size Store 
Statistic_Clearance_Clothing_Small_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Clearance_Clothings_sale_for_small_sized_store)  # means of size for Clearance clothing Sales in small size Store
Statistic_Video_games_Small_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Video_Games_sale_for_small_sized_store)  # means of size for Video Games Sales in small size Store 

Statistic_Jewelry_Medium_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Jewelry_sale_for_medium_sized_store)  # means of size for Jewelry Sales in medium size Store 
Statistic_Cell_Phone_Medium_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Cell_Phones_sale_for_medium_sized_store)  # means of size for Cell Phone Sales in medium size Store 
Statistic_Floral_Medium_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Floral_sale_for_medium_sized_store)  # means of size for Floral Sales in medium size Store 
Statistic_Clearance_Clothing_Medium_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Clearance_Clothings_sale_for_medium_sized_store)  # means of size for Clearance clothing Sales in medium size Store
Statistic_Video_games_Medium_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Video_Games_sale_for_medium_sized_store)  # means of size for Video Games Sales in medium size Store

Statistic_Jewelry_Large_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Jewelry_sale_for_large_sized_store)  # means of size for Jewelry Sales in large size Store 
Statistic_Cell_Phone_Large_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Cell_Phones_sale_for_large_sized_store)  # means of size for Cell Phone Sales in large size Store 
Statistic_Floral_Large_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Floral_sale_for_large_sized_store)  # means of size for Floral Sales in large size Store 
Statistic_Clearance_Clothing_Large_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Clearance_Clothings_sale_for_large_sized_store)  # means of size for Clearance clothing Sales in large size Store
Statistic_Video_games_Large_Size_Store = stats.ttest_1samp(a= df_master['Size'], # Sample of population size 
                  popmean = Video_Games_sale_for_large_sized_store)  # means of size for Video Games Sales in large size Store 

print('SMALL SIZE STORE TEST STATISTIC')
print('The t-test statistsic and p value for Jewelry Sales Small Sized Store are: ' + str(Statistic_Jewelry_Small_Size_Store))
print('The t-test statistsic and p value for Cell Phones Sales Small Sized Store are: ' + str(Statistic_Cell_Phone_Small_Size_Store))
print('The t-test statistsic and p value for Floral Sales Small Sized Store are: ' + str(Statistic_Floral_Small_Size_Store))
print('The t-test statistsic and p value for Clearance Clothing Sales Small Sized Store are: ' + str(Statistic_Clearance_Clothing_Small_Size_Store))
print('The t-test statistsic and p value for Video Games Sales Small Sized Store are: ' + str(Statistic_Video_games_Small_Size_Store))

print('MEDIUM SIZE STORE TEST STATISTIC')
print('The t-test statistsic and p value for Jewelry Sales Medium Sized Store are: ' + str(Statistic_Jewelry_Medium_Size_Store))
print('The t-test statistsic and p value for Cell Phones Sales Medium Sized Store are: ' + str(Statistic_Cell_Phone_Medium_Size_Store))
print('The t-test statistsic and p value for Floral Sales Small Medium Store are: ' + str(Statistic_Floral_Medium_Size_Store))
print('The t-test statistsic and p value for Clearance Clothing Sales Medium Sized Store are: ' + str(Statistic_Clearance_Clothing_Medium_Size_Store))
print('The t-test statistsic and p value for Video Games Sales Medium Sized Store are: ' + str(Statistic_Video_games_Medium_Size_Store))

print('LARGE SIZE STORE TEST STATISTIC')
print('The t-test statistsic and p value for Jewelry Sales Large Sized Store are: ' + str(Statistic_Jewelry_Large_Size_Store))
print('The t-test statistsic and p value for Cell Phones Sales Large Sized Store are: ' + str(Statistic_Cell_Phone_Large_Size_Store))
print('The t-test statistsic and p value for Floral Sales Small Large Store are: ' + str(Statistic_Floral_Large_Size_Store))
print('The t-test statistsic and p value for Clearance Clothing Sales Large Sized Store are: ' + str(Statistic_Clearance_Clothing_Large_Size_Store))
print('The t-test statistsic and p value for Video Games Sales Large Sized Store are: ' + str(Statistic_Video_games_Large_Size_Store))

SMALL SIZE STORE TEST STATISTIC
The t-test statistsic and p value for Jewelry Sales Small Sized Store are: Ttest_1sampResult(statistic=153.52202248500055, pvalue=0.0)
The t-test statistsic and p value for Cell Phones Sales Small Sized Store are: Ttest_1sampResult(statistic=138.3587801353186, pvalue=0.0)
The t-test statistsic and p value for Floral Sales Small Sized Store are: Ttest_1sampResult(statistic=169.25802414108639, pvalue=0.0)
The t-test statistsic and p value for Clearance Clothing Sales Small Sized Store are: Ttest_1sampResult(statistic=172.58325065686893, pvalue=0.0)
The t-test statistsic and p value for Video Games Sales Small Sized Store are: Ttest_1sampResult(statistic=134.72865288360433, pvalue=0.0)
MEDIUM SIZE STORE TEST STATISTIC
The t-test statistsic and p value for Jewelry Sales Medium Sized Store are: Ttest_1sampResult(statistic=160.94439215368101, pvalue=0.0)
The t-test statistsic and p value for Cell Phones Sales Medium Sized Store are: Ttest_1sampResult(statistic

## 3.8 Difference in MarkDown for different sizes

### 3.8.1 Mean Estimation - MarkDowns w.r.t Store Size

In [93]:
# Let's compare the means of MarkDown 1 w.r.t Store Size
MarkDown1_for_small_sized_store = df_master[df_master['Size']<= 37392.0]['MarkDown1'].mean()
MarkDown1_for_medium_sized_store = df_master[(df_master['Size'] >37392.0) & (df_master['Size']<=130287.6)]['MarkDown1'].mean()
MarkDown1_for_large_sized_store = df_master[df_master['Size']>130287.6]['MarkDown1'].mean()
MarkDown1_for_population = df_master['MarkDown1'].mean()

print('MARKDOWN 1 MEAN')
print('The mean of MarkDown1 for small sized store is: ' + str(MarkDown1_for_small_sized_store))
print('The mean of MarkDown1 for medium sized store is: ' + str(MarkDown1_for_medium_sized_store))
print('The mean of MarkDown1 for large sized store is: ' + str(MarkDown1_for_large_sized_store))
print('The mean of MarkDown1 for population is: ' + str(MarkDown1_for_population))

MARKDOWN 1 MEAN
The mean of MarkDown1 for small sized store is: 4518.781813186804
The mean of MarkDown1 for medium sized store is: 6027.216341158869
The mean of MarkDown1 for large sized store is: 12300.308464154952
The mean of MarkDown1 for population is: 8887.617797313967


In [94]:
# Let's compare the means of MarkDown 2 w.r.t Store Size
MarkDown2_for_small_sized_store = df_master[df_master['Size']<= 37392.0]['MarkDown2'].mean()
MarkDown2_for_medium_sized_store = df_master[(df_master['Size'] >37392.0) & (df_master['Size']<=130287.6)]['MarkDown2'].mean()
MarkDown2_for_large_sized_store = df_master[df_master['Size']>130287.6]['MarkDown2'].mean()
MarkDown2_for_population = df_master['MarkDown2'].mean()

print('MARKDOWN2 MEAN')
print('The mean of MarkDown2 for small sized store is: ' + str(MarkDown2_for_small_sized_store))
print('The mean of MarkDown2 for medium sized store is: ' + str(MarkDown2_for_medium_sized_store))
print('The mean of MarkDown2 for large sized store is: ' + str(MarkDown2_for_large_sized_store))
print('The mean of MarkDown2 for population is: ' + str(MarkDown2_for_population))

MARKDOWN2 MEAN
The mean of MarkDown2 for small sized store is: 1776.0314285714303
The mean of MarkDown2 for medium sized store is: 3887.289722777181
The mean of MarkDown2 for large sized store is: 8845.364644165227
The mean of MarkDown2 for population is: 6107.224317460187


In [95]:
# Let's compare the means of MarkDown 2 w.r.t Store Size
MarkDown3_for_small_sized_store = df_master[df_master['Size']<= 37392.0]['MarkDown3'].mean()
MarkDown3_for_medium_sized_store = df_master[(df_master['Size'] >37392.0) & (df_master['Size']<=130287.6)]['MarkDown3'].mean()
MarkDown3_for_large_sized_store = df_master[df_master['Size']>130287.6]['MarkDown3'].mean()
MarkDown3_for_population = df_master['MarkDown3'].mean()

print('MARKDOWN3 MEAN')
print('The mean of MarkDown3 for small sized store is: ' + str(MarkDown3_for_small_sized_store))
print('The mean of MarkDown3 for medium sized store is: ' + str(MarkDown3_for_medium_sized_store))
print('The mean of MarkDown3 for large sized store is: ' + str(MarkDown3_for_large_sized_store))
print('The mean of MarkDown3 for population is: ' + str(MarkDown3_for_population))

MARKDOWN3 MEAN
The mean of MarkDown3 for small sized store is: 559.6821978021975
The mean of MarkDown3 for medium sized store is: 703.2078946054048
The mean of MarkDown3 for large sized store is: 1200.2569911041387
The mean of MarkDown3 for population is: 928.7852197802101


In [96]:
# Let's compare the means of MarkDown 2 w.r.t Store Size
MarkDown4_for_small_sized_store = df_master[df_master['Size']<= 37392.0]['MarkDown4'].mean()
MarkDown4_for_medium_sized_store = df_master[(df_master['Size'] >37392.0) & (df_master['Size']<=130287.6)]['MarkDown4'].mean()
MarkDown4_for_large_sized_store = df_master[df_master['Size']>130287.6]['MarkDown4'].mean()
MarkDown4_for_population = df_master['MarkDown4'].mean()

print('MARKDOWN4 MEAN')
print('The mean of MarkDown4 for small sized store is: ' + str(MarkDown4_for_small_sized_store))
print('The mean of MarkDown4 for medium sized store is: ' + str(MarkDown4_for_medium_sized_store))
print('The mean of MarkDown4 for large sized store is: ' + str(MarkDown4_for_large_sized_store))
print('The mean of MarkDown4 for population is: ' + str(MarkDown4_for_population))

MARKDOWN4 MEAN
The mean of MarkDown4 for small sized store is: 776.683653846151
The mean of MarkDown4 for medium sized store is: 2038.2756843157442
The mean of MarkDown4 for large sized store is: 4498.214887493456
The mean of MarkDown4 for population is: 3130.1765555554607


In [97]:
# Let's compare the means of MarkDown 2 w.r.t Store Size
MarkDown5_for_small_sized_store = df_master[df_master['Size']<= 37392.0]['MarkDown5'].mean()
MarkDown5_for_medium_sized_store = df_master[(df_master['Size'] >37392.0) & (df_master['Size']<=130287.6)]['MarkDown5'].mean()
MarkDown5_for_large_sized_store = df_master[df_master['Size']>130287.6]['MarkDown5'].mean()
MarkDown5_for_population = df_master['MarkDown5'].mean()

print('MARKDOWN5 MEAN')
print('The mean of MarkDown5 for small sized store is: ' + str(MarkDown5_for_small_sized_store))
print('The mean of MarkDown5 for medium sized store is: ' + str(MarkDown5_for_medium_sized_store))
print('The mean of MarkDown5 for large sized store is: ' + str(MarkDown5_for_large_sized_store))
print('The mean of MarkDown5 for population is: ' + str(MarkDown5_for_population))

MARKDOWN5 MEAN
The mean of MarkDown5 for small sized store is: 1347.6482417582422
The mean of MarkDown5 for medium sized store is: 3410.1435589411017
The mean of MarkDown5 for large sized store is: 6036.331957090524
The mean of MarkDown5 for population is: 4544.031686202577


In [98]:
import scipy.stats as stats
print('Difference between the sizes for MarkDown 5')
Statistic_MarkDown5_Small_vs_Medium_Size_Stores = stats.ttest_1samp(a= df_master[df_master['Size']<= 37392.0]['MarkDown5'], # Sample of population size 
                  popmean = MarkDown5_for_medium_sized_store)  # means of size for Jewelry Sales in small size Store 
print('t-statistic is: ' + str(Statistic_MarkDown5_Small_vs_Medium_Size_Stores))

Statistic_MarkDown5_Small_vs_Large_Size_Stores = stats.ttest_1samp(a= df_master[df_master['Size']<= 37392.0]['MarkDown5'], # Sample of population size 
                  popmean = MarkDown5_for_large_sized_store)  # means of size for Jewelry Sales in small size Store 
print('t-statistic is: ' + str(Statistic_MarkDown5_Small_vs_Large_Size_Stores))

Statistic_MarkDown5_Medium_vs_Large_Size_Stores = stats.ttest_1samp(a= df_master[(df_master['Size'] >37392.0) & (df_master['Size']<=130287.6)]['MarkDown5'], # Sample of population size 
                  popmean = MarkDown5_for_large_sized_store)  # means of size for Jewelry Sales in small size Store 
print('t-statistic is: ' + str(Statistic_MarkDown5_Medium_vs_Large_Size_Stores))

print('Difference between the sizes for MarkDown 5 and the entire population')
Statistic_MarkDown5_Small_vs_Entire_Population = stats.ttest_1samp(a= df_master['MarkDown5'], # Sample of population size 
                  popmean = MarkDown5_for_small_sized_store)  # means of size for Jewelry Sales in small size Store 
print('t-statistic is: ' + str(Statistic_MarkDown5_Small_vs_Entire_Population))

Statistic_MarkDown5_Medium_vs_Entire_Population = stats.ttest_1samp(a= df_master['MarkDown5'], # Sample of population size 
                  popmean = MarkDown5_for_medium_sized_store)  # means of size for Jewelry Sales in small size Store 
print('t-statistic is: ' + str(Statistic_MarkDown5_Medium_vs_Entire_Population))

Statistic_MarkDown5_Large_vs_Entire_Population = stats.ttest_1samp(a= df_master['MarkDown5'], # Sample of population size 
                  popmean = MarkDown5_for_large_sized_store)  # means of size for Jewelry Sales in small size Store 
print('t-statistic is: ' + str(Statistic_MarkDown5_Large_vs_Entire_Population))

Difference between the sizes for MarkDown 5
t-statistic is: Ttest_1sampResult(statistic=-26.845741585206824, pvalue=3.1417518221657581e-88)
t-statistic is: Ttest_1sampResult(statistic=-61.028594997492647, pvalue=6.1015328547630221e-193)
t-statistic is: Ttest_1sampResult(statistic=-37.236937663773894, pvalue=7.2653377698265106e-261)
Difference between the sizes for MarkDown 5 and the entire population
t-statistic is: Ttest_1sampResult(statistic=29.883936179255624, pvalue=2.4458007490587093e-186)
t-statistic is: Ttest_1sampResult(statistic=10.601056168150174, pvalue=4.3467574978202645e-26)
t-statistic is: Ttest_1sampResult(statistic=-13.951957526562992, pvalue=9.6715051517016997e-44)
