## Case study on: Testing of Hypothesis

### Objectives:
A company started to invest in digital marketing as a new way of their product promotions.For that they collected data and decided to carry out a study on it.
- The company wishes to clarify whether there is any `increase in sales` after stepping into digital marketing.
- The company needs to check whether there is any dependency between the features **Region** and **Manager**.

Help the company to carry out their study with the help of data provided.


### Importing relevant Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

### Reading the Sales_add dataset

In [2]:
df = pd.read_csv('Data/Sales_add.csv')
df.head()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402


### Preliminary analysis

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 5 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Month                           22 non-null     object
 1   Region                          22 non-null     object
 2   Manager                         22 non-null     object
 3   Sales_before_digital_add(in $)  22 non-null     int64 
 4   Sales_After_digital_add(in $)   22 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 1008.0+ bytes


In [4]:
df.describe()

Unnamed: 0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
count,22.0,22.0
mean,149239.954545,231123.727273
std,14844.042921,25556.777061
min,130263.0,187305.0
25%,138087.75,214960.75
50%,147444.0,229986.5
75%,157627.5,250909.0
max,178939.0,276279.0


#### Insights:
- The dataset contains 22 rows and 5 columns
- There are no null values
- Average sales before investing in digital marketing is \\$149239.95
- Average sales after investing in digital marketing is \\$231123.73

### 1. Testing for increase in sales

#### Hypothesis:
- $H_{0}$: The average sales before investing in digital marketing, ($\mu_{1}$) = The average sales after investing in digital marketing, ($\mu_{2}$)
- $H_{a}$: The average sales after investing in digital marketing ($\mu_{1}$) > The average sales before investing in digital marketing, ($\mu_{2}$)
- Significance level $\alpha$: 0.05

As the sample size is small(n < 30), we use **one tailed t-test** to test our hypothesis

In [5]:
t_value, p_value = stats.ttest_ind(df['Sales_After_digital_add(in $)'], df['Sales_before_digital_add(in $)'], alternative='greater')

Assuming the null hypothesis $H_{0}$ to be true:

In [12]:
print("The t value is:", t_value)
print("The p value is: ", p_value)

The t value is: 12.995084451110877
The p value is:  1.3071840034523225e-16


#### Insights:
- The p value is much smaller than the significance level $\alpha$: 0.05
- We reject the null hypothesis

#### Conclusion:
> There is an increase in sales after investing in digital marketing.

### 2. Checking for dependency between features: Manager and Region

To check the dependency between two categorical variable, we use the $\chi^{2}$ test.

#### Hypothesis:
- $H_{0}$: The features **manager** and **region** are independent
- $H_{a}$: The features are dependant
- $\alpha$: 0.05

In [13]:
contingency_table = pd.crosstab(df['Manager'], df['Region'])
contingency_table

Region,Region - A,Region - B,Region - C
Manager,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Manager - A,4,4,1
Manager - B,3,1,3
Manager - C,3,2,1


In [14]:
observed_values = contingency_table.values
observed_values

array([[4, 4, 1],
       [3, 1, 3],
       [3, 2, 1]], dtype=int64)

In [15]:
x2, x2_p_value, deg_of_frdm, exp_val = stats.chi2_contingency(observed=observed_values)

In [21]:
print("The expected values are: \n\n", exp_val)
print("\nThe degree of freedom is: ", deg_of_frdm)

The expected values are: 

 [[4.09090909 2.86363636 2.04545455]
 [3.18181818 2.22727273 1.59090909]
 [2.72727273 1.90909091 1.36363636]]

The degree of freedom is:  4


In [22]:
print("The chi square statistic is : ", x2)
print("\nThe p-value for the chi square statistic is: ", x2_p_value)

The chi square statistic is :  3.050566893424036

The p-value for the chi square statistic is:  0.5493991051158094


#### Insights:
- The p-value is much higher than the significance value $\alpha$: 0.05
- We fail to reject the null hypothesis

#### Conclusin:
> The features are not dependent.