##  Case Study on Testing of Hypothesis
A company started to invest in digital marketing as a new way of their product
promotions.For that they collected data and decided to carry out a study on it. 
● The company wishes to clarify whether there is any increase in sales after
stepping into digital marketing. 
● The company needs to check whether there is any dependency between the
features “Region” and “Manager”.
Help the company to carry out their study with the help of data provided.
 

#### Importing required libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### Loading the data

In [3]:
data = pd.read_csv('Sales_add.csv')
data.head()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 5 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Month                           22 non-null     object
 1   Region                          22 non-null     object
 2   Manager                         22 non-null     object
 3   Sales_before_digital_add(in $)  22 non-null     int64 
 4   Sales_After_digital_add(in $)   22 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 1008.0+ bytes


#### There are 22 entries ranging from 0 to 21 with columns Month, Region, Manager, Sales before digital add and sales after digita add.

###### There are no null values present in the given dataset.

In [4]:
data.describe()  #statistical description of the data

Unnamed: 0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
count,22.0,22.0
mean,149239.954545,231123.727273
std,14844.042921,25556.777061
min,130263.0,187305.0
25%,138087.75,214960.75
50%,147444.0,229986.5
75%,157627.5,250909.0
max,178939.0,276279.0


## 1. Checking whether there is any increase in sales after stepping into digital marketing.

##### Hypothesis:

Ho:Sales before = Sales after

Ha:Sales before < Sales after

alpha: 0.05 , level of significance = 5%

Hence we are performing a paired t-test on the dataset to analyse our hypothesis.

In [7]:
from scipy import stats

In [6]:
stats.ttest_rel(data['Sales_before_digital_add(in $)'], data['Sales_After_digital_add(in $)'])

Ttest_relResult(statistic=-12.09070525287017, pvalue=6.336667004575778e-11)

##### Since the P value is less than the chosen significance level the null hypothesis can be rejected.

##### INTERPRETATION OF THE RESULTS
A paired sample t-test was used to analyze the sales before and after the adoption of digital marketing to test if the new way of product promotion had a significant effect on the sales.

It is found that the sales after was higher [231123.727273 ± 25556.777061 units] when compared to the sales before stepping into digital marketing methods [149239.954545 ± 14844.042921 units]. Thus there was a statistically significant increase in sales.

## 2. Checking whether there is any dependency between the features “Region” and “Manager”.

##### Hypothesis

Ho:Whether Manager and Region are independent, i.e. no relationship exists.

Ha:whether Manager and Region are dependent, i.e. a relationship exists.

alpha: 0.05 , level of significance = 5%

We are performing 'Chi-square test of independence' on the dataset to analyse our hypothesis.



In [4]:
data['Manager'].value_counts()

Manager - A    9
Manager - B    7
Manager - C    6
Name: Manager, dtype: int64

In [8]:
data['Region'].value_counts()

Region - A    10
Region - B     7
Region - C     5
Name: Region, dtype: int64

In [5]:
data_crosstab=pd.crosstab(data.Manager, data.Region, margins=True)  ## creating a contingency table
data_crosstab

Region,Region - A,Region - B,Region - C,All
Manager,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Manager - A,4,4,1,9
Manager - B,3,1,3,7
Manager - C,3,2,1,6
All,10,7,5,22


In [8]:
# Ho:Whether month and region are independent, i.e. no relationship exists.
# Ha:whether month and region are dependent, i.e. a relationship exists.

alpha = 0.05
chi_square = 0
rows = data['Manager'].unique()
columns = data['Region'].unique()
for i in columns:
    for j in rows:
        O = data_crosstab[i][j]
        E = data_crosstab[i]['All'] * data_crosstab['All'][j] / data_crosstab['All']['All']
        chi_square += (O-E)**2/E

p_value = 1 - stats.norm.cdf(chi_square, (len(rows)-1)*(len(columns)-1))

print("chisquare-score is:", chi_square, " and p value is:", p_value)

chisquare-score is: 3.050566893424036  and p value is: 0.8287998105017607


##### INTERPRETATION OF THE RESULTS

Degrees of freedom = (r-1)x(c-1)=2x2=4. From the critical value table it is found that the critical value corresponding to the DF value 4 and level of significance 0.05 is 9.49. The chisquare-score obtained = 3.05056 which is lesser than 9.49. Thus we can accept the null hypothesis. Thus it can be said that Manager and Region are independent to each other. 

Since the P value (0.8288) is greater than the chosen significance level(0.05), the null hypothesis (i.e manager and region are independent) can be accepted using probability. Thus no relationship exists between 'manager' and 'region'.