## Case Study on ANOVA

### Objectives:
XYZ Company has offices in four different zones. The company wishes to investigate the following :
- The mean sales generated by each zone.
- Total sales generated by all the zones for each month.
- Check whether all the zones generate the same amount of sales.

### Importing relevant libraries

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

### General Analysis

In [21]:
df = pd.read_csv('Data/Sales_data_zone_wise.csv')
df.head()

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D
0,Month - 1,1483525,1748451,1523308,2267260
1,Month - 2,1238428,1707421,2212113,1994341
2,Month - 3,1860771,2091194,1282374,1241600
3,Month - 4,1871571,1759617,2290580,2252681
4,Month - 5,1244922,1606010,1818334,1326062


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Month     29 non-null     object
 1   Zone - A  29 non-null     int64 
 2   Zone - B  29 non-null     int64 
 3   Zone - C  29 non-null     int64 
 4   Zone - D  29 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.3+ KB


In [23]:
df.describe()

Unnamed: 0,Zone - A,Zone - B,Zone - C,Zone - D
count,29.0,29.0,29.0,29.0
mean,1540493.0,1755560.0,1772871.0,1842927.0
std,261940.1,168389.9,333193.7,375016.5
min,1128185.0,1527574.0,1237722.0,1234311.0
25%,1305972.0,1606010.0,1523308.0,1520406.0
50%,1534390.0,1740365.0,1767047.0,1854412.0
75%,1820196.0,1875658.0,2098463.0,2180416.0
max,2004480.0,2091194.0,2290580.0,2364132.0


#### Insights:
- The dataset contains 5 columns and 29 rows that specify the months and respective sales in each zone
- The dataset appears not to have null values

### 1. Mean sales generated by each zone

In [24]:
print('\nThe average sales in Zone A is: ', round(df['Zone - A'].mean(), 2))
print('\nThe average sales in Zone B is: ', round(df['Zone - B'].mean(), 2))
print('\nThe average sales in Zone C is: ', round(df['Zone - C'].mean(), 2))
print('\nThe average sales in Zone D is: ', round(df['Zone - D'].mean(), 2))


The average sales in Zone A is:  1540493.14

The average sales in Zone B is:  1755559.59

The average sales in Zone C is:  1772871.03

The average sales in Zone D is:  1842926.76


### 2. Total sales in each month

In [25]:
df.columns

Index(['Month', 'Zone - A', 'Zone - B', 'Zone - C', 'Zone - D'], dtype='object')

In [19]:
zones = ['Zone - A', 'Zone - B', 'Zone - C', 'Zone - D']
df['Total_Sales'] = df[zones].sum(axis=1)

Using a list of all zone names to sum sales and create a new column Total_sales

In [20]:
df[['Month', 'Total_Sales']]

Unnamed: 0,Month,Total_Sales
0,Month - 1,7022544
1,Month - 2,7152303
2,Month - 3,6475939
3,Month - 4,8174449
4,Month - 5,5995328
5,Month - 6,7151387
6,Month - 7,7287108
7,Month - 8,7816299
8,Month - 9,6703395
9,Month - 10,7128210


### 3. Checking whether all zones generate same sales amount

We use ANOVA to test whether the zones have same average sales.

- $H_{0}$: The average sales are same across the zones; that is: $\mu_{1}$ = $\mu_{2}$ = $\mu_{3}$ = $\mu_{4}$
- $H_{a}$: The average sales across the zones are not equal; that is: $\mu_{1}$ $\neq$ $\mu_{2}$ $\neq$ $\mu_{3}$ $\neq$ $\mu_{4}$
- Significant value, $\alpha$: 0.05

In [28]:
f_value, p_value = stats.f_oneway(df['Zone - A'], df['Zone - B'], df['Zone - C'], df['Zone - D'])

In [30]:
print('\nThe f value is: ', round(f_value, 3))
print('\nThe p value for the abve f value is: ', round(p_value, 4))


The f value is:  5.672

The p value for the abve f value is:  0.0012


#### Insights:
- The p value is smaller than the significant value $\alpha$: 0.05
- Therefore we reject the null hypothesis that our means are equal

### Concusion:
> The zones doesn't generate the same amount of sales.