## Case Study on ANOVA

XYZ Company has offices in four different zones. The company wishes to
investigate the following :
● The mean sales generated by each zone.
● Total sales generated by all the zones for each month.
● Check whether all the zones generate the same amount of sales.
Help the company to carry out their study with the help of data provided.

#### Importing the required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### Loading the dataset

In [2]:
data = pd.read_csv('Sales_data_zone_wise.csv')
data.head()

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D
0,Month - 1,1483525,1748451,1523308,2267260
1,Month - 2,1238428,1707421,2212113,1994341
2,Month - 3,1860771,2091194,1282374,1241600
3,Month - 4,1871571,1759617,2290580,2252681
4,Month - 5,1244922,1606010,1818334,1326062


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Month     29 non-null     object
 1   Zone - A  29 non-null     int64 
 2   Zone - B  29 non-null     int64 
 3   Zone - C  29 non-null     int64 
 4   Zone - D  29 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.3+ KB


#### The dataset contains 29 entries with columns Month, Zone A, Zone B, Zone C, Zone D. The data shows the zonewise monthwise sales.

### The mean sales generated by each zone.

In [4]:
pd.options.display.float_format = '{:.2f}'.format

In [5]:
data.describe()

Unnamed: 0,Zone - A,Zone - B,Zone - C,Zone - D
count,29.0,29.0,29.0,29.0
mean,1540493.14,1755559.59,1772871.03,1842926.76
std,261940.06,168389.89,333193.72,375016.48
min,1128185.0,1527574.0,1237722.0,1234311.0
25%,1305972.0,1606010.0,1523308.0,1520406.0
50%,1534390.0,1740365.0,1767047.0,1854412.0
75%,1820196.0,1875658.0,2098463.0,2180416.0
max,2004480.0,2091194.0,2290580.0,2364132.0


In [6]:
data.mean()

Zone - A   1540493.14
Zone - B   1755559.59
Zone - C   1772871.03
Zone - D   1842926.76
dtype: float64

#### The mean sales generated by each zone are displayed above.

### Total sales generated by all the zones for each month

In [12]:
data.sum(axis = 1, skipna = True)

0     7022544
1     7152303
2     6475939
3     8174449
4     5995328
5     7151387
6     7287108
7     7816299
8     6703395
9     7128210
10    7032783
11    6111084
12    5925424
13    7155515
14    5934156
15    6506659
16    7149383
17    7083490
18    6971953
19    7124599
20    7389597
21    7560001
22    6687919
23    7784747
24    6095918
25    6512360
26    6267918
27    7470920
28    6772277
dtype: int64

#### The sum total of sales, zonewise, for all 28 months are displayed above.

In [8]:
data.sum(axis = 0)

Month       Month - 1Month - 2Month - 3Month - 4Month - 5M...
Zone - A                                             44674301
Zone - B                                             50911228
Zone - C                                             51413260
Zone - D                                             53444876
dtype: object

#### The sum total of sales for 4 zones, i.e, zone A, zone B, zone C and zone D are displayed above.

### Check whether all the zones generate the same amount of sales.

#### Hypothesis

Ho:Sales(zone A) = Sales(zone B) = Sales(zone C) = Sales(zone D) i.e.,There is no significant difference among the sales.

Ha:Sales(zone A) != Sales(zone B) != Sales(zone C) != Sales(zone D) i.e.,There is significant difference among the sales

alpha: 0.05 , level of significance = 5%

One way Anova

In [10]:
import scipy.stats as stats

In [11]:
fvalue, pvalue = stats.f_oneway(data['Zone - A'], data['Zone - B'], data['Zone - C'], data['Zone - D'])
print('f value= ',fvalue,'and p value =',pvalue)

f value=  5.672056106843581 and p value = 0.0011827601694503335


#### P value less than 0.05. Thus we reject the null hypothesis. There is significant difference among the sales in 4 given zones.