# Big Data Real-Time Analytics with Python and Spark

## Chapter 8 - Statistical Data Analysis Part 2

### Lab 3 - Hypothesis testing in python

In [1]:
# Python version
from platform import python_version
print('The version used in this notebook is: ', python_version())

The version used in this notebook is:  3.8.13


In [2]:
# Imports
import pandas as pd
import numpy as np
import scipy.stats as stats

In [3]:
# Package versions used in this notebook
%reload_ext watermark
%watermark -a "Bianca Amorim" --iversions

Author: Bianca Amorim

numpy : 1.22.3
scipy : 1.7.3
pandas: 1.4.2



In [4]:
# Load dataset
df = pd.read_csv(r'datasets/dataset.csv')

In [5]:
df.shape

(200, 10)

In [6]:
df.head()

Unnamed: 0,id_cliente,genero,canal_atendimento,regiao,estado_civil,segmento,consumo_medio_mensal_antes_upgrade,consumo_medio_mes_anterior_ao_upgrade,consumo_medio_primeiro_mes_apos_upgrade,consumo_medio_segundo_mes_apos_upgrade
0,70,0,4,1,1,1,57,49.2,52,57.2
1,121,1,4,2,1,3,68,63.6,59,64.9
2,86,0,4,3,1,1,44,64.8,33,36.3
3,141,0,4,3,1,3,63,56.4,44,48.4
4,172,0,4,2,1,2,47,68.4,52,57.2


In [7]:
df.columns

Index(['id_cliente', 'genero', 'canal_atendimento', 'regiao', 'estado_civil',
       'segmento', 'consumo_medio_mensal_antes_upgrade',
       'consumo_medio_mes_anterior_ao_upgrade',
       'consumo_medio_primeiro_mes_apos_upgrade',
       'consumo_medio_segundo_mes_apos_upgrade'],
      dtype='object')

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 10 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   id_cliente                               200 non-null    int64  
 1   genero                                   200 non-null    int64  
 2   canal_atendimento                        200 non-null    int64  
 3   regiao                                   200 non-null    int64  
 4   estado_civil                             200 non-null    int64  
 5   segmento                                 200 non-null    int64  
 6   consumo_medio_mensal_antes_upgrade       200 non-null    int64  
 7   consumo_medio_mes_anterior_ao_upgrade    200 non-null    float64
 8   consumo_medio_primeiro_mes_apos_upgrade  200 non-null    int64  
 9   consumo_medio_segundo_mes_apos_upgrade   200 non-null    float64
dtypes: float64(2), int64(8)
memory usage: 15.8 KB


In [9]:
df.describe()

Unnamed: 0,id_cliente,genero,canal_atendimento,regiao,estado_civil,segmento,consumo_medio_mensal_antes_upgrade,consumo_medio_mes_anterior_ao_upgrade,consumo_medio_primeiro_mes_apos_upgrade,consumo_medio_segundo_mes_apos_upgrade
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,100.5,0.545,3.43,2.055,1.16,2.025,52.23,63.174,52.775,58.0525
std,57.879185,0.49922,1.039472,0.724291,0.367526,0.690477,10.252937,11.242137,9.478586,10.426445
min,1.0,0.0,1.0,1.0,1.0,1.0,28.0,39.6,31.0,34.1
25%,50.75,0.0,3.0,2.0,1.0,2.0,44.0,54.0,45.75,50.325
50%,100.5,1.0,4.0,2.0,1.0,2.0,50.0,62.4,54.0,59.4
75%,150.25,1.0,4.0,3.0,1.0,2.25,60.0,70.8,60.0,66.0
max,200.0,1.0,4.0,3.0,2.0,3.0,76.0,90.0,67.0,73.7


### Question 1: Was the average bandwidth consumption for the month before the upgrade greater than 50?

#### We will use One-Sample t-Test
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html

We calculate the T-test for the mean of 1 group. This is a test of the null hypothesis that the expected value (mean) of a sample of independent observations is equal to the given population mean, popmean.

Hypotheses:
- H0: The average bandwidth consumption before the upgrade was 50.
- H1: The average bandwidth consumption before the upgrade was different from 50.

If the p-value is less than 0, we reject H0. Otherwise, we fail to reject H0.

If the p-value reported from a t-test is less than 0.05, this result is considered statistically significant. If a p-value is greater than 0.05, the result is insignificant.

A p-value for a t-test is the probability that the results of your sample data occurred by chance.

In [10]:
# Run the test
stats.ttest_1samp(a = df.consumo_medio_mes_anterior_ao_upgrade, popmean = 50)

Ttest_1sampResult(statistic=16.57233752433133, pvalue=2.4963719280931583e-39)

Since the p-value is very small (less than 0.05), we reject H0 and take H1 as true. The average bandwidth consumption before the upgrade was different from 50.

In [11]:
# Checking the mean
df.consumo_medio_mes_anterior_ao_upgrade.mean()

63.17400000000001

### Question 2: Was there a difference in bandwidth consumption before and after the upgrade, considering the first month after the upgrade?

#### We will use Paired T-Test on Two Related samples 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html

This test is used when we have two samples that are related or dependent. This is a test for the null hypothesis that two related samples have identical (expected) mean values.

Hypotheses:
- H0: The consumption before the upgrade was the same as the consumption after the upgrade (the mean are the same).
- H1: The consumption before the upgrade was different from the consumption after the upgrade (the mean are not equal).

If the p-value is less than 0.05 we reject H0. Otherwise, we fail to reject H0.

In [13]:
# Run the test
stats.ttest_rel(a = df.consumo_medio_mensal_antes_upgrade, b = df.consumo_medio_primeiro_mes_apos_upgrade)

Ttest_relResult(statistic=-0.8673065458794775, pvalue=0.3868186820914985)

As the p-value is greater than 0.05 we fail to reject the null hypothesis. Therefore, the average consumption in the first month after the upgrade was similar to the average consumption before the upgrade.

In [15]:
print(df.consumo_medio_mensal_antes_upgrade.mean())
print(df.consumo_medio_primeiro_mes_apos_upgrade.mean())

52.23
52.775


### Question 3: Did the gender of the customer influence the bandwidth consumption in the first month after the upgrade?

#### We will use T-Test on Two independent samples 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

We calculate the t-test for the means of two independent samples. That is a test of the null hypothesis that two independent samples have identical (expected) mean values. This test assumes that the populations have identical variances by default.

Hypotheses:
- H0: Male customer consumption was equal to the female customer consumption in the first month after the upgrade (average consumption was equal between genders).
- H1: Male customer consumption was not equal to female customer consumption in the first month after the upgrade (average consumption was not equal between genders).

If the p-value is less than 0.05 we reject H0. Otherwise, we fail to reject H0.

In [16]:
df.columns

Index(['id_cliente', 'genero', 'canal_atendimento', 'regiao', 'estado_civil',
       'segmento', 'consumo_medio_mensal_antes_upgrade',
       'consumo_medio_mes_anterior_ao_upgrade',
       'consumo_medio_primeiro_mes_apos_upgrade',
       'consumo_medio_segundo_mes_apos_upgrade'],
      dtype='object')

In [17]:
# Separate the samples
consumption_male_customers = df.consumo_medio_primeiro_mes_apos_upgrade[df.genero == 0]
consumption_female_customers = df.consumo_medio_primeiro_mes_apos_upgrade[df.genero == 1]

In [19]:
print(consumption_male_customers.head())
print(consumption_female_customers.head())

0    52
2    33
3    44
4    52
5    52
Name: consumo_medio_primeiro_mes_apos_upgrade, dtype: int64
1     59
92    62
93    44
94    44
95    62
Name: consumo_medio_primeiro_mes_apos_upgrade, dtype: int64


In [20]:
 print(consumption_male_customers.mean())
print(consumption_female_customers.mean())

50.120879120879124
54.99082568807339


In [21]:
print(consumption_male_customers.var())
print(consumption_female_customers.var())

106.19633699633701
66.15732246007477


In [22]:
# Run the test
stats.ttest_ind(a = consumption_male_customers, b = consumption_female_customers, equal_var = False)

Ttest_indResult(statistic=-3.6564080478875276, pvalue=0.00034088493594266187)

In [23]:
# We can do the same using ANOVA
stats.f_oneway(consumption_male_customers, consumption_female_customers)

F_onewayResult(statistic=13.94330754080599, pvalue=0.0002462546120354903)

Since the p-value is less than 0.05, we reject H0. Thus, we conclude that there was a difference between male and female bandwidth consumption in the first month after the upgrade.

### Question 4: Is there some relation between the region and the customer segment?

#### We will use Chi-square test of independence of variables in a contingency table
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

Chi-square test of variable independence is used in a contingency table. The function calculates the chi-square statistic and the p-value for the hypothesis test of independence of the observed frequencies in the contingency table. The expected frequencies are calculated based on the marginal sums under the assumption of independence.

Hypotheses:
- H0: There is no relation between region and segment.
- H1: There is a relation between region and segment.

If the p-value is less than 0.05 we reject H0. Otherwise, we fail to reject H0.

In [25]:
df.head()

Unnamed: 0,id_cliente,genero,canal_atendimento,regiao,estado_civil,segmento,consumo_medio_mensal_antes_upgrade,consumo_medio_mes_anterior_ao_upgrade,consumo_medio_primeiro_mes_apos_upgrade,consumo_medio_segundo_mes_apos_upgrade
0,70,0,4,1,1,1,57,49.2,52,57.2
1,121,1,4,2,1,3,68,63.6,59,64.9
2,86,0,4,3,1,1,44,64.8,33,36.3
3,141,0,4,3,1,3,63,56.4,44,48.4
4,172,0,4,2,1,2,47,68.4,52,57.2


In [24]:
# Contingency table
table_count = pd.crosstab(df.segmento, df.regiao, margins = True)
table_count

regiao,1,2,3,All
segmento,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,16,20,9,45
2,19,44,42,105
3,12,31,7,50
All,47,95,58,200


In [26]:
# Run the test
chic2, p, dof, ex = stats.chi2_contingency(observed = table_count)

In [27]:
# P-value
p

0.055282939487992365

With the p-value vase greater than 0.05, we fail to reject H0 and can say that there is no relationship between region and customer segment.

# The End