# **Hypothesis Testing**

Hypothesis testing is an integral part of statistical inference. It is used to decide whether the given sample data from the population parameter satisfies the given hypothetical condition. So, it will predict and decide using several factors whether the predictions satisfy the conditions or not.

In [3]:
#Import google drive
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
# Library importation
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

In [4]:
#Load the data
train = pd.read_csv('/content/drive/MyDrive/DAP- Capstone Project/Data/Train.csv')

In [5]:
#overview of the data
train.head()

Unnamed: 0,user_id,REGION,TENURE,MONTANT,FREQUENCE_RECH,REVENUE,ARPU_SEGMENT,FREQUENCE,DATA_VOLUME,ON_NET,ORANGE,TIGO,ZONE1,ZONE2,MRG,REGULARITY,TOP_PACK,FREQ_TOP_PACK,CHURN
0,7ee9e11e342e27c70455960acc80d3f91c1286d1,DAKAR,K > 24 month,20000.0,47.0,21602.0,7201.0,52.0,8835.0,3391.0,396.0,185.0,,,NO,62,On net 200F=Unlimited _call24H,30.0,0
1,50443f42bdc92b10388fc56e520e4421a5fa655c,,K > 24 month,,,,,,,,,,,,NO,3,,,0
2,da90b5c1a9b204c186079f89969aa01cb03c91b2,,K > 24 month,,,,,,,,,,,,NO,1,,,0
3,364ec1b424cdc64c25441a444a16930289a0051e,SAINT-LOUIS,K > 24 month,7900.0,19.0,7896.0,2632.0,25.0,9385.0,27.0,46.0,20.0,,2.0,NO,61,"Data:490F=1GB,7d",7.0,0
4,d5a5247005bc6d41d3d99f4ef312ebb5f640f2cb,DAKAR,K > 24 month,12350.0,21.0,12351.0,4117.0,29.0,9360.0,66.0,102.0,34.0,,,NO,56,All-net 500F=2000F;5d,11.0,0


##**Hypothesis 1**

**Customers with higher average revenue per user (ARPU) are less likely to churn.**

In [6]:
# Calculate the mean ARPU
mean_arpu = train['ARPU_SEGMENT'].mean()
mean_arpu

1835.3559605915466

In [9]:
# Create a column to categorize ARPU as high or low based on the mean
train['ARPU_group'] = np.where(train['ARPU_SEGMENT']>= mean_arpu, 'High', 'Low')

In [10]:
#Look at the changes
train.head()

Unnamed: 0,user_id,REGION,TENURE,MONTANT,FREQUENCE_RECH,REVENUE,ARPU_SEGMENT,FREQUENCE,DATA_VOLUME,ON_NET,ORANGE,TIGO,ZONE1,ZONE2,MRG,REGULARITY,TOP_PACK,FREQ_TOP_PACK,CHURN,ARPU_group
0,7ee9e11e342e27c70455960acc80d3f91c1286d1,DAKAR,K > 24 month,20000.0,47.0,21602.0,7201.0,52.0,8835.0,3391.0,396.0,185.0,,,NO,62,On net 200F=Unlimited _call24H,30.0,0,High
1,50443f42bdc92b10388fc56e520e4421a5fa655c,,K > 24 month,,,,,,,,,,,,NO,3,,,0,Low
2,da90b5c1a9b204c186079f89969aa01cb03c91b2,,K > 24 month,,,,,,,,,,,,NO,1,,,0,Low
3,364ec1b424cdc64c25441a444a16930289a0051e,SAINT-LOUIS,K > 24 month,7900.0,19.0,7896.0,2632.0,25.0,9385.0,27.0,46.0,20.0,,2.0,NO,61,"Data:490F=1GB,7d",7.0,0,High
4,d5a5247005bc6d41d3d99f4ef312ebb5f640f2cb,DAKAR,K > 24 month,12350.0,21.0,12351.0,4117.0,29.0,9360.0,66.0,102.0,34.0,,,NO,56,All-net 500F=2000F;5d,11.0,0,High


In [11]:
# Create a contingency table
contingency_table = pd.crosstab(train['CHURN'], train['ARPU_group'])

In [33]:
# Let's look at our contingent table
contingency_table

Recharge_frequency_group,Higher,Lower
CHURN,Unnamed: 1_level_1,Unnamed: 2_level_1
0,232770,642261
1,3187,198806


In [14]:
# Perform chi-squared test
chi2, p, _, _ = chi2_contingency(contingency_table)

In [16]:
# Print the results
print(f'Mean ARPU: {mean_arpu}')
print(f'Chi-Squared Value: {chi2}')
print(f'P-Value: {p}')

Mean ARPU: 1835.3559605915466
Chi-Squared Value: 58480.70281614405
P-Value: 0.0


**Chi-Squared Value**: The chi-squared value is 58480.70, which is a large and statistically significant value. This suggests that there is a significant association between ARPU and churn in your dataset.

**P-Value**: The p-value is 0.0, which is less than any conventional significance level (0.05). This indicates strong evidence against the null hypothesis. In other words, it's highly unlikely that the observed association between ARPU and churn is due to random chance.

**Conclusion:** Based on these results, we can confidently conclude that there is a significant relationship between average revenue per user (ARPU) and churn in your dataset. Specifically, it appears that customers with different ARPU levels have significantly different churn rates. Customers with higher ARPU are less likely to churn compared to customers with lower ARPU.

##**Hypothesis 2**

**Customers who have been with the telecom company for a longer duration (higher regularity) are less likely to churn**

In [18]:
# Calculate the mean Regularity
mean_regularity = train['REGULARITY'].mean()
mean_regularity

28.044495758683187

In [19]:
# Create a column to categorize customers as "longer duration" or "shorter duration"
train['regularity_group'] = np.where(train['REGULARITY'] > mean_regularity, 'Longer', 'Shorter')

In [20]:
#Lets, look at the new table
train.head()

Unnamed: 0,user_id,REGION,TENURE,MONTANT,FREQUENCE_RECH,REVENUE,ARPU_SEGMENT,FREQUENCE,DATA_VOLUME,ON_NET,...,TIGO,ZONE1,ZONE2,MRG,REGULARITY,TOP_PACK,FREQ_TOP_PACK,CHURN,ARPU_group,regularity_group
0,7ee9e11e342e27c70455960acc80d3f91c1286d1,DAKAR,K > 24 month,20000.0,47.0,21602.0,7201.0,52.0,8835.0,3391.0,...,185.0,,,NO,62,On net 200F=Unlimited _call24H,30.0,0,High,Longer
1,50443f42bdc92b10388fc56e520e4421a5fa655c,,K > 24 month,,,,,,,,...,,,,NO,3,,,0,Low,Shorter
2,da90b5c1a9b204c186079f89969aa01cb03c91b2,,K > 24 month,,,,,,,,...,,,,NO,1,,,0,Low,Shorter
3,364ec1b424cdc64c25441a444a16930289a0051e,SAINT-LOUIS,K > 24 month,7900.0,19.0,7896.0,2632.0,25.0,9385.0,27.0,...,20.0,,2.0,NO,61,"Data:490F=1GB,7d",7.0,0,High,Longer
4,d5a5247005bc6d41d3d99f4ef312ebb5f640f2cb,DAKAR,K > 24 month,12350.0,21.0,12351.0,4117.0,29.0,9360.0,66.0,...,34.0,,,NO,56,All-net 500F=2000F;5d,11.0,0,High,Longer


In [21]:
# Create a contingency table
contingency_table1 = pd.crosstab(train['CHURN'], train['regularity_group'])

In [32]:
#Overveiw of new table
contingency_table1

regularity_group,Longer,Shorter
CHURN,Unnamed: 1_level_1,Unnamed: 2_level_1
0,486565,388466
1,7428,194565


In [40]:
# Perform chi-squared test
chi3, p1, _, _ = chi2_contingency(contingency_table1)

In [42]:
#results
print(f'Duration Threshold: {mean_regularity} months')
print(f'Chi-Squared Value: {chi3}')
print(f'P-Value: {p1}')

Duration Threshold: 28.044495758683187 months
Chi-Squared Value: 178226.60039907804
P-Value: 0.0


**Chi-Squared Value:** The chi-squared value is 178226.60, which is a large and statistically significant value. This suggests that there is a significant association between the duration of association with the telecom company and churn in your dataset.

**P-Value:** The p-value is 0.0, which is less than any conventional significance level (0.05). This indicates strong evidence against the null hypothesis. In other words, it's highly unlikely that the observed association between duration and churn is due to random chance.

**Conclusion:** Based on these results, we can confidently conclude that there is a significant relationship between the duration of association with the telecom company and churn in your dataset. Specifically, it appears that customers who have been with the company for a longer duration (approximately 28.04 months or more) are less likely to churn compared to customers with a shorter duration of association.

## **Hypothesis 3**

**Customers who frequently recharge (higher FREQUENCE_RECH) are less likely to churn**

In [28]:
# Calculate the mean FREQUENCE_RECH
mean_frequency = train['FREQUENCE_RECH'].mean()
mean_frequency

11.523755648018492

In [29]:
# Create a column to categorize customers as "higher frequency" or "lower frequency"
train['Recharge_frequency_group'] = np.where(train['FREQUENCE_RECH'] > mean_frequency, 'Higher', 'Lower')

In [34]:
# Create a contingency table
contingency_table2 = pd.crosstab(train['CHURN'], train['Recharge_frequency_group'])


In [35]:
contingency_table2

Recharge_frequency_group,Higher,Lower
CHURN,Unnamed: 1_level_1,Unnamed: 2_level_1
0,232770,642261
1,3187,198806


In [39]:
chi4, p2, _, _ = chi2_contingency(contingency_table2)

In [43]:
print(f'Frequency Threshold: {mean_frequency}')
print(f'Chi-Squared Value: {chi4}')
print(f'P-Value: {p2}')

Frequency Threshold: 11.523755648018492
Chi-Squared Value: 60063.342514069256
P-Value: 0.0


**Chi-Squared Value:** The chi-squared value is 60063.34, which is a large and statistically significant value. This suggests that there is a significant association between the frequency of recharging (FREQUENCE_RECH) and churn in your dataset.

**P-Value:** The p-value is 0.0, which is less than any conventional significance level (e.g., 0.05). This indicates strong evidence against the null hypothesis. In other words, it's highly unlikely that the observed association between recharge frequency and churn is due to random chance.

**Conclusion:** Based on these results, we can confidently conclude that there is a significant relationship between the frequency of recharging and churn in your dataset. Specifically, it appears that customers who frequently recharge (above a threshold of approximately 11.52) are less likely to churn compared to customers with lower recharge frequency.

##**Hypothesis 4**

**Customers who have a top service package (TOP_PACK) are less likely to churn**

In [45]:
# Determine the top  service package
unique_counts = train['TOP_PACK'].value_counts()

In [47]:
# Let's list the top 5 packages
top_10_entries = unique_counts.head(5)
top_10_entries

All-net 500F=2000F;5d             158779
On net 200F=Unlimited _call24H     76302
Data:490F=1GB,7d                   57679
Data: 100 F=40MB,24H               42515
Mixt 250F=Unlimited_call24H        33620
Name: TOP_PACK, dtype: int64

In [48]:
# Define the top service package
top_service_package = 'All-net 500F=2000F;5d'

In [49]:
# Create a column to categorize customers as "top package" or "not top package"
train['Top_Package_group'] = train['TOP_PACK'].apply(lambda x: 'Top Package' if x == top_service_package else 'Not Top Package')

In [51]:
#Overview
train.head()

Unnamed: 0,user_id,REGION,TENURE,MONTANT,FREQUENCE_RECH,REVENUE,ARPU_SEGMENT,FREQUENCE,DATA_VOLUME,ON_NET,...,ZONE2,MRG,REGULARITY,TOP_PACK,FREQ_TOP_PACK,CHURN,ARPU_group,regularity_group,Recharge_frequency_group,Top_Package_group
0,7ee9e11e342e27c70455960acc80d3f91c1286d1,DAKAR,K > 24 month,20000.0,47.0,21602.0,7201.0,52.0,8835.0,3391.0,...,,NO,62,On net 200F=Unlimited _call24H,30.0,0,High,Longer,Higher,Not Top Package
1,50443f42bdc92b10388fc56e520e4421a5fa655c,,K > 24 month,,,,,,,,...,,NO,3,,,0,Low,Shorter,Lower,Not Top Package
2,da90b5c1a9b204c186079f89969aa01cb03c91b2,,K > 24 month,,,,,,,,...,,NO,1,,,0,Low,Shorter,Lower,Not Top Package
3,364ec1b424cdc64c25441a444a16930289a0051e,SAINT-LOUIS,K > 24 month,7900.0,19.0,7896.0,2632.0,25.0,9385.0,27.0,...,2.0,NO,61,"Data:490F=1GB,7d",7.0,0,High,Longer,Higher,Not Top Package
4,d5a5247005bc6d41d3d99f4ef312ebb5f640f2cb,DAKAR,K > 24 month,12350.0,21.0,12351.0,4117.0,29.0,9360.0,66.0,...,,NO,56,All-net 500F=2000F;5d,11.0,0,High,Longer,Higher,Top Package


In [52]:
# Create a contingency table
contingency_table3 = pd.crosstab(train['CHURN'], train['Top_Package_group'])


In [53]:
#table overview
contingency_table3

Top_Package_group,Not Top Package,Top Package
CHURN,Unnamed: 1_level_1,Unnamed: 2_level_1
0,721477,153554
1,196768,5225


In [54]:
# Perform chi-squared test
chi5, p3, _, _ = chi2_contingency(contingency_table3)

In [55]:
print(f'Top Service Package: {top_service_package}')
print(f'Chi-Squared Value: {chi5}')
print(f'P-Value: {p3}')

Top Service Package: All-net 500F=2000F;5d
Chi-Squared Value: 29226.495784757
P-Value: 0.0


**Chi-Squared Value:** The chi-squared value is 29226.50, which is a large and statistically significant value. This suggests that there is a significant association between having the top service package "All-net 500F=2000F;5d" and churn in your dataset.

**P-Value:** The p-value is 0.0, which is less than any conventional significance level (e.g., 0.05). This indicates strong evidence against the null hypothesis. In other words, it's highly unlikely that the observed association between having the top service package and churn is due to random chance.

**Conclusion:** Based on these results, we can confidently conclude that there is a significant relationship between having the top service package "All-net 500F=2000F;5d" and churn in your dataset. Specifically, customers who have this top service package are less likely to churn compared to customers who do not have this package.

## **Hypothesis 5**

**Customers with higher data usage (DATA_VOLUME) are less likely to churn**

In [56]:
# Calculate the mean FREQUENCE_RECH
mean_data = train['DATA_VOLUME'].mean()
mean_data

3368.8017216648

In [57]:
# Create a column to categorize customers as "higher data usage" or "lower data usage"
train['Data_Usage_group'] = train['DATA_VOLUME'].apply(lambda x: 'Higher Data Usage' if x > mean_data else 'Lower Data Usage')


In [58]:
# Create a contingency table
contingency_table4 = pd.crosstab(train['CHURN'], train['Data_Usage_group'])

In [59]:
#table overview
contingency_table4

Data_Usage_group,Higher Data Usage,Lower Data Usage
CHURN,Unnamed: 1_level_1,Unnamed: 2_level_1
0,121688,753343
1,4328,197665


In [60]:
# Perform chi-squared test
chi6, p4, _, _ = chi2_contingency(contingency_table4)

In [61]:
print(f'Data Usage Threshold: {mean_data}')
print(f'Chi-Squared Value: {chi6}')
print(f'P-Value: {p4}')

Data Usage Threshold: 3368.8017216648
Chi-Squared Value: 21981.99762031759
P-Value: 0.0


**Chi-Squared Value:** The chi-squared value is 21981.998, which is a large and statistically significant value. This suggests that there is a significant association between data usage (DATA_VOLUME) and churn in your dataset.

**P-Value:** The p-value is 0.0, which is less than any conventional significance level (e.g., 0.05). This indicates strong evidence against the null hypothesis. In other words, it's highly unlikely that the observed association between data usage and churn is due to random chance.

**Conclusion:** Based on these results, we can confidently conclude that there is a significant relationship between data usage (DATA_VOLUME) and churn in your dataset. Specifically, customers with higher data usage (above a threshold of approximately 3368.80) are less likely to churn compared to customers with lower data usage.