# Statistical Analysis

In this notebook we will be analysing our data using statistical measures.
Tasks to be perfomed:
- Descriptive statistics for both numerical and categorical data
- Hypothesis testing 

In [1]:
# importing libraries
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency,chisquare
from scipy.stats import shapiro
from statsmodels.stats import weightstats as test
import scipy.stats as stats

import warnings
warnings.filterwarnings('ignore')

In [2]:
# importing data
data = pd.read_csv('dropped.csv')

- We dropped 'unknown' values for the analysis purpose.

In [3]:
# checking the shape
data.shape

(27494, 29)

## `Descriptive Statistics`
(Overview of the data)

### 1. For Numerical data

In [4]:
data.describe()

Unnamed: 0,Customer_id,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,age,Postal Code,Region_Code,duration,campaign,pdays,previous
count,27494.0,27494.0,27494.0,27494.0,27494.0,27494.0,27494.0,27494.0,27494.0,27494.0,27494.0,27494.0,27494.0
mean,19739.850331,-0.068353,93.52436,-40.608318,3.462257,5160.913687,39.060413,55236.138066,2.576198,259.44457,2.525387,956.527097,0.194188
std,10651.990621,1.607475,0.584561,4.778214,1.776449,75.078548,10.343971,32057.09284,1.161994,261.45828,2.719997,200.930589,0.523907
min,1.0,-3.4,92.201,-50.8,0.634,4963.6,17.0,1040.0,1.0,0.0,1.0,0.0,0.0
25%,10812.25,-1.8,93.075,-42.7,1.313,5099.1,31.0,23223.0,2.0,103.0,1.0,999.0,0.0
50%,20503.0,1.1,93.444,-41.8,4.856,5191.0,37.0,56301.0,2.0,181.0,2.0,999.0,0.0
75%,28987.25,1.4,93.994,-36.4,4.961,5228.1,45.75,90008.0,4.0,321.0,3.0,999.0,0.0
max,37084.0,1.4,94.767,-26.9,5.045,5228.1,95.0,99301.0,4.0,4918.0,43.0,999.0,7.0


#### **Inference:**
**Customer Demographics:**
- The median age of customers is around 37, with a minimum age of 17 and a maximum age of 95. 
- The majority of customers fall within the age range of 31 to 45 years (25th to 75th percentile).
- This suggests a diverse age range among customers.

**Employment and Economic Indicators:**
- The average employment variation rate is close to zero, with a minimum of -3.4 and a maximum of 1.4. This indicates a relatively stable employment environment during the observed period.
- The average consumer price index is around 93.52, with a slight variation (standard deviation of 0.58). This suggests a moderate level of stability in consumer prices.
- Consumer confidence index has a wide range, indicating fluctuations in consumer sentiment.

**Current Campaign Insights:**
- The average campaign duration is 259 seconds, with a maximum of 4918 seconds. This suggests variability in the duration of marketing campaigns, with some lasting significantly longer than others.
- The average number of contacts during the campaign is 2 with a maximum of 43 and a minimum of 1. This provides insight into the typical outreach frequency during the campaign. 

**Previous Marketing Interactions:**
- The average number of contacts performed before this campaign is 0, with a maximum of 7. This indicates that, on average, customers have not been contacted frequently before the current campaign.

**Financial Market Indicators:**
- The average three-month Euribor rate is 3.46, indicating the average interest rate at which a large panel of European banks borrow funds from one another. The three-month Euribor rate shows variability, suggesting changing economic conditions during the campaign periods. This can be a useful indicator for financial market conditions.

**Postal Code and Region Information:**
- The median postal code is 56301, with a min of 1040 and max of 99301. This suggests that customers are spread across different regions, with some regions having higher or lower postal codes.

### 2. For Categorical data

In [5]:
data.describe(include=['object'])

Unnamed: 0,job,marital,education,default,housing,loan,State_Code,City_Code,City_Name,State_Name,Region_Name,contact,month,day_of_week,poutcome,Customer_Response
count,27494,27494,27494,27494,27494,27494,27494,27494,27494,27494,27494,27494,27494,27494,27494,27494
unique,11,3,7,2,2,2,49,531,531,49,4,2,10,5,3,2
top,admin.,married,university.degree,no,yes,no,S2,C21,New York City,California,West,cellular,may,thu,nonexistent,no
freq,7898,15801,9387,27492,14927,23199,5526,2535,2535,5526,8862,18433,8809,5781,23307,23995


**Customer Demographics:**

- The majority of customers in the dataset are married (15,801 occurrences) and majority of customers work in administrative roles (7,898 occurrences).
- Targeted marketing campaigns  could be designed to appeal to the needs and preferences of married individuals and those in administrative professions.

**Educational Background:**

- A large number of customers have a university degree (9,387 occurrences).
- The bank may consider offering educational resources.

**Credit and Loan Status:**

- The majority of customers do not have credit in default (27,492 occurrences), majority of customers have housing loans (14,927 occurrences), and do not have personal loans (23,199 occurrences).
- The bank has a customer base with good credit history and a preference for housing loans. 

**Geographic Distribution:**

- Customers are distributed across various states and cities.
- Regional marketing strategies can be implemented to account for differences in customer needs and preferences in different locations.

**Contact and Response Patterns:**

- Cellular communication is the most common method (18,433 occurrences), and May is the most frequent month for contact (8,809 occurrences).

**Outcome of Previous Campaigns:**

- The majority of customers have a 'nonexistent' outcome in previous campaigns (23,307 occurrences).
- Future campaigns may benefit from analyzing and learning from the patterns associated with successful past campaigns. Strategies can be adjusted to improve engagement with customers who have not been previously contacted.

**Customer Response to Term Deposit:**

- The target variable 'Customer_Response' indicates whether the client subscribed to a term deposit. The majority responded with 'no' (23,995 occurrences).
- Understanding the factors influencing customer responses can guide future marketing efforts.

Note: Other analysis such as skewness, kurtosis and distribution of the data are done in the univariate analysis of EDA (task4)

## `Hypothesis Testing`

### `Non parametric: Chi square test of association/independence`
### To understand the relationship between marital status and Customer Response

- Null: There is no relationship between marital status and Customer Responce. (independent)
- Alternate: There is a relationship between marital status and Customer Responce. (dependent)

In [6]:
#Contingency table : 
obs=pd.crosstab(data['marital'],data['Customer_Response'])
obs

Customer_Response,no,yes
marital,Unnamed: 1_level_1,Unnamed: 2_level_1
divorced,2827,370
married,13944,1857
single,7224,1272


In [7]:
chi_sq_Stat, p_value, deg_freedom, exp_freq=chi2_contingency(obs)
print('Chi-square statistic %3.5f P value %1.6f Degrees of freedom %d'
       %(chi_sq_Stat, p_value,deg_freedom))

Chi-square statistic 55.88770 P value 0.000000 Degrees of freedom 2


- p value is 0.0000
- p-value < 0.05 (alpha)
- Reject H0

- **we don't have enough evidence to say that marital status and Customer response are independent of each other.**
- However we need to check whether marital status is influencing the Customer response.

### To understand the relationship between contact medium and Customer Response

- Null: There is no relationship between contact medium and Customer Responce. (independent)
- Alternate: There is a relationship between contact medium and Customer Responce. (dependent)

In [8]:
#Contingency table : 
obs=pd.crosstab(data['contact'],data['Customer_Response'])
obs

Customer_Response,no,yes
contact,Unnamed: 1_level_1,Unnamed: 2_level_1
cellular,15461,2972
telephone,8534,527


In [9]:
chi_sq_Stat, p_value, deg_freedom, exp_freq=chi2_contingency(obs)
print('Chi-square statistic %3.5f P value %1.6f Degrees of freedom %d'
       %(chi_sq_Stat, p_value,deg_freedom))

Chi-square statistic 580.13136 P value 0.000000 Degrees of freedom 1


- p value is 0.0000
- p-value < 0.05 (alpha)
- Reject H0
- **we don't have enough evidence to say that contact medium and Customer response are independent of each other.**
- However we need to check whether contact medium is influencing the Customer response.

### To understand the relationship between education and Customer Response

- Null: There is no relationship between education and Customer Responce. (independent)
- Alternate: There is a relationship between education and Customer Responce. (dependent)

In [10]:
pd.crosstab(data['education'], data['Customer_Response'])

Customer_Response,no,yes
education,Unnamed: 1_level_1,Unnamed: 2_level_1
basic.4y,1865,296
basic.6y,1130,125
basic.9y,3505,338
high.school,6079,859
illiterate,8,1
professional.course,3413,488
university.degree,7995,1392


In [11]:
# contingency table
obs = pd.crosstab(data['education'], data['Customer_Response'])

# Chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(obs)

# results
print("Chi-square Statistic:", chi2_stat)
print("P-value:", p_value)
print("Degrees of Freedom:", dof)

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Chi-square Statistic: 102.24856893835508
P-value: 8.515602840152896e-20
Degrees of Freedom: 6
Reject the null hypothesis.


- **we don't have enough evidence to say that education and Customer response are independent of each other.**
- However we need to check whether education is influencing the Customer response.

### To understand the relationship between housing loan and Campaign Response

- Null: There is no relationship between having a housing loan and Customer Responce. (independent)
- Alternate: There is a relationship between having a housing loan and Customer Responce. (dependent)

In [12]:
pd.crosstab(data['housing'], data['Customer_Response'])

Customer_Response,no,yes
housing,Unnamed: 1_level_1,Unnamed: 2_level_1
no,11018,1549
yes,12977,1950


In [13]:
# contingency table
obs = pd.crosstab(data['housing'], data['Customer_Response'])

# Chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(obs)

# results
print("Chi-square Statistic:", chi2_stat)
print("P-value:", p_value)
print("Degrees of Freedom:", dof)

# Interpretation
alpha = 0.05
if p_value < alpha:
    
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Chi-square Statistic: 3.276404681254342
P-value: 0.0702827363428754
Degrees of Freedom: 1
Fail to reject the null hypothesis.


- **we don't have enough evidence to say that housing loan and Customer response are dependent on each other.**

### To understand the relationship between Profession and Campaign Response

- Null: There is no relationship between having a profession and Customer Responce. (independent)
- Alternate: There is a relationship between having a profession and Customer Responce. (dependent)

In [14]:
pd.crosstab(data['job'], data['Customer_Response'])

Customer_Response,no,yes
job,Unnamed: 1_level_1,Unnamed: 2_level_1
admin.,6777,1121
blue-collar,4708,407
entrepreneur,884,93
housemaid,541,75
management,1834,254
retired,766,323
self-employed,866,115
services,2316,238
student,360,177
technician,4389,578


In [15]:
# contingency table
obs = pd.crosstab(data['job'], data['Customer_Response'])

# Chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(obs)

# results
print("Chi-square Statistic:", chi2_stat)
print("P-value:", p_value)
print("Degrees of Freedom:", dof)

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Chi-square Statistic: 656.0224619876792
P-value: 1.7186529866407562e-134
Degrees of Freedom: 10
Reject the null hypothesis.


- **we don't have enough evidence to say that profession and Customer response are independent of each other.**
- However we need to check whether profession is influencing the Customer response.

### To understand the relationship between defaulters and Customer Response

- Null: There is no relationship between defaulters and Customer Responce. (independent)
- Alternate: There is a relationship between defaulters and Customer Responce. (dependent)

In [16]:
pd.crosstab(data['default'], data['Customer_Response'])

Customer_Response,no,yes
default,Unnamed: 1_level_1,Unnamed: 2_level_1
no,23993,3499
yes,2,0


In [17]:
# contingency table
obs = pd.crosstab(data['default'], data['Customer_Response'])

# Chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(obs)

# results
print("Chi-square Statistic:", chi2_stat)
print("P-value:", p_value)
print("Degrees of Freedom:", dof)

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Chi-square Statistic: 0.0
P-value: 1.0
Degrees of Freedom: 1
Fail to reject the null hypothesis.


- **we don't have enough evidence to say that defaulters and Customer response are dependent on each other.**

### To understand the relationship between personal loan and Customer Response

- Null: There is no relationship between having a personal loan and Customer Responce. (independent)
- Alternate: There is a relationship between having a personal loan and Customer Responce. (dependent)

In [18]:
pd.crosstab(data['loan'], data['Customer_Response'])

Customer_Response,no,yes
loan,Unnamed: 1_level_1,Unnamed: 2_level_1
no,20226,2973
yes,3769,526


In [19]:
# contingency table
obs = pd.crosstab(data['loan'], data['Customer_Response'])

# Chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(obs)

# results
print("Chi-square Statistic:", chi2_stat)
print("P-value:", p_value)
print("Degrees of Freedom:", dof)

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Chi-square Statistic: 1.0036552597662287
P-value: 0.31642765553716
Degrees of Freedom: 1
Fail to reject the null hypothesis.


- **we don't have enough evidence to say that personal loan and Customer response are dependent on each other.**

### To understand the relationship between Region and Customer Response

- Null: There is no relationship between Region and Customer Responce. (independent)
- Alternate: There is a relationship between Region and Customer Responce. (dependent)

In [20]:
pd.crosstab(data['Region_Name'], data['Customer_Response'])

Customer_Response,no,yes
Region_Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Central,5568,774
East,6856,974
South,3870,590
West,7701,1161


In [21]:
# contingency table
obs = pd.crosstab(data['Region_Name'], data['Customer_Response'])

# Chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(obs)

# results
print("Chi-square Statistic:", chi2_stat)
print("P-value:", p_value)
print("Degrees of Freedom:", dof)

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Chi-square Statistic: 4.269182884267758
P-value: 0.2338259984833805
Degrees of Freedom: 3
Fail to reject the null hypothesis.


- **we don't have enough evidence to say that region and Customer response are dependent on each other.**

### To understand the relationship between result of previous campaign and current campaign

- Null: There is no relationship between previous campaign outcome and current customer Responce. (independent)
- Alternate: There is a relationship between previous campaign outcome and current customer Responce. (dependent)

In [22]:
# contingency table
obs = pd.crosstab(data['poutcome'], data['Customer_Response'])

# Chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(obs)

# results
print("Chi-square Statistic:", chi2_stat)
print("P-value:", p_value)
print("Degrees of Freedom:", dof)

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Chi-square Statistic: 2895.580038101799
P-value: 0.0
Degrees of Freedom: 2
Reject the null hypothesis.


- **we don't have enough evidence to say that previous campaign and Customer response are independent of each other.**
- However we need to check whether poutcome is influencing the Customer response.

## `Normality test`

### Normality check for the 'age' variable

- Null: age data is normally distributed
- Alternate: age data is not normally distributed

In [33]:
# Testing for normality
stat, p = shapiro(data.age)
print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpretation
alpha = 0.05
if p < alpha:
    print('We reject the null hypothesis. Data is not normally distributed.')
else:
    print('Fail to reject the null hypothesis. Data follows a normal distribution.')

Statistics=0.941, p=0.000
We reject the null hypothesis. Data is not normally distributed.


## `Two Sample Z - test`
(since our sample is greater than 30)
### To test the difference in the mean age between customers who subscribed to a term deposit and those who did not.

- Null Hypothesis (H0): There is no significant difference in the mean age between customers who subscribed to a term deposit and those who did not.
- Alternative Hypothesis (H1): There is a significant difference in the mean age.

In [24]:
subscribed_age = data[data['Customer_Response'] == 'yes']['age']
not_subscribed_age = data[data['Customer_Response'] == 'no']['age']

z_statistic, p_value = test.ztest(x1=subscribed_age,x2=not_subscribed_age)

print("Z Statistic:", z_statistic)
print("P-value:", p_value)

# Comparing p-value to the significance level (0.05)
if p_value < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Z Statistic: 7.952524092288905
P-value: 1.827492777381958e-15
Reject the null hypothesis. There is a significant difference in mean ages.


- **We do not have enough evidance to say that the age of the customers who subscribed to the term term deposit and who did not are same.**

### To test the difference in the average duration of campaign between customers who subscribed to a term deposit and those who did not.

- Null Hypothesis (H0): There is no significant difference in the mean duration between customers who subscribed to a term deposit and those who did not.
- Alternative Hypothesis (H1): There is a significant difference in the mean duration.

In [25]:
subscribed_age = data[data['Customer_Response'] == 'yes']['duration']
not_subscribed_age = data[data['Customer_Response'] == 'no']['duration']

z_statistic, p_value = test.ztest(x1=subscribed_age,x2=not_subscribed_age)

print("Z Statistic:", z_statistic)
print("P-value:", p_value)

# Comparing p-value to the significance level (0.05)
if p_value < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

Z Statistic: 71.46941360104718
P-value: 0.0
Reject the null hypothesis. There is a significant difference in mean duration.


- **We do not have enough evidance to say that the duration of campaign of the customers who subscribed to the term deposit and who did not are same.**
- Hence we can do the further investigation on this variable.

### To test if average duration of campaign is larger for those customers who subscribed to a term deposit than who did not.

- Null Hypothesis (H0): There is no significant difference in the mean duration between customers who subscribed to a term deposit and those who did not.
- Alternative Hypothesis (H1): Mean duration is larger for those who subscribed than who did not.

In [29]:
subscribed_age = data[data['Customer_Response'] == 'yes']['duration']
not_subscribed_age = data[data['Customer_Response'] == 'no']['duration']

z_statistic, p_value = test.ztest(x1=subscribed_age,x2=not_subscribed_age, alternative="larger")

print("Z Statistic:", z_statistic)
print("P-value:", p_value)

# Comparing p-value to the significance level (0.05)
if p_value < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

Z Statistic: 71.46941360104718
P-value: 0.0
Reject the null hypothesis. Mean duration is significantly larger for those who subscribed than who did not.


- **We do not have enough evidance to say that the duration of campaign of the customers who subscribed to the term deposit and who did not are same.**
- Hence we can do the further investigation on this variable.

### To test the difference in the mean nr.employed between customers who subscribed to a term deposit and those who did not.

- Null Hypothesis (H0): There is no significant difference in the mean nr.employed between customers who subscribed to a term deposit and those who did not.
- Alternative Hypothesis (H1): There is a significant difference in the mean nr.employed.

In [27]:
subscribed_age = data[data['Customer_Response'] == 'yes']['nr.employed']
not_subscribed_age = data[data['Customer_Response'] == 'no']['nr.employed']

z_statistic, p_value = test.ztest(x1=subscribed_age,x2=not_subscribed_age)

print("Z Statistic:", z_statistic)
print("P-value:", p_value)

# Comparing p-value to the significance level (0.05)
if p_value < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

Z Statistic: -65.50035284214754
P-value: 0.0
Reject the null hypothesis. There is a significant difference in mean nr.employed.


- **We do not have enough evidance to say that the average nr.employed of the customers who subscribed to the term deposit and who did not are same.**

### To test the difference in the average number of contacts between customers who subscribed to a term deposit and those who did not.

- Null Hypothesis (H0): There is no significant difference in the mean number of contacts between customers who subscribed to a term deposit and those who did not.
- Alternative Hypothesis (H1): There is a significant difference in the mean number of contacts.

In [28]:
subscribed_age = data[data['Customer_Response'] == 'yes']['campaign']
not_subscribed_age = data[data['Customer_Response'] == 'no']['campaign']

z_statistic, p_value = test.ztest(x1=subscribed_age,x2=not_subscribed_age)

print("Z Statistic:", z_statistic)
print("P-value:", p_value)

# Comparing p-value to the significance level (0.05)
if p_value < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

Z Statistic: -11.65347123368027
P-value: 2.2029862491314617e-31
Reject the null hypothesis. There is a significant difference in mean number of contacts.


- **We do not have enough evidance to say that the number of contacts with the customers who subscribed to the term deposit and who did not are same.**