<a href="https://colab.research.google.com/github/YogeshGadade/Statistical-Data-Analysis/blob/main/Customer_churn_estimation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Business Problem: Estimate the proportion of customers who will churn in the next month based on the behavior of a sample of customers 

Solution: 
1. Will try to estimate the range of customers 
2. will try to confirm my calculate confidance interval, perform hypothesis testing confirming sample of customers representing actual customer distribution using inferential statistics in Python.

Steps: Following code first loads data (created for this analysis) loads customer data from a CSV file, and then calculates the number of customers who have churned, the total number of customers, and the proportion of customers who have churned. Then, using the sample proportion and the sample size, it calculates the standard error, margin of error, and the lower and upper bounds of the 95% confidence interval for the proportion of customers who will churn in the next month. And finally, it uses a binomial test to get the p-value of the hypothesis that the proportion of customers who will churn in the next month is equal to the proportion of customers who have churned so far.

In [None]:
import pandas as pd
from scipy.stats import binom_test
#import random 

# Creating synthetic customer data
num_customers = 1000
#customers = {'customer_id': list(range(1, num_customers+1)),
#             'churned': [random.choice([True, False]) for i in range(num_customers)]
#            }

import numpy as np

np.random.seed(42)
customer_id = np.random.randint(1,1001,num_customers)
churned = np.random.binomial(1,0.1,num_customers)

data = pd.DataFrame({'customer_id':customer_id, 'churned':churned})

#data = pd.DataFrame(customer_data)


# As data is in desired format therefore no preprocessing

# calculate the number of customers who have churned
num_churned = data[data['churned'] == True].shape[0]

# calculate the total number of customers
num_total = data.shape[0]

# calculate the proportion of customers who have churned
prop_churned = num_churned / num_total

# calculate the standard error of the proportion
se = (prop_churned * (1 - prop_churned) / num_total)**0.5

# calculate the margin of error
me = 1.96 * se

# calculate the lower and upper bounds of the 95% confidence interval
lower_bound = prop_churned - me
upper_bound = prop_churned + me

print("The proportion of customers who will churn in the next month is estimated to be between", lower_bound, "and", upper_bound)

# Now that we have the interval, we can use a binomial test
p_value = binom_test(num_churned, num_total, prop_churned, alternative = 'two-sided')

print("The p_value for the hypothesis test is", p_value)


However, since the sample proportion is just an estimate and not the actual population proportion, there is a chance that the sample proportion could be different from the true population proportion. The p-value helps us to quantify this chance.

A p-value is the probability of observing a sample proportion as extreme or more extreme than the one observed, assuming that the null hypothesis is true. In this case, the null hypothesis is that the proportion of customers who will churn in the next month is equal to the proportion of customers who have churned so far. If the p-value is less than a certain threshold (usually 0.05), it means that the sample proportion is unlikely to have occurred by chance alone and therefore we can reject the null hypothesis and conclude that the sample proportion is different from the true population proportion.

In other words, p-value is a measure of how much evidence we have against the null hypothesis. If p-value is low, that means we have enough evidence to reject the null hypothesis, which means that we have enough evidence to say that the sample proportion is different from the true population proportion.

In this case, the p-value is 1.0, which means that the sample proportion is consistent with the population proportion, and thus the customer churn rate is consistent with what was expected.

Result Analysis: The output shows that the proportion of customers who will churn in the next month is estimated to be between 7.8% and 11.4%. The 95% confidence interval gives us a range of plausible values for the true population proportion of customers who will churn in the next month, based on the sample data. The lower bound of the interval is 0.07774104563782472 and the upper bound is 0.11425895436217529.

The p-value of 1.0 means that the probability of observing the current sample proportion, or a more extreme proportion, assuming that the null hypothesis is true, is 1.0. This means that the sample proportion is consistent with the population proportion, and therefore the null hypothesis is accepted.

In other words, it means that the sample proportion of customer churn is not significantly different from the population proportion, which is 0.1, and thus the customer churn rate is consistent with what was expected.

Improvements In the above implementation:

code I provided earlier, no preprocessing is required as the synthetic data is already created and it is in the desired format. However, if you are working with real-world data, some preprocessing steps might be necessary before running the analysis. These steps might include:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# load data
data = pd.read_csv("customer_data.csv")

# Data cleaning
data.dropna(inplace=True) # remove rows with missing values
data.drop_duplicates(inplace=True) # remove duplicate rows

# Data transformation
data['churned'] = data['churned'].astype(int) # convert churned column to numeric
data = pd.get_dummies(data, columns=['customer_type']) # convert customer_type column to dummy variables

# Feature engineering
data['average_purchase_amount'] = data.groupby('customer_id')['purchase_amount'].transform('mean') # create a new variable for average purchase amount

# Data splitting
X = data.drop(columns=['churned'])
y = data['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Data normalization
scaler = StandardScaler()
X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.transform(X_test)


This code first loads customer data from a CSV file, and then performs several preprocessing steps:

1. Data cleaning: Removes any missing values and duplicate rows
2. Data transformation: converts the churned column to numeric and customer_type column to dummy variables
3. Feature engineering: creates a new variable for average purchase amount by customer_id
4. Data splitting: Splits the data into a training set (80%) and a test set (20%)
5. Data normalization: normalizes the data using StandardScaler.

Note that this code is just an example, and the specific preprocessing steps required will depend on the nature and quality of the data, and on the goals of the analysis. Also, it's always a good practice to check the data and do some data exploration, cleaning, and transformation when required.

Possible challanges: 
There are several challenges that can arise when trying to find customer churning, including:

1. Defining churn: Churn can be defined in different ways, depending on the context and the goals of the analysis. For example, churn might be defined as customers who cancel their subscription, stop using the service, or switch to a competitor. Defining churn in a clear and consistent way is important for the analysis.
2. Data availability and quality: In order to identify customers who are at risk of churning, it is important to have access to data on their behavior and demographics. However, in many cases, data might be missing, incomplete, or of poor quality, which can make it difficult to conduct the analysis.
3. Identifying the causes of churn: Churn can be caused by a variety of factors, such as poor service quality, high prices, or lack of features. Identifying the specific causes of churn can be challenging, as customers may have different reasons for churning, and a single customer may have multiple reasons.
4. Handling class imbalance: Churn is typically a rare event, which means that the number of customers who churn is much smaller than the number of customers who do not churn. This can lead to class imbalance in the data, which can make it difficult to train accurate predictive models.
5. Handling temporal data: Churn prediction often relies on time-series data, where the information on the past behavior of customers is used to predict the future behavior of customers. Such data can be challenging to handle, as it can be affected by seasonality, trends, and irregular patterns.
6. Model evaluation and selection: Churn prediction models are often evaluated based on their accuracy or AUC-ROC scores, but these metrics may not be the most appropriate for imbalanced data and the business goals. There are other metrics like precision, recall, F1-score and Lift that are more suited to evaluate the model in such scenarios.
7. Model interpretability: Model interpretability is crucial for understanding what factors are driving customer churn and how to take action to retain customers. Some models, such as decision trees or rule-based models, are more interpretable than others, such as neural networks or random forests.