## Step:1 Dataset Description and Objective:
**Dataset Description**: The dataset, named "customer_churn," contains information about customers of a telecom company. Each row represents a unique customer, and the columns provide various attributes such as customer demographics, services subscribed, tenure, payment method, and churn status.

* Attributes in the dataset include:
    * Customer demographics (e.g., gender, age, senior citizenship)
    * Services subscribed (e.g., internet service type)
    * Tenure (duration of subscription)
    * Payment method
    * Churn status (whether the customer has churned or not)

**Objective:**
The objective of our analysis is to gain insights from the data and develop strategies to prevent customers from churning out to competitors. We will perform data manipulation, visualization, and build predictive models to achieve this objective. 
* Key tasks include:

    * Data manipulation to extract relevant subsets of data.
    * Visualization to understand the distribution of variables.
    * Building predictive models using machine learning algorithms to identify factors influencing churn and devise retention strategies.

## Step:2 Import Necessary libraries

In [None]:
# Standard data manipulation libraries
import pandas as pd
import numpy as np

# Data visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Machine learning libraries
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Ignore UserWarning
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

# Evaluation Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Library for Model Saving and Loading
import joblib

## Step:3 Data loading

In [None]:
df=pd.read_csv(r"E:\csv\Intel csv\customer_churn.csv")

**About columns in dataset**

* customerID: this column contain unique value to identify customers uniquely
* gender: this column contain categorical value male or female
* SeniorCitizen: this column also contan categorical value 1(senior citizen) or 0(not a senior citizen) 
* Partner: this column can be anything, let's consider as married(1)(yes) or not married(0)(no)
* Dependents: this column also can be anything, let's consider as person have job(yes)(1) or not(no)(0)
* tenure: this indicates the number of months a customer has been using the services of Telecom Company
* PhoneService: this column indicates whether customer using calling service or not
* MultipleLines: In some PNT landline phone there is two line, line1 and line2 are present means customer have choice to call with any one line
    * yes means customer have multiple line
    * no means not using multiple line
    * no phone means customer have no phone 
* InternetService: Type of internet service subscribed by the customer (DSL, Fiber optic, No)
    * DSL provides high-speed internet access through a wired connection, usually via a modem and telephone line infrastructure
* OnlineSecurity: Indicates if the customer has online security service (Yes/No).
* OnlineBackup: Indicates if the customer has online backup service (Yes/No).
* DeviceProtection: Indicates if the customer has device protection service (Yes/No).
* TechSupport: Indicates if the customer has tech support service (Yes/No).
* StreamingTV: Indicates if the customer has streaming TV service (Yes/No).
* StreamingMovies: Indicates if the customer has streaming movie service (Yes/No).
* Contract: Type of contract the customer has (Month-to-month, One year, Two year).
* PaperlessBilling: Indicates if the customer has opted for prepaid sim i.e paperless(yes), or postpaid sim(no) which have a paper bill
* PaymentMethod: Payment method used by the customer (Electronic check (paytm), Mailed check, Bank transfer (automatic), Credit card (automatic))
* MonthlyCharges: Amount charged to the customer monthly.
* TotalCharges: Total amount charged to the customer.
* Churn: Indicates if the customer has churned (Yes/No).

## Step:4 Exploratory Data Analysis (EDA)

#### 4.1 Data Overview
* Checking the dimensions of the dataset (number of rows and columns).
* Inspect first few rows to understand the structure of the data.

In [None]:
# view top 5 records
df.head()

In [None]:
# view last 5 records
df.tail()

In [None]:
#Checking the shape of the dataset
df.shape

In [None]:
# To view all records
print(df.to_string())

#### 4.2 Checking for Duplicates

In [None]:
# checking for duplicated values
df.duplicated().sum()

#### 4.3 Understanding Data Types and Information

In [None]:
# Checking information about my data
df.info()

* Within this dataset, we have one column of float type named 'monthlyCharges',
* two columns of int64 type named 'SeniorCitizen' and 'Tenure', and
* eighteen columns of object type.

In [None]:
# Count no of records for each column
df.count()

In [None]:
# Count occurrences of unique value in Gender Column
df['gender'].value_counts()

In [None]:
#checking number of unique values in each column
df.nunique()

In [None]:
#Checking the unique values in each column
cols = df.columns
for i in cols:
    print(i, df[i].unique(), '\n')

#### 4.4 Descriptive Statistics

In [None]:
df.describe()

**Insights from SeniorCitizen Column:**

* Approximately 16.21% of the customers in the dataset are senior citizens.

**Insights from Tenure Column:**
* The average tenure of customers is approximately 32.37 months.
* The standard deviation of approximately 24.56 suggests variability in the tenure lengths of customers, indicating that some customers have been with the company for a shorter duration, while others have been with the company for a longer duration.
* The minimum tenure is 0 months, indicating that there are customers who are relatively new to the service, while the maximum tenure is 72 months, indicating long-term customers.

**Insights from MonthlyCharges Column:**

* The average monthly charge is approximately \\$64.76.
* The standard deviation of approximately \\$30.09 suggests variability in monthly charges across customers.
* Monthly charges range from \\$18.25 to \\$118.75, indicating a wide range of pricing plans or services offered to customers.

#### 4.5 Removing Customer_id column

In [None]:
# Removing the 'customerID' column as it is not required for our analysis
df.drop(columns='customerID', inplace=True)

#### 4.6 Typecasting 'TotalCharges' column from object to numeric

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

#### 4.7 Checking for Null Values

In [None]:
# checking null values
df.isnull().sum().sum()

#### 4.8 Handling Missing Values

In [None]:
# Drop null values
df.dropna(inplace=True)

# Check if null values have been dropped
null_count = df.isnull().sum().sum()
null_count

#### 4.9 Outlier Analysis

In [None]:
# Ploting box-plot for checking outliers
col = ['tenure', 'MonthlyCharges', 'TotalCharges']

for col_name in col:
    if df[col_name].dtype != 'object':
        fig = px.box(df, y=col_name)
        fig.update_layout(title=f'Box plot of {col_name}', xaxis_title=col_name, yaxis_title='Count')
        fig.show()

* So, based on the above plot, we can confidently conclude that there are no outliers present in the dataset.

## Step: 5 Data Visualization

### 5.1 Univariate Analysis

#### 5.1.1 Visualization of Customer Demographics

In [None]:
# Create subplots
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=("Gender Distribution", "Senior Citizen Distribution", "Partner Distribution", "Dependents Distribution"),
    specs=[[{'type': 'domain'}, {'type': 'xy'}],
           [{'type': 'xy'}, {'type': 'xy'}]]
)

# Gender Distribution
gender_counts = df['gender'].value_counts()
fig.add_trace(go.Pie(labels=gender_counts.index, values=gender_counts.values, hole=.3), row=1, col=1)

# Senior Citizen Distribution
senior_counts = df['SeniorCitizen'].value_counts()
fig.add_trace(go.Bar(x=senior_counts.index, y=senior_counts.values, marker=dict(color='royalblue')), row=1, col=2)

# Partner Distribution
partner_counts = df['Partner'].value_counts()
fig.add_trace(go.Bar(x=partner_counts.index, y=partner_counts.values, marker=dict(color='royalblue')), row=2, col=1)

# Dependents Distribution
dependents_counts = df['Dependents'].value_counts()
fig.add_trace(go.Bar(x=dependents_counts.index, y=dependents_counts.values, marker=dict(color='royalblue')), row=2, col=2)

# Update layout
fig.update_layout(title_text="Distribution Plots", height=600, width=1000, showlegend=False)

fig.show()

From the above Customer Demographics plot, I can say that:

* The number of males and females is almost the same, with slightly more males than females in the dataset.
* The majority of customers are not senior citizens.
* Nearly 3,500 customers have a partner, and almost similar number of customers don't.
* Most customers don't have dependents, but a significant number do have dependents.
(Dependents are individuals who are not earning and rely on their family members for financial support, including paying for their plan recharges and other expenses.)

From these graphs, we gain insights into the customers' demographics, which helps us understand their psychology based on age, relationship status, and dependents.

#### 5.1.2 Visualization of Services provided by Telecom Company

In [None]:
# Create a subplot figure with 3x3 layout
fig = make_subplots(rows=3, cols=3, subplot_titles=(
    'Phone Service', 'Multiple Lines', 'Internet Service',
    'Online Security', 'Online Backup', 'Device Protection',
    'Tech Support', 'Streaming TV', 'Streaming Movies'
))

# List of columns to plot
columns_to_plot = [
    'PhoneService', 'MultipleLines', 'InternetService',
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
    'TechSupport', 'StreamingTV', 'StreamingMovies'
]

# Add traces for each subplot
for i, col in enumerate(columns_to_plot):
    row = i // 3 + 1
    col_pos = i % 3 + 1
    counts = df[col].value_counts().reset_index()
    counts.columns = ['category', 'count']
    trace = go.Bar(x=counts['category'], y=counts['count'], name=col)
    fig.add_trace(trace, row=row, col=col_pos)

# Update layout
fig.update_layout(
    title_text='Customer Demographics',
    showlegend=False,
    height=800,
    width=1000
)

# Show the figure
fig.show()

The above graphs visualize the services taken by customers from the telecom company.

**Inference:1** For PhoneService, MultipleLines and InternetService

* Nearly 6000 customers have taken phone service.
* Among these 6000, nearly half have opted for multiple lines from the company.
* Almost 5500 customers have taken internet services from the company, with nearly 3000 opting for fiber optic and the rest opting for DSL, likely for business purposes. From these three major telecom services, phone services and internet services are the most popular among customers.

**Inference:2** When it comes to other services, including Online Security, Online Backup, Device Protection, Tech Support, and Streaming Services:

* Online backup and device protection services are opted for by almost 2500 customers, highlighting customers' concerns regarding their device safety and data protection.
* Online security and tech support are the least opted services among customers, with almost 2000 customers choosing them.
* Streaming services are the most popular, with more than 2500 customers opting for them.

From this, I conclude that apart from internet and phone services, streaming services are the most opted ones. Therefore, the company should focus on providing better streaming services to the customers.

#### 5.1.3 Visualization of Customer Tenure and Contract Type

In [None]:
# Create subplot figure with 1 row and 2 columns
fig = make_subplots(rows=1, cols=2, subplot_titles=('Customer Tenure in Months', 'Contract Type'))

# Plot histogram of customer tenure
fig.add_trace(go.Histogram(x=df['tenure'], name='Customer Tenure'), row=1, col=1)

# Plot count of contract types
contract_counts = df['Contract'].value_counts()
fig.add_trace(go.Bar(x=contract_counts.index, y=contract_counts.values, name='Contract Type'), row=1, col=2)

# Update layout
fig.update_layout(title_text='Customer Tenure and Contract Type', showlegend=False, height=500, width=1000)

# Show the figure
fig.show()

From the above graphs, we can see the distribution of customer tenure with the company and the count of the types of contracts the company had with customers.

* Most customers had a tenure of less than a month, and nearly 3800 customers had a month-to-month contract with the company.
* Therefore, customers with shorter tenure typically have a month-to-month contract with the company.
* In addition, a significant number of customers have a tenure of nearly 70 months, highlighting the loyalty of these customers towards the company.
* Moreover, after the month-to-month contract, the second most popular contract is the two-year contract, which is opted for by nearly 1700 customers. The remaining customers have a one-year contract.

#### 5.1.4 Insights into Billing Preferences and Charge Patterns

In [None]:
# Create subplot figure with 2 rows and 2 columns
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Paperless Billing', 'Payment Method', 'Monthly Charges', 'Total Charges'),
    vertical_spacing=0.3,
    horizontal_spacing=0.1
)

# Plot Paperless Billing
paperless_counts = df['PaperlessBilling'].value_counts()
fig.add_trace(go.Bar(x=paperless_counts.index, y=paperless_counts.values, name='Paperless Billing'), row=1, col=1)

# Plot Payment Method
payment_counts = df['PaymentMethod'].value_counts()
fig.add_trace(go.Bar(x=payment_counts.index, y=payment_counts.values, name='Payment Method'), row=1, col=2)

# Plot Monthly Charges
fig.add_trace(go.Histogram(x=df['MonthlyCharges'], name='Monthly Charges'), row=2, col=1)

# Plot Total Charges
fig.add_trace(go.Histogram(x=df['TotalCharges'], name='Total Charges'), row=2, col=2)

# Update layout
fig.update_layout(
    title_text='Customer Billing and Charges',
    showlegend=False,
    height=800,
    width=1000
)

# Rotate x-axis labels for Payment Method
fig.update_xaxes(tickangle=90, row=1, col=2)

# Show the figure
fig.show()

These graphs show the method of billing and the bill amounts.

**Inference 1:** Inferences from bar plots of 'Paperless Billing' and 'Payment Method' columns

* Most of the customers, nearly 4000, prefer paperless billing. However, a little over half of them pay through electronic checks.
* A significant number of customers still prefer paper bills.
* Apart from electronic checks, the other modes of payment accepted by the company include mailed checks, bank transfers, and credit cards.

**Inference 2:** Inferences from histograms of 'Monthly Charges' and 'Total Charges' columns

* For the monthly charges, a large number of customers pay around \\$20 for monthly services, and the majority of customers have total charges less than \\$200.
* However, there are a considerable number of customers with monthly charges between \\$70 to \\$100 and total charges between \\$200 and \\$800.
* Interestingly, if we look at the total charges graph, we can see that some customers have a total bill exceeding \\$4000 and even \\$8000. This could be possible if the customer has a long tenure or uses many services.

In conclusion, the company mainly has customers with low charges, which means the company should focus on these customers by providing even more affordable services.

#### 5.1.5 Churn Analysis

In [None]:
# Create the pie chart
fig = px.pie(df, 
             names='Churn', 
             title='Churn Count', 
             hole=0.3,  # If you want a donut chart, otherwise remove this
             color_discrete_sequence=px.colors.sequential.RdBu)

# Display the chart
fig.show()

In the dataset, the number of churning customers is very low compared to non-churning customers. 
* Only 26.5% of customers have churned out from the telecom company. 
* This could potentially indicate that the company is quite effective at retaining its customers.

### 5.2 Bi-Variate Analysis

* Up to this point, I've visualized the data to gain a comprehensive understanding. 
* Now, I'll examine the relationship between the independent variables and the target variable.

#### 5.2.1 Demographic Analysis and Churn Relationship

In [None]:
# Create subplot figure with 2 rows and 2 columns
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Gender Distribution', 'Senior Citizen and Churn', 'Partner and Churn', 'Dependents and Churn')
)

# Gender Distribution
gender_fig = px.histogram(df, x='gender', color='Churn', barmode='group')
for trace in gender_fig.data:
    fig.add_trace(trace, row=1, col=1)

# Senior Citizen Distribution
senior_citizen_fig = px.histogram(df, x='SeniorCitizen', color='Churn', barmode='group')
for trace in senior_citizen_fig.data:
    fig.add_trace(trace, row=1, col=2)

# Partner Distribution
partner_fig = px.histogram(df, x='Partner', color='Churn', barmode='group')
for trace in partner_fig.data:
    fig.add_trace(trace, row=2, col=1)

# Dependents Distribution
dependents_fig = px.histogram(df, x='Dependents', color='Churn', barmode='group')
for trace in dependents_fig.data:
    fig.add_trace(trace, row=2, col=2)

# Update layout
fig.update_layout(
    title_text='Customer Demographics and Churn Distribution',
    height=800,
    width=1000
)

# Show the figure
fig.show()

From these graphs, we can understand the relationship between customer demographics and customer churn.

* Both males and females have an equal number of churns, indicating no significant relationship between gender and customer churn.
* However, senior citizens have a lower churn count compared to non-senior citizens, possibly because they might prefer avoiding the hassle of switching telecom companies.
* Customers without partners have a higher churn count compared to those with partners. Similarly, customers without dependents have a higher churn count compared to those with dependents.

In conclusion, customers who are single, without a partner, or without dependents have a higher churn count, while senior citizens have a lower churn count.

#### 5.2.2 Impact of Service Choices on Customer Churn

In [None]:
# Create subplots using Seaborn
fig, ax = plt.subplots(3, 3, figsize=(20, 20))

# Phone Service
sns.countplot(x='PhoneService', data=df, hue='Churn', ax=ax[0,0]).set_title('Phone Service')

# Multiple Lines
sns.countplot(x='MultipleLines', data=df, hue='Churn', ax=ax[0,1]).set_title('Multiple Lines')

# Internet Service
sns.countplot(x='InternetService', data=df, hue='Churn', ax=ax[0,2]).set_title('Internet Service')

# Online Security
sns.countplot(x='OnlineSecurity', data=df, hue='Churn', ax=ax[1,0]).set_title('Online Security')

# Online Backup
sns.countplot(x='OnlineBackup', data=df, hue='Churn', ax=ax[1,1]).set_title('Online Backup')

# Device Protection
sns.countplot(x='DeviceProtection', data=df, hue='Churn', ax=ax[1,2]).set_title('Device Protection')

# Tech Support
sns.countplot(x='TechSupport', data=df, hue='Churn', ax=ax[2,0]).set_title('Tech Support')

# Streaming TV
sns.countplot(x='StreamingTV', data=df, hue='Churn', ax=ax[2,1]).set_title('Streaming TV')

# Streaming Movies
sns.countplot(x='StreamingMovies', data=df, hue='Churn', ax=ax[2,2]).set_title('Streaming Movies')

plt.tight_layout()
plt.show()

These graphs visualize the relationship between customer churn and the services opted by customers.

* In the phone and internet services, there is no significant relationship between churn and the type of service opted.
* However, the churn rate is higher among customers who have opted for multiple lines.

Regarding other services:

* Customers who have not taken Online Backup or Device Protection services have a higher churn rate compared to those who have opted for these services.
* Additionally, customers with streaming services have a lower churn rate compared to those who have not opted for them.
Therefore, certain services are related to customer churn, specifically multiple lines, Online Backup, Device Protection, and Streaming Services.

#### 5.2.3 Relationship Between Customer Tenure, Contract Type, and Churn

In [None]:
# Creating subplots
fig = make_subplots(rows=1, cols=2, subplot_titles=('Customer Tenure (in Months) and Churn', 'Contract Type and Churn'))

# Tenure and Churn histogram
tenure_hist = px.histogram(df, x='tenure', color='Churn', barmode='stack')
for trace in tenure_hist['data']:
    fig.add_trace(trace, row=1, col=1)

# Contract Type and Churn countplot
contract_count = px.histogram(df, x='Contract', color='Churn')
for trace in contract_count['data']:
    fig.add_trace(trace, row=1, col=2)

# Updating layout
fig.update_layout(height=600, width=1000, title_text='Customer Tenure and Contract Type in Relation to Churn')

# Showing the figure
fig.show()

* There appears to be an inverse relationship between customer tenure and churn.
* Customers with shorter tenures, particularly less than 5 months, exhibit a higher churn rate.
* As tenure increases, the churn rate decreases.
* Additionally, customers with month-to-month contracts demonstrate a higher churn rate compared to those with one or two-year contracts. This suggests that customers with longer contract durations have a lower churn rate.

#### 5.2.4 Impact of Billing Methods and Charges on Churn

In [None]:
# Create subplots
fig = make_subplots(rows=2, cols=2, subplot_titles=("Paperless Billing", "Payment Method", "Monthly Charges", "Total Charges"))

# Paperless Billing
fig.add_trace(go.Bar(x=df['PaperlessBilling'], y=df[df['Churn'] == 'Yes'].groupby('PaperlessBilling').size(), name='Churned', marker_color='red'), row=1, col=1)
fig.add_trace(go.Bar(x=df['PaperlessBilling'], y=df[df['Churn'] == 'No'].groupby('PaperlessBilling').size(), name='Not Churned', marker_color='green'), row=1, col=1)

# Payment Method
fig.add_trace(go.Bar(x=df['PaymentMethod'], y=df[df['Churn'] == 'Yes'].groupby('PaymentMethod').size(), name='Churned', marker_color='red'), row=1, col=2)
fig.add_trace(go.Bar(x=df['PaymentMethod'], y=df[df['Churn'] == 'No'].groupby('PaymentMethod').size(), name='Not Churned', marker_color='green'), row=1, col=2)

# Monthly Charges
fig.add_trace(go.Histogram(x=df[df['Churn'] == 'Yes']['MonthlyCharges'], name='Churned', marker_color='red', opacity=0.5), row=2, col=1)
fig.add_trace(go.Histogram(x=df[df['Churn'] == 'No']['MonthlyCharges'], name='Not Churned', marker_color='green', opacity=0.5), row=2, col=1)

# Total Charges
fig.add_trace(go.Histogram(x=df[df['Churn'] == 'Yes']['TotalCharges'], name='Churned', marker_color='red', opacity=0.5), row=2, col=2)
fig.add_trace(go.Histogram(x=df[df['Churn'] == 'No']['TotalCharges'], name='Not Churned', marker_color='green', opacity=0.5), row=2, col=2)

# Update layout
fig.update_layout(title_text="Customer Churn Analysis", height=700, width=1000)

# Show plot
fig.show()

* There appears to be no significant relationship between paperless billing and payment method with customer churn.
* However, monthly and total charges exhibit interesting dynamics regarding customer churn.
* Customers with higher monthly charges tend to have a higher churn count, which aligns with expectations.
* Surprisingly, customers with higher total charges demonstrate a lower churn count, which suggests a potential correlation with long tenure or extensive service usage.
* This phenomenon could arise if the customer has a lengthy tenure or utilizes numerous services.
* Therefore, reducing monthly charges for customers may be a strategic approach to mitigate churn.

## Step:6 Data Preprocessing

#### 6.1 Label encoding

In [None]:
# Create an empty dictionary to store the mapping
encoded_to_original = {}

for col in df.columns:
    if df[col].dtype == "object" or df[col].dtype == "bool":
        le = LabelEncoder()
        # Fit and transform the column
        df[col] = le.fit_transform(df[col]) 
        # Store the mapping in the dictionary
        encoded_to_original[col] = {i: label for i, label in enumerate(le.classes_)}

# Now you can access the original labels corresponding to each encoded value for each column
encoded_to_original

#### 6.2 Feature Scaling

In [None]:
#Standardizing the data
sc = StandardScaler()
df[['tenure', 'MonthlyCharges', 'TotalCharges']] = sc.fit_transform(df[['tenure', 'MonthlyCharges', 'TotalCharges']])

In [None]:
df.head()

## Step:7 Model Building

I will be using the following models to predict the customer churn:
1. Logistic Regression
2. Decision Tree Classifier
3. Random Forest Classifier

### 7.1 Train test split

In [None]:
X = df.drop(columns='Churn')
y= df['Churn']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### 7.2 Applying Logistic Regression

In [None]:
# Initialize Logistic Regression model
lg_model = LogisticRegression()

# Train the model
lg_model.fit(X_train, y_train)

# Predict using the trained model
lg_pred = lg_model.predict(X_test)

### 7.3 Applying Decision Tree Classifier

In [None]:
# Initialize Decision Tree Classifier model
dt_model = DecisionTreeClassifier()

# Train the model
dt_model.fit(X_train, y_train)

# Predict using the trained model
dt_pred = dt_model.predict(X_test)

### 7.4 Applying Random Forest Classifier

In [None]:
# Initialize Random Forest Classifier model
rf_model = RandomForestClassifier()

# Train the model
rf_model.fit(X_train, y_train)

# Predict using the trained model
rf_pred = rf_model.predict(X_test)

## Step:8 Model Evaluation and Comparison

#### 8.1 Model Evaluation (Using Accuracy, Precision, Recall, F1-score, and ROC-AUC)

In [None]:
# Logistic Regression Evaluation
lg_pred = lg_model.predict(X_test)
lg_accuracy = accuracy_score(y_test, lg_pred)
lg_precision = precision_score(y_test, lg_pred)
lg_recall = recall_score(y_test, lg_pred)
lg_f1 = f1_score(y_test, lg_pred)
lg_roc_auc = roc_auc_score(y_test, lg_pred)

# Decision Tree Classifier Evaluation
dt_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)
dt_precision = precision_score(y_test, dt_pred)
dt_recall = recall_score(y_test, dt_pred)
dt_f1 = f1_score(y_test, dt_pred)
dt_roc_auc = roc_auc_score(y_test, dt_pred)

# Random Forest Classifier Evaluation
rf_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)
rf_precision = precision_score(y_test, rf_pred)
rf_recall = recall_score(y_test, rf_pred)
rf_f1 = f1_score(y_test, rf_pred)
rf_roc_auc = roc_auc_score(y_test, rf_pred)

# Print or display evaluation metrics for each model
print("Logistic Regression Metrics:")
print("Accuracy:", lg_accuracy)
print("Precision:", lg_precision)
print("Recall:", lg_recall)
print("F1-score:", lg_f1)
print("ROC-AUC:", lg_roc_auc)
print()

print("Decision Tree Classifier Metrics:")
print("Accuracy:", dt_accuracy)
print("Precision:", dt_precision)
print("Recall:", dt_recall)
print("F1-score:", dt_f1)
print("ROC-AUC:", dt_roc_auc)
print()

print("Random Forest Classifier Metrics:")
print("Accuracy:", rf_accuracy)
print("Precision:", rf_precision)
print("Recall:", rf_recall)
print("F1-score:", rf_f1)
print("ROC-AUC:", rf_roc_auc)

#### 8.2 Model Comparison

In [None]:
# Comparing Model Performance based on Sccuracy
models_accuracy = {'Logistic Regression': lg_accuracy, 'Decision Tree Classifier': dt_accuracy, 'Random Forest Classifier': rf_accuracy}
best_model = max(models_accuracy, key=models_accuracy.get)
print("Best Model based on Accuracy:", best_model)

## Step:9 Model Saving and Loading for Scalability and Reproducibility

#### 9.1 Saving the model

In [None]:
# Step 1: Saving the model
model_file_path = '../models/customer_churn.pkl'

# Save the model to the specified file path
joblib.dump(lg_model, model_file_path)

#### 9.2 Loading the model

In [None]:
# Step 2: Loading the model
loaded_model = joblib.load(model_file_path)

#### 9.3 Perform predictions using the loaded model

In [None]:
# Step 3: Perform predictions using the loaded model
predictions = loaded_model.predict([[1, 1, 1, 0, -1.158016, 1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0.319168, -0.872095]])
print(predictions)

* So here we got output as 1, means customer has churned out.

## Step:10 Conclusion

The conclusion of this project emphasizes significant insights obtained through thorough exploratory data analysis (EDA) and advanced machine learning models in forecasting customer churn. Here's a concise summary of the main findings and recommendations:

**Exploratory Data Analysis (EDA):**

`Demographic Insights:` Seniors displayed lower churn rates, contrasting with higher churn rates observed among single customers or those without dependents.                  
`Service Satisfaction:` Higher satisfaction with streaming services correlated with lower churn rates among users.       
`Contractual Commitments:` Shorter tenure and month-to-month contracts were linked to higher churn rates, highlighting the importance of longer-term commitments.           
`Financial Implications:` Higher monthly charges and lower total charges were associated with increased churn, suggesting strategies to lower monthly fees and mitigate churn.            

**Machine Learning Modeling:**

`Model Performance:` Logistic Regression emerged as the most effective model with an accuracy of 80%, while Decision Tree Classifier and Random Forest Classifier showed respectable accuracies of 71.7% and 78.0%, respectively.       
`Model Selection:` Despite competitive accuracies, Logistic Regression demonstrated balanced performance metrics, making it the preferred choice for churn prediction.         


**Recommendations:**

`Data-Driven Insights:` Use EDA findings to inform proactive retention strategies based on customer behavior and preferences.           
`Predictive Modeling:` Continuously refine and monitor predictive models to adapt to evolving customer dynamics, fostering long-term relationships and business sustainability.            

In conclusion, the integration of EDA insights and predictive modeling emphasizes the significance of data-driven decision-making in addressing customer churn. By leveraging these insights and implementing proactive strategies, businesses can improve customer retention efforts, drive sustainable growth, and enhance their competitive edge in dynamic market environments.