# MACHINE LEARNING INTERNSHIP @ CODE SOFT

## TASK 3 : Customer Churn Prediction 

### The dataset is available at Kaggle : 

### https://www.kaggle.com/datasets/shantanudhakadd/bank-customer-churn-prediction

### We will start with exploring and preprocessing our data.
1)Importing necessary libraries
2)Loading the dataset and Exploring the dataset
3)Preprocessing the data (handling missing values, encoding categorical variables, etc.)
4)Feature engineering continued.

In [1]:
# Step 1: Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

In [2]:
# Step 2: Loading and Exploring the dataset
df = pd.read_csv("Churn_Modelling.csv")

# Display the first few rows of the dataset
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
# Step 3: Preprocessing the data

# Check for missing values
missing_values = df.isnull().sum()

# Check the data types of each column
data_types = df.dtypes

# Check for any duplicate rows
duplicate_rows = df.duplicated().sum()

# Summary statistics
summary_stats = df.describe()

# Display the results
missing_values, data_types, duplicate_rows, summary_stats

(RowNumber          0
 CustomerId         0
 Surname            0
 CreditScore        0
 Geography          0
 Gender             0
 Age                0
 Tenure             0
 Balance            0
 NumOfProducts      0
 HasCrCard          0
 IsActiveMember     0
 EstimatedSalary    0
 Exited             0
 dtype: int64,
 RowNumber            int64
 CustomerId           int64
 Surname             object
 CreditScore          int64
 Geography           object
 Gender              object
 Age                  int64
 Tenure               int64
 Balance            float64
 NumOfProducts        int64
 HasCrCard            int64
 IsActiveMember       int64
 EstimatedSalary    float64
 Exited               int64
 dtype: object,
 0,
          RowNumber    CustomerId   CreditScore           Age        Tenure  \
 count  10000.00000  1.000000e+04  10000.000000  10000.000000  10000.000000   
 mean    5000.50000  1.569094e+07    650.528800     38.921800      5.012800   
 std     2886.89568  7.19361

BASED ON ABOVE 
Missing Values: The dataset does not contain any missing values, as indicated by the absence of non-zero values in the "Missing values" output.
Data Types: The data types of each column are displayed. We can see that most columns are integers (int64), while some columns like "Surname", "Geography", and "Gender" are of type object, indicating categorical variables.
Duplicate Rows: There are no duplicate rows in the dataset, as indicated by the count of 0 in the "Number of duplicate rows" output.

### NOW FEATURE ENGINEERING 

In [4]:
# Step 4: Feature engineering
# Encoding categorical variables
df = pd.get_dummies(df, columns=['Geography', 'Gender'], drop_first=True)

# Dropping unnecessary columns
df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)

# Display the modified dataset
df.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2,0.0,1,1,1,101348.88,1,False,False,False
1,608,41,1,83807.86,1,0,1,112542.58,0,False,True,False
2,502,42,8,159660.8,3,1,0,113931.57,1,False,False,False
3,699,39,1,0.0,2,0,0,93826.63,0,False,False,False
4,850,43,2,125510.82,1,1,1,79084.1,0,False,True,False


EXPLAINATION:
Encoding Categorical Variables: We used one-hot encoding to convert the categorical variables "Geography" and "Gender" into binary columns. Each category in these columns is represented by a binary (0 or 1) indicator column. This encoding allows us to include categorical data in machine learning models effectively.

Dropping Unnecessary Columns: We dropped the "RowNumber", "CustomerId", and "Surname" columns. These columns don't seem to provide any predictive power for the churn prediction task. "RowNumber" is simply an index, while "CustomerId" and "Surname" are identifiers that are unlikely to impact churn prediction.

In [5]:
# Step 1: Splitting the dataset
X = df.drop(columns=['Exited'])
y = df['Exited']

In [6]:
# Step 2: Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### We split the dataset into training and testing sets, with 80% of the data used for training and 20% for testing. We use a random state of 42 for reproducibility.

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Step 3: Define a function to train and evaluate models
def evaluate_models(models, X_train, y_train, cv=5):
    results = {}
    for name, model in models.items():
        scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy')
        results[name] = scores.mean()
    return results
# Define the models to evaluate
models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier()
}

### Function to train and evaluate multiple models using cross-validation.
### def evaluate_models(models, X_train, y_train, cv=5):
    Args:
    - models: A dictionary containing the models to evaluate.
    - X_train: The features of the training set.
    - y_train: The target variable of the training set.
    - cv: Number of folds for cross-validation.
    Returns:
    - A dictionary containing the mean accuracy scores of each model.

In [8]:
# Step 4: Evaluate the models
model_results = evaluate_models(models, X_train, y_train)

# Display the results
model_results

{'Logistic Regression': 0.787625,
 'Random Forest': 0.861375,
 'Gradient Boosting': 0.8616249999999999}

- Since Gradient Boosting has shown slightly better accuracy, let's proceed with fine-tuning the parameters of the Gradient Boosting model to optimize its performance. We'll use GridSearchCV for hyperparameter tuning.

In [9]:
from sklearn.model_selection import GridSearchCV

# Defining the Gradient Boosting model
gb_model = GradientBoostingClassifier()

In [10]:
# Defining the hyperparameter grid for GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 4, 5]
}

# Performing GridSearchCV
grid_search = GridSearchCV(gb_model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Displaying the best parameters
best_params = grid_search.best_params_
best_params

{'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 150}

### Now, let's proceed with the final evaluation of the tuned Gradient Boosting model on the test set to get an accurate estimate of its performance.

In [11]:
# Instantiating the Gradient Boosting model with the best parameters
best_gb_model = GradientBoostingClassifier(**best_params)

# Training the model on the training set
best_gb_model.fit(X_train, y_train)

# Evaluating the model on the test set
test_accuracy = best_gb_model.score(X_test, y_test)
test_accuracy

0.8705

### The tuned Gradient Boosting model achieved an accuracy of 87% on the test set. This indicates that the model performs well in predicting customer churn for the given dataset.