# PROJECT TITLE 
## customer churn prediction model


# PROJECT OBJECTIVE 
This project is tailored specifically for a Telecommunication Company. Leveraging machine learning models and advanced analytics, this project aims to  understand patterns and forecast customer churn rates within the company's subscriber base.Learn more about classification models and help the client, a telecommunication company, to understand their data.
This will help in finding the lifetime value of each customer and know what factors affect the rate at which customers stop using their services.



## 1. Business understanding

### EXPLANATION OF FEATURES
1. customerID: A unique identifier assigned to each customer
2. gender: Indicates the gender of the customer,categorized as male or female	
3. SeniorCitizen: This demographic information helps segment customers based on age.	
4. Partner: Indicates whether the customer has a partner
5. Dependents: Indicates whether the customer has dependents (e.g., children or others relying on their service)	
6. tenure: Represents the length of time (usually in months) that the customer has been subscribed to the service.
7. PhoneService: Indicates whether the customer has subscribed to phone services provided by the company.
8. MultipleLines: Indicates whether the customer has multiple phone lines as part of their service package.
9. InternetService: Specifies the type of internet service subscribed to by the customer 
10. OnlineSecurity: Indicates whether the customer has an online security add-on as part of their internet service.
11. OnlineBackup: Indicates whether the customer has an online backup service for data as part of their internet package.	
12. DeviceProtection: Indicates whether the customer has device protection services (e.g., insurance or warranty) for their devices.
13. TechSupport: Indicates whether the customer has technical support services included in their subscription.
14. StreamingTV: Indicates whether the customer has subscribed to streaming TV services from the provider.
15. StreamingMovies: Indicates whether the customer has subscribed to streaming movie services from the provider.
16. Contract: Specifies the type of contract the customer has.	
17. PaperlessBilling: Indicates whether the customer receives electronic bills instead of paper bills.	
18. PaymentMethod: Specifies the method the customer uses to make payments.	
19. MonthlyCharges: Represents the amount charged to the customer monthly for the subscribed services.	
20. TotalCharges: Represents the total amount charged to the customer over their entire tenure.
21. Churn: The target variable indicating whether the customer churned (left the service) or not.

### HYPOTHESIS
**Null Hypothesis** : There is no significant difference in churn rate based on monthly charges.     
**Alternative Hypothesis**: Customers with higher monthly charges are more likely to churn.

### RESEARCH QUESTIONS
1. Among customers who have churned, which type of contract is most prevalent? 
2. Which gender has the higest rate of churning
3. Is there a correlation between total charges and the type of contract? (Bar chart)
4. What is the percentage breakdown of customers who have left the company? (Pie chart)
5. How does the churn rate vary based on the duration of customer subscription (tenure)? (Line chart)
6. What is the distribution of services subscribed by customers based on their tenure? (Stacked bar chart)





## 2. Data Understanding

### LIBRARY IMPORTATION

In [None]:
#Data handling 
import pandas as pd 
import numpy as np

from dotenv import dotenv_values 
import pyodbc

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

#machine learning
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from scipy.stats import chi2_contingency
from sklearn.metrics import roc_curve, auc
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler,SMOTE
from sklearn.feature_selection import SelectKBest,mutual_info_classif
from sklearn.metrics import confusion_matrix

  
from sklearn.model_selection import GridSearchCV






import warnings
 
warnings.filterwarnings('ignore')


: 

### LOAD DATASET

the first dataset is accessible remotely on a database.

In [None]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')
 
# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("SERVERNAME")
database = environment_variables.get("DATABASE")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")
 
connection_string = f"DRIVER={{ODBC Driver 18 for SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"

: 

In [None]:
connection = pyodbc.connect(connection_string)

: 

In [None]:
query= "select * from dbo.LP2_Telco_churn_first_3000"
data1= pd.read_sql(query, connection)
data1.head()

: 

second dataset is stored in a repository on github

In [None]:
data2= pd.read_csv('LP2_Telco-churn-last-2000.csv')
data2.head()

: 

third dataset is the test data and it's found in a google drive in a csv file named 'LP2_Telco-churn-second-2000'

In [None]:

data3= pd.read_csv('Telco-churn-second-2000.csv')
data3.head()

: 

### Exploratory data analysis

Assesing the first dataset

In [None]:
data1.shape

: 

In [None]:
data1.info()

: 

In [None]:
data1.duplicated().sum()

: 

In [None]:
data1.isnull().sum()

: 

In [None]:
data1['Contract'].unique()


: 

In [None]:
data1['Contract'].value_counts()

: 

In [None]:
data1['InternetService'].unique()

: 

In [None]:
data1['InternetService'].value_counts()

: 

In [None]:
data1['MultipleLines'].unique()

: 

In [None]:
data1['tenure'].unique()

: 

In [None]:
data1['tenure'].max() #to check for the customers who have stayed the longest with the company 

: 

In [None]:
data1['PaymentMethod'].unique()

: 

Assesing the second dataset

In [None]:
data2.shape

: 

In [None]:
data2.info()

: 

In [None]:
data2.isna().sum()

: 

In [None]:
data2.duplicated().sum()

: 

In [None]:
data2['Contract'].unique()


: 

In [None]:
data2['Contract'].value_counts()


: 

In [None]:
data2['InternetService'].unique()

: 

In [None]:
data2['InternetService'].value_counts()

: 

In [None]:
data2['MultipleLines'].unique()

: 

In [None]:
data2['Partner'].unique()

: 

In [None]:
data2['PaymentMethod'].unique()

: 

In [None]:
data2['OnlineSecurity'].unique()

: 

In [None]:
# Convert to numeric, coerce errors to NaN
data2['TotalCharges'] = pd.to_numeric(data2['TotalCharges'], errors='coerce')


: 

In [None]:
data2['TotalCharges'].dtypes

: 

Combining the training datasets, that is the first and second dataset. This is what we are going to use in training the models we are going to build.

we notice that in the first dataset some categorical columns are in boolean values 'True' and 'False' but in the second datasets values are in 'Yes' and 'No' format. therefore we need to change that so that we will be able to concatinate them.

In [None]:

# Replace True/False values with Yes/No
data1.replace({True: 'Yes', False: 'No'}, inplace=True)

data1.head()

: 

In [None]:
trn_data= pd.concat([data1, data2], axis=0)
trn_data.reset_index(drop=True, inplace=True)
trn_data

: 

In [None]:
trn_data.shape

: 

In [None]:
(trn_data['tenure']==72).sum()  #out of 3000 customers 167 stayed the longest with the company (6 years)

: 

In [None]:
(trn_data['tenure']== 0).sum()    #customers who stayed for less than a month 

: 

In [None]:
trn_data[trn_data['tenure']== 0]

: 

In [None]:

#  set the 'totalcharges' column equal to the 'monthlycharges' column for rows where 'tenure' is equal to 0
trn_data.loc[trn_data['tenure'] == 0, 'TotalCharges'] = trn_data.loc[trn_data['tenure'] == 0, 'MonthlyCharges']


: 

In [None]:
trn_data[trn_data['tenure']== 0]

: 

In [None]:
trn_data.duplicated().sum()

: 

In [None]:
#summarry statistics
trn_data.describe().T

: 

In [None]:
trn_data.describe(include='object').T

: 

In [None]:
trn_data.hist()

: 

In [None]:
sns.scatterplot(data=trn_data, x='tenure', y='MonthlyCharges')

: 

In [None]:
data = trn_data[["Churn","tenure","Contract","MonthlyCharges","TotalCharges"]]
plt.figure(figsize=(12,6))
sns.pairplot(data,hue='Churn',palette={'Yes':"purple","No":"limegreen"})
plt.show()

: 

In [None]:

# Select numerical columns for correlation
numerical_columns = trn_data.select_dtypes(include=[np.number])

# Calculate correlation matrix
correlation_matrix = numerical_columns.corr()

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(6, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='Pastel1', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()


: 

In [None]:
trn_data.to_csv('ml_dataset.csv', index=False)

: 

Answering analytical questions 

1. Among customers who have churned, which type of contract is most prevalent? (Bar chart)

In [None]:

churned_customers = trn_data[trn_data['Churn'] == 'Yes'].reset_index()


# Count the most prevalent type of contract among churned customers
contract_counts = churned_customers['Contract'].value_counts()

plt.figure(figsize=(8, 6))
contract_counts.plot(kind='bar', color='skyblue')
plt.title('Contract Distribution Among Churned Customers')
plt.xlabel('Contract Types')
plt.ylabel('Count')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()



: 

customers with a monthly renewable contract churn more compared to 1 year and 2 year contract

2. Which gender has the higest rate of churning


In [None]:

# Calculate churn rates by gender
churn_rates = trn_data.groupby('gender')['Churn'].value_counts(normalize=True).loc[:, 'Yes']

# Create a bar plot for churn rates by gender
plt.figure(figsize=(8, 6))
bars= churn_rates.plot(kind='bar', color='skyblue')
# Add value labels on top of each bar
for bar in bars.patches:
    plt.text(bar.get_x() + bar.get_width() / 2 - 0.1,  # X coordinate for label
             bar.get_height() + 0.01,  # Y coordinate for label
             f'{bar.get_height():.2%}',  # Text to display (formatted as percentage)
             ha='center', va='bottom', color='black', fontsize=10)  # Text properties

plt.title('Churn Rate by Gender')
plt.xlabel('Gender')
plt.ylabel('Churn Rate')
plt.xticks(rotation=0)

plt.tight_layout()
plt.show()


: 

churning rate between the 2 genders is almost the same

3. Is there a correlation between total charges and the type of contract? (Box plot)


In [None]:


plt.figure(figsize=(8, 6))
sns.boxplot(x='Contract', y='TotalCharges', data=trn_data)
plt.title('Distribution of Total Charges by Contract Type')
plt.xlabel('Contract Type')
plt.ylabel('Total Charges')

plt.show()


: 

because cust with a 2 year contract churn less that means they generate more revenue for the company i.e they have the highest avg total charges

4. What is the percentage breakdown of customers who have left the company? (Pie chart)


In [None]:

churn_percentage = trn_data['Churn'].value_counts(normalize=True) * 100


# Create a pie chart for the percentage breakdown of churned customers
plt.figure(figsize=(6, 6))
plt.pie(churn_percentage, labels=churn_percentage.index, autopct='%1.1f%%', colors=['skyblue', 'lightgreen'])
plt.title('Percentage Breakdown of Churned Customers')
plt.axis('equal') 
plt.tight_layout()
plt.show()


: 

5. How does the churn rate vary based on the duration of customer subscription (tenure)? (Line chart)


In [None]:
# Calculate churn rates for each tenure
churn_rates = trn_data.groupby('tenure')['Churn'].value_counts(normalize=True).loc[:, 'Yes'] * 100


# Create a line chart for churn rates over tenure
plt.figure(figsize=(10, 6))
plt.plot(churn_rates.index, churn_rates.values, marker='o', linestyle='-')
plt.title('Churn Rate Variation Based on Tenure')
plt.xlabel('Tenure (Months)')
plt.ylabel('Churn Rate (%)')
plt.tight_layout()
plt.show()



: 

### Testing Hypothesis 
**Null Hypothesis** : There is no significant difference in churn rate based on monthly charges.     
**Alternative Hypothesis**: Customers with higher monthly charges are more likely to churn.

In [None]:


# Create a contingency table
contingency_table = pd.crosstab(trn_data['Churn'], trn_data['MonthlyCharges'])

# Perform Chi-Square Test for Independence
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Set my significance level
alpha = 0.05

# Interpret the results
if p < alpha:
    print("Reject the null hypothesis. There is a significant difference in customer churn based on monthly charges.")
else:
    print("Fail to reject the null hypothesis. There is no significantdifference in customer churn based on monthly charges .")


: 

### Insights

   - The calculation of the 'TotalCharges' column as the product of 'tenure' and 'monthly charges' is inconsistent.
   - In cases where 'tenure' is 0 (indicating new clients), 'TotalCharges' should be equal to 'MonthlyCharges' instead of having null values.


### Data preparation

In [None]:
trn_data.loc[trn_data['Churn'].isnull()]

: 

In [None]:
trn_data['Churn'].fillna(trn_data['Churn'].mode()[0], inplace=True)


: 

In [None]:
trn_data.head(3)

: 

In [None]:
X= trn_data.drop('Churn', axis=1)

: 

In [None]:
y= trn_data['Churn']

: 

In [None]:
y.isnull().sum()

: 

### split data into training and evaluation

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


: 

In [None]:
X_train

: 

### Feature Engineering

In [None]:
num_column= X.select_dtypes(include='number').columns
num_column

: 

In [None]:
cat_column= X.drop(['customerID','SeniorCitizen'], axis=1).select_dtypes(include='object').columns
cat_column

: 

In [None]:

y_train= y_train.values.reshape(-1)


: 

In [None]:
y_test= y_test.values.reshape(-1)

: 

In [None]:
imputer = SimpleImputer(strategy='most_frequent')  # Using mean strategy, can be 'median', 'most_frequent', or 'constant'

y_train = imputer.fit_transform(y_train.reshape(-1, 1))
y_test = imputer.fit_transform(y_test.reshape(-1, 1))


: 

### Pipelines

In [None]:
numerical_pipeline= Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='median')),
    ('scaler', RobustScaler()),
])

categorical_pipeline= Pipeline(steps=[
    ('cat_imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder()),
])

preprocessor = ColumnTransformer([
    ('numerical_pipeline', numerical_pipeline, num_column),
    ('categorical_pipeline',categorical_pipeline, cat_column),
])

: 

In [None]:
X_transform=pd.DataFrame(preprocessor.fit_transform(X),columns=preprocessor.get_feature_names_out())
X_transform


: 

### Label Encoding

In [None]:
label_encoder= LabelEncoder()
y_train_encoded= label_encoder.fit_transform(y_train)
y_test_encoded= label_encoder.transform(y_test)

: 

### ML Pipeline - unbalanced

In [None]:
decision_tree_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42)),
])

decision_tree_pipeline.fit(X_train,y_train_encoded )

: 

### Evaluate

In [None]:
y_pred = decision_tree_pipeline.predict(X_test)

: 

In [None]:
y_true = y_test_encoded

print(classification_report(y_true , y_pred))

: 

In [None]:
label_encoder.classes_

: 

In [None]:
svc_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SVC(random_state=42)),
])

svc_pipeline.fit(X_train,y_train_encoded )

: 

In [None]:
y_true= y_test_encoded

svc_pred= svc_pipeline.predict(X_test)
print(classification_report(y_true, svc_pred))

: 

In [None]:


rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('rf_classifier', RandomForestClassifier(random_state=42))
])

rf_pipeline.fit(X_train, y_train_encoded)

: 

In [None]:
rf_pred= rf_pipeline.predict(X_test)

: 

In [None]:
print(classification_report(y_test_encoded, rf_pred))

: 

In [None]:

# Define the KNN pipeline
knn_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),        # Preprocessing steps
    ('knn_classifier', KNeighborsClassifier())  # KNN classifier
    ])
# Set random seed for the entire Python environment

# Fit the KNN pipeline on the training data
knn_pipeline.fit(X_train, y_train_encoded)


: 

In [None]:
knn_pred = knn_pipeline.predict(X_test)

: 

In [None]:
# Generate a classification report
report_knn = classification_report(y_test_encoded, knn_pred)

# Print the classification report
print(report_knn)

: 

### Compare Models

In [None]:
models= [
    ('tree_classifier', DecisionTreeClassifier(random_state=42)),
    ('svc_classifier', SVC(random_state=42, probability= True)),
    ('rf_classifier', RandomForestClassifier(random_state=42)),
    ('knn2_classifier', KNeighborsClassifier()),
]


for model_name, classifier  in models:
    pipeline= Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
    ])

    pipeline.fit(X_train, y_train_encoded)

    y_pred= pipeline.predict(X_test)

    print(f'Report for {model_name}')
    print(classification_report(y_test_encoded, y_pred))
    print('='* 50)

    

: 

### Train on unbalanced dataset 

In [None]:
unbalanced_metrics = pd.DataFrame(columns=['Model_name','Accuracy','Precision','Recall','F1_Score'])


for model_name, classifier in models:


    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', classifier)
    ])
   
    # Fit the pipeline on the training data
    pipeline.fit(X_train, y_train_encoded)


    # Make predictions on the test data
    y_pred = pipeline.predict(X_test)


   


    metrics = classification_report(y_test_encoded, y_pred, output_dict=True)
    Accuracy = metrics['accuracy']
    precision = metrics['weighted avg']['precision']
    recall = metrics['weighted avg']['precision']
    f1_score = metrics['weighted avg']['precision']
    unbalanced_metrics.loc[len(unbalanced_metrics)] = [model_name,Accuracy,precision,recall,f1_score]



: 

In [None]:
unbalanced_metrics

: 

### Train and compare balanced data set- SMOTE

In [None]:


smote_df = pd.DataFrame(columns=['Model_name', 'Accuracy', 'Precision', 'Recall', 'F1_Score'])

for model_name, classifier in models:
    pipeline = imbpipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('classifier', classifier),
    ])

    # Fit the pipeline on the training data
    pipeline.fit(X_train, y_train_encoded)

    # Make predictions on the test data
    smote_y_pred = pipeline.predict(X_test)

    smote_dict = classification_report(y_test_encoded, smote_y_pred, output_dict=True)

    accuracy = smote_dict['accuracy']
    precision = smote_dict['weighted avg']['precision']
    recall = smote_dict['weighted avg']['recall']
    f1_score = smote_dict['weighted avg']['f1-score']

    smote_df.loc[len(smote_df)] = [model_name, accuracy, precision, recall, f1_score]

smote_df


: 

In [None]:


sampler= RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled= sampler.fit_resample(X_train, y_train_encoded)

: 

In [None]:
X_train_resampled.shape

: 

In [None]:
pd.DataFrame(y_train_resampled).value_counts()

: 

In [None]:
models= [
    ('tree_classifier', DecisionTreeClassifier(random_state=42)),
    ('svc_classifier', SVC(random_state=42, probability = True)),
    ('rf_classifier', RandomForestClassifier(random_state=42)),
    ('knn2_classifier', KNeighborsClassifier()),
]


for model_name, classifier  in models:
    pipeline= Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
    ])

    pipeline.fit(X_train_resampled, y_train_resampled)

    y_pred= pipeline.predict(X_test)

    print(f'Report for {model_name}')
    print(classification_report(y_test_encoded, y_pred))
    print('='* 50)

    

: 

### Feature importance and Selection

In [None]:

selection = SelectKBest(mutual_info_classif,k='all')

: 

In [None]:


fi_smote_df = pd.DataFrame(columns=['Model_name', 'Accuracy', 'Precision', 'Recall', 'F1_Score'])

all_pipeline = {}

for model_name, classifier in models:

    pipeline = imbpipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('feature_importance',selection),
        ('classifier', classifier),
    ])

    # Fit the pipeline on the training data
    pipeline.fit(X_train, y_train_encoded)
    all_pipeline[model_name] = pipeline


    # Make predictions on the test data
    smote_y_pred = pipeline.predict(X_test)

    fi_smote_dict = classification_report(y_test_encoded, smote_y_pred, output_dict=True)

    accuracy = fi_smote_dict['accuracy']
    precision = fi_smote_dict['weighted avg']['precision']
    recall = fi_smote_dict['weighted avg']['recall']
    f1_score = fi_smote_dict['weighted avg']['f1-score']

    fi_smote_df.loc[len(fi_smote_df)] = [model_name, accuracy, precision, recall, f1_score]

fi_smote_df 



: 

### Visualise ROC Curve - Overlaapping

In [None]:


fig, ax = plt.subplots(figsize=(8, 8))
roc_curve_data = {}

for model_name, classifier in models:
    pipeline = imbpipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('feature_importance', selection),
        ('classifier', classifier),
    ])

    pipeline.fit(X_train, y_train_encoded)

    y_score = pipeline.predict_proba(X_test)[:, 1]
    fpr, tpr, threshold = roc_curve(y_test_encoded, y_score)
    roc_auc = auc(fpr, tpr)

    roc_curve_df = pd.DataFrame({'False Positive Rate':fpr,'True Positive Rate':tpr,'Threshold':threshold})
    roc_curve_data[model_name] = roc_curve_df
    
    ax.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.2f})')

ax.plot([0, 1], [0, 1], color='navy', linestyle='--')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver Operating Characteristic (ROC) Curve')
ax.legend(loc='lower right')
plt.show()


: 

In [None]:
roc_curve_data['rf_classifier']

: 

In [None]:

forest_pipeline = all_pipeline['rf_classifier']
forest_y_pred = forest_pipeline.predict(X_test)
matrix = confusion_matrix(y_test_encoded, forest_y_pred)

: 

In [None]:
sns.heatmap(data=matrix,annot=True,fmt='d',cmap='coolwarm')

: 

In [None]:
threshold = 0.02
y_pred_proba = forest_pipeline.predict_proba(X_test)[:,1]

binary_prediction = (y_pred_proba > threshold)
threshold_matrix = confusion_matrix(y_test_encoded,binary_prediction )
threshold_matrix

: 

In [None]:
roc_curve_data['svc_classifier']

: 

### Hyperparameter tuning 

In [None]:


# param_grid = {
#     'feature_importance__k': [5],
#     'classifier__n_estimators': [5],  # Corrected parameter name
#     'classifier__max_depth': [None],
#     'classifier__min_samples_split': [2]
# }

# grid_search = GridSearchCV(
#     forest_pipeline,
#     param_grid=param_grid,
#     cv=5,
#     scoring='f1'
# )

# grid_search.fit(X_train, y_train_encoded)

param_grid = {
    'feature_importance__k': [5],
    'sv_classifier__C': [0.1, 1, 10],  # Hyperparameters for Support Vector Classifier
    'sv_classifier__kernel': ['linear', 'rbf'],
    'rf_classifier__n_estimators': [50, 100, 200],  # Hyperparameters for Random Forest Classifier
    'rf_classifier__max_depth': [None, 10, 20],
    'rf_classifier__min_samples_split': [2, 5]
}

grid_search = GridSearchCV(
    pipeline,  # Replace 'forest_pipeline' with your combined pipeline containing both classifiers
    param_grid=param_grid,
    cv=5,
    scoring='f1'
)
grid_search.fit(X_train, y_train_encoded)


: 

In [None]:
best_parametres = grid_search.best_params_
best_parametres

: 

### Retrain Model with Best Parameters

In [None]:
forest_pipeline.set_params(**best_parametres)
forest_pipeline.fit(X_train,y_train_encoded)

: 

### Save the model

In [None]:
pickle.dump(RandomForestClassifier, open('Random_Model.pkl', 'wb'))

: 

### Evaluate model with test dataset 

In [None]:
with open('Random_Model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

: 

In [None]:
pd.DataFrame(preprocessor.fit_transform(X_train),columns=preprocessor.get_feature_names_out())

: 