# Project Overview:

The aim of this project is to develop a model that can predict customer churn in the telecom industry. The dataset used in this project contains information about customers' demographic information, service usage, and their churn status. We will use a Random Forest classifier to train our model, and use cross-validation to select the best hyperparameters for our model.

In [3]:
# Import required libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
df = pd.read_csv('customer_churn.csv')

# Data preprocessing
df.dropna(inplace=True)  # Drop rows with missing values
X = df.drop('churn', axis=1)  # Independent variables
y = df['churn']  # Dependent variable
X = pd.get_dummies(X)  # One-hot encoding
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Split the dataset

# Random Forest Classifier
rfc = RandomForestClassifier()  # Instantiate the classifier
rfc.fit(X_train, y_train)  # Fit the classifier on the training data
y_pred = rfc.predict(X_test)  # Predict the labels for the test data

# Cross-Validation
scores = cross_val_score(rfc, X_train, y_train, cv=10)  # Perform 10-fold cross-validation
print("Cross-Validation Scores:", scores)
print("Average Score:", scores.mean())

# Hyperparameter Tuning
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(rfc, param_grid, cv=10)
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)

# Model Evaluation
y_pred = grid_search.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-Score:", f1_score(y_test, y_pred))




Cross-Validation Scores: [1.  1.  0.5 1.  1.  1.  1.  0.5 1.  1. ]
Average Score: 0.9




Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 10}
Accuracy: 0.8333333333333334


ValueError: pos_label=1 is not a valid label. It should be one of ['No', 'Yes']

In this code, we first load the dataset and perform data preprocessing by dropping rows with missing values, one-hot encoding the categorical features, and splitting the dataset into training and testing sets. We then instantiate a Random Forest classifier and fit it on the training data.

Next, we use cross-validation to evaluate the performance of the model by performing 10-fold cross-validation and printing the average score. We then use grid search to tune the hyperparameters of the model, including the number of decision trees, maximum depth, minimum number of samples required to split an internal node, and minimum number of samples required to be at a leaf node. We print the best parameters selected by grid search.

Finally, we evaluate the performance of the model on the test dataset using metrics such as accuracy, precision, recall, and F1-score.

Overall, this code demonstrates how cross-validation can be used to select the best hyperparameters for a machine learning model, in this case a Random Forest classifier used to predict customer churn in the telecom industry.