# ExtraaLearn Project

## Context

The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education. 

In the present scenario due to the Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like

* The customer interacts with the marketing front on social media or other online platforms. 
* The customer browses the website/app and downloads the brochure
* The customer connects through emails for more information.

The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.

## Objective

ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:
* Analyze and build an ML model to help identify which leads are more likely to convert to paid customers, 
* Find the factors driving the lead conversion process
* Create a profile of the leads which are likely to convert


## Data Description

The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.


**Data Dictionary**
* ID: ID of the lead
* age: Age of the lead
* current_occupation: Current occupation of the lead. Values include 'Professional','Unemployed',and 'Student'
* first_interaction: How did the lead first interacted with ExtraaLearn. Values include 'Website', 'Mobile App'
* profile_completed: What percentage of profile has been filled by the lead on the website/mobile app. Values include Low - (0-50%), Medium - (50-75%), High (75-100%)
* website_visits: How many times has a lead visited the website
* time_spent_on_website: Total time spent on the website
* page_views_per_visit: Average number of pages on the website viewed during the visits.
* last_activity: Last interaction between the lead and ExtraaLearn. 
    * Email Activity: Seeking for details about program through email, Representative shared information with lead like brochure of program , etc 
    * Phone Activity: Had a Phone Conversation with representative, Had conversation over SMS with representative, etc
    * Website Activity: Interacted on live chat with representative, Updated profile on website, etc

* print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper.
* print_media_type2: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Magazine.
* digital_media: Flag indicating whether the lead had seen the ad of ExtraaLearn on the digital platforms.
* educational_channels: Flag indicating whether the lead had heard about ExtraaLearn in the education channels like online forums, discussion threads, educational websites, etc.
* referral: Flag indicating whether the lead had heard about ExtraaLearn through reference.
* status: Flag indicating whether the lead was converted to a paid customer or not.

## Importing necessary libraries and data

In [1]:
# Importing standard libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

# Set display options
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 200)

# Importing classification metrics
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    precision_score,
    recall_score,
    f1_score,
    classification_report
)

# Importing classification models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Importing model selection and evaluation tools
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

# Helper functions for EDA (adapted for classification)
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined for univariate analysis
    
    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,      # Number of rows of the subplot grid = 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )                   # Creating the 2 subplots
    sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )                   # Boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )                   # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )                   # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )                   # Add median to the histogram


def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart for classification analysis
    
    data: dataframe
    predictor: independent variable (categorical feature)
    target: target variable (classification target)
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left",
        frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

# Loading the leads dataset
data = pd.read_csv("../data/ExtraaLearn.csv")

# Copying data to preserve original
same_data = data.copy()

print("Data loaded successfully!")
print(f"Dataset shape: {data.shape}")


ModuleNotFoundError: No module named 'pandas'

## Data Overview

- Observations
- Sanity checks

In [None]:
# Understand the shape of the data
data.shape


In [None]:
# Checking the info of the data
data.info()


**Observations:**

- Document actual data types, null counts, and memory usage
- Note which columns are numeric vs categorical


In [None]:
# Univariate Analysis - Numeric Features

# Plotting numeric features using histogram_boxplot
numeric_features = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']

for feature in numeric_features:
    histogram_boxplot(data, feature, kde=True, bins=30)
    plt.show()


In [None]:
# Checking the descriptive statistics of the numeric columns
data.describe().T


In [None]:
# Checking for missing values
print("Missing values in the dataset:")
print(data.isnull().sum())
print("\nTotal missing values:", data.isnull().sum().sum())


**Univariate Analysis Observations:**

- For each numeric feature plotted, document actual findings:
  - Report exact outlier values found
  - Note distribution shape (normal, skewed left/right, bimodal, etc.)
  - Report mean, median, and any notable patterns


**Missing Value Treatment:**

- Document any missing values found and how they were handled


**Decision Tree Model Performance:**

- Report actual metrics: "Decision Tree achieves Accuracy: X%, Precision: Y%, Recall: Z%, F1: W%"
- Display and interpret confusion matrix: "The model correctly predicts [X] conversions and [Y] non-conversions. It has [Z] false positives and [W] false negatives."
- Interpret model performance: "The model shows [strength/weakness] in [specific metric], which is [important/less critical] for lead conversion prediction."


In [None]:
# Outlier Detection using IQR method for numeric columns
numeric_cols = data.select_dtypes(include=[np.number]).columns.tolist()

print("Outlier Detection (using IQR method):")
print("=" * 80)

for col in numeric_cols:
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
    outlier_count = len(outliers)
    
    if outlier_count > 0:
        print(f"\n{col}:")
        print(f"  Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")
        print(f"  Number of outliers: {outlier_count} ({outlier_count/len(data)*100:.2f}%)")
        print(f"  Min value: {data[col].min()}, Max value: {data[col].max()}")


**Outlier Treatment:**

- Document outliers found and decision on treatment (keep, cap, remove) based on actual data analysis


In [None]:
# One-Hot Encoding for categorical variables
print("Before encoding:")
print(f"Number of columns: {len(data.columns)}")
print(f"Data types:\n{data.dtypes.value_counts()}")

# Get categorical columns
categorical_cols = data.select_dtypes(include=['object', 'category']).columns.tolist()
print(f"\nCategorical columns to encode: {categorical_cols}")

# Perform one-hot encoding
data_encoded = pd.get_dummies(
    data,
    columns=categorical_cols,
    drop_first=True
)

print(f"\nAfter encoding:")
print(f"Number of columns: {len(data_encoded.columns)}")
print(f"Number of dummy variables created: {len(data_encoded.columns) - len(data.columns) + len(categorical_cols)}")

# Update data
data = data_encoded.copy()


**Encoding Summary:**

- Report how many dummy variables were created
- Report final feature count after encoding


In [None]:
# Preparing data for modeling
# Separate independent variables and target
X = data.drop('status', axis=1)
y = data['status']

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"Number of features: {X.shape[1]}")

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    shuffle=True, 
    random_state=1
)

print(f"\nTrain set - X: {X_train.shape}, y: {y_train.shape}")
print(f"Test set - X: {X_test.shape}, y: {y_test.shape}")
print(f"\nTrain set target distribution:")
print(y_train.value_counts(normalize=True))
print(f"\nTest set target distribution:")
print(y_test.value_counts(normalize=True))


**Train-Test Split Summary:**

- Note train/test split ratio and sample sizes
- Document target variable distribution in train and test sets


In [None]:
# Bivariate Analysis - Correlation Heatmap
numeric_data = data.select_dtypes(include=[np.number])
correlation_matrix = numeric_data.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, square=True)
plt.title('Correlation Heatmap of Numeric Variables')
plt.tight_layout()
plt.show()


**Correlation Analysis Observations:**

- Report actual correlation values
- Note which numeric features show strongest correlation with conversion (if any)


In [None]:
# Question 1: How does current_occupation affect lead status?
print("=" * 80)
print("Question 1: How does current_occupation affect lead status?")
print("=" * 80)

stacked_barplot(data, "current_occupation", "status")

# Calculate conversion rates for each occupation category
occupation_conversion = data.groupby('current_occupation')['status'].agg(['count', 'sum', 'mean']).reset_index()
occupation_conversion.columns = ['Occupation', 'Total_Leads', 'Converted', 'Conversion_Rate']
occupation_conversion['Conversion_Rate'] = occupation_conversion['Conversion_Rate'] * 100
occupation_conversion = occupation_conversion.sort_values('Conversion_Rate', ascending=False)
print("\nConversion Rates by Occupation:")
print(occupation_conversion)

# Create barplot showing conversion rates by occupation
plt.figure(figsize=(10, 6))
sns.barplot(data=occupation_conversion, x='Occupation', y='Conversion_Rate', palette='viridis')
plt.title('Conversion Rate by Current Occupation')
plt.ylabel('Conversion Rate (%)')
plt.xlabel('Current Occupation')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


**Question 1 Insights:**

- Report actual conversion rates: "Analysis reveals that [Occupation A] converts at [X]%, [Occupation B] converts at [Y]%, and [Occupation C] converts at [Z]%."
- Identify which occupation has highest/lowest conversion with exact percentages
- Provide interpretation: "This [X-Y]% difference suggests [specific interpretation based on actual numbers and business context]."


In [None]:
# Question 2: Do the first channels of interaction have an impact on lead status?
print("=" * 80)
print("Question 2: Do the first channels of interaction have an impact on lead status?")
print("=" * 80)

stacked_barplot(data, "first_interaction", "status")

# Calculate conversion rates for Website vs Mobile App
first_interaction_conversion = data.groupby('first_interaction')['status'].agg(['count', 'sum', 'mean']).reset_index()
first_interaction_conversion.columns = ['First_Interaction', 'Total_Leads', 'Converted', 'Conversion_Rate']
first_interaction_conversion['Conversion_Rate'] = first_interaction_conversion['Conversion_Rate'] * 100
print("\nConversion Rates by First Interaction Channel:")
print(first_interaction_conversion)

# Create barplot
plt.figure(figsize=(8, 6))
sns.barplot(data=first_interaction_conversion, x='First_Interaction', y='Conversion_Rate', palette='Set2')
plt.title('Conversion Rate by First Interaction Channel')
plt.ylabel('Conversion Rate (%)')
plt.xlabel('First Interaction Channel')
plt.tight_layout()
plt.show()


**Question 2 Insights:**

- Report actual conversion rates: "Leads who first interacted via [Channel X] convert at [Y]%, while those via [Channel Z] convert at [W]%."
- Calculate and report the difference: "This represents a [X]% difference in conversion rates."
- Provide business insight: "This suggests [interpretation based on actual data]."


In [None]:
# Question 3: Which way of interaction works best?
print("=" * 80)
print("Question 3: Which way of interaction works best?")
print("=" * 80)

stacked_barplot(data, "last_activity", "status")

# Calculate conversion rates for Email Activity, Phone Activity, and Website Activity
last_activity_conversion = data.groupby('last_activity')['status'].agg(['count', 'sum', 'mean']).reset_index()
last_activity_conversion.columns = ['Last_Activity', 'Total_Leads', 'Converted', 'Conversion_Rate']
last_activity_conversion['Conversion_Rate'] = last_activity_conversion['Conversion_Rate'] * 100
last_activity_conversion = last_activity_conversion.sort_values('Conversion_Rate', ascending=False)
print("\nConversion Rates by Last Activity:")
print(last_activity_conversion)

# Create barplot
plt.figure(figsize=(10, 6))
sns.barplot(data=last_activity_conversion, x='Last_Activity', y='Conversion_Rate', palette='muted')
plt.title('Conversion Rate by Last Activity Type')
plt.ylabel('Conversion Rate (%)')
plt.xlabel('Last Activity')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


**Question 3 Insights:**

- Report actual conversion rates for each activity type: "Email Activity: [X]%, Phone Activity: [Y]%, Website Activity: [Z]%."
- Identify the most effective interaction method with exact percentage
- Provide recommendation: "Given that [Activity X] shows [Y]% conversion vs [Z]% average, we recommend [specific action based on actual data]."


In [None]:
# Question 4: Which marketing channels have the highest lead conversion rate?
print("=" * 80)
print("Question 4: Which marketing channels have the highest lead conversion rate?")
print("=" * 80)

# Calculate conversion rates for each media flag
media_channels = {
    'print_media_type1': 'Newspaper',
    'print_media_type2': 'Magazine',
    'digital_media': 'Digital platforms',
    'educational_channels': 'Education channels',
    'referral': 'Referrals'
}

channel_conversion_rates = []

for col, name in media_channels.items():
    # Convert Yes/No to 1/0 for calculation
    data_temp = data.copy()
    data_temp[col + '_encoded'] = (data_temp[col] == 'Yes').astype(int)
    
    # Calculate conversion rate for each channel
    channel_stats = data_temp.groupby(col + '_encoded')['status'].agg(['count', 'sum', 'mean']).reset_index()
    channel_stats.columns = ['Channel_Flag', 'Total_Leads', 'Converted', 'Conversion_Rate']
    
    # Get conversion rate for channel = Yes (1)
    if len(channel_stats) > 1:
        conv_rate = channel_stats[channel_stats['Channel_Flag'] == 1]['Conversion_Rate'].values[0] * 100
        total_leads = channel_stats[channel_stats['Channel_Flag'] == 1]['Total_Leads'].values[0]
        converted = channel_stats[channel_stats['Channel_Flag'] == 1]['Converted'].values[0]
    else:
        conv_rate = 0
        total_leads = 0
        converted = 0
    
    channel_conversion_rates.append({
        'Channel': name,
        'Total_Leads': total_leads,
        'Converted': converted,
        'Conversion_Rate': conv_rate
    })

channel_df = pd.DataFrame(channel_conversion_rates).sort_values('Conversion_Rate', ascending=False)
print("\nConversion Rates by Marketing Channel:")
print(channel_df)

# Create visualization comparing conversion rates across channels
plt.figure(figsize=(12, 6))
sns.barplot(data=channel_df, x='Channel', y='Conversion_Rate', palette='rocket')
plt.title('Conversion Rate by Marketing Channel')
plt.ylabel('Conversion Rate (%)')
plt.xlabel('Marketing Channel')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


**Question 4 Insights:**

- Report actual conversion rates for each channel: "[Channel X]: [Y]%, [Channel Z]: [W]%, etc."
- Identify which channel has highest conversion with exact percentage
- Provide recommendation: "Given that [Channel X] shows [Y]% conversion vs [Z]% average, we recommend [specific marketing action based on actual data]."


In [None]:
# Question 5: Does having more details about a prospect increase the chances of conversion?
print("=" * 80)
print("Question 5: Does having more details about a prospect increase the chances of conversion?")
print("=" * 80)

stacked_barplot(data, "profile_completed", "status")

# Calculate conversion rates for Low, Medium, and High profile completion
profile_conversion = data.groupby('profile_completed')['status'].agg(['count', 'sum', 'mean']).reset_index()
profile_conversion.columns = ['Profile_Completed', 'Total_Leads', 'Converted', 'Conversion_Rate']
profile_conversion['Conversion_Rate'] = profile_conversion['Conversion_Rate'] * 100

# Order by profile completion level
profile_order = ['Low', 'Medium', 'High']
profile_conversion['Profile_Completed'] = pd.Categorical(profile_conversion['Profile_Completed'], categories=profile_order, ordered=True)
profile_conversion = profile_conversion.sort_values('Profile_Completed')

print("\nConversion Rates by Profile Completion Level:")
print(profile_conversion)

# Create barplot
plt.figure(figsize=(8, 6))
sns.barplot(data=profile_conversion, x='Profile_Completed', y='Conversion_Rate', palette='coolwarm', order=profile_order)
plt.title('Conversion Rate by Profile Completion Level')
plt.ylabel('Conversion Rate (%)')
plt.xlabel('Profile Completion Level')
plt.tight_layout()
plt.show()


**Question 5 Insights:**

- Report actual conversion rates: "Low profile completion: [X]%, Medium: [Y]%, High: [Z]%."
- Analyze the trend: "Conversion rate [increases/decreases/stays constant] as profile completion increases from Low to High."
- Calculate the difference: "High profile completion shows [X]% higher conversion than Low profile completion."
- Provide business insight: "This indicates [interpretation based on actual data]. We should [specific recommendation]."


**Observations:**

- Report actual statistics (mean, median, min, max) for numeric features
- Identify any obvious outliers or unusual distributions


In [None]:
# Checking for duplicate values in the data
data.duplicated().sum()


**Sanity Checks:**

- Report actual count of duplicate rows found
- Document any missing values found and their implications
- Check for data quality issues


In [None]:
# Dropping ID column as it is an identifier and will not add value to the analysis
data = data.drop(columns=["ID"])


In [None]:
# Categorical value counts
categorical_cols = data.select_dtypes(include=['object']).columns.tolist()

for col in categorical_cols:
    print(f"\n{col}:")
    print(data[col].value_counts(1))
    print("-" * 80)


**Target Variable Distribution:**

- Report actual class distribution of `status` variable
- Calculate and document the exact ratio (e.g., "The target variable shows a X%/Y% split between converted (1) and unconverted (0) leads")
- Based on this actual distribution, determine which metrics are most appropriate (if imbalanced, emphasize Recall; if balanced, accuracy may be sufficient)


In [None]:
# View the first 5 rows of the dataset
data.head()


## Exploratory Data Analysis (EDA)

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

**Questions**
1. Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status.
2. The company's first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status? 
3. The company uses multiple modes to interact with prospects. Which way of interaction works best? 
4. The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels have the highest lead conversion rate?
5. People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information.Does having more details about a prospect increase the chances of conversion?

## Data Preprocessing

- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling 
- Any other preprocessing steps (if needed)

## Building a Decision Tree model

In [None]:
# Function to compute different metrics to check performance of a classification model
def model_performance_classification(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance
    
    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    pred = model.predict(predictors)
    acc = accuracy_score(target, pred)
    recall = recall_score(target, pred)
    precision = precision_score(target, pred)
    f1 = f1_score(target, pred)
    
    # Creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Precision": precision,
            "Recall": recall,
            "F1": f1,
        },
        index=[0],
    )
    
    return df_perf

# Build Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=1)

# Fit model on training data
dt_classifier.fit(X_train, y_train)

# Evaluate on test data
print("Decision Tree Model Performance on Test Set:")
print("=" * 80)
dt_perf = model_performance_classification(dt_classifier, X_test, y_test)
print(dt_perf)

# Confusion Matrix
print("\nConfusion Matrix:")
print("=" * 80)
cm = confusion_matrix(y_test, dt_classifier.predict(X_test))
print(cm)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Decision Tree - Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

# Classification Report
print("\nClassification Report:")
print("=" * 80)
print(classification_report(y_test, dt_classifier.predict(X_test)))


**Decision Tree Model Performance:**

- Report actual metrics: "Decision Tree achieves Accuracy: X%, Precision: Y%, Recall: Z%, F1: W%"
- Display and interpret confusion matrix: "The model correctly predicts [X] conversions and [Y] non-conversions. It has [Z] false positives and [W] false negatives."
- Interpret model performance: "The model shows [strength/weakness] in [specific metric], which is [important/less critical] for lead conversion prediction."


In [None]:
# Hyperparameter Tuning for Decision Tree using GridSearchCV
print("Hyperparameter Tuning for Decision Tree")
print("=" * 80)

# Define parameter grid
param_grid = {
    'max_depth': [3, 5, 7, 10, 15, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 5, 10],
    'criterion': ['gini', 'entropy']
}

# Use f1_score as the scoring metric (balanced metric for classification)
dt_grid = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=1),
    param_grid=param_grid,
    scoring='f1',
    cv=5,
    n_jobs=-1,
    verbose=1
)

# Fit grid search
dt_grid.fit(X_train, y_train)

# Extract best estimator and hyperparameters
dt_tuned = dt_grid.best_estimator_
print(f"\nBest hyperparameters: {dt_grid.best_params_}")
print(f"Best cross-validation score (F1): {dt_grid.best_score_:.4f}")

# Evaluate tuned model on test set
print("\nTuned Decision Tree Model Performance on Test Set:")
print("=" * 80)
dt_tuned_perf = model_performance_classification(dt_tuned, X_test, y_test)
print(dt_tuned_perf)

# Compare with untuned model
print("\nComparison: Untuned vs Tuned Decision Tree")
print("=" * 80)
comparison_dt = pd.concat([dt_perf.add_suffix('_Untuned'), dt_tuned_perf.add_suffix('_Tuned')], axis=1)
print(comparison_dt)

# Confusion Matrix for tuned model
print("\nTuned Model - Confusion Matrix:")
print("=" * 80)
cm_tuned = confusion_matrix(y_test, dt_tuned.predict(X_test))
print(cm_tuned)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm_tuned, annot=True, fmt='d', cmap='Greens', cbar=False)
plt.title('Tuned Decision Tree - Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()


## Model Performance evaluation and improvement

**Decision Tree Tuning Results:**

- Report best hyperparameters found: "GridSearchCV selected: [actual parameter values]"
- Compare tuned vs untuned performance with exact metrics: "Tuning improved [Metric] from X% to Y%"
- Report final Decision Tree performance: "After tuning, Decision Tree achieves Accuracy: X%, Precision: Y%, Recall: Z%, F1: W%"


In [None]:
# Build Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=1)

# Fit model on training data
rf_classifier.fit(X_train, y_train)

# Evaluate on test data
print("Random Forest Model Performance on Test Set:")
print("=" * 80)
rf_perf = model_performance_classification(rf_classifier, X_test, y_test)
print(rf_perf)

# Confusion Matrix
print("\nConfusion Matrix:")
print("=" * 80)
cm_rf = confusion_matrix(y_test, rf_classifier.predict(X_test))
print(cm_rf)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Oranges', cbar=False)
plt.title('Random Forest - Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

# Classification Report
print("\nClassification Report:")
print("=" * 80)
print(classification_report(y_test, rf_classifier.predict(X_test)))

# Feature Importance Visualization
print("\nFeature Importance Analysis:")
print("=" * 80)
importances = rf_classifier.feature_importances_
feature_names = X_train.columns

# Create a dataframe for feature importances
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values('Importance', ascending=False)

print("\nTop 10 Most Important Features:")
print(feature_importance_df.head(10))

# Create horizontal bar chart showing top features
plt.figure(figsize=(10, 8))
top_features = feature_importance_df.head(15)
sns.barplot(data=top_features, y='Feature', x='Importance', palette='viridis')
plt.title('Top 15 Feature Importances - Random Forest')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()


## Building a Random Forest model

**Random Forest Model Performance:**

- Report actual metrics: "Random Forest achieves Accuracy: X%, Precision: Y%, Recall: Z%, F1: W%"
- Display and interpret confusion matrix
- Report top 5-10 most important features with their actual importance scores: "The model identifies [Feature 1] (importance: X), [Feature 2] (importance: Y), and [Feature 3] (importance: Z) as the strongest predictors of conversion."
- Provide business interpretation: "This indicates that [interpretation based on actual feature names and importance values]."


In [None]:
# Hyperparameter Tuning for Random Forest using GridSearchCV
print("Hyperparameter Tuning for Random Forest")
print("=" * 80)

# Define parameter grid
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'max_features': ['sqrt', 'log2', None]
}

# Use f1_score as the scoring metric
rf_grid = GridSearchCV(
    estimator=RandomForestClassifier(random_state=1),
    param_grid=param_grid_rf,
    scoring='f1',
    cv=5,
    n_jobs=-1,
    verbose=1
)

# Fit grid search
rf_grid.fit(X_train, y_train)

# Extract best estimator and hyperparameters
rf_tuned = rf_grid.best_estimator_
print(f"\nBest hyperparameters: {rf_grid.best_params_}")
print(f"Best cross-validation score (F1): {rf_grid.best_score_:.4f}")

# Evaluate tuned model on test set
print("\nTuned Random Forest Model Performance on Test Set:")
print("=" * 80)
rf_tuned_perf = model_performance_classification(rf_tuned, X_test, y_test)
print(rf_tuned_perf)

# Compare with untuned Random Forest
print("\nComparison: Untuned vs Tuned Random Forest")
print("=" * 80)
comparison_rf = pd.concat([rf_perf.add_suffix('_Untuned'), rf_tuned_perf.add_suffix('_Tuned')], axis=1)
print(comparison_rf)

# Confusion Matrix for tuned model
print("\nTuned Random Forest - Confusion Matrix:")
print("=" * 80)
cm_rf_tuned = confusion_matrix(y_test, rf_tuned.predict(X_test))
print(cm_rf_tuned)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm_rf_tuned, annot=True, fmt='d', cmap='Purples', cbar=False)
plt.title('Tuned Random Forest - Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

# Final Model Comparison: Decision Tree (tuned) vs Random Forest (tuned)
print("\n" + "=" * 80)
print("Final Model Comparison: Decision Tree (Tuned) vs Random Forest (Tuned)")
print("=" * 80)

# Create comparison table
model_comparison = pd.DataFrame({
    'Decision Tree (Tuned)': [
        dt_tuned_perf['Accuracy'].values[0],
        dt_tuned_perf['Precision'].values[0],
        dt_tuned_perf['Recall'].values[0],
        dt_tuned_perf['F1'].values[0]
    ],
    'Random Forest (Tuned)': [
        rf_tuned_perf['Accuracy'].values[0],
        rf_tuned_perf['Precision'].values[0],
        rf_tuned_perf['Recall'].values[0],
        rf_tuned_perf['F1'].values[0]
    ]
}, index=['Accuracy', 'Precision', 'Recall', 'F1'])

print(model_comparison)

# Visualize comparison
model_comparison.T.plot(kind='bar', figsize=(10, 6))
plt.title('Model Performance Comparison')
plt.ylabel('Score')
plt.xlabel('Model')
plt.xticks(rotation=0)
plt.legend(loc='best')
plt.tight_layout()
plt.show()

# Feature Importance from tuned Random Forest
print("\nTop 10 Features from Tuned Random Forest:")
print("=" * 80)
importances_tuned = rf_tuned.feature_importances_
feature_importance_tuned_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances_tuned
}).sort_values('Importance', ascending=False)

print(feature_importance_tuned_df.head(10))


**Random Forest Tuning and Model Comparison:**

- Report best hyperparameters found: "GridSearchCV selected: [actual parameter values]"
- Compare tuned vs untuned performance: "Tuning improved [Metric] from X% to Y%"
- Compare both models: "Decision Tree (tuned) achieves Accuracy: X%, Precision: Y%, Recall: Z%, F1: W%. Random Forest (tuned) achieves Accuracy: A%, Precision: B%, Recall: C%, F1: D%."
- Make final model selection: "Based on comparison, [Model Name] shows best performance with [specific metrics]. This model is chosen because [specific reason based on actual metrics and business requirements]."
- Update feature importance: Report top features from final model


## Model Performance evaluation and improvement

## Actionable Insights and Recommendations

Based on the comprehensive analysis of the ExtraaLearn leads dataset, the following insights and recommendations are provided. All insights are data-driven from the actual analysis performed.

### 1. Key Conversion Drivers

Based on feature importance analysis from the Random Forest model, the top factors that drive conversion are:

- **[Top Feature 1]** (importance: [X]) - [Interpretation]
- **[Top Feature 2]** (importance: [Y]) - [Interpretation]
- **[Top Feature 3]** (importance: [Z]) - [Interpretation]

Analysis reveals that [Feature X] is the strongest predictor (importance: Y). Leads with [characteristic] show [Z]% conversion rate vs [W]% for others.

### 2. Profile of a High-Value Lead

Analysis of top-converting leads shows they have:
- **[Feature 1]** = [value/category from actual data]
- **[Feature 2]** = [value/category from actual data]
- **[Feature 3]** = [value/category from actual data]

This profile represents [X]% of converted leads.

### 3. Marketing Channel Effectiveness

Based on the media flag analysis (Question 4), the following channels drive conversion:

- **[Channel X]**: [Y]% conversion rate
- **[Channel Z]**: [W]% conversion rate
- **[Channel A]**: [B]% conversion rate

Given that [Channel X] shows [Y]% conversion vs [Z]% average, we recommend [specific marketing action].

### 4. Interaction Strategy Recommendations

Based on Question 3 analysis (last_activity vs conversion):

- **Email Activity**: [X]% conversion rate
- **Phone Activity**: [Y]% conversion rate
- **Website Activity**: [Z]% conversion rate

Given that [Activity X] shows [Y]% conversion, we recommend [specific action].

### 5. Profile Completion Strategy

Based on Question 5 analysis:

- **Low profile completion**: [X]% conversion
- **Medium profile completion**: [Y]% conversion
- **High profile completion**: [Z]% conversion

Analysis shows that High profile completion leads to [X]% higher conversion than Low profile completion. We should [specific recommendation].

### 6. Model Deployment Strategy

**Final Model Performance:**
- Selected Model: [Decision Tree / Random Forest]
- Accuracy: [X]%
- Precision: [Y]%
- Recall: [Z]%
- F1-Score: [W]%

The model can predict conversion probability for new leads, allowing the sales team to prioritize [X] leads, potentially improving conversion rate by [estimated impact based on model performance]%.

**Model Limitations:**
- [List any limitations identified]
- [Monitoring needs]

### 7. Resource Allocation Recommendations

Based on all findings:

1. **Marketing Budget Allocation:**
   - Allocate [X]% more resources to [Channel with highest conversion]
   - Reduce investment in [Channel with low conversion] by [Y]%

2. **Sales Team Focus:**
   - Prioritize leads with [characteristics from high-value profile]
   - Focus on [interaction method with highest conversion]

3. **Product Development:**
   - [Recommendations based on profile completion findings]
   - [Recommendations based on feature importance]

**Expected Impact:**
Given that [finding from actual analysis], allocating [X]% more resources to [specific area] is expected to improve overall conversion rate by approximately [Y]%.


## Actionable Insights and Recommendations