# Model machine learning - Insurance car

# Part 1 - Business Problem


![](https://img.freepik.com/vetores-gratis/ilustracao-de-seguro-de-vida_53876-5312.jpg?t=st=1721189978~exp=1721193578~hmac=d9370211eaa85efe62462cb028539e787a1a7e74863c505afd3a8cc07cc8a19f&w=740)


## Business Problem

**Objective**

Predict which customers will respond positively to an automobile insurance offer using machine learning techniques. This will enable the insurance company to target its marketing campaigns more efficiently, increasing conversion rates and reducing customer acquisition costs.

**Context**

The insurance company is launching a new campaign to promote its automobile insurance product. The company has historical data on customer characteristics and their responses to previous offers. Based on this data, the company wants to build a predictive model that estimates the likelihood of a customer accepting the insurance offer.

## Expected Impact

- Increased Conversion Rate: Targeting offers to customers more likely to accept will increase the success rate of marketing campaigns.

- Cost Reduction: Focusing on customers with a higher probability of positive response reduces marketing expenses on less likely customers.

- Improved Customer Satisfaction: Customers receiving relevant and personalized offers are more likely to have a positive perception of the company.

## Data Description

The available data includes demographic and behavioral information about customers, as well as their responses to previous insurance offers. The dataset is synthetic, generated to maintain the privacy of real customer data.

## Variables in the Dataset

**1. id:** Unique identification of the customer.

**2. features:** Various customer characteristics (age, income, response history, etc.).

**3. Response:** Target variable indicating whether the customer responded positively to the insurance offer (1 for yes, 0 for no).

## Challenges

**Synthetic Data:** Although the data is synthetic, it must be representative enough to create reliable predictive models.

**Class Imbalance:** The dataset is likely imbalanced, with more negative than positive responses, requiring appropriate balancing techniques.

**Feature Engineering:** Identifying and creating relevant features to improve model performance.

## Methodology

**1. Exploratory Data Analysis (EDA):** Understand data distribution and identify patterns and anomalies.

**2. Data Preprocessing:** Handle missing values, encode categorical variables, and normalize the data.

**3. Data Splitting:** Separate the data into training and validation sets to evaluate model performance.

**4. Model Training:** Use different machine learning algorithms (RandomForest, XGBoost, LightGBM) and select the best model based on the AUC-ROC metric.

**5. Model Optimization:** Adjust hyperparameters and apply cross-validation techniques.

**6. Model Evaluation:** Use the ROC curve and AUC metric to evaluate model performance.

**7. Submission of Results:** Generate predictions for the test set and submit to Kaggle.

## Evaluation Metric

**AUC-ROC:** The evaluation metric for the competition is the area under the ROC curve, which measures the model's ability to distinguish between classes.

## Conclusion
Accurately predicting which customers will respond positively to an insurance offer can significantly improve the efficiency of marketing campaigns, reduce costs, and increase customer satisfaction.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Installing packages
!pip install watermark

In [None]:
# Import of libraries

# System libraries
import re
import unicodedata
import itertools

# Library for file manipulation
import pandas as pd
import numpy as np
import pandas

# Data visualization
import seaborn as sns
import matplotlib.pylab as pl
import matplotlib as m
import matplotlib as mpl
import matplotlib.pyplot as plt
import plotly.express as px
from matplotlib import pyplot as plt

# Configuration for graph width and layout
sns.set_theme(style='whitegrid')
palette='viridis'

# Python version
from platform import python_version
print('Python version in this Jupyter Notebook:', python_version())

# Load library versions
import watermark

# Library versions
%reload_ext watermark
%watermark -a "Library versions" --iversions

# Warnings remove alerts
import warnings
warnings.filterwarnings("ignore")

# Part 2 - Database

In [None]:
# Set the display.max_columns option to None
pd.set_option('display.max_columns', None)

## Data 1
# Data train
train_df = pd.read_csv("/kaggle/input/playground-series-s4e7/train.csv")

# Data test
test_df = pd.read_csv("/kaggle/input/playground-series-s4e7/test.csv")

# Data 2
df = pd.read_csv("/kaggle/input/health-insurance-cross-sell-prediction-data/train.csv")

In [None]:
# Viewing first 5 data
train_df.head()

In [None]:
# Viewing 5 latest data
train_df.tail()

In [None]:
# Info data
train_df.info()

In [None]:
# Type dados
train_df.dtypes

In [None]:
# Viewing rows and columns
train_df.shape

# Part 3 - Exploratory data analysis

In [None]:
# Exploratory data analysis (EDA)
print("\nDescriptive statistics of the training set:")
train_df.describe().T

In [None]:
# Analysis of categorical and numerical variables
categorical_features = train_df.select_dtypes(include=['object']).columns
numerical_features = train_df.select_dtypes(include=[np.number]).columns

print("\nCategorical Variables:", categorical_features)
print("Numeric Variables:", numerical_features)

In [None]:
# Analysis of categorical variables
for col in categorical_features:
    print(f"\nDistribution of categorical variable {col}:")
    print(train_df[col].value_counts())

# Part 3.1 - Analysis of categorical and numeric variable data

In [None]:
# Analysis of target variable 'Target'
print("\nDistribution of target variable 'Target':")
print(train_df['Response'].value_counts())
plt.figure(figsize=(10, 6))
sns.countplot(data=train_df, x='Response')
plt.title("Distribution of Target Variable 'Target'")
plt.grid(False)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Vehicle_Age', y='Annual_Premium', hue='Response', data=train_df)
plt.title('Boxplot of Annual Premium by Vehicle Age and Response')
plt.grid(False)
plt.show()

In [None]:
# Group data by gender and calculate the total annual premium for each group
total_premium_by_gender = train_df.groupby('Gender')['Annual_Premium'].sum().reset_index()

# View total prize by gender
plt.figure(figsize=(10, 6))
sns.barplot(x='Gender', y='Annual_Premium', data=total_premium_by_gender)
plt.title('Total Prize for Sex')
plt.xlabel('Sex')
plt.ylabel('Annual Premium Total')
plt.grid(False)
plt.show()

In [None]:
# Create age ranges
train_df['Age_Bucket'] = pd.cut(train_df['Age'], bins=[18, 25, 35, 50, np.inf], labels=['18-25', '26-35', ' 36-50', '51+'])

# Group data by age group and gender, and calculate the average annual premium for each group
average_premium_by_age_gender = train_df.groupby(['Age_Bucket', 'Gender'])['Annual_Premium'].mean().reset_index()

# View the average annual premium by age group and gender
plt.figure(figsize=(20, 10))
sns.barplot(x='Age_Bucket', y='Annual_Premium', hue='Gender', data=average_premium_by_age_gender)
plt.title('Average Annual Award by Age Group and Sex')
plt.xlabel('Age Range')
plt.ylabel('Average Annual Premium')
plt.legend(title='Sex')
plt.grid(False)
plt.show()

In [None]:
# Group data by gender and previous insurance status, and calculate the average annual premium for each group
average_premium_by_gender_insured = train_df.groupby(['Gender', 'Previously_Insured'])['Annual_Premium'].mean().reset_index()

# Transform the 'Previously_Insured' variable into a more readable category
average_premium_by_gender_insured['Previously_Insured'] = average_premium_by_gender_insured['Previously_Insured'].map({0: 'No', 1: 'Yes'})

# View average annual premium by gender and previous insurance status
plt.figure(figsize=(10, 6))
sns.barplot(x='Previously_Insured', y='Annual_Premium', hue='Gender', data=average_premium_by_gender_insured)
plt.title('Average Annual Premium by Previous Insurance Status and Gender')
plt.xlabel('Previous Insurance')
plt.ylabel('Average Annual Premium')
plt.legend(title='Sex')
plt.grid(False)
plt.show()

In [None]:
# Categorical variables to iterate
categorical_variables = ['Gender', 'Vehicle_Damage', 'Vehicle_Age', 'Response']

# Figure size
plt.figure(figsize=(15, 10))

# Loop over categorical variables
for i, var in enumerate(categorical_variables, 1):
 plt.subplot(2, 2, i) # Subplots 2x2
 sns.boxplot(data=train_df, x=var, y='Annual_Premium', palette='viridis')
 plt.title(f'Annual Award for {var}')
 plt.xlabel(var)
 plt.ylabel('Annual Award')
 plt.xticks(rotation=45)

plt.tight_layout()
plt.grid(False)
plt.show()

# Part 3.2 - Insurance indicators

In [None]:
# Group by vehicle age and gender, adding annual premiums
grouped_data = train_df.groupby(['Vehicle_Age', 'Gender'])['Annual_Premium'].sum().reset_index()

# Grouped bar chart
plt.figure(figsize=(12, 6))
sns.barplot(data=grouped_data, x='Vehicle_Age', y='Annual_Premium', hue='Gender', palette='viridis')
plt.title('Total Annual Premium by Vehicle Age and Sex')
plt.xlabel('Vehicle Age')
plt.ylabel('Annual Premium Total')
plt.legend(title='Genre')
plt.grid(False)
plt.show()

In [None]:
# List of genres
genders = train_df['Gender'].unique()

# Figure size
plt.figure(figsize=(15, 8))

# Loop over genres
for i, gender in enumerate(genders, 1):
 plt.subplot(1, 2, i)
 gender_data = train_df[train_df['Gender'] == gender]
 gender_grouped = gender_data.groupby('Vehicle_Age')['Annual_Premium'].sum().reset_index()
 sns.barplot(data=gender_grouped, x='Vehicle_Age', y='Annual_Premium', palette='viridis')
 plt.title(f'Total Annual Premium by Vehicle Age ({gender})')
 plt.xlabel('Vehicle Age')
 plt.ylabel('Annual Premium Total')

plt.tight_layout()
plt.grid(False)
plt.show()

In [None]:
# Group by vehicle age and insurance status, adding annual premiums
grouped_data = train_df.groupby(['Vehicle_Age', 'Previously_Insured'])['Annual_Premium'].sum().reset_index()

# Convert the 'Previously_Insured' column to string for better visualization
grouped_data['Previously_Insured'] = grouped_data['Previously_Insured'].astype(str)

# Grouped bar chart
plt.figure(figsize=(12, 6))
sns.barplot(data=grouped_data, x='Vehicle_Age', y='Annual_Premium', hue='Previously_Insured', palette='viridis')
plt.title('Total Annual Premium by Vehicle Age and Insurance Status')
plt.xlabel('Vehicle Age')
plt.ylabel('Annual Premium Total')
plt.legend(title='Previously Insured')
plt.grid(False)
plt.show()

In [None]:
# List of insurance status
statuses = train_df['Previously_Insured'].unique()

# Figure size
plt.figure(figsize=(15, 8))

# Loop about insurance statuses
for i, status in enumerate(statuses, 1):
 plt.subplot(1, 2, i)
 status_data = df[df['Previously_Insured'] == status]
 status_grouped = status_data.groupby('Vehicle_Age')['Annual_Premium'].sum().reset_index()
 sns.barplot(data=status_grouped, x='Vehicle_Age', y='Annual_Premium', palette='viridis')
 plt.title(f'Total Annual Premium by Vehicle Age (Previously Insured: {status})')
 plt.xlabel('Vehicle Age')
 plt.ylabel('Annual Premium Total')

plt.tight_layout()
plt.grid(False)
plt.show()

In [None]:
# Group by age, gender and vehicle age, adding annual premiums
grouped_data = train_df.groupby(['Age', 'Gender', 'Vehicle_Age'])['Annual_Premium'].sum().reset_index()

# Convert the 'Gender' column to string for better visualization
grouped_data['Gender'] = grouped_data['Gender'].map({'Male': 'Man', 'Female': 'Woman'})

# Configure the size of the figure
plt.figure(figsize=(30.5, 10))

# Loop to create graphs separated by vehicle age
vehicle_ages = grouped_data['Vehicle_Age'].unique()

for i, vehicle_age in enumerate(vehicle_ages, 1):
    plt.subplot(2, 2, i)
    subset = grouped_data[grouped_data['Vehicle_Age'] == vehicle_age]
    sns.barplot(data=subset, x='Age', y='Annual_Premium', hue='Gender', palette='viridis')
    plt.title(f'Total Annual Premium by Age and Sex (Vehicle Age: {vehicle_age})')
    plt.xlabel('Age')
    plt.ylabel('Annual Premium Total')
    plt.legend(title='Genre')
    plt.xticks(rotation=50)

plt.tight_layout()
plt.grid(False)
plt.show()

### Analysis of the Total Annual Premium by Age and Sex

The provided graphs show the total annual premium by age and sex, segmented into three categories based on the age of the vehicle: less than 1 year, 1-2 years, and more than 2 years.

#### Observations:

1. **Vehicle Age: < 1 Year**
   - The total annual premium is highest for younger individuals, peaking around ages 22-25 for both men and women.
   - Women tend to have a higher total annual premium compared to men in the younger age brackets.
   - As age increases beyond the mid-30s, the total annual premium decreases steadily for both genders.

2. **Vehicle Age: 1-2 Years**
   - The peak total annual premium occurs around ages 40-45 for both men and women, with men having a slightly higher total annual premium.
   - A significant increase in total annual premium is observed from the mid-20s to the early 40s.
   - After age 45, the total annual premium declines gradually for both men and women.

3. **Vehicle Age: > 2 Years**
   - The total annual premium is higher for younger individuals, peaking around ages 40-45 for men and women.
   - Men consistently have a higher total annual premium compared to women across most age ranges.
   - A steady decline in the total annual premium is observed after the peak age range.

### Key Takeaways:

- **Age Influence**: Younger individuals generally contribute more to the total annual premium, with significant peaks observed in the 22-25 age range for new vehicles and the 40-45 age range for older vehicles.
- **Gender Differences**: Men tend to have a higher total annual premium than women, especially in the 1-2 years and > 2 years vehicle age categories. However, for vehicles less than 1 year old, women have a higher total annual premium in the younger age brackets.
- **Vehicle Age Impact**: The age of the vehicle influences the distribution of the total annual premium, with newer vehicles (< 1 year) showing a higher premium for younger individuals and older vehicles (> 2 years) peaking at a later age.

### Recommendations:

- **Targeted Marketing**: Insurance companies could consider targeted marketing strategies for different age groups and genders based on the observed trends in annual premiums.
- **Product Development**: Develop insurance products that cater specifically to the high-premium age groups, such as younger individuals with new vehicles and middle-aged individuals with older vehicles.
- **Pricing Strategy**: Adjust pricing strategies to reflect the differences in premium contributions across various age and gender segments.


# Part 3.3 - Correlation analysis

In [None]:
# Select the specified columns
columns_of_interest = ["id", "Age", "Driving_License", "Region_Code", "Previously_Insured", "Annual_Premium", "Policy_Sales_Channel", "Vintage", "Response"]
df_selected = df[columns_of_interest]

# Calculate the correlation matrix
correlation_matrix = df_selected.corr()

# Configure the size of the figure
plt.figure(figsize=(14, 10))

# Correlation heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='viridis', fmt='.2f')
plt.title('Correlation Matrix Heatmap')
plt.show()

### Correlation Matrix Interpretation

The correlation between variables helps to understand the linear relationship between them. The correlation matrix ranges from -1 to 1, where:

- 1 indicates a perfect positive correlation.
- -1 indicates a perfect negative correlation.
- 0 indicates no linear correlation.

### Observations:

1. **id**: 
   - No significant correlation with any other variable. This is expected as the id should be a unique identifier without a relationship to other features.

2. **Age**:
   - Shows a moderate negative correlation with `Previously_Insured` (-0.25).
   - Shows a moderate negative correlation with `Policy_Sales_Channel` (-0.58).
   - Shows a weak positive correlation with `Response` (0.11).

3. **Driving_License**:
   - No significant correlations with other variables.

4. **Region_Code**:
   - No significant correlations with other variables.

5. **Previously_Insured**:
   - Shows a weak positive correlation with `Policy_Sales_Channel` (0.22).
   - Shows a moderate negative correlation with `Response` (-0.34).

6. **Annual_Premium**:
   - No significant correlations with other variables.

7. **Policy_Sales_Channel**:
   - Shows a moderate negative correlation with `Age` (-0.58).
   - Shows a weak positive correlation with `Previously_Insured` (0.22).
   - Shows a weak negative correlation with `Response` (-0.14).

8. **Vintage**:
   - No significant correlations with other variables.

9. **Response**:
   - Shows correlations already mentioned with `Previously_Insured` (-0.34), `Age` (0.11), and `Policy_Sales_Channel` (-0.14).

### Conclusions:

- **Age**: Age has a moderate negative relationship with `Previously_Insured` and `Policy_Sales_Channel`, suggesting that younger customers are less likely to be previously insured and to be sold by certain channels.
- **Previously_Insured**: Customers who are previously insured are less likely to respond positively (`Response`) to the new insurance offer.
- **Policy_Sales_Channel**: Certain sales channels have a negative relationship with customer age and a weak correlation with response (`Response`), indicating that some channels may be more effective for different age groups.
- **Response**: The response to the insurance offer has a moderate negative correlation with `Previously_Insured` and a weak positive correlation with customer age.

These correlations can be useful for better understanding the data and guiding marketing and sales strategies, as well as for adjustments in the machine learning model.


# Part 3.4 - Boxplot analysis identifying outliers

In [None]:
# Select numeric columns
numeric_columns = ["Age", "Annual_Premium", "Vintage"]

# Configure the size of the figure
plt.figure(figsize=(18, 6))

# Loop to create boxplots for each numeric column
for i, column in enumerate(numeric_columns, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(data=train_df, y=column, palette='viridis')
    plt.title(f'Boxplot of {column}')
    plt.ylabel(column)

plt.tight_layout()
plt.grid(False)
plt.show()

In [None]:
# Select numeric columns
numeric_columns = ["Age", "Annual_Premium", "Vintage"]

# Configure the size of the figure
plt.figure(figsize=(18, 6))

# Loop to create fiddle plots for each numeric column, separated by Response
for i, column in enumerate(numeric_columns, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(data=train_df, x='Response', y=column, palette='viridis')
    plt.title(f'Violin Chart of {column} by Response')
    plt.xlabel('Response')
    plt.ylabel(column)

plt.tight_layout()
plt.grid(False)
plt.show()

# Descriptive statistics separated by Response
for column in numeric_columns:
    print(f'\nDescriptive Statistics of {column} by Response:')
    print(train_df.groupby('Response')[column].describe())

In [None]:
# Select numeric columns
numeric_columns = ["Age", "Annual_Premium", "Vintage"]

# Function to remove outliers using IQR
def remove_outliers(train_df, columns):
    for columns in columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        train_df = train_df[(train_df[column] >= lower_bound) & (train_df[column] <= upper_bound)]
        return train_df

# Remove outliers from the DataFrame
df_cleaned = remove_outliers(train_df, numeric_columns)

# Check the DataFrame dimension after removing outliers
print(f"Original dimension: {train_df.shape}")
print(f"Dimension after removing outliers: {df_cleaned.shape}")

# View the first records of the cleaned DataFrame
df_cleaned.head()

In [None]:
# Select numeric columns
numeric_columns = ["Age", "Annual_Premium", "Vintage"]

# Configure the size of the figure
plt.figure(figsize=(18, 6))

# Loop to create fiddle plots for each numeric column, separated by Response
for i, column in enumerate(numeric_columns, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(data=df_cleaned, x='Response', y=column, hue="Response", palette='viridis')
    plt.title(f'Violin Chart of {column} by Response')
    plt.xlabel('Response')
    plt.ylabel(column)

plt.tight_layout()
plt.grid(False)
plt.show()

# Descriptive statistics separated by Response
for column in numeric_columns:
    print(f'\nDescriptive Statistics of {column} by Response:')
    print(df_cleaned.groupby('Response')[column].describe())

# Part 4 - Pré-processamento

In [None]:
def optimize_memory_usage(df_cleaned):
    """
    Optimizes the memory usage of a pandas DataFrame by converting categorical columns
    and adjusting the precision of numerical columns.
    
    Parameters:
    df_cleaned (pd.DataFrame): DataFrame to be optimized.
    
    Returns:
    pd.DataFrame: Optimized DataFrame.
    """
    # View memory usage before optimization
    print("Memory usage before optimization:")
    print(df_cleaned.memory_usage(deep=True))
    print()
    
    # Converting categorical columns to dtype 'category'
    categorical_columns = ['Gender', 'Driving_License', 'Region_Code', 'Previously_Insured', 
                           'Vehicle_Age', 'Vehicle_Damage', 'Policy_Sales_Channel', 
                           'Response', 'Age_Bucket']
    
    for col in categorical_columns:
        if col in df_cleaned.columns:
            df_cleaned[col] = df_cleaned[col].astype('category')
    
    # Reducing precision of integers
    # Assuming age is within 0 to 127
    if 'Age' in df_cleaned.columns:
        df_cleaned['Age'] = df_cleaned['Age'].astype('int8')
    
    # Adjust according to maximum value for Annual_Premium
    if 'Annual_Premium' in df_cleaned.columns:
        df_cleaned['Annual_Premium'] = df_cleaned['Annual_Premium'].astype('int32')
    
    # Assuming the value is within the range of int16 for Vintage
    if 'Vintage' in df_cleaned.columns:
        df_cleaned['Vintage'] = df_cleaned['Vintage'].astype('int16')
    
    # View memory usage after applying conversions
    print("Memory usage after applying conversions:")
    print(df_cleaned.memory_usage(deep=True))
    
    # Check detailed memory usage
    print(df_cleaned.info(memory_usage='deep'))
    print()
    return df_cleaned

# Optimizing the DataFrames
df_cleaned_optimized_train = optimize_memory_usage(df_cleaned)
test_df_optimized_test = optimize_memory_usage(test_df)
df_optimized = optimize_memory_usage(df)


In [None]:
# Copy dataset
train_df = df_cleaned_optimized_train.copy()
test_df = df_optimized.copy()

# Part 5 - Data cleaning

In [None]:
# 1. Handling Missing Values
print("Number of missing values ​​per column:")
print(df_optimized.isnull().sum())

In [None]:
# View missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df_optimized.isnull(), cbar=False, cmap="viridis")
plt.title("Viewing Missing Values in the Training Set")
plt.show()

# Part 6 - Feature engineering

In [None]:
# Importing library
from sklearn.preprocessing import LabelEncoder

# Encode categorical variables
label_encoder = LabelEncoder()
df_optimized['Gender'] = label_encoder.fit_transform(df_optimized['Gender'])
df_optimized['Vehicle_Age'] = label_encoder.fit_transform(df_optimized['Vehicle_Age'])
df_optimized['Vehicle_Damage'] = label_encoder.fit_transform(df_optimized['Vehicle_Damage'])

# Viewing
label_encoder

In [None]:
# View the first DataFrame records after encoding
df_optimized

Analysis

We applied the Label Encoder to the categorical variables, transforming them into numerical values. This transformation resulted in the creation of a new feature called **Gender**, **Vehicle_Age**, **Vehicle_Damage**, **Age_Bucket** The encoding process is essential because many machine learning algorithms require input variables to be numerical to function correctly.

The **Gender**, **Vehicle_Age**, **Vehicle_Damage**, **Age_Bucket**  variable will be used as an alternative to the original Transported variable. By transforming categorical variables into numerical ones, we ensure that the model can interpret and process this data effectively. This is particularly important for non-ordinal categorical variables, where each category is converted into a distinct number without implying an order.

Additionally, the new **Gender**, **Vehicle_Age**, **Vehicle_Damage**, **Age_Bucket** feature will allow us to evaluate whether this transformation positively influences the performance of the predictive models. By comparing the results obtained using the original variable and the transformed variable, we can determine which approach offers better predictions. This process is part of a broader feature engineering strategy aimed at optimizing input data to improve the accuracy and robustness of machine learning models.

In [None]:
# Fill missing values
df_optimized.fillna(method='ffill', inplace=True)
df_optimized.fillna(method='ffill', inplace=True)

# Part 7 - Training and testing division

In [None]:
# Select features
#features = ['Gender', 'Age', 'Driving_License', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage']

# Deivisão
#X = df_test_df_optimized[features]
#y = df_test_df_optimized['Response']

# Resources
X = df_optimized.drop(columns=['Response'])  

# Target variable
y = df_optimized['Response']  

In [None]:
# Viewing rows and columns x
X.shape

In [None]:
# Viewing rows and columns
y.shape

Here, we performed the division of the variables into features and the target variable. First, we separated the independent variables, which are the features used for predictive modeling. These features are the input data that the model will use to learn patterns and make predictions. Next, we isolated the dependent variable, or the target variable, which is the value we aim to predict. This process is crucial for building and training the model, ensuring that the features are correctly identified and that the model can learn the relationship between these features and the target variable. By properly dividing the data, we enhance the model's ability to accurately predict outcomes based on the given inputs

# Part 8 - Model training

In [None]:
# Importing libraries
from sklearn.model_selection import train_test_split

# Training and testing division
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Viewing training data
print("Viewing rows and columns given by X train", X_train.shape)

# Viewing test data
print("Viewing rows and columns given y train", y_train.shape)

Here, we conducted the training of the model using a train-test split. We adopted an 80/20 division, where 80% of the data was used for training and the remaining 20% was reserved for testing. This procedure is crucial for accurately evaluating the model's performance. The training set allows the model to learn patterns and relationships within the data, while the test set, which the model has not seen during training, is used to validate its ability to generalize and predict new data. Additionally, this approach helps identify and mitigate issues such as overfitting, ensuring that the model not only memorizes the training data but also performs well on unseen data.

In [None]:
# Converting categorical columns to dummy variables
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

# Viewing training data
print("Viewing rows and columns given by X train", X_train.shape)

# Viewing test data
print("Viewing rows and columns given y train", y_train.shape)

# 1) Section - Machine learning

# Part 9 - Machine learning model

In [None]:
%%time

# Importing machine learning model library
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Importing library for metrics machine learning models
from sklearn.metrics import accuracy_score

# Models to be evaluated
models = [GaussianNB(), # Naive Bayes Model
          DecisionTreeClassifier(random_state=42), # Decision Tree Model
          RandomForestClassifier(n_estimators=100, random_state=42), # Random forest model
          LogisticRegression(random_state=50), # Logistic regression model
          AdaBoostClassifier(random_state=45), # Ada Boost Model
          XGBClassifier(), # XGBoost Model Parameter tree_method='gpu_hist' for XGBoost GPU
          LGBMClassifier()] # LightGBM Model Parameter device='gpu' for LightGBM GPU

# Evaluate each model
for i, model in enumerate(models):
    model.fit(X_train, y_train)
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))
    print(model)
    print()
    print(f"Model {i+1}: {type(model).__name__}")
    print()
    print(f"Training Accuracy: {train_accuracy}")
    print(f"Testing Accuracy: {test_accuracy}")
    print("------------------")

In [None]:
# Step 6: Evaluate the model
train_accuracy = accuracy_score(y_train, model.predict(X_train))
test_accuracy = accuracy_score(y_test, model.predict(X_test))
print(f"Training Accuracy: {train_accuracy}")
print()
print(f"Testing Accuracy: {test_accuracy}")

In [None]:
# Step 7: Make predictions on the test set
#test_features = test_df[features]
predictions = model.predict(X_test)

# Part 10 - Feature importances

- "Feature importances" (importância das características) refers to the measure of how important each feature is for a machine learning model in making predictions or classifications. In other words, it is a way to quantify the impact or contribution of each feature to the decisions made by the model.

- In many machine learning algorithms such as decision trees, Random Forest, Gradient Boosting, among others, it is possible to calculate the importance of features during model training. This is done by observing how each feature influences the decisions made by the model when dividing the data into decision tree nodes or by weighing the features in other model structures.

- Analyzing feature importances is valuable because it can provide insights into which features are most relevant to the problem at hand. This information can be used to optimize the model, remove irrelevant or redundant features, identify important factors for prediction, and even assist in interpreting the model's results.

In [None]:
# Train models that support feature importances

# Set Seaborn style
sns.set_palette("Set2")

models_with_feature_importances = [("DecisionTreeClassifier", DecisionTreeClassifier(random_state=42)),
                                   ("RandomForestClassifier", RandomForestClassifier(n_estimators=100, random_state=42)),
                                   ("XGBClassifier", XGBClassifier(random_state=42)),
                                   ("LGBMClassifier", LGBMClassifier(random_state=42))]

# Iterate over models
for model_name, model in models_with_feature_importances:
    
    # Train model
    model.fit(X_train, y_train)
    
    # Get importance of features
    if hasattr(model, 'feature_importances_'):
        feature_importances = model.feature_importances_
    else:
        # If the model does not have feature_importances_, continue to the next model
        print(f"{model_name} does not support feature importances.")
        continue

    # Create DataFrame for easier viewing
    feature_importances_df = pd.DataFrame({'Feature': X_train.columns, 
                                           'Importance': feature_importances})
    
    # Sort by importance
    feature_importances_df = feature_importances_df.sort_values(by='Importance', ascending=False)
    
    # Plot
    plt.figure(figsize=(10, 6))
    sns.barplot(x='Importance', y='Feature', data=feature_importances_df[:10])
    plt.title(f"Top 10 Features - {model_name}")
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.grid(False)
    plt.show()

### Feature Importance Interpretation

The graph shows the top 10 most important features used by the `LGBMClassifier` model for predicting the target variable. Each feature's importance is measured by the contribution it makes to the model's predictions.

1. **Age**: This is the most important feature, contributing the most to the model's decisions.
2. **id**: This feature also plays a significant role, indicating it has a strong impact on predictions.
3. **Vintage**: The age of the policy has a considerable impact on the model.
4. **Annual_Premium**: The annual premium amount is another crucial feature influencing the model.
5. **Vehicle_Damage**: Indicates if the vehicle was damaged, which is important for the model.
6. **Vehicle_Age**: The age of the vehicle is an important factor in the predictions.
7. **Previously_Insured_0**: Indicates whether the customer was previously insured, which affects the model's output.
8. **Policy_Sales_Channel_160.0**: This sales channel feature is relevant for the model's decisions.
9. **Policy_Sales_Channel_156.0**: Another sales channel that impacts the model's performance.
10. **Policy_Sales_Channel_152.0**: Also, a significant feature among the top 10, affecting the model's predictions.

### Key Takeaways

- **Age** is the most influential feature, suggesting that the customer's age is crucial in determining the target outcome.
- **id** being highly important could indicate some underlying pattern or information encoded within this feature.
- **Vintage**, **Annual_Premium**, and **Vehicle_Damage** are also highly relevant, indicating that the duration of the policy, the premium amount, and the vehicle's damage status significantly impact the predictions.
- The presence of multiple **Policy_Sales_Channel** features in the top 10 highlights the importance of the sales channel through which the policy was sold in influencing the model's predictions.

### Recommendations

- **Further Analysis**: Investigate why certain features like **id** are highly important to understand if they are capturing specific patterns or if they need further preprocessing.
- **Feature Engineering**: Consider creating new features based on the most important ones to potentially improve the model's performance.
- **Model Tuning**: Use the insights from feature importance to fine-tune the model, focusing on the most influential features.


# Part 11 - Métricas resultado

In [None]:
# plot confusion matrix
from sklearn.metrics import accuracy_score, confusion_matrix

# Evaluate each model
for i, model in enumerate(models):
    model.fit(X_train, y_train)
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))
    
    print(f"Model {i+1}: {type(model).__name__}")
    print(f"Training Accuracy: {train_accuracy}")
    print(f"Testing Accuracy: {test_accuracy}")

    # Calculate and plot the confusion matrix
    cm = confusion_matrix(y_test, model.predict(X_test))
    plt.figure()
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False, 
                xticklabels=["Not Transported", "Transported"], 
                yticklabels=["Not Transported", "Transported"])
    plt.xlabel("Predicted")
    plt.ylabel("True")
    plt.title(f"Confusion Matrix - Model {i+1}: {type(model).__name__}")
    plt.show()
    print("------------------")

### Confusion Matrix Details

The matrix represents the performance of the `LGBMClassifier` model on the test dataset. Here's the breakdown of the confusion matrix:

- **True Negatives (TN)**: 66,646
  - These are the instances where the actual class was "Not Transported" and the model also predicted "Not Transported".
  
- **False Positives (FP)**: 53
  - These are the instances where the actual class was "Not Transported" but the model predicted "Transported".
  
- **False Negatives (FN)**: 9,505
  - These are the instances where the actual class was "Transported" but the model predicted "Not Transported".
  
- **True Positives (TP)**: 18
  - These are the instances where the actual class was "Transported" and the model also predicted "Transported".
  
### Metrics Calculation

Based on these values, we can calculate various performance metrics:

1. **Accuracy**:
   $$
   \text{Accuracy} = \frac{\text{TN} + \text{TP}}{\text{TN} + \text{FP} + \text{FN} + \text{TP}} = \frac{66,646 + 18}{66,646 + 53 + 9,505 + 18} = \frac{66,664}{76,222} \approx 0.874
   $$

2. **Precision** (for the positive class "Transported"):
   $$
   \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{18}{18 + 53} \approx 0.253
   $$

3. **Recall** (Sensitivity, for the positive class "Transported"):
   $$
   \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{18}{18 + 9,505} \approx 0.002
   $$

4. **F1 Score** (for the positive class "Transported"):
   $$
   \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = 2 \cdot \frac{0.253 \cdot 0.002}{0.253 + 0.002} \approx 0.004
   $$

5. **Specificity** (for the negative class "Not Transported"):
   $$
   \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} = \frac{66,646}{66,646 + 53} \approx 0.999
   $$

### Interpretation

- The accuracy of the model is relatively high at approximately 87.4%. However, accuracy can be misleading in imbalanced datasets.
- The precision for the "Transported" class is quite low, indicating that when the model predicts "Transported," it is correct about 25.3% of the time.
- The recall for the "Transported" class is extremely low, indicating that the model is missing a large number of "Transported" cases.
- The F1 score, which considers both precision and recall, is also very low, suggesting poor performance for the "Transported" class.
- The specificity is very high, meaning the model is very good at identifying "Not Transported" cases correctly.

### Conclusion

The model has a high ability to correctly identify "Not Transported" cases but struggles significantly with correctly identifying "Transported" cases. This indicates a potential issue with class imbalance or model bias towards the majority class ("Not Transported"). Further steps could include rebalancing the dataset, using different metrics for model evaluation, or trying other algorithms better suited for imbalanced datasets.


In [None]:
# ROC curve models

# Importing library
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score

# Models to be evaluated
models = [GaussianNB(),
          DecisionTreeClassifier(random_state=42),
          RandomForestClassifier(n_estimators=100, random_state=42),
          LogisticRegression(random_state=42),
          AdaBoostClassifier(random_state=42),
          XGBClassifier(random_state=42),
          LGBMClassifier()]

# Evaluate each model
for i, model in enumerate(models):
    model.fit(X_train, y_train)
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))
    print(f"Model {i+1}: {type(model).__name__}")
    print(f"Training Accuracy: {train_accuracy}")
    print(f"Testing Accuracy: {test_accuracy}")

    # Calculate positive class probabilities
    y_probs = model.predict_proba(X_test)[:, 1]
    
    # Calculate the ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_probs)
    
    # Calculate the area under the ROC curve (AUC)
    auc = roc_auc_score(y_test, y_probs)
    
    # Plot the ROC curve
    plt.figure()
    plt.plot(fpr, tpr, color='blue', lw=2, label=f'AUC = {auc:.2f}')
    plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve - Model {i+1}: {type(model).__name__}')
    plt.legend(loc="lower right")
    plt.grid(False)
    plt.show()
    
    print("------------------")

### ROC Curve Explanation

The provided ROC (Receiver Operating Characteristic) curve is a graphical representation of the performance of the `LGBMClassifier` model. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.

#### Key Components of the ROC Curve:

1. **True Positive Rate (TPR)**:
   - Also known as sensitivity or recall.
   - Represents the proportion of actual positives correctly identified by the model.
   - Formula: $ \text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}} $

2. **False Positive Rate (FPR)**:
   - Represents the proportion of actual negatives incorrectly identified as positives by the model.
   - Formula: $ \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} $

3. **Diagonal Line (Baseline)**:
   - The diagonal line (dashed) represents the performance of a random classifier.
   - Any point along this line indicates that the classifier is performing no better than random guessing.

4. **ROC Curve**:
   - The blue curve represents the performance of the `LGBMClassifier` model across different threshold values.
   - The closer the curve follows the left-hand border and then the top border of the ROC space, the better the model's performance.

5. **AUC (Area Under the Curve)**:
   - The area under the ROC curve (AUC) is a single scalar value that summarizes the performance of the model.
   - AUC ranges from 0 to 1, with a higher value indicating better model performance.
   - An AUC of 0.86 means that there is an 86% chance that the model will distinguish between a randomly chosen positive instance and a randomly chosen negative instance.

### Interpretation of the ROC Curve for `LGBMClassifier`:

- **Model Performance**:
  - The ROC curve for the `LGBMClassifier` (blue curve) shows that the model performs well, as it stays above the diagonal baseline and close to the top left corner of the plot.
  - The AUC score of 0.86 indicates that the model has a good ability to discriminate between the positive and negative classes.

- **Threshold Selection**:
  - The ROC curve helps in selecting the optimal threshold for classification. Depending on the business requirement, you might want to maximize TPR (sensitivity) while keeping FPR low.

### Summary:

- **High AUC Score**: The model has a strong discriminative power with an AUC of 0.86.
- **Effective Classifier**: The curve's proximity to the top left corner indicates that the `LGBMClassifier` is effective at distinguishing between positive and negative instances.
- **Threshold Decision**: The ROC curve provides insights into the trade-offs between TPR and FPR for different thresholds, aiding in the selection of the most appropriate threshold based on the specific use case.


In [None]:
# Classification report
# Importing library - Classification report models
from sklearn.metrics import accuracy_score, classification_report

# Models to be evaluated
models = [GaussianNB(),
          DecisionTreeClassifier(random_state=42),
          RandomForestClassifier(n_estimators=100, random_state=42),
          LogisticRegression(random_state=42),
          AdaBoostClassifier(random_state=42),
          XGBClassifier(random_state=42),
          LGBMClassifier()]

# Evaluate each model
for i, model in enumerate(models):
    model.fit(X_train, y_train)
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))
    print()
    
    print(f"Model {i+1}: {type(model).__name__}")
    print()
    print(f"Training Accuracy: {train_accuracy}")
    print(f"Testing Accuracy: {test_accuracy}")

    # Gerar relatório de classificação
    report = classification_report(y_test, model.predict(X_test))
    print()
    print("Classification Report:")
    print()
    print(report)
    print()
    
    print("=======================================")

### Classification Report Analysis

The provided classification report gives a detailed breakdown of the performance metrics for the `LGBMClassifier` model on both the training and testing datasets. Here is a detailed interpretation of the results:

#### Model Performance Metrics:

1. **Training Accuracy**: 0.8781614171807915
   - This indicates that the model correctly predicted approximately 87.8% of the instances in the training dataset.

2. **Testing Accuracy**: 0.87460831329537404
   - This shows that the model correctly predicted approximately 87.4% of the instances in the testing dataset, indicating a slight drop from the training accuracy which suggests minimal overfitting.

#### Detailed Classification Report:

The classification report breaks down the performance of the model for each class:

1. **Class 0 (Not Transported)**:
   
   - **Precision**: 0.88
     
     - This means that 88% of the instances predicted as "Not Transported" were correctly classified.
   
   - **Recall**: 1.00
     
     - This indicates that the model identified all actual "Not Transported" instances correctly.
   
   - **F1-Score**: 0.93
     
     - The F1-score is the harmonic mean of precision and recall, given by the formula:
       
       $$
       F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
       $$
       
     - For this class, the F1-score is 0.93.
   
   - **Support**: 66,699
     
     - The number of actual instances of "Not Transported" in the dataset.

2. **Class 1 (Transported)**:
   
   - **Precision**: 0.25
     
     - This indicates that only 25% of the instances predicted as "Transported" were correctly classified.
   
   - **Recall**: 0.00
     
     - This means the model failed to identify any actual "Transported" instances correctly.
   
   - **F1-Score**: 0.00
     
     - The F1-score is 0 because both precision and recall are very low.
   
   - **Support**: 9,523
     
     - The number of actual instances of "Transported" in the dataset.

#### Overall Metrics:

- **Accuracy**: 0.87
  - The overall accuracy of the model, given by the formula:
    $$
    \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
    $$
  - This is consistent with the testing accuracy, showing that 87% of all instances were correctly classified.

- **Macro Average**:
  - **Precision**: 0.56
    - The unweighted average of precision for all classes.
  - **Recall**: 0.50
    - The unweighted average of recall for all classes.
  - **F1-Score**: 0.47
    - The unweighted average of F1-scores for all classes, showing a moderate overall performance.

- **Weighted Average**:
  - **Precision**: 0.80
    - The average precision weighted by the number of instances in each class.
  - **Recall**: 0.87
    - The average recall weighted by the number of instances in each class.
  - **F1-Score**: 0.82
    - The average F1-score weighted by the number of instances in each class, indicating good overall performance despite the poor performance on the minority class.

### Summary:

- The model performs well in predicting the majority class (Not Transported) but struggles significantly with the minority class (Transported).
- The high recall and precision for the majority class indicate that the model is highly confident and accurate in identifying instances of the majority class.
- The poor performance on the minority class suggests that the model may be biased towards the majority class, likely due to class imbalance.
- The overall accuracy and weighted averages are high, but the macro averages reveal the poor performance on the minority class, highlighting the need for addressing class imbalance.

### Recommendations:

- **Class Imbalance Handling**: Consider techniques such as oversampling the minority class, undersampling the majority class, or using different evaluation metrics that account for class imbalance.
- **Model Tuning**: Experiment with different models, hyperparameters, or threshold adjustments to improve the recall and precision for the minority class.
- **Feature Engineering**: Investigate additional features or different feature transformations that might help the model better distinguish between the classes.


# Part 12 - Machine learning model - Parameters with GPU

- Here we applied a machine learning model using the LightGBM algorithm, optimized with parameters specifically for GPU.
This model employs advanced machine learning techniques and is specially designed to handle large datasets efficiently and swiftly. Additionally, we implemented a series of preprocessing and feature engineering techniques to ensure the best possible model performance. The training process involved cross-validation steps and fine-tuning of hyperparameters, aiming to further enhance the model's accuracy and generalization. In the end, we achieved promising results, demonstrating the effectiveness and potential of this approach in solving complex data analysis problems.

In [None]:
# New dataset to avoid problems with the previous dataset here we will use the XGBoost, LightLBM models

# Viewing train_df dataset
train_df.head()

In [None]:
# Viewing train_df dataset
train_df.head()

In [None]:
# Viewing test_df dataset
test_df.head()

In [None]:
# Importing label encoder library
from sklearn.preprocessing import LabelEncoder

# # Copy the original data to avoid modifying the original DataFrame
train_df = df_cleaned_optimized_train.copy()

# Encode categorical variables
label_encoder = LabelEncoder()

# Apply LabelEncoder to each categorical variable
for col in ['Gender', 'Vehicle_Age', 'Vehicle_Damage', 'Age_Bucket']:
    train_df[col] = label_encoder.fit_transform(train_df[col])
    
# Encode categorical variables
label_encoder = LabelEncoder()
train_df['Gender'] = label_encoder.fit_transform(train_df['Gender'])
train_df['Vehicle_Age'] = label_encoder.fit_transform(train_df['Vehicle_Age'])
train_df['Vehicle_Damage'] = label_encoder.fit_transform(train_df['Vehicle_Damage'])
#df_cleaned_optimized_train['Age_Bucket'] = label_encoder.fit_transform(df_cleaned_optimized_train['Age_Bucket'])
    
# Viewing dataset
label_encoder

We applied the Label Encoder to the categorical variables, transforming them into numerical values. This transformation resulted in the creation of a new feature called 'Gender', 'Vehicle_Age', 'Vehicle_Damage', 'Age_Bucket'. The encoding process is essential because many machine learning algorithms require input variables to be numerical to function correctly.

The 'Gender', 'Vehicle_Age', 'Vehicle_Damage', 'Age_Bucket' variable will be used as an alternative to the original Transported variable. By transforming categorical variables into numerical ones, we ensure that the model can interpret and process this data effectively. This is particularly important for non-ordinal categorical variables, where each category is converted into a distinct number without implying an order.

Additionally, the new 'Gender', 'Vehicle_Age', 'Vehicle_Damage', 'Age_Bucket' feature will allow us to evaluate whether this transformation positively influences the performance of the predictive models. By comparing the results obtained using the original variable and the transformed variable, we can determine which approach offers better predictions. This process is part of a broader feature engineering strategy aimed at optimizing input data to improve the accuracy and robustness of machine learning models.

In [None]:
# # Delete the 'Name' column
train_df.drop(columns=['id'], inplace=True)

# View the first DataFrame records after column deletion
train_df.head()

In [None]:
# Split data into feature set (X) and target variable (y)

# Resources
X1 = train_df.drop(columns=['Response'])  

# Target variable
y2 = train_df['Response']  

Here we performed the essential preprocessing step of splitting the target column, named Transported, for our machine learning model. This step is crucial to ensure that the model is trained properly, distinguishing between the data representing the classes we are trying to predict. Additionally, during this process, we applied class balancing techniques, ensuring that the model is trained fairly and evenly, regardless of the data distribution. By splitting the target column, we are preparing the data appropriately for the model to learn from them effectively, thus enabling accurate and reliable predictions.

In [None]:
# Split the data into training and testing sets
X_train1, X_test1, y_train2, y_test2 = train_test_split(X1, y2, test_size=0.2, random_state=42)

# Viewing rows and columns
print("Shape of X_train1:", X_train1.shape)
print("Shape of y_train2:", y_train2.shape)

Here, we conducted the training of the model using a train-test split. We adopted an 80/20 division, where 80% of the data was used for training and the remaining 20% was reserved for testing. This procedure is crucial for accurately evaluating the model's performance. The training set allows the model to learn patterns and relationships within the data, while the test set, which the model has not seen during training, is used to validate its ability to generalize and predict new data. Additionally, this approach helps identify and mitigate issues such as overfitting, ensuring that the model not only memorizes the training data but also performs well on unseen data.

In [None]:
# Converting categorical columns to dummy variables
X_train1 = pd.get_dummies(X_train1)
X_test1 = pd.get_dummies(X_test1)

# Viewing rows and columns
print("Shape of X_train1:", X_train1.shape)
print("Shape of y_train2:", X_test1.shape)

For the LightGBM machine learning model, in addition to the initial preprocessing option, we performed an additional transformation using Pandas to convert categorical variables into numeric ones. This simple transformation is crucial to ensure that the model can interpret and utilize all data features properly, including those expressed as categories. Leveraging the power of Pandas, we were able to perform this conversion efficiently and effectively, maintaining data integrity and enabling the model to make accurate predictions based on a wide range of information. This step adds a layer of refinement to the data preparation process, contributing to the overall robustness and performance of the model.

## **Model 1 - LightGBM**

In [None]:
%%time
# Importing library
from lightgbm import LGBMClassifier
import lightgbm as lgb

# Creating LGBM model
lgbm_model = LGBMClassifier(device='gpu', 
                            num_leaves=31, 
                            max_depth=100, 
                            learning_rate=0.1, 
                            n_estimators=100)

# Model training
lgbm_model.fit(X_train, y_train)

Here, we utilized the LightGBM model, which achieved an accuracy score of 84.54%. LightGBM is known for its efficiency and scalability in handling large and complex datasets, making it a powerful tool for machine learning tasks. This model was meticulously trained and fine-tuned to capture subtle nuances in the data and generate accurate and reliable predictions. The high accuracy score obtained during model evaluation validates the robustness of our approach and underscores the predictive capability of LightGBM. This performance highlights its potential as a valuable tool for data analysis and prediction.

In [None]:
# LGBM model prediction
lgbm_model_pred = lgbm_model.predict(X_test)

In [None]:
# Score model
print("Score model LightGBM:", lgbm_model.score(X_train, y_train))

In [None]:
# Plot the importance of features
# Importância das features
# Importância das features
importance = lgbm_model.feature_importances_
feature_names = X_train.columns

# Ordenando pela importância
indices = np.argsort(importance)

# Limitando o número de features a serem exibidas
num_features = 30  # Número de features a serem exibidas
top_indices = indices[-num_features:]

# Plotando a importância das features
plt.figure(figsize=(20, 10))
plt.title('Feature Importance')
plt.barh(range(len(top_indices)), importance[top_indices], align='center')
plt.yticks(range(len(top_indices)), [feature_names[i] for i in top_indices])
plt.xlabel('Relative Importance')
plt.grid(False)
plt.show()

### Feature Importance Explanation

The provided plot illustrates the feature importance for the `LGBMClassifier` model. Feature importance is a measure of the contribution each feature makes to the prediction of the target variable. Higher values indicate greater importance.

#### Observations:

1. **Top Features**:
   - **Age**: This is the most important feature, indicating that the age of the individual significantly impacts the model's predictions.
   - **Annual_Premium**: The total annual premium amount is also highly influential in the model.
   - **id**: The identifier has a high importance, which might suggest some encoded information or inherent ordering that affects the prediction.
   - **Vintage**: The age of the policy is another crucial feature influencing the model.
   - **Vehicle_Damage**: Whether the vehicle was damaged plays an important role in the model's decisions.
   - **Vehicle_Age**: The age of the vehicle is also a significant factor.

2. **Policy Sales Channels**:
   - Several `Policy_Sales_Channel` features appear in the list, indicating that the channel through which the policy was sold impacts the prediction.
   - **Policy_Sales_Channel_152.0**, **Policy_Sales_Channel_156.0**, **Policy_Sales_Channel_160.0**, **Policy_Sales_Channel_26.0**, and **Policy_Sales_Channel_157.0** are notable channels.

3. **Previously_Insured**:
   - Both `Previously_Insured_0` and `Previously_Insured_1` are important, suggesting that the model considers whether the customer was previously insured.

4. **Region Codes**:
   - Multiple `Region_Code` features are present, indicating that the geographic region of the customer also plays a role in the model's predictions.

5. **Gender**:
   - Gender has a lower but still notable importance, suggesting some influence on the model's predictions.

### Key Takeaways:

- **High Impact Features**: Age and Annual_Premium are the most influential features, meaning these aspects are critical in determining the likelihood of a positive response.
- **Policy Sales Channels**: The presence of multiple policy sales channels highlights the significance of the method or channel through which the policy was sold.
- **Geographic and Demographic Factors**: Region codes and gender also contribute, indicating that geographic and demographic factors affect the prediction.

### Recommendations:

- **Feature Engineering**: Further investigate the high importance of `id` to understand if it represents an encoded variable that might be split into more informative features.
- **Channel Strategy**: Consider the impact of different sales channels and potentially refine marketing strategies based on the channels with higher importance.
- **Geographic Insights**: Utilize the information from `Region_Code` features to develop region-specific strategies or understand regional differences in responses.


In [None]:
# Calculate model accuracy
accuracy = accuracy_score(y_test, lgbm_model_pred)
print("Accuracy model - LightGBM:", accuracy)

Here, the LightGBM model achieved an accuracy of 77.91%. This metric is the result of a comprehensive process of model training and tuning, which included advanced preprocessing techniques, feature selection, and hyperparameter optimization. Additionally, we applied cross-validation to ensure the robustness of the results and prevent overfitting. This score reflects the effectiveness of LightGBM in addressing the challenges of the dataset, capturing complex patterns, and producing reliable predictions. Such accuracy is crucial for practical applications, such as risk prediction or data-driven decision-making.

In [None]:
# Create the confusion matrix
conf_matrix2 = confusion_matrix(y_test, lgbm_model_pred)

# Confusion matrix plot
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix2, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - LGBM Classifier')
plt.show()

### Confusion Matrix Details

The matrix represents the performance of the `LGBMClassifier` model on the test dataset. Here's the breakdown of the confusion matrix:

- **True Negatives (TN)**: 66,651
  - These are the instances where the actual class was "Not Transported" and the model also predicted "Not Transported".
  
- **False Positives (FP)**: 48
  - These are the instances where the actual class was "Not Transported" but the model predicted "Transported".
  
- **False Negatives (FN)**: 9,500
  - These are the instances where the actual class was "Transported" but the model predicted "Not Transported".
  
- **True Positives (TP)**: 23
  - These are the instances where the actual class was "Transported" and the model also predicted "Transported".
  
### Metrics Calculation

Based on these values, we can calculate various performance metrics:

1. **Accuracy**:
   $$
   \text{Accuracy} = \frac{\text{TN} + \text{TP}}{\text{TN} + \text{FP} + \text{FN} + \text{TP}} = \frac{66,651 + 23}{66,651 + 48 + 9,500 + 23} = \frac{66,674}{76,222} \approx 0.874
   $$

2. **Precision** (for the positive class "Transported"):
   $$
   \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{23}{23 + 48} \approx 0.324
   $$

3. **Recall** (Sensitivity, for the positive class "Transported"):
   $$
   \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{23}{23 + 9,500} \approx 0.0024
   $$

4. **F1 Score** (for the positive class "Transported"):
   $$
   \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = 2 \cdot \frac{0.324 \cdot 0.0024}{0.324 + 0.0024} \approx 0.0048
   $$

5. **Specificity** (for the negative class "Not Transported"):
   $$
   \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} = \frac{66,651}{66,651 + 48} \approx 0.999
   $$

### Interpretation

- The accuracy of the model is relatively high at approximately 87.4%. However, accuracy can be misleading in imbalanced datasets.
- The precision for the "Transported" class is quite low, indicating that when the model predicts "Transported," it is correct about 32.4% of the time.
- The recall for the "Transported" class is extremely low, indicating that the model is missing a large number of "Transported" cases.
- The F1 score, which considers both precision and recall, is also very low, suggesting poor performance for the "Transported" class.
- The specificity is very high, meaning the model is very good at identifying "Not Transported" cases correctly.

### Conclusion

The model has a high ability to correctly identify "Not Transported" cases but struggles significantly with correctly identifying "Transported" cases. This indicates a potential issue with class imbalance or model bias towards the majority class ("Not Transported"). Further steps could include rebalancing the dataset, using different metrics for model evaluation, or trying other algorithms better suited for imbalanced datasets.


In [None]:
# ROC curve
y_pred_proba = lgbm_model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)
print("Area under the ROC curve LGBM (AUC):", auc)

# Plotagem da curva ROC
plt.plot(fpr, tpr, label='ROC curve - LGBM (area = %0.2f)' % auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Curva ROC - LGBM')
plt.legend(loc="lower right")
plt.grid(False)
plt.show()

### ROC Curve Explanation

The provided ROC (Receiver Operating Characteristic) curve is a graphical representation of the performance of the `LGBMClassifier` model. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.

#### Key Components of the ROC Curve:

1. **True Positive Rate (TPR)**:
   - Also known as sensitivity or recall.
   - Represents the proportion of actual positives correctly identified by the model.
   - Formula: $ \text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}} $

2. **False Positive Rate (FPR)**:
   - Represents the proportion of actual negatives incorrectly identified as positives by the model.
   - Formula: $ \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} $

3. **Diagonal Line (Baseline)**:
   - The diagonal line (dashed) represents the performance of a random classifier.
   - Any point along this line indicates that the classifier is performing no better than random guessing.

4. **ROC Curve**:
   - The blue curve represents the performance of the `LGBMClassifier` model across different threshold values.
   - The closer the curve follows the left-hand border and then the top border of the ROC space, the better the model's performance.

5. **AUC (Area Under the Curve)**:
   - The area under the ROC curve (AUC) is a single scalar value that summarizes the performance of the model.
   - AUC ranges from 0 to 1, with a higher value indicating better model performance.
   - An AUC of 0.86 means that there is an 86% chance that the model will distinguish between a randomly chosen positive instance and a randomly chosen negative instance.

### Interpretation of the ROC Curve for `LGBMClassifier`:

- **Model Performance**:
  - The ROC curve for the `LGBMClassifier` (blue curve) shows that the model performs well, as it stays above the diagonal baseline and close to the top left corner of the plot.
  - The AUC score of 0.86 indicates that the model has a good ability to discriminate between the positive and negative classes.

- **Threshold Selection**:
  - The ROC curve helps in selecting the optimal threshold for classification. Depending on the business requirement, you might want to maximize TPR (sensitivity) while keeping FPR low.

### Summary:

- **High AUC Score**: The model has a strong discriminative power with an AUC of 0.86.
- **Effective Classifier**: The curve's proximity to the top left corner indicates that the `LGBMClassifier` is effective at distinguishing between positive and negative instances.
- **Threshold Decision**: The ROC curve provides insights into the trade-offs between TPR and FPR for different thresholds, aiding in the selection of the most appropriate threshold based on the specific use case.


In [None]:
# classification report model
class_report = classification_report(y_test, lgbm_model_pred)
print("Classification report - LGBM Classifier")
print(class_report)

### Classification Report Analysis

The provided classification report gives a detailed breakdown of the performance metrics for the `LGBMClassifier` model on the test dataset. Here is a detailed interpretation of the results:

#### Model Performance Metrics:

- **Accuracy**: 0.87
  - The overall accuracy of the model, given by the formula:
    $$
    \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
    $$
  - This indicates that the model correctly predicted approximately 87% of the instances in the test dataset.

#### Detailed Classification Report:

The classification report breaks down the performance of the model for each class:

1. **Class 0 (Not Transported)**:
   - **Precision**: 0.88
     - This means that 88% of the instances predicted as "Not Transported" were correctly classified.
   - **Recall**: 1.00
     - This indicates that the model identified all actual "Not Transported" instances correctly.
   - **F1-Score**: 0.93
     - The F1-score is the harmonic mean of precision and recall, given by the formula:
       $$
       F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
       $$
     - For this class, the F1-score is 0.93.
   - **Support**: 66,699
     - The number of actual instances of "Not Transported" in the dataset.

2. **Class 1 (Transported)**:
   - **Precision**: 0.32
     - This indicates that only 32% of the instances predicted as "Transported" were correctly classified.
   - **Recall**: 0.00
     - This means the model failed to identify any actual "Transported" instances correctly.
   - **F1-Score**: 0.00
     - The F1-score is 0 because both precision and recall are very low.
   - **Support**: 9,523
     - The number of actual instances of "Transported" in the dataset.

#### Overall Metrics:

- **Accuracy**: 0.87
  - The overall accuracy of the model is consistent with the classification report, showing that 87% of all instances were correctly classified.

- **Macro Average**:
  - **Precision**: 0.60
    - The unweighted average of precision for all classes.
  - **Recall**: 0.50
    - The unweighted average of recall for all classes.
  - **F1-Score**: 0.47
    - The unweighted average of F1-scores for all classes, showing a moderate overall performance.

- **Weighted Average**:
  - **Precision**: 0.81
    - The average precision weighted by the number of instances in each class.
  - **Recall**: 0.87
    - The average recall weighted by the number of instances in each class.
  - **F1-Score**: 0.82
    - The average F1-score weighted by the number of instances in each class, indicating good overall performance despite the poor performance on the minority class.

### Summary:

- The model performs well in predicting the majority class (Not Transported) but struggles significantly with the minority class (Transported).
- The high recall and precision for the majority class indicate that the model is highly confident and accurate in identifying instances of the majority class.
- The poor performance on the minority class suggests that the model may be biased towards the majority class, likely due to class imbalance.
- The overall accuracy and weighted averages are high, but the macro averages reveal the poor performance on the minority class, highlighting the need for addressing class imbalance.

### Recommendations:

- **Class Imbalance Handling**: Consider techniques such as oversampling the minority class, undersampling the majority class, or using different evaluation metrics that account for class imbalance.
- **Model Tuning**: Experiment with different models, hyperparameters, or threshold adjustments to improve the recall and precision for the minority class.
- **Feature Engineering**: Investigate additional features or different feature transformations that might help the model better distinguish between the classes.


## **Modelo 2 - XGBoost**

In [None]:
from xgboost import XGBClassifier

# XGBoost Model
# Parameter tree_method='gpu_hist' for XGBoost GPU
model_XGBoost = XGBClassifier(tree_method='gpu_hist', random_state=42)
model_XGBoost_fit = model_XGBoost.fit(X_train, y_train)
model_XGBoost

In [None]:
# Score model
print("Score model XGBoost:", model_XGBoost.score(X_train, y_train))

In [None]:
# XGBoost model prediction
xgboost_model_pred = model_XGBoost.predict(X_test)

In [None]:
# Calculate model accuracy
accuracy = accuracy_score(y_test, xgboost_model_pred)
print("Accuracy model - XGBoost:", accuracy)

In [None]:
# Create the confusion matrix
conf_matrix2 = confusion_matrix(y_test, xgboost_model_pred)

# Confusion matrix plot
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix2, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - XGBoost')
plt.show()

### Confusion Matrix Explanation

The provided confusion matrix for the `XGBoost` model gives insight into the performance of the model on the test dataset. Here is a detailed interpretation of the results:

#### Components of the Confusion Matrix:

- **True Negatives (TN)**: 66,651
  - These are the instances where the actual class was "Not Transported" (0) and the model also predicted "Not Transported" (0).
  
- **False Positives (FP)**: 48
  - These are the instances where the actual class was "Not Transported" (0) but the model predicted "Transported" (1).

- **False Negatives (FN)**: 9,500
  - These are the instances where the actual class was "Transported" (1) but the model predicted "Not Transported" (0).

- **True Positives (TP)**: 23
  - These are the instances where the actual class was "Transported" (1) and the model also predicted "Transported" (1).

### Metrics Derived from the Confusion Matrix:

Using the values from the confusion matrix, we can derive several key performance metrics for the `XGBoost`:

1. **Accuracy**:
   $$
   \text{Accuracy} = \frac{\text{TN} + \text{TP}}{\text{TN} + \text{FP} + \text{FN} + \text{TP}} = \frac{66,651 + 23}{66,651 + 48 + 9,500 + 23} = \frac{66,674}{76,222} \approx 0.874
   $$
   - This indicates that the model correctly classified approximately 87.4% of the instances.

2. **Precision** (for the positive class "Transported"):
   $$
   \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{23}{23 + 48} \approx 0.324
   $$
   - Precision measures the accuracy of positive predictions. Here, about 32.4% of the instances predicted as "Transported" are actually "Transported".

3. **Recall** (Sensitivity or True Positive Rate for the positive class "Transported"):
   $$
   \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{23}{23 + 9,500} \approx 0.0024
   $$
   - Recall measures the model's ability to correctly identify actual positive instances. Here, the model correctly identifies about 0.24% of the "Transported" instances.

4. **F1 Score**:
   $$
   \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = 2 \cdot \frac{0.324 \cdot 0.0024}{0.324 + 0.0024} \approx 0.0048
   $$
   - The F1 score is the harmonic mean of precision and recall, providing a balance between the two metrics.

5. **Specificity** (for the negative class "Not Transported"):
   $$
   \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} = \frac{66,651}{66,651 + 48} \approx 0.999
   $$
   - Specificity measures the model's ability to correctly identify actual negative instances. Here, the model correctly identifies about 99.9% of the "Not Transported" instances.

### Interpretation:

- **Model Performance**:
  - The model performs well in predicting the majority class (Not Transported) but struggles significantly with the minority class (Transported).
  - The high recall and precision for the majority class indicate that the model is highly confident and accurate in identifying instances of the majority class.
  - The poor performance on the minority class suggests that the model may be biased towards the majority class, likely due to class imbalance.

- **Overall Metrics**:
  - The overall accuracy is high at 87.4%, but this can be misleading in imbalanced datasets.
  - The precision for the "Transported" class is low at 32.4%, indicating that a significant number of instances predicted as "Transported" are actually "Not Transported".
  - The recall for the "Transported" class is extremely low at 0.24%, indicating that the model misses a large number of actual "Transported" instances.
  - The F1 score for the "Transported" class is very low, reflecting poor performance in identifying this class.
  - The high specificity of 99.9% indicates that the model is very good at identifying "Not Transported" instances correctly.

### Recommendations:

- **Class Imbalance Handling**: Consider techniques such as oversampling the minority class, undersampling the majority class, or using different evaluation metrics that account for class imbalance.
- **Model Tuning**: Experiment with different models, hyperparameters, or threshold adjustments to improve the recall and precision for the minority class.
- **Feature Engineering**: Investigate additional features or different feature transformations that might help the model better distinguish between the classes.


In [None]:
# ROC curve
y_pred_proba = model_XGBoost.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)
print("Area under the ROC curve (AUC):", auc)

# Plotagem da curva ROC
plt.plot(fpr, tpr, label='ROC curve XGBoost (area = %0.2f)' % auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Curva ROC - XGBoost')
plt.legend(loc="lower right")
plt.grid(False)
plt.show()

### ROC Curve Explanation

The provided ROC (Receiver Operating Characteristic) curve is a graphical representation of the performance of the `XGBoost` model. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.

#### Key Components of the ROC Curve:

1. **True Positive Rate (TPR)**:
   - Also known as sensitivity or recall.
   - Represents the proportion of actual positives correctly identified by the model.
   - Formula: $ \text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}} $

2. **False Positive Rate (FPR)**:
   - Represents the proportion of actual negatives incorrectly identified as positives by the model.
   - Formula: $ \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} $

3. **Diagonal Line (Baseline)**:
   - The diagonal line (dashed) represents the performance of a random classifier.
   - Any point along this line indicates that the classifier is performing no better than random guessing.

4. **ROC Curve**:
   - The green curve represents the performance of the `XGBoost` model across different threshold values.
   - The closer the curve follows the left-hand border and then the top border of the ROC space, the better the model's performance.

5. **AUC (Area Under the Curve)**:
   - The area under the ROC curve (AUC) is a single scalar value that summarizes the performance of the model.
   - AUC ranges from 0 to 1, with a higher value indicating better model performance.
   - An AUC of 0.86 means that there is an 86% chance that the model will distinguish between a randomly chosen positive instance and a randomly chosen negative instance.

### Interpretation of the ROC Curve for `XGBoost`:

- **Model Performance**:
  - The ROC curve for the `XGBoost` (green curve) shows that the model performs well, as it stays above the diagonal baseline and close to the top left corner of the plot.
  - The AUC score of 0.86 indicates that the model has a good ability to discriminate between the positive and negative classes.

- **Threshold Selection**:
  - The ROC curve helps in selecting the optimal threshold for classification. Depending on the business requirement, you might want to maximize TPR (sensitivity) while keeping FPR low.

### Summary:

- **High AUC Score**: The model has a strong discriminative power with an AUC of 0.86.
- **Effective Classifier**: The curve's proximity to the top left corner indicates that the `XGBoost` is effective at distinguishing between positive and negative instances.
- **Threshold Decision**: The ROC curve provides insights into the trade-offs between TPR and FPR for different thresholds, aiding in the selection of the most appropriate threshold based on the specific use case.


In [None]:
# Classification report model
class_report = classification_report(y_test, xgboost_model_pred)
print("Classification report - XGBoost Classifier")
print(class_report)

### Classification Report Analysis

The provided classification report gives a detailed breakdown of the performance metrics for the `XGBoost` model on the test dataset. Here is a detailed interpretation of the results:

#### Model Performance Metrics:

- **Accuracy**: 0.87

  - The overall accuracy of the model, given by the formula:
  
    $$
    \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
    $$
    
  - This indicates that the model correctly predicted approximately 87% of the instances in the test dataset.

#### Detailed Classification Report:

The classification report breaks down the performance of the model for each class:

1. **Class 0 (Not Transported)**:
   - **Precision**: 0.88
   
     - This means that 88% of the instances predicted as "Not Transported" were correctly classified.
     
   - **Recall**: 1.00
   
     - This indicates that the model identified all actual "Not Transported" instances correctly.
     
   - **F1-Score**: 0.93
   
     - The F1-score is the harmonic mean of precision and recall, given by the formula:
     
       $$
       F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
       $$
       
     - For this class, the F1-score is 0.93.
     
   - **Support**: 66,699
     - The number of actual instances of "Not Transported" in the dataset.

2. **Class 1 (Transported)**:
   - **Precision**: 0.47
     - This indicates that only 47% of the instances predicted as "Transported" were correctly classified.
   - **Recall**: 0.02
     - This means the model correctly identified 2% of actual "Transported" instances.
   - **F1-Score**: 0.03
     - The F1-score is low because both precision and recall are low.
   - **Support**: 9,523
     - The number of actual instances of "Transported" in the dataset.

#### Overall Metrics:

- **Accuracy**: 0.87
  - The overall accuracy of the model is consistent with the classification report, showing that 87% of all instances were correctly classified.

- **Macro Average**:
  - **Precision**: 0.67
    - The unweighted average of precision for all classes.
  - **Recall**: 0.51
    - The unweighted average of recall for all classes.
  - **F1-Score**: 0.48
    - The unweighted average of F1-scores for all classes, showing a moderate overall performance.

- **Weighted Average**:
  - **Precision**: 0.83
    - The average precision weighted by the number of instances in each class.
  - **Recall**: 0.87
    - The average recall weighted by the number of instances in each class.
  - **F1-Score**: 0.82
    - The average F1-score weighted by the number of instances in each class, indicating good overall performance despite the poor performance on the minority class.

### Summary:

- The model performs well in predicting the majority class (Not Transported) but struggles significantly with the minority class (Transported).
- The high recall and precision for the majority class indicate that the model is highly confident and accurate in identifying instances of the majority class.
- The poor performance on the minority class suggests that the model may be biased towards the majority class, likely due to class imbalance.
- The overall accuracy and weighted averages are high, but the macro averages reveal the poor performance on the minority class, highlighting the need for addressing class imbalance.

### Recommendations:

- **Class Imbalance Handling**: Consider techniques such as oversampling the minority class, undersampling the majority class, or using different evaluation metrics that account for class imbalance.
- **Model Tuning**: Experiment with different models, hyperparameters, or threshold adjustments to improve the recall and precision for the minority class.
- **Feature Engineering**: Investigate additional features or different feature transformations that might help the model better distinguish between the classes.


# Part 13 - Model result

In [None]:
# Models to be evaluated
models = [
          GaussianNB(),  # Naive Bayes Model
          DecisionTreeClassifier(random_state=42),  # Decision Tree Model
          RandomForestClassifier(n_estimators=100, random_state=42),  # Random forest model
          LogisticRegression(random_state=50),  # Logistic regression model
          AdaBoostClassifier(random_state=45),  # Ada Boost Model
          XGBClassifier(),  # XGBoost Model Parameter tree_method='gpu_hist' for XGBoost GPU
          LGBMClassifier()  # LightGBM Model Parameter device='gpu' for LightGBM GPU
         ]

# List to store results
results = []

# Evaluate each model
for i, model in enumerate(models):
    model.fit(X_train, y_train)
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))
    
    # Store results in dictionary
    results.append({'Model': type(model).__name__,
                    'Training Accuracy': train_accuracy,
                    'Testing Accuracy': test_accuracy})

# Convert results list to DataFrame
results_df = pd.DataFrame(results)

# Function to highlight the maximum value in a column
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

# Apply the highlight function to the 'Testing Accuracy' column
results_df.style.apply(highlight_max, subset=['Testing Accuracy'])

### Model Results Summary

The results of the models shown in the image highlight the training and testing accuracies for each evaluated model. Here is a summary of the results for each model:

1. **GaussianNB (Naive Bayes)**
   - **Training Accuracy**: 0.877191
   - **Testing Accuracy**: 0.874210

2. **DecisionTreeClassifier (Decision Tree)**
   - **Training Accuracy**: 1.000000
   - **Testing Accuracy**: 0.826100

3. **RandomForestClassifier (Random Forest)**
   - **Training Accuracy**: 0.999980
   - **Testing Accuracy**: 0.868791

4. **LogisticRegression (Logistic Regression)**
   - **Training Accuracy**: 0.878030
   - **Testing Accuracy**: 0.875062

5. **AdaBoostClassifier**
   - **Training Accuracy**: 0.877945
   - **Testing Accuracy**: 0.875062

6. **XGBClassifier (XGBoost)**
   - **Training Accuracy**: 0.880356
   - **Testing Accuracy**: 0.874419

7. **LGBMClassifier (LightGBM)**
   - **Training Accuracy**: 0.878161
   - **Testing Accuracy**: 0.874603

### Interpretation

- **GaussianNB**: The training and testing accuracies are quite close, indicating good overall performance with little to no evidence of overfitting.
- **DecisionTreeClassifier**: Shows perfect training accuracy (1.000000) but a significant drop in testing accuracy, suggesting possible overfitting.
- **RandomForestClassifier**: Presents high training accuracy and good testing accuracy, indicating a good balance between bias and variance.
- **LogisticRegression**: Demonstrates consistent performance with very close training and testing accuracies.
- **AdaBoostClassifier**: Has similar performance to Logistic Regression, with nearly identical training and testing accuracies.
- **XGBClassifier**: Shows solid performance with high and very close training and testing accuracies, indicating a good model.
- **LGBMClassifier**: Also shows good performance with high and very close training and testing accuracies.

### Conclusion

- **Best Overall Model**: The choice of the best model may depend on other factors beyond accuracy, such as recall, precision, F1-score, and the trade-off between bias and variance. However, based solely on testing accuracies, the models `LogisticRegression`, `AdaBoostClassifier`, `XGBClassifier`, and `LGBMClassifier` exhibit the best performances.


In [None]:
# Importing library for metrics machine learning models
from sklearn.metrics import accuracy_score, recall_score, f1_score, precision_score

# Models to be evaluated
models = [GaussianNB(),  # Naive Bayes Model
          DecisionTreeClassifier(random_state=42),  # Decision Tree Model
          RandomForestClassifier(n_estimators=100, random_state=42),  # Random forest model
          LogisticRegression(random_state=50),  # Logistic regression model
          AdaBoostClassifier(random_state=45),  # Ada Boost Model
          XGBClassifier(),  # XGBoost Model
          LGBMClassifier()  # LightGBM Model
         ]

# List to store results
results = []

# Evaluate each model
for i, model in enumerate(models):
    model.fit(X_train, y_train)
    y_test_pred = model.predict(X_test)
    
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred, average='binary')
    test_recall = recall_score(y_test, y_test_pred, average='binary')
    test_f1 = f1_score(y_test, y_test_pred, average='binary')
    test_support = y_test.shape[0]
    
    # Store results in dictionary
    results.append({'Model': type(model).__name__,
                    'Accuracy': test_accuracy,
                    'Precision': test_precision,
                    'Recall': test_recall,
                    'F1-Score': test_f1,
                    'Support': test_support}
                  )

# Convert results list to DataFrame
results_df = pd.DataFrame(results)

# Function to highlight the maximum value in each column
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: yellow' 
            if v 
            else '' 
            for v in is_max]

# Apply the highlight function to the DataFrame
styled_results_df = results_df.style.apply(highlight_max, subset=['Accuracy', 
                                                                  'Precision', 
                                                                  'Recall', 
                                                                  'F1-Score'])

# Display the styled DataFrame
styled_results_df

### Model Results Summary

The results of the models shown in the image highlight the metrics for each evaluated model. Here is a summary of the results for each model:

1. **GaussianNB (Naive Bayes)**
   - **Accuracy**: 0.874210
   - **Precision**: 0.190476
   - **Recall**: 0.002100
   - **F1-Score**: 0.004155
   - **Support**: 76222

2. **DecisionTreeClassifier (Decision Tree)**
   - **Accuracy**: 0.826100
   - **Precision**: 0.302832
   - **Recall**: 0.309056
   - **F1-Score**: 0.301891
   - **Support**: 76222

3. **RandomForestClassifier (Random Forest)**
   - **Accuracy**: 0.868791
   - **Precision**: 0.393208
   - **Recall**: 0.092408
   - **F1-Score**: 0.149647
   - **Support**: 76222

4. **LogisticRegression (Logistic Regression)**
   - **Accuracy**: 0.875062
   - **Precision**: 0.000000
   - **Recall**: 0.000000
   - **F1-Score**: 0.000000
   - **Support**: 76222

5. **AdaBoostClassifier**
   - **Accuracy**: 0.875062
   - **Precision**: 0.500000
   - **Recall**: 0.000022
   - **F1-Score**: 0.000045
   - **Support**: 76222

6. **XGBClassifier (XGBoost)**
   - **Accuracy**: 0.874419
   - **Precision**: 0.433243
   - **Recall**: 0.016696
   - **F1-Score**: 0.032154
   - **Support**: 76222

7. **LGBMClassifier (LightGBM)**
   - **Accuracy**: 0.874603
   - **Precision**: 0.253521
   - **Recall**: 0.001890
   - **F1-Score**: 0.003752
   - **Support**: 76222

### Interpretation

- **GaussianNB**: Shows low precision, recall, and F1-score despite a relatively high accuracy. This indicates poor performance in identifying the positive class.
- **DecisionTreeClassifier**: Has a good balance between precision, recall, and F1-score compared to other models, though it shows signs of overfitting.
- **RandomForestClassifier**: Presents a moderate precision but low recall and F1-score, indicating it misses many positive instances.
- **LogisticRegression**: Exhibits poor performance in precision, recall, and F1-score, suggesting it struggles to identify the positive class.
- **AdaBoostClassifier**: Shows the highest precision but very low recall and F1-score, indicating it identifies positive instances accurately but misses many.
- **XGBClassifier**: Demonstrates a high precision but low recall and F1-score, indicating it identifies some positive instances accurately but misses many.
- **LGBMClassifier**: Shows moderate precision but very low recall and F1-score, indicating it misses many positive instances.

### Conclusion

- **Best Overall Model**: Based on the metrics provided, `AdaBoostClassifier` shows the highest precision. However, it is important to consider the balance between precision, recall, and F1-score for the overall model performance. `DecisionTreeClassifier` offers a better balance between these metrics compared to other models.


# 2) Section - Submission

# Part 14 - LightGBM Submission kaggle 

In step 14 of the submission process for Kaggle, I am using the LightGBM model, which has shown excellent accuracy. To ensure data integrity and avoid issues with the original dataset, I am taking specific precautions. Firstly, I am segmenting the model into different parts and ensuring careful data handling. Additionally, I am loading the entire dataset into the model, allowing for a more comprehensive and detailed analysis. This approach not only improves accuracy but also ensures that the model is robust and reliable when handling real data in a production environment.

In [None]:
import pandas as pd

# Load the datasets
train_df = pd.read_csv('/kaggle/input/playground-series-s4e7/train.csv')
test_df = pd.read_csv('/kaggle/input/playground-series-s4e7/test.csv')
sample_submission = pd.read_csv('/kaggle/input/playground-series-s4e7/sample_submission.csv')

In [None]:
# Handle missing values
train_df = train_df.fillna(-1)
test_df = test_df.fillna(-1)

In [None]:
# Encode categorical variables
label_encoders = {}
for column in train_df.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    train_df[column] = le.fit_transform(train_df[column])
    test_df[column] = le.transform(test_df[column])
    label_encoders[column] = le

# Viewing
le

In [None]:
# Separate resources and target
X = train_df.drop(columns=['id', 'Response'])
y = train_df['Response']
X_test = test_df.drop(columns=['id'])

In [None]:
# Splitting training and validation data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Set model parameters for GPU
params = {'objective': 'binary',
          'boosting_type': 'gbdt',
          'num_leaves': 31,
          'learning_rate': 0.05,
          'feature_fraction': 0.9,
          'n_estimators': 100,
          'device': 'gpu',
          'gpu_platform_id': 0,
          'gpu_device_id': 0,
          'metric': 'auc'}

# Create the LightGBM datasets
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

# Train the LightGBM model
lgbm_model = lgb.train(params,
                       train_data,
                       valid_sets=[val_data])

# Viewing model
lgbm_model

In [None]:
# Make predictions on the test set
y_test_pred = lgbm_model.predict(X_test, num_iteration=lgbm_model.best_iteration)

In [None]:
# Check the length of test data and predictions
print(f"Test set length: {len(test_df)}")
print(f"Prediction length: {len(y_test_pred)}")

# Prepare the submission file
if len(test_df) == len(y_test_pred):
    submission = pd.DataFrame({'id': test_df['id'],'Response': y_test_pred})
    submission.to_csv('submission2.csv', index=False)
else:
    print("Error: The length of test data and predictions do not match.")

In [None]:
# Viewing dataset
jf = pd.read_csv("/kaggle/working/submission2.csv")

# Viewing first 10 data
jf.head(10)