<a href="https://colab.research.google.com/github/VALDE021/Prediction-of-Product-Sales/blob/main/Car_Insurance_Data_Phase1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center><u>**Car Insurance Data Core Phase 1**</u>

**Authored by:** Eric N. Valdez

**Date:** October 1st, 2023

# <u>Assignment:

# Dataset:
* ## [Car Insurance Data](https://www.kaggle.com/datasets/sagnik1511/car-insurance-data)

## `When choosing your dataset, consider the following:`


* What is the target? (You are required to complete a classification task for this project.)
* What does one row represent? (A person? A business? An event? A product?)
* How many features does the data have?
* How many rows are in the dataset?
* What, if any, challenges do you foresee in cleaning, exploring, or modeling this dataset?

## **Clean and EDA**
* ~~Choose a dataset from the provided list~~
* Explore/clean the data
* Exploratory Visualizations
  * Creating exploratory visualizations to understand your data and search for trends.
* Create Explanatory Visualizations
  * Select two features from your EDA and produce explanatory visualizations showing the relationship between the feature and the target.
  * The purpose is to demonstrate key trends you found that will interest a stakeholder.
    * These visuals should be reporting-quality with titles, labels, and a short explanation of the trend.
    * Be sure to explain the insight associated with each visual in a text cell.
    * Both visualizations should be easily understood by a non-technical audience (Neither of these should be histograms, boxplots, or correlation plots).

**Start the README file**

* Create a README.md file in your GitHub repository. This README should include:
  * Your business problem and stakeholders
  * The source of your data
  * A description of your data
  * The 2 analytical insights from your data analysis.
    * Use the 2 explanatory visualizations you created above
    * Include a brief written explanation of each visual

# **Imports:**

In [1]:
# Our standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#mode
from statistics import mode

# Missingno
import missingno as msno

# New Libraries
import scipy.cluster.hierarchy as sch
import sklearn.cluster as cluster

# Preprocessing tools
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.cluster import DBSCAN, AgglomerativeClustering
from sklearn.neighbors import NearestNeighbors
from mpl_toolkits import mplot3d

# Models & evaluation metrics
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from sklearn import set_config
set_config(transform_output='pandas')

import joblib

# setting random state for reproducibility
SEED = 321
np.random.seed(SEED)
## Matplotlib style
fav_style = ('ggplot','tableau-colorblind10')
fav_context  ={'context':'notebook', 'font_scale':1.1}
plt.style.use(fav_style)
sns.set_context(**fav_context)
plt.rcParams['savefig.transparent'] = False
plt.rcParams['savefig.bbox'] = 'tight'


# Warnings
import warnings

In [2]:
# Pandas
import pandas as pd
# Seaborn
import seaborn as sns
# Numpy
import numpy as np
# MatplotLib
import matplotlib.pyplot as plt
# Warnings
import warnings

# Preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.metrics import precision_score, recall_score, \
ConfusionMatrixDisplay, accuracy_score, classification_report
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn import set_config
set_config(transform_output='pandas')

# Models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.impute import KNNImputer

# Classification Metrics
from sklearn.metrics import accuracy_score, recall_score, precision_score, \
f1_score, classification_report, confusion_matrix, ConfusionMatrixDisplay


# Set global scikit-learn configuration
from sklearn import set_config

# Display estimators as a diagram
set_config(display='diagram') # 'text' or 'diagram'}

#### **Warnings**

In [3]:
# Set filter warnings to ignore
warnings.filterwarnings('ignore')

#### **Pandas Display Configurations**

In [4]:
## Display all columns
pd.set_option('display.max_column', None)

## Display all rows
pd.set_option('display.max_rows', None)

#### **SK Learn Display Configuration**

In [5]:
## SK Learn Display
set_config(display='diagram')

## Transformers output as a Pandas Dataframe
set_config(transform_output='pandas')

# <u>Custom Functions:

In [6]:
def classification_metrics(y_true, y_pred, label='',
                           output_dict=False, figsize=(8,4),
                           normalize='true', cmap='Blues',
                           colorbar=False):
  # Get the classification report
  report = classification_report(y_true, y_pred)
  ## Print header and report
  header = "-"*70
  print(header, f" Classification Metrics: {label}", header, sep='\n')
  print(report)
  ## CONFUSION MATRICES SUBPLOTS
  fig, axes = plt.subplots(ncols=2, figsize=figsize)
  # create a confusion matrix  of raw counts
  ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                normalize=None, cmap='gist_gray', colorbar=colorbar,
                ax = axes[0],);
  axes[0].set_title("Raw Counts")
  # create a confusion matrix with the test data
  ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                normalize=normalize, cmap=cmap, colorbar=colorbar,
                ax = axes[1]);
  axes[1].set_title("Normalized Confusion Matrix")
  # Adjust layout and show figure
  fig.tight_layout()
  plt.show()
  # Return dictionary of classification_report
  if output_dict==True:
    report_dict = classification_report(y_true, y_pred, output_dict=True)
    return report_dict

In [7]:
def classification_metrics(y_true, y_pred, label='',
                           output_dict=False, figsize=(8,4),
                           normalize='true', cmap='Blues',
                           colorbar=False):
  # Get the classification report
  report = classification_report(y_true, y_pred)
  ## Print header and report
  header = "-"*70
  print(header, f" Classification Metrics: {label}", header, sep='\n')
  print(report)
  ## CONFUSION MATRICES SUBPLOTS
  fig, axes = plt.subplots(ncols=2, figsize=figsize)
  # create a confusion matrix  of raw counts
  ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                normalize=None, cmap='gist_gray', colorbar=colorbar,
                ax = axes[0],);
  axes[0].set_title("Raw Counts")
  # create a confusion matrix with the test data
  ConfusionMatrixDisplay.from_predictions(y_true, y_pred,
                normalize=normalize, cmap=cmap, colorbar=colorbar,
                ax = axes[1]);
  axes[1].set_title("Normalized Confusion Matrix")
  # Adjust layout and show figure
  fig.tight_layout()
  plt.show()
  # Return dictionary of classification_report
  if output_dict==True:
    report_dict = classification_report(y_true, y_pred, output_dict=True)
    return report_dict

# **Load and Inspect Data:**


### **Load the Data**

In [8]:
#Defining the data source
data = '/content/drive/MyDrive/#Data Science -C.D./CodingDojo/03-AdvanceML/Week09/Data/Car_Insurance_Claim.csv'
# Reading in the Data frame with Pandas
df = pd.read_csv(data)

### **Inspect the Data**

>## <u>.head()

In [9]:
# Displaying the first (5) rows of the dataframe
df.head(5)

Unnamed: 0,ID,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME
0,569520,65+,female,majority,0-9y,high school,upper class,0.629027,1.0,after 2015,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,750365,16-25,male,majority,0-9y,none,poverty,0.357757,0.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,199901,16-25,female,majority,0-9y,high school,working class,0.493146,1.0,before 2015,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0
3,478866,16-25,male,majority,0-9y,university,working class,0.206013,1.0,before 2015,0.0,1.0,32765,11000.0,sedan,0,0,0,0.0
4,731664,26-39,male,majority,10-19y,none,working class,0.388366,1.0,before 2015,0.0,0.0,32765,12000.0,sedan,2,0,1,1.0


>## <u>.shape

In [10]:
# Displaying the number of rows and columns for the dataframe
df.shape

(10000, 19)

>## <u>.dtypes

In [11]:
# Displaying the column names and datatypes for each column
df.dtypes

ID                       int64
AGE                     object
GENDER                  object
RACE                    object
DRIVING_EXPERIENCE      object
EDUCATION               object
INCOME                  object
CREDIT_SCORE           float64
VEHICLE_OWNERSHIP      float64
VEHICLE_YEAR            object
MARRIED                float64
CHILDREN               float64
POSTAL_CODE              int64
ANNUAL_MILEAGE         float64
VEHICLE_TYPE            object
SPEEDING_VIOLATIONS      int64
DUIS                     int64
PAST_ACCIDENTS           int64
OUTCOME                float64
dtype: object

>## <u>.info()

In [12]:
# Displaying the column names, count of non-null values, and their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   10000 non-null  int64  
 1   AGE                  10000 non-null  object 
 2   GENDER               10000 non-null  object 
 3   RACE                 10000 non-null  object 
 4   DRIVING_EXPERIENCE   10000 non-null  object 
 5   EDUCATION            10000 non-null  object 
 6   INCOME               10000 non-null  object 
 7   CREDIT_SCORE         9018 non-null   float64
 8   VEHICLE_OWNERSHIP    10000 non-null  float64
 9   VEHICLE_YEAR         10000 non-null  object 
 10  MARRIED              10000 non-null  float64
 11  CHILDREN             10000 non-null  float64
 12  POSTAL_CODE          10000 non-null  int64  
 13  ANNUAL_MILEAGE       9043 non-null   float64
 14  VEHICLE_TYPE         10000 non-null  object 
 15  SPEEDING_VIOLATIONS  10000 non-null  

>## <u>.describe()

In [13]:
# Displaying the descriptive statistics for the nonnumerical columns
df.describe(include='number')

Unnamed: 0,ID,CREDIT_SCORE,VEHICLE_OWNERSHIP,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME
count,10000.0,9018.0,10000.0,10000.0,10000.0,10000.0,9043.0,10000.0,10000.0,10000.0,10000.0
mean,500521.9068,0.515813,0.697,0.4982,0.6888,19864.5484,11697.003207,1.4829,0.2392,1.0563,0.3133
std,290030.768758,0.137688,0.459578,0.500022,0.463008,18915.613855,2818.434528,2.241966,0.55499,1.652454,0.463858
min,101.0,0.053358,0.0,0.0,0.0,10238.0,2000.0,0.0,0.0,0.0,0.0
25%,249638.5,0.417191,0.0,0.0,0.0,10238.0,10000.0,0.0,0.0,0.0,0.0
50%,501777.0,0.525033,1.0,0.0,1.0,10238.0,12000.0,0.0,0.0,0.0,0.0
75%,753974.5,0.618312,1.0,1.0,1.0,32765.0,14000.0,2.0,0.0,2.0,1.0
max,999976.0,0.960819,1.0,1.0,1.0,92101.0,22000.0,22.0,6.0,15.0,1.0


In [14]:
# Displaying the descriptive statistics for the nonnumerical columns
df.describe(exclude="number")

Unnamed: 0,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,VEHICLE_YEAR,VEHICLE_TYPE
count,10000,10000,10000,10000,10000,10000,10000,10000
unique,4,2,2,4,3,4,2,2
top,26-39,female,majority,0-9y,high school,upper class,before 2015,sedan
freq,3063,5010,9012,3530,4157,4336,6967,9523


# **Data Cleaning and Exploring:**

In [15]:
# Remove leading and trailing characters
df.columns = df.columns.str.strip()

In [16]:
# Display the column names and datatypes for each column
# Columns with mixed datatypes are identified as an object datatype
df.dtypes

ID                       int64
AGE                     object
GENDER                  object
RACE                    object
DRIVING_EXPERIENCE      object
EDUCATION               object
INCOME                  object
CREDIT_SCORE           float64
VEHICLE_OWNERSHIP      float64
VEHICLE_YEAR            object
MARRIED                float64
CHILDREN               float64
POSTAL_CODE              int64
ANNUAL_MILEAGE         float64
VEHICLE_TYPE            object
SPEEDING_VIOLATIONS      int64
DUIS                     int64
PAST_ACCIDENTS           int64
OUTCOME                float64
dtype: object

> ## <u>Duplicates

In [17]:
# Dropping 12 duplicate row
df = df.drop_duplicates()

In [18]:
# Display the number of duplicate rows in the dataset
print(f'There are {df.duplicated().sum()} duplicate rows.')

There are 0 duplicate rows.


> ## <u>Dropping Columns & Duplicates


In [19]:
# Dropping unnecessary columns
df = df.drop(['ID'], axis = 1)


In [20]:
df.head()

Unnamed: 0,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME
0,65+,female,majority,0-9y,high school,upper class,0.629027,1.0,after 2015,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,16-25,male,majority,0-9y,none,poverty,0.357757,0.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,16-25,female,majority,0-9y,high school,working class,0.493146,1.0,before 2015,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0
3,16-25,male,majority,0-9y,university,working class,0.206013,1.0,before 2015,0.0,1.0,32765,11000.0,sedan,0,0,0,0.0
4,26-39,male,majority,10-19y,none,working class,0.388366,1.0,before 2015,0.0,0.0,32765,12000.0,sedan,2,0,1,1.0


>## <u>Missing Data

In [21]:
df.isna().sum()

AGE                      0
GENDER                   0
RACE                     0
DRIVING_EXPERIENCE       0
EDUCATION                0
INCOME                   0
CREDIT_SCORE           982
VEHICLE_OWNERSHIP        0
VEHICLE_YEAR             0
MARRIED                  0
CHILDREN                 0
POSTAL_CODE              0
ANNUAL_MILEAGE         957
VEHICLE_TYPE             0
SPEEDING_VIOLATIONS      0
DUIS                     0
PAST_ACCIDENTS           0
OUTCOME                  0
dtype: int64

In [22]:
# Checking features for null values
df['CREDIT_SCORE'].isna().sum()

982

In [23]:
# Checking features for null values
df['ANNUAL_MILEAGE'].isna().sum()

957

In [24]:
# Percent of total rows missing values
percent_missing = (1 - df.dropna().shape[0] / df.shape[0]) * 100
print(f'{percent_missing:.4f} percent of rows are missing at least 1 value')

18.5100 percent of rows are missing at least 1 value


In [25]:
# Handle any Nan values in CREDIT_SCORE and ANNUAL_MILEAGE
imp = SimpleImputer(missing_values=np.NaN, strategy='mean')

In [26]:
# Check target for null values
df['CREDIT_SCORE'].isna().sum()

982

In [27]:
# Check target for null values
df['ANNUAL_MILEAGE'].isna().sum()

957

# **EDA:**

>## Exploratory Visualization:

In [28]:
# The target we are trying to predict
y = df['OUTCOME']
# The features we will use to make the prediction
X = df.drop(columns = 'OUTCOME')
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [29]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.head()

Unnamed: 0,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS
4901,40-64,male,majority,0-9y,high school,upper class,0.694461,1.0,before 2015,1.0,1.0,92101,,sedan,0,0,0
4375,16-25,female,majority,0-9y,none,poverty,0.295794,1.0,before 2015,0.0,0.0,32765,12000.0,sedan,0,0,0
6698,40-64,male,majority,10-19y,university,upper class,,1.0,before 2015,1.0,1.0,10238,,sedan,0,0,3
9805,26-39,female,majority,10-19y,university,working class,0.454836,1.0,before 2015,0.0,0.0,10238,20000.0,sedan,2,0,0
1101,16-25,female,majority,0-9y,none,poverty,0.152972,1.0,before 2015,1.0,0.0,10238,10000.0,sedan,0,0,0


<u>**Preprocessing:**

**Numerical Preprocessing Pipeline**

In [30]:
# Numerical Prepprocessing Pipeline
# Save list of column names
num_cols = X_train.select_dtypes("number").columns
print("Numeric Columns:", num_cols)
# instantiate preprocessors
impute_median = SimpleImputer(strategy='median')
scaler = StandardScaler()
# Make a numeric preprocessing pipeline
num_pipe = make_pipeline(impute_median, scaler)
# Making a numeric tuple for ColumnTransformer
num_tuple = ('numeric', num_pipe, num_cols)

Numeric Columns: Index(['CREDIT_SCORE', 'VEHICLE_OWNERSHIP', 'MARRIED', 'CHILDREN',
       'POSTAL_CODE', 'ANNUAL_MILEAGE', 'SPEEDING_VIOLATIONS', 'DUIS',
       'PAST_ACCIDENTS'],
      dtype='object')


**Create the Column Transformer**

In [31]:
# Create the preprocessing ColumnTransformer
preprocessor = ColumnTransformer([num_tuple], verbose_feature_names_out=False)
preprocessor

>## Explanatory Visualization:

In [33]:
# Replacing int target with string labels
target_map = {1:"Customer Claimed Loans", 0:'Not Claimed'}
df['OUTCOME'] = df['OUTCOME'].replace(target_map)
df['OUTCOME'].value_counts(dropna=False)

Not Claimed               6867
Customer Claimed Loans    3133
Name: OUTCOME, dtype: int64

In [34]:
# Check how many samples of each class are present
df['OUTCOME'].value_counts(normalize=True)

Not Claimed               0.6867
Customer Claimed Loans    0.3133
Name: OUTCOME, dtype: float64

**Split the Data**

In [35]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

**Class Balance**

In [36]:
# Check how many samples of each class are present for train
y_train.value_counts(normalize=True)

0.0    0.686667
1.0    0.313333
Name: OUTCOME, dtype: float64

>> <u>Feature 1:

**Recreating Numerical Preprocessing Pipeline**
`(following classification Core Metrics)`

In [37]:
# Numerical Prepprocessing Pipeline
# Save list of column names
num_cols = X_train.select_dtypes("number").columns
print("Numeric Columns:", num_cols)
# instantiate preprocessors
impute_median = SimpleImputer(strategy='median')
scaler = StandardScaler()
# Make a numeric preprocessing pipeline
num_pipe = make_pipeline(impute_median, scaler)
# Making a numeric tuple for ColumnTransformer
num_tuple = ('numeric', num_pipe, num_cols)

Numeric Columns: Index(['CREDIT_SCORE', 'VEHICLE_OWNERSHIP', 'MARRIED', 'CHILDREN',
       'POSTAL_CODE', 'ANNUAL_MILEAGE', 'SPEEDING_VIOLATIONS', 'DUIS',
       'PAST_ACCIDENTS'],
      dtype='object')


**Recreating the Column Transformer** `(core classification metrics)`

In [39]:
# Create the Column Transformer
preprocessor = ColumnTransformer([num_tuple], verbose_feature_names_out=False)

In [40]:
# Instantiate the transformers
scaler = StandardScaler()
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

>> <u>Feature 2:

**RandomForestClassifer (rfc) Model**

In [41]:
# Instantiate a random forest classififer
rfc = RandomForestClassifier(random_state = 42)
# Make a pipeline to scale the data and fit a model
random_forest_pipe = make_pipeline(preprocessor, rfc)
# Fit the model on the training data
random_forest_pipe.fit(X_train, y_train)

In [42]:
# Define the predicted values
y_pred_train = random_forest_pipe.predict(X_train)
y_pred_test = random_forest_pipe.predict(X_test)
# Obtain the accuracy score
train_acc = accuracy_score(y_pred_train, y_train).round(3)
test_acc = accuracy_score(y_pred_test, y_test).round(3)
# Print the results
print(f'Training accuracy : {train_acc}.')
print(f'Testing accuracy  : {test_acc}.')

Training accuracy : 0.992.
Testing accuracy  : 0.775.


# **K-Means:**

> ## <u>SimpleImputer

> ## <u>Scale the Data

In [None]:
# # define the columns you want to use
# x = df[['CREDIT_SCORE', 'ANNUAL_MILEAGE']]
# x.head()