<div style="width:100%;text-align: center;"> <img align=middle src="https://images.pexels.com/photos/164634/pexels-photo-164634.jpeg?auto=compress&cs=tinysrgb&w=600"> </div>

# <h1 style='background:#610C63; border:0; color:white'><center> üöóCars - Purchase Decision Model</center></h1> 

# **<span style="color:#C689C6;">üì∞About the Dataset</span>**

This dataset contains details of 1000 customers who intend to buy a car, considering their annual salaries.

# **<span style="color:#C689C6;">üìÅAbout the files</span>**

The datset contains 5 classes namely:

> User ID

> Gender

> Age

> Annual Salary

> Purchase Decision (No = 0; Yes = 1)

In [None]:
#Imports
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import sklearn
from sklearn.ensemble import RandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier 
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler

In [None]:
#Custom Colors
class clr:
    S = '\033[1m' + '\033[96m'
    E = '\033[0m'
    
my_colors = [ "#610C63","#937DC2","#C689C6", "#E8A0BF", "#FCC5C0"]

print(clr.S + "Notebook Color Scheme: " + clr.E)
sns.palplot(sns.color_palette(my_colors))

In [None]:
#Environment check
import os
import warnings
warnings.filterwarnings("ignore")

## **<span style="color:#610C63;">üìÉGet the Data</span>**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/cars-purchase-decision-dataset/car_data.csv')

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.columns

In [None]:
df.head()

In [None]:
#Check for null values 
df.isna().any()

There are no null values in this dataset.

## **<span style="color:#610C63;">üëÄ Visualization</span>**

In [None]:
df.hist(bins = 50, figsize = (20,15), color = my_colors[2])

## **<span style="color:#610C63;">üßπ Data Cleaning</span>**

In [None]:
df['Gender'].unique()

In [None]:
df['Gender'] = df['Gender'].str.replace('Male', '0')
df['Gender'] = df['Gender'].str.replace('Female', '1')

In [None]:
df['Gender'] = df['Gender'].astype(int)

In [None]:
X = df.iloc[:,:-1]
y = df['Purchased']

In [None]:
X.shape

In [None]:
y.shape

In [None]:
y.value_counts()

There is a difference of about 100 between number of 0 labels and 1 labels in 'y'.

## **<span style="color:#610C63;">üóÇClassification</span>**

## **<span style="color:#C689C6;">Classification using Random Forest Classifier</span>**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.2, stratify = y)

In [None]:
# Define and fit Model

model = RandomForestClassifier(n_estimators = 100)
model.fit(X_train, y_train)


In [None]:
# Make predictions

predictions = model.predict(X_test)

In [None]:
#Mean absolute error

mae = mean_absolute_error(predictions, y_test)
print("Mean Absolute Error with  Random Forest classifier:" , mae)

In [None]:
#Accuracy

accuracy = sklearn.metrics.accuracy_score(y_test,predictions)
print('Accuracy for Random Forest classifier model is - ', accuracy)

In [None]:
#Precision Score

precision_score = sklearn.metrics.precision_score(y_test, predictions, labels=model.classes_)
print("Precision score for Random Forest Classifier is  ", precision_score)

In [None]:
#Classification Report

clf_report = classification_report(y_test, predictions)
print(clf_report)

In [None]:
#Confusion Matrix

cm = confusion_matrix(y_test, model.predict(X_test), labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=model.classes_)
disp.plot()

**<span style="color:#E8A0BF;">Great work! we got a precsion score of 93 for '0' label and 87 for '1' label. Also, we achieved an accuracy of 91%.</span>**

**<span style="color:#FCC5C0;">Let's see if we can improve it?</span>**

## **<span style="color:#C689C6;">Classification using Random Over Sampler</span>**

In [None]:
# Reshape the data

X, y = make_classification(n_classes = 2, class_sep = 2, weights = [0.598, 0.402], 
                           n_informative = 3, n_redundant = 1, flip_y = 0, n_features = 20,
                          n_clusters_per_class = 1, n_samples = 1000, random_state = 10)

print('Orignal dataset shape %s' % Counter(y))

In [None]:
ros = RandomOverSampler(random_state = 42)
X_res, y_res = ros.fit_resample(X, y)

print('Reshaped dataset shape %s' % Counter(y_res))

In [None]:
# Train test split

X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, random_state = 42, test_size = 0.2, stratify = y_res)

In [None]:
# Define and Fit model

model_1 = RandomForestClassifier() 
model_1.fit(X_train, y_train)

In [None]:
#Make predictions

predictions_1 = model_1.predict(X_test)

In [None]:
#Classification Report

clf_report = classification_report(y_test, predictions_1)
print(clf_report)

**We achieved accuracy of 99% using Random Over Sampler with Random forest classifier**

In [None]:
#Confusion Matrix

cm = confusion_matrix(y_test, model_1.predict(X_test), labels=model_1.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=model_1.classes_)
disp.plot()

## **<span style="color:#C689C6;">Classification using XGBoost</span>**

In [None]:
from xgboost import XGBClassifier

In [None]:
# Define Model

model_2 = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

In [None]:
# Fit model

model_2.fit(X_train, y_train)

In [None]:
# Make predictions

predictions_2 = model_2.predict(X_test)

In [None]:
#Classification Report

clf_report = classification_report(y_test, predictions_2)
print(clf_report)

**We obtained an accuracy of 98% with XGBoost**

In [None]:
#Confusion Matrix

cm = confusion_matrix(y_test, model_2.predict(X_test), labels=model_2.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=model_2.classes_)
disp.plot()

# **<span style="color:#C689C6;">ü§ò Conclusion</span>**
**<span style="color:#C689C6;">Best accuracy is obtained by Random Forest Classifier trained using data from RandomOverSampler or  Oversampling, but over sampling can also lead to overfitting sometimes. Also, results of XGBoost are satisfactory. Thus, we can bank of XGBoost for better classification.</span>**

> This marks the end of üöóCars - Purchase Decision Model

> Stay Tuned for more..

> Please share your feedback and suggestions and help me improve üòá