# **Insurance Cross-Sell Prediction**

### This is an exploratory data analysis of a Kaggle dataset. It can be found here:

https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction

### We are given a training dataset of health insurance customers and we are charged with predicting which customers in the test dataset will be interested in purchasing extra vehicle insurance.

# Imports

In [None]:
# Installing HBSCAN library

# !pip install hdbscan

In [None]:
# Imorting all the libraries we need or may need

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.metrics import classification_report
from sklearn.cluster import KMeans
from xgboost import XGBClassifier
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

import pickle

# Beginning, Define Path and DataFrame

### Here we load our training dataset and do some basic exploration to see what kind of data we are dealing with. Each customer has an ID number, so it's simpler to use that as the index.

In [None]:
# This is a link to a Google Drive account as we are doing this in Google Colab

path_train = '/content/drive/My Drive/ai_project_2021/train.csv'

In [None]:
df = pd.read_csv(path_train, index_col = 'id')

In [None]:
df = df.reset_index(drop=True)
df

Unnamed: 0,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,Male,44,1,28.0,0,> 2 Years,Yes,40454.0,26.0,217,1
1,Male,76,1,3.0,0,1-2 Year,No,33536.0,26.0,183,0
2,Male,47,1,28.0,0,> 2 Years,Yes,38294.0,26.0,27,1
3,Male,21,1,11.0,1,< 1 Year,No,28619.0,152.0,203,0
4,Female,29,1,41.0,1,< 1 Year,No,27496.0,152.0,39,0
...,...,...,...,...,...,...,...,...,...,...,...
381104,Male,74,1,26.0,1,1-2 Year,No,30170.0,26.0,88,0
381105,Male,30,1,37.0,1,< 1 Year,No,40016.0,152.0,131,0
381106,Male,21,1,30.0,1,< 1 Year,No,35118.0,160.0,161,0
381107,Female,68,1,14.0,0,> 2 Years,Yes,44617.0,124.0,74,0


### Looking at the value counts for the 'Response' label column, we can see that we are dealing with a strongly imbalanced dataset. Checking the data types for each feature, we can see that some are string objects and should be converted to ints or floats for us to do any proper machine learning.

In [None]:
df.iloc[:,-1].value_counts()

0    334399
1     46710
Name: Response, dtype: int64

In [None]:
df.dtypes

Gender                   object
Age                       int64
Driving_License           int64
Region_Code             float64
Previously_Insured        int64
Vehicle_Age              object
Vehicle_Damage           object
Annual_Premium          float64
Policy_Sales_Channel    float64
Vintage                   int64
Response                  int64
dtype: object

### Since the dataset is very large and we can't examine each row individually, it is good to do a quick check and see how many unique string objects are in each of the features with string objects.

In [None]:
print(f"Number of unique elements in column 'Vehicle_Age' is : {df['Vehicle_Age'].nunique(dropna=False)}")
print('')
print(f"Unique elements in column 'Vehicle_Age' are : {df['Vehicle_Age'].unique()}")

Number of unique elements in column 'Vehicle_Age' is : 3

Unique elements in column 'Vehicle_Age' are : ['> 2 Years' '1-2 Year' '< 1 Year']


In [None]:
print(f"Number of unique elements in column 'Gender' is : {df['Gender'].nunique(dropna=False)}")
print('')
print(f"Unique elements in column 'Gender' are : {df['Gender'].unique()}")

Number of unique elements in column 'Gender' is : 2

Unique elements in column 'Gender' are : ['Male' 'Female']


In [None]:
print(f"Number of unique elements in column 'Vehicle_Damage' is : {df['Vehicle_Damage'].nunique(dropna=False)}")
print('')
print(f"Unique elements in column 'Vehicle_Damage' are : {df['Vehicle_Damage'].unique()}")

Number of unique elements in column 'Vehicle_Damage' is : 2

Unique elements in column 'Vehicle_Damage' are : ['Yes' 'No']


### The 'Policy_Sales_Channel' feature has no numerical value and is simply comprised of codes representing the means of contacting a given customer. It's good to know how many different channels are represented here. We see that there are 155 different channels.

In [None]:
print(f"Number of unique elements in column 'Policy_Sales_Channel' is : {df['Policy_Sales_Channel'].nunique(dropna=False)}")

Number of unique elements in column 'Policy_Sales_Channel' is : 155


In [None]:
pd.DataFrame(df.groupby('Policy_Sales_Channel').count().iloc[:,-1])

Unnamed: 0_level_0,Response
Policy_Sales_Channel,Unnamed: 1_level_1
1.0,1074
2.0,4
3.0,523
4.0,509
6.0,3
...,...
157.0,6684
158.0,492
159.0,51
160.0,21779


In [None]:
# Checking to see how many channels there are with very few positive responses

for i in range(5):
  print(f"Policy Sales Channels with {i} positive responses : {len(np.where(df.loc[(df['Response'] == 1)]['Policy_Sales_Channel'].value_counts() == i)[0])}")

Policy Sales Channels with 0 positive responses : 0
Policy Sales Channels with 1 positive responses : 21
Policy Sales Channels with 2 positive responses : 10
Policy Sales Channels with 3 positive responses : 7
Policy Sales Channels with 4 positive responses : 5


In [None]:
print(f"Number of unique elements in column 'Response' is : {df['Response'].nunique(dropna=False)}")
print('')
print(f"Unique elements in column 'Response' are : {df['Response'].unique()}")

Number of unique elements in column 'Response' is : 2

Unique elements in column 'Response' are : [1 0]


In [None]:
# And one last check to see how many people without driver's licenses we have in the training set who actually purchased vehicle insurance. It's a very small number.

len(df.loc[(df['Driving_License'] == 0) & (df['Response'] == 1)])

41

# Converting Categorical Variables and defining training and test sets

### Here we have taken binary features and converted them to 0/1 binary, and for now we have set 'Vehicle_Age' between 0 and 1, with 0.5 representing the middle value of the three values present.

In [None]:
cleanup_categories = {'Gender': {'Male': 0, 'Female': 1},
                      'Vehicle_Age': {'< 1 Year': 0, '1-2 Year': 0.5, '> 2 Years': 1},
                      'Vehicle_Damage': {'Yes': 1, 'No': 0}}

In [None]:
df = df.replace(cleanup_categories)
df.head()

Unnamed: 0,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,0,44,1,28.0,0,1.0,1,40454.0,26.0,217,1
1,0,76,1,3.0,0,0.5,0,33536.0,26.0,183,0
2,0,47,1,28.0,0,1.0,1,38294.0,26.0,27,1
3,0,21,1,11.0,1,0.0,0,28619.0,152.0,203,0
4,1,29,1,41.0,1,0.0,0,27496.0,152.0,39,0


### Here we divide our dataset into a training set and a test set for validation. We originally had a training dataset from Kaggle, so this will be in fact a training set made from that training set.

In [None]:
# define dataset
X = df.iloc[:,:df.shape[1]-1]
y = df.iloc[:,df.shape[1]-1]

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7)

In [None]:
X_train

Unnamed: 0,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage
188538,0,22,1,8.0,1,0.0,0,40073.0,160.0,101
235132,1,22,1,17.0,1,0.0,0,30658.0,152.0,77
128465,0,33,1,49.0,1,0.0,0,30570.0,152.0,80
160565,0,22,1,47.0,1,0.0,0,2630.0,160.0,29
48924,1,28,1,28.0,0,0.0,1,2630.0,158.0,288
...,...,...,...,...,...,...,...,...,...,...
235075,1,26,1,33.0,1,0.0,0,27397.0,152.0,279
10742,0,50,1,48.0,1,0.5,0,2630.0,124.0,212
49689,1,32,1,2.0,0,0.0,1,2630.0,160.0,166
189636,1,21,1,27.0,1,0.0,0,27490.0,152.0,206


# Dummy Classifier

### Here we create a dummy classifier which will simply take the most common classification as a prediction for the entire dataset. One of the problems with imbalanced data is that it's very easy to get a high accuracy score with a completely worthless model. This dummy classifier will give us an idea of the absolute minimum to compare against. Because our dataset contains around 87-88% of negative classifications, we can see that the dummy classifier has a great accuracy score with essentially no effort at all. However, the recall and precision scores for positive classification are zero.

### As a reminder, precision is the amount of true predicted positives divided by the sum of all predicted positives. Recall is the amount of true predicted positives fivided by the sum of true predicted positives and false predicted negatives.

In [None]:
# DummyClassifier to predict only target 0
dummy = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
predictions = dummy.predict(X_test)

In [None]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.88      1.00      0.94     50220
           1       0.00      0.00      0.00      6947

    accuracy                           0.88     57167
   macro avg       0.44      0.50      0.47     57167
weighted avg       0.77      0.88      0.82     57167



# First Classification Tests

### Here we set up a few basic classification tests for out dataset. The models are K-Means with 2 centroids and 10 centroids, and then XGBoost and a basic neural network for binary classification. 

In [None]:
# k-means k=2 clustering

# define the model
model = KMeans(n_clusters=2)
# fit the model
model.fit(X)
# assign a cluster to each example
predictions = model.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       0.87      0.19      0.31    334399
           1       0.12      0.81      0.21     46710

    accuracy                           0.26    381109
   macro avg       0.50      0.50      0.26    381109
weighted avg       0.78      0.26      0.30    381109



In [None]:
# XGBoost model

# fit model to training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.88      1.00      0.94     50220
           1       0.00      0.00      0.00      6947

    accuracy                           0.88     57167
   macro avg       0.44      0.50      0.47     57167
weighted avg       0.77      0.88      0.82     57167



In [None]:
# k-means k=10 clustering

# define the model
model = KMeans(n_clusters=10)
# fit the model
model.fit(X)
# assign a cluster to each example
predictions = model.predict(X)

In [None]:
# Checking to see the density of positive labels in each of the clusters

for j in range(len(np.unique(predictions))):
  print(f"Occurrence of positives in cluster {j} : {round( [y[i] for i in np.where(predictions == j)[0]].count(1) / len(np.where(predictions == j)[0])*100, 2)} %   of {len(np.where(predictions == j)[0])} samples")

print('\n')

print(f"Occurrence of positives in data : { round (list(y).count(1) / len(y) * 100, 2)} %")

Occurrence of positives in cluster 0 : 10.78 %   of 89364 samples
Occurrence of positives in cluster 1 : 13.1 %   of 64968 samples
Occurrence of positives in cluster 2 : 14.98 %   of 10540 samples
Occurrence of positives in cluster 3 : 15.14 %   of 28268 samples
Occurrence of positives in cluster 4 : 15.31 %   of 2083 samples
Occurrence of positives in cluster 5 : 16.23 %   of 154 samples
Occurrence of positives in cluster 6 : 14.15 %   of 52359 samples
Occurrence of positives in cluster 7 : 18.57 %   of 70 samples
Occurrence of positives in cluster 8 : 9.4 %   of 54495 samples
Occurrence of positives in cluster 9 : 12.45 %   of 78808 samples


Occurrence of positives in data : 12.26 %


In [None]:
# first neural network with keras make predictions

# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=df.shape[1]-1, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam')
# fit the keras model on the dataset
model.fit(X, y, epochs=2, batch_size=32, verbose=1)
# make class predictions with the model
predictions = (model.predict(X) > 0.5).astype("int32")
print(classification_report(y, predictions))

Epoch 1/2
Epoch 2/2
              precision    recall  f1-score   support

           0       0.88      1.00      0.93    334399
           1       0.34      0.01      0.01     46710

    accuracy                           0.88    381109
   macro avg       0.61      0.50      0.47    381109
weighted avg       0.81      0.88      0.82    381109



# MinMax Scaling and Rerunning Tests

### The next step in our exploration is basically to re-run the same tests with our algorithms, but applying feature scaling using MixMaxScaler.

In [None]:
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df))
df_scaled

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.0,0.369231,1.0,0.538462,0.0,1.0,1.0,0.070366,0.154321,0.716263,1.0
1,0.0,0.861538,1.0,0.057692,0.0,0.5,0.0,0.057496,0.154321,0.598616,0.0
2,0.0,0.415385,1.0,0.538462,0.0,1.0,1.0,0.066347,0.154321,0.058824,1.0
3,0.0,0.015385,1.0,0.211538,1.0,0.0,0.0,0.048348,0.932099,0.667820,0.0
4,1.0,0.138462,1.0,0.788462,1.0,0.0,0.0,0.046259,0.932099,0.100346,0.0
...,...,...,...,...,...,...,...,...,...,...,...
381104,0.0,0.830769,1.0,0.500000,1.0,0.5,0.0,0.051234,0.154321,0.269896,0.0
381105,0.0,0.153846,1.0,0.711538,1.0,0.0,0.0,0.069551,0.932099,0.418685,0.0
381106,0.0,0.015385,1.0,0.576923,1.0,0.0,0.0,0.060439,0.981481,0.522491,0.0
381107,1.0,0.738462,1.0,0.269231,0.0,1.0,1.0,0.078110,0.759259,0.221453,0.0


In [None]:
# define dataset
X = df_scaled.iloc[:,:df_scaled.shape[1]-1]
y = df.iloc[:,-1]

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7)

In [None]:
# k-means k=2 clustering

# define the model
model = KMeans(n_clusters=2)
# fit the model
model.fit(X)
# assign a cluster to each example
predictions = model.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       0.77      0.46      0.57    334399
           1       0.00      0.01      0.01     46710

    accuracy                           0.40    381109
   macro avg       0.39      0.23      0.29    381109
weighted avg       0.67      0.40      0.50    381109



In [None]:
# XGBoost model

# fit model to training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data
predictions = model.predict(X_test)

In [None]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.88      1.00      0.94     50220
           1       0.00      0.00      0.00      6947

    accuracy                           0.88     57167
   macro avg       0.44      0.50      0.47     57167
weighted avg       0.77      0.88      0.82     57167



In [None]:
# k-means k=10 clustering

model = KMeans(n_clusters=10)
model.fit(X)
predictions = model.predict(X)

In [None]:
for j in range(len(np.unique(predictions))):
  print(f"Occurrence of positives in cluster {j} : {round( [y[i] for i in np.where(predictions == j)[0]].count(1) / len(np.where(predictions == j)[0])*100, 2)} %   of {len(np.where(predictions == j)[0])} samples")

print('\n')

print(f"Occurrence of positives in data : { round (list(y).count(1) / len(y) * 100, 2)} %")

Occurrence of positives in cluster 0 : 26.38 %   of 30168 samples
Occurrence of positives in cluster 1 : 0.08 %   of 48470 samples
Occurrence of positives in cluster 2 : 27.83 %   of 41719 samples
Occurrence of positives in cluster 3 : 0.02 %   of 58793 samples
Occurrence of positives in cluster 4 : 23.64 %   of 69096 samples
Occurrence of positives in cluster 5 : 0.07 %   of 25155 samples
Occurrence of positives in cluster 6 : 27.6 %   of 25663 samples
Occurrence of positives in cluster 7 : 3.79 %   of 23990 samples
Occurrence of positives in cluster 8 : 0.06 %   of 34223 samples
Occurrence of positives in cluster 9 : 11.44 %   of 23832 samples


Occurrence of positives in data : 12.26 %


# Reloading CSV File to Run Tests with One-Hot Encoding and Defining Dataset 

### Because the feature scaling seemed interesting, we'll continue using MinMaxScaler, but we will first use one-hot encoding for our dataset. One-hot encoding is a way of turning categorical features with more than two categories into binary features, where a new feature is added to the dataset that represents each category in the original feature. For example, in this dataset, 'Policy_Sales_Channel' contains 155 different values each represented by a number, but the numbers have nothing to do with each other, and one being higher than another means nothing. This is the perfect situation in which to use one-hot encoding.

In [None]:
path_train = '/content/drive/My Drive/ai_project_2021/train.csv'
df = pd.read_csv(path_train, index_col = 'id')

In [None]:
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,Male,44,1,28.0,0,> 2 Years,Yes,40454.0,26.0,217,1
1,Male,76,1,3.0,0,1-2 Year,No,33536.0,26.0,183,0
2,Male,47,1,28.0,0,> 2 Years,Yes,38294.0,26.0,27,1
3,Male,21,1,11.0,1,< 1 Year,No,28619.0,152.0,203,0
4,Female,29,1,41.0,1,< 1 Year,No,27496.0,152.0,39,0


In [None]:
one_hot_policy = pd.get_dummies(df['Policy_Sales_Channel'], prefix='Pol', prefix_sep='_ ')
one_hot_region = pd.get_dummies(df['Region_Code'], prefix='Reg', prefix_sep='_ ')

In [None]:
df.drop(['Policy_Sales_Channel', 'Region_Code'], axis = 1, inplace=True)
df

Unnamed: 0,Gender,Age,Driving_License,Previously_Insured,Vehicle_Damage,Annual_Premium,Vintage,Response
0,Male,44,1,0,Yes,40454.0,217,1
1,Male,76,1,0,No,33536.0,183,0
2,Male,47,1,0,Yes,38294.0,27,1
3,Male,21,1,1,No,28619.0,203,0
4,Female,29,1,1,No,27496.0,39,0
...,...,...,...,...,...,...,...,...
381104,Male,74,1,1,No,30170.0,88,0
381105,Male,30,1,1,No,40016.0,131,0
381106,Male,21,1,1,No,35118.0,161,0
381107,Female,68,1,0,Yes,44617.0,74,0


In [None]:
df_one_hot = df.drop(['Response'], axis=1).join(one_hot_region).join(one_hot_policy).join(df.iloc[:,-1])
df_one_hot

Unnamed: 0,Gender,Age,Driving_License,Previously_Insured,Vehicle_Damage,Annual_Premium,Vintage,Reg_ 0.0,Reg_ 1.0,Reg_ 2.0,Reg_ 3.0,Reg_ 4.0,Reg_ 5.0,Reg_ 6.0,Reg_ 7.0,Reg_ 8.0,Reg_ 9.0,Reg_ 10.0,Reg_ 11.0,Reg_ 12.0,Reg_ 13.0,Reg_ 14.0,Reg_ 15.0,Reg_ 16.0,Reg_ 17.0,Reg_ 18.0,Reg_ 19.0,Reg_ 20.0,Reg_ 21.0,Reg_ 22.0,Reg_ 23.0,Reg_ 24.0,Reg_ 25.0,Reg_ 26.0,Reg_ 27.0,Reg_ 28.0,Reg_ 29.0,Reg_ 30.0,Reg_ 31.0,Reg_ 32.0,...,Pol_ 121.0,Pol_ 122.0,Pol_ 123.0,Pol_ 124.0,Pol_ 125.0,Pol_ 126.0,Pol_ 127.0,Pol_ 128.0,Pol_ 129.0,Pol_ 130.0,Pol_ 131.0,Pol_ 132.0,Pol_ 133.0,Pol_ 134.0,Pol_ 135.0,Pol_ 136.0,Pol_ 137.0,Pol_ 138.0,Pol_ 139.0,Pol_ 140.0,Pol_ 143.0,Pol_ 144.0,Pol_ 145.0,Pol_ 146.0,Pol_ 147.0,Pol_ 148.0,Pol_ 149.0,Pol_ 150.0,Pol_ 151.0,Pol_ 152.0,Pol_ 153.0,Pol_ 154.0,Pol_ 155.0,Pol_ 156.0,Pol_ 157.0,Pol_ 158.0,Pol_ 159.0,Pol_ 160.0,Pol_ 163.0,Response
0,Male,44,1,0,Yes,40454.0,217,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,Male,76,1,0,No,33536.0,183,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Male,47,1,0,Yes,38294.0,27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,Male,21,1,1,No,28619.0,203,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,Female,29,1,1,No,27496.0,39,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
381104,Male,74,1,1,No,30170.0,88,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
381105,Male,30,1,1,No,40016.0,131,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
381106,Male,21,1,1,No,35118.0,161,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
381107,Female,68,1,0,Yes,44617.0,74,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
cleanup_categories = {'Gender': {'Male': 0, 'Female': 1},
                      'Vehicle_Damage': {'Yes': 1, 'No': 0},
                      'Vehicle_Age': {'> 2 Years': 1, '< 1 Year': 0, '1-2 Year': 0.5}}

In [None]:
df_one_hot = df_one_hot.replace(cleanup_categories)
df_one_hot.head(10)

Unnamed: 0,Gender,Age,Driving_License,Previously_Insured,Vehicle_Damage,Annual_Premium,Vintage,Reg_ 0.0,Reg_ 1.0,Reg_ 2.0,Reg_ 3.0,Reg_ 4.0,Reg_ 5.0,Reg_ 6.0,Reg_ 7.0,Reg_ 8.0,Reg_ 9.0,Reg_ 10.0,Reg_ 11.0,Reg_ 12.0,Reg_ 13.0,Reg_ 14.0,Reg_ 15.0,Reg_ 16.0,Reg_ 17.0,Reg_ 18.0,Reg_ 19.0,Reg_ 20.0,Reg_ 21.0,Reg_ 22.0,Reg_ 23.0,Reg_ 24.0,Reg_ 25.0,Reg_ 26.0,Reg_ 27.0,Reg_ 28.0,Reg_ 29.0,Reg_ 30.0,Reg_ 31.0,Reg_ 32.0,...,Pol_ 121.0,Pol_ 122.0,Pol_ 123.0,Pol_ 124.0,Pol_ 125.0,Pol_ 126.0,Pol_ 127.0,Pol_ 128.0,Pol_ 129.0,Pol_ 130.0,Pol_ 131.0,Pol_ 132.0,Pol_ 133.0,Pol_ 134.0,Pol_ 135.0,Pol_ 136.0,Pol_ 137.0,Pol_ 138.0,Pol_ 139.0,Pol_ 140.0,Pol_ 143.0,Pol_ 144.0,Pol_ 145.0,Pol_ 146.0,Pol_ 147.0,Pol_ 148.0,Pol_ 149.0,Pol_ 150.0,Pol_ 151.0,Pol_ 152.0,Pol_ 153.0,Pol_ 154.0,Pol_ 155.0,Pol_ 156.0,Pol_ 157.0,Pol_ 158.0,Pol_ 159.0,Pol_ 160.0,Pol_ 163.0,Response
0,0,44,1,0,1,40454.0,217,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,76,1,0,0,33536.0,183,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,47,1,0,1,38294.0,27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,21,1,1,0,28619.0,203,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,1,29,1,1,0,27496.0,39,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
5,1,24,1,0,1,2630.0,176,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
6,0,23,1,0,1,23367.0,249,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
7,1,56,1,0,1,32031.0,72,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
8,1,24,1,1,0,27619.0,28,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
9,1,32,1,1,0,28771.0,80,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [None]:
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_one_hot))
df_scaled

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215
0,0.0,0.369231,1.0,0.0,1.0,0.070366,0.716263,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.861538,1.0,0.0,0.0,0.057496,0.598616,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.415385,1.0,0.0,1.0,0.066347,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.015385,1.0,1.0,0.0,0.048348,0.667820,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.138462,1.0,1.0,0.0,0.046259,0.100346,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
381104,0.0,0.830769,1.0,1.0,0.0,0.051234,0.269896,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
381105,0.0,0.153846,1.0,1.0,0.0,0.069551,0.418685,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
381106,0.0,0.015385,1.0,1.0,0.0,0.060439,0.522491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
381107,1.0,0.738462,1.0,0.0,1.0,0.078110,0.221453,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# define dataset
X = df_scaled.iloc[:,:df_scaled.shape[1]-1]
y = df_scaled.iloc[:,df_scaled.shape[1]-1]

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7)

In [None]:
# df_scaled.to_csv('/content/drive/My Drive/ai_project_2021/df_scaled.csv')

# Check Basic Models

In [None]:
# k-means k=2 clustering

model = KMeans(n_clusters=2)
model.fit(X)
predictions = model.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

         0.0       0.77      0.47      0.58    334399
         1.0       0.00      0.01      0.00     46710

    accuracy                           0.41    381109
   macro avg       0.39      0.24      0.29    381109
weighted avg       0.68      0.41      0.51    381109



In [None]:
# XGBoost model

model = XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         0.0       0.88      1.00      0.94     50220
         1.0       0.00      0.00      0.00      6947

    accuracy                           0.88     57167
   macro avg       0.44      0.50      0.47     57167
weighted avg       0.77      0.88      0.82     57167



In [None]:
# k-means k=10 clustering

# define the model
model = KMeans(n_clusters=10)
# fit the model
model.fit(X)
# assign a cluster to each example
predictions = model.predict(X)

### Looking in these clusters to see the percentage of positive classifications contained in each, we see that some clusters have a very percentage while others have a percentage higher than the base percentage of positives in the dataset. This gives us an idea... Perhaps it would be possible to use a neural network or XGBoost to separate our dataset into high-density and low-density classes. 

In [None]:
for j in range(len(np.unique(predictions))):
  print(f"Occurrence of positives in cluster {j} : {round( [y[i] for i in np.where(predictions == j)[0]].count(1) / len(np.where(predictions == j)[0])*100, 2)} %   of {len(np.where(predictions == j)[0])} samples")

print('\n')

print(f"Occurrence of positives in data : { round (list(y).count(1) / len(y) * 100, 2)} %")

Occurrence of positives in cluster 0 : 23.87 %   of 44467 samples
Occurrence of positives in cluster 1 : 0.22 %   of 60971 samples
Occurrence of positives in cluster 2 : 0.08 %   of 58264 samples
Occurrence of positives in cluster 3 : 0.3 %   of 45657 samples
Occurrence of positives in cluster 4 : 24.3 %   of 41751 samples
Occurrence of positives in cluster 5 : 24.58 %   of 28689 samples
Occurrence of positives in cluster 6 : 25.5 %   of 22500 samples
Occurrence of positives in cluster 7 : 1.99 %   of 21418 samples
Occurrence of positives in cluster 8 : 30.18 %   of 29255 samples
Occurrence of positives in cluster 9 : 12.74 %   of 28137 samples


Occurrence of positives in data : 12.26 %


In [None]:
# define the keras model
# We just pass X rather thatn X_test because we're curious if the model is interesting at all

model = Sequential()
model.add(Dense(12, input_dim=df_scaled.shape[1]-1, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(X, y, epochs=2, batch_size=32, verbose=1)
predictions = (model.predict(X_test) > 0.5).astype("int32")
print(classification_report(y_test, predictions))

Epoch 1/2
Epoch 2/2
              precision    recall  f1-score   support

         0.0       0.88      1.00      0.94     50220
         1.0       0.00      0.00      0.00      6947

    accuracy                           0.88     57167
   macro avg       0.44      0.50      0.47     57167
weighted avg       0.77      0.88      0.82     57167



  # Using K-Means k=10 for Identification of Low-Positive Clusters

In [None]:
# k-means k=10 clustering

# define the model
model = KMeans(n_clusters=15)
# fit the model
model.fit(X)
# assign a cluster to each example
predictions = model.predict(X)

In [None]:
print(model.random_state)

None


In [None]:
clusters = {}

for j in np.unique(predictions):
  positives_in_cluster = [y[i] for i in np.where(predictions == j)[0]].count(1) / len(np.where(predictions == j)[0])*100
  clusters[j] = positives_in_cluster
  print(f"Occurrence of positives in cluster {j} : {round( positives_in_cluster, 2)} %   of {len(np.where(predictions == j)[0])} samples")

print('\n')

print(f"Occurrence of positives in data : { round (list(y).count(1) / len(y) * 100, 2)} %")

Occurrence of positives in cluster 0 : 0.17 %   of 12808 samples
Occurrence of positives in cluster 1 : 24.84 %   of 32934 samples
Occurrence of positives in cluster 2 : 24.58 %   of 28689 samples
Occurrence of positives in cluster 3 : 12.36 %   of 16360 samples
Occurrence of positives in cluster 4 : 0.05 %   of 42149 samples
Occurrence of positives in cluster 5 : 0.05 %   of 21756 samples
Occurrence of positives in cluster 6 : 0.03 %   of 50426 samples
Occurrence of positives in cluster 7 : 17.02 %   of 18922 samples
Occurrence of positives in cluster 8 : 8.93 %   of 19943 samples
Occurrence of positives in cluster 9 : 23.61 %   of 9810 samples
Occurrence of positives in cluster 10 : 20.07 %   of 23339 samples
Occurrence of positives in cluster 11 : 30.18 %   of 29255 samples
Occurrence of positives in cluster 12 : 0.63 %   of 22247 samples
Occurrence of positives in cluster 13 : 27.45 %   of 30487 samples
Occurrence of positives in cluster 14 : 0.19 %   of 21984 samples


Occurrence 

In [None]:
# Here we identify the clusters that that contain more or less than 1% positive labels 

positive_clusters = [k for k, v in clusters.items() if v >= 1]
negative_clusters = [k for k, v in clusters.items() if v < 1]

positive_indices = [i for i in range(len(predictions)) if predictions[i] in positive_clusters]
negative_indices = [i for i in range(len(predictions)) if predictions[i] in negative_clusters]

print(df.iloc[positive_indices]['Response'].value_counts())
print(df.iloc[negative_indices]['Response'].value_counts())

0    163284
1     46455
Name: Response, dtype: int64
0    171115
1       255
Name: Response, dtype: int64


In [None]:
df_responses = pd.DataFrame(df_scaled.iloc[:,df_scaled.shape[1]-1])

In [None]:
df_scaled = df_scaled.iloc[:,:-1]

### Here we create a new DataFrame with the target feature classification dividing the dataset into the high and low density clusters from earlier. We test XGBoost and a basic neural network to see if we can separate the dataset in this way.

In [None]:
df_scaled['Clustered'] = 0
df_scaled.loc[ positive_indices, 'Clustered'] = 1

In [None]:
# define dataset
X = df_scaled.iloc[:,:df_scaled.shape[1]-1]
y = df_scaled.iloc[:,df_scaled.shape[1]-1]

# setting up testing and training sets
X_train, X_test, y_train, y_test, df_responses_train, df_responses_test = train_test_split(X, y, df_responses, test_size=0.15, random_state=7)

In [None]:
# k-means k=2 clustering

# define the model
model = KMeans(n_clusters=2)
# fit the model
model.fit(X)
# assign a cluster to each example
predictions = model.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       0.94      0.98      0.96    171370
           1       0.98      0.95      0.97    209739

    accuracy                           0.96    381109
   macro avg       0.96      0.96      0.96    381109
weighted avg       0.96      0.96      0.96    381109



In [None]:
# XGBoost model

# fit model no training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     25658
           1       1.00      1.00      1.00     31509

    accuracy                           1.00     57167
   macro avg       1.00      1.00      1.00     57167
weighted avg       1.00      1.00      1.00     57167



In [None]:
# pickle.dump(model, open('/content/drive/My Drive/ai_project_2021/xgboost1.model', 'wb'))

In [None]:
# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=df_scaled.shape[1]-1, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam')
# fit the keras model on the dataset
model.fit(X_train, y_train, epochs=2, batch_size=32, verbose=1)
# make class predictions with the model
predictions = (model.predict(X_test) > 0.5).astype("int32")
print(classification_report(y_test, predictions))

Epoch 1/2
Epoch 2/2
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     25658
           1       1.00      1.00      1.00     31509

    accuracy                           1.00     57167
   macro avg       1.00      1.00      1.00     57167
weighted avg       1.00      1.00      1.00     57167



In [None]:
# checking metrics on prediction of original class based on the k=10 clusters

print(classification_report(df_responses_test, predictions))

              precision    recall  f1-score   support

         0.0       1.00      0.51      0.68     50220
         1.0       0.22      0.99      0.36      6947

    accuracy                           0.57     57167
   macro avg       0.61      0.75      0.52     57167
weighted avg       0.90      0.57      0.64     57167



In [None]:
predictions_clusters = (model.predict(X_train) > 0.5).astype("int32")

In [None]:
# first neural network with keras make predictions

# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=df_scaled.shape[1]-1, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam')
# fit the keras model on the dataset
model.fit(X_train, predictions_clusters, epochs=2, batch_size=32, verbose=1)
# make class predictions with the model
predictions = (model.predict(X_test) > 0.5).astype("int32")

Epoch 1/2
Epoch 2/2


In [None]:
print(classification_report(df_responses_test, predictions))

              precision    recall  f1-score   support

         0.0       1.00      0.51      0.68     50220
         1.0       0.22      0.99      0.36      6947

    accuracy                           0.57     57167
   macro avg       0.61      0.75      0.52     57167
weighted avg       0.90      0.57      0.64     57167



### We see that both XGBoost and the neural network can predict the high and low density clusters quite easily. We will thus use the XGBoost model to reduce the size of our dataset by discounting part of it with our model, and continue further exploration.

# Creating New Dataset for Just the High Occurence Clusters

### In this section we create the new DataFrame

In [None]:
positive_indices = [i for i in range(len(predictions_clusters)) if predictions_clusters[i] == 1]

In [None]:
df_smaller = X_train.join(df_responses_train)
df_smaller

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215
188538,0.0,0.030769,1.0,1.0,0.0,0.069657,0.314879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
235132,1.0,0.030769,1.0,1.0,0.0,0.052142,0.231834,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
128465,0.0,0.200000,1.0,1.0,0.0,0.051978,0.242215,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
160565,0.0,0.030769,1.0,1.0,0.0,0.000000,0.065744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
48924,1.0,0.123077,1.0,0.0,1.0,0.000000,0.961938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235075,1.0,0.092308,1.0,1.0,0.0,0.046075,0.930796,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10742,0.0,0.461538,1.0,1.0,0.0,0.000000,0.698962,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49689,1.0,0.184615,1.0,0.0,1.0,0.000000,0.539792,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
189636,1.0,0.015385,1.0,1.0,0.0,0.046248,0.678201,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# here we take only the indices corresponding to the high-density clusters

df_smaller = df_smaller.iloc[positive_indices]

In [None]:
df_smaller.iloc[:,-1].value_counts()

0.0    138512
1.0     39536
Name: 215, dtype: int64

In [None]:
# Saving our smaller dataframe to csv, but only keeping the 'Response' column because what we really need are the indices, and the csv take up a lot of space

df_smaller.iloc[:,-1].to_csv('/content/drive/My Drive/ai_project_2021/df_smaller1.csv')

In [None]:
df_smaller

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215
48924,1.0,0.123077,1.0,0.0,1.0,0.000000,0.961938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
325448,1.0,0.261538,1.0,0.0,1.0,0.075370,0.366782,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
253471,0.0,0.784615,1.0,0.0,1.0,0.000000,0.916955,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
51967,1.0,0.184615,1.0,0.0,1.0,0.038044,0.809689,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
44490,1.0,0.892308,1.0,0.0,1.0,0.062857,0.211073,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126137,1.0,0.630769,1.0,0.0,1.0,0.090645,0.193772,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
361820,0.0,0.000000,1.0,0.0,1.0,0.052862,0.951557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
328599,0.0,0.538462,1.0,0.0,1.0,0.048300,0.512111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49689,1.0,0.184615,1.0,0.0,1.0,0.000000,0.539792,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


# Adding another model 

### Here we attempt to add another model with the same methods we've done before. Unfortunately, it was not fruitful.

In [None]:
df_scaled['3rd_model'] = 0
df_scaled.loc[ positive_indices, '3rd_model'] = 1

In [None]:
# define dataset
X = df_smaller.iloc[:,:df_smaller.shape[1]-1]
y = df_smaller.iloc[:,df_smaller.shape[1]-1]

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7)

In [None]:
# XGBoost model

model = XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         0.0       0.78      1.00      0.88     20867
         1.0       0.00      0.00      0.00      5841

    accuracy                           0.78     26708
   macro avg       0.39      0.50      0.44     26708
weighted avg       0.61      0.78      0.69     26708



In [None]:
# first neural network with keras make predictions

model = Sequential()
model.add(Dense(12, input_dim=df_smaller.shape[1]-1, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(X_train, y_train, epochs=2, batch_size=32, verbose=1)
predictions = (model.predict(X_test) > 0.5).astype("int32")
print(classification_report(y_test, predictions))

Epoch 1/2
Epoch 2/2
              precision    recall  f1-score   support

         0.0       0.78      1.00      0.88     20867
         1.0       0.00      0.00      0.00      5841

    accuracy                           0.78     26708
   macro avg       0.39      0.50      0.44     26708
weighted avg       0.61      0.78      0.69     26708



In [None]:
# k-means k=10 clustering

model = KMeans(n_clusters=10)
model.fit(X)
predictions = model.predict(X)

In [None]:
clusters = {}

for j in np.unique(predictions):
  positives_in_cluster = [y.values[i] for i in np.where(predictions == j)[0]].count(1) / len(np.where(predictions == j)[0])*100
  clusters[j] = positives_in_cluster
  print(f"Occurrence of positives in cluster {j} : {round( positives_in_cluster, 2)} %   of {len(np.where(predictions == j)[0])} samples")

print('\n')

print(f"Occurrence of positives in data : { round (list(y).count(1) / len(y) * 100, 2)} %")

Occurrence of positives in cluster 0 : 27.55 %   of 30570 samples
Occurrence of positives in cluster 1 : 8.95 %   of 16914 samples
Occurrence of positives in cluster 2 : 24.99 %   of 25382 samples
Occurrence of positives in cluster 3 : 27.39 %   of 18681 samples
Occurrence of positives in cluster 4 : 27.66 %   of 8367 samples
Occurrence of positives in cluster 5 : 20.08 %   of 19864 samples
Occurrence of positives in cluster 6 : 16.86 %   of 16078 samples
Occurrence of positives in cluster 7 : 27.19 %   of 17523 samples
Occurrence of positives in cluster 8 : 12.46 %   of 13674 samples
Occurrence of positives in cluster 9 : 24.19 %   of 10995 samples


Occurrence of positives in data : 22.21 %


### We see that we don't have any quick or easy progress to make here using the methods we used before. So in the next section, we will look at ways of creating a balanced dataset.

# Upsampling / Downsampling / SMOTE / ADASYN

### Here we first try upsampling. With imbalanced datasets, sometimes certain models will not be able to be applied, but would be better suited to balanced datasets. With upsampling, we take the label with fewer occurences and randomly copy different rows until it has an equal number of instances and both classes balanced.

In [None]:
from sklearn.utils import resample


# concatenate our training data back together
X = pd.concat([X_train, y_train], axis=1)

# separate minority and majority classes
negative = X[X.iloc[:,-1]==0]
positive = X[X.iloc[:,-1]==1]

# upsample minority
positive_upsampled = resample(positive,
                          replace=True, # sample with replacement
                          n_samples=len(negative), # match number in majority class
                          random_state=7)

# combine majority and upsampled minority
upsampled = pd.concat([negative, positive_upsampled])

# check new class counts
upsampled.iloc[:,-1].value_counts()


1.0    117645
0.0    117645
Name: 215, dtype: int64

In [None]:
X_train = upsampled.iloc[:,:-1]
y_train = upsampled.iloc[:,-1]

In [None]:
# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=df_smaller.shape[1]-1, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam')
# fit the keras model on the dataset
model.fit(X_train, y_train, epochs=3, batch_size=32, verbose=1)
# make class predictions with the model
predictions = (model.predict(X_test) > 0.5).astype("int32")
print(classification_report(y_test, predictions))

Epoch 1/3
Epoch 2/3
Epoch 3/3
              precision    recall  f1-score   support

         0.0       0.90      0.51      0.65     20867
         1.0       0.31      0.79      0.45      5841

    accuracy                           0.58     26708
   macro avg       0.61      0.65      0.55     26708
weighted avg       0.77      0.58      0.61     26708



### Here we downsample. This involves taking the majority class and randomly removing instances until the dataset is balanced with both classes having an equal number.

In [None]:
# downsample majority
negative_downsampled = resample(negative,
                                replace = False, # sample without replacement
                                n_samples = len(positive), # match minority n
                                random_state = 7) # reproducible results

# combine minority and downsampled majority
downsampled = pd.concat([negative_downsampled, positive])

# checking counts
downsampled.iloc[:,-1].value_counts()

1.0    33695
0.0    33695
Name: 215, dtype: int64

In [None]:
X_train = downsampled.iloc[:,:-1]
y_train = downsampled.iloc[:,-1]

In [None]:
# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=df_smaller.shape[1]-1, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam')
# fit the keras model on the dataset
model.fit(X_train, y_train, epochs=2, batch_size=32, verbose=1)
# make class predictions with the model
predictions = (model.predict(X_test) > 0.5).astype("int32")

Epoch 1/2
Epoch 2/2


In [None]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         0.0       0.90      0.51      0.65     20867
         1.0       0.31      0.79      0.45      5841

    accuracy                           0.57     26708
   macro avg       0.60      0.65      0.55     26708
weighted avg       0.77      0.57      0.61     26708



### SMOTE stands for Synthetic Minority Oversampling Technique. It is a method that oversamples a minority class, but with the particularity that instead of copying existing data, it synthesizes new data by performing minor tweaks to existing data until the dataset is balanced.

In [None]:
# define dataset
X = df_smaller.iloc[:,:df_smaller.shape[1]-1]
y = df_smaller.iloc[:,df_smaller.shape[1]-1]


In [None]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=7, ratio=1.0)
X_train, y_train = sm.fit_sample(X_train, y_train)

In [None]:
list(y_train).count(1)

33695

In [None]:
# first neural network with keras make predictions

# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=df_smaller.shape[1]-1, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam')
# fit the keras model on the dataset
model.fit(X_train, y_train, epochs=2, batch_size=32, verbose=1)
# make class predictions with the model
predictions = (model.predict(X_test) > 0.5).astype("int32")
print(classification_report(y_test, predictions))

Epoch 1/2
Epoch 2/2
              precision    recall  f1-score   support

         0.0       0.90      0.48      0.63     20867
         1.0       0.31      0.81      0.44      5841

    accuracy                           0.56     26708
   macro avg       0.60      0.65      0.54     26708
weighted avg       0.77      0.56      0.59     26708



In [None]:
# define dataset
X = df_smaller.iloc[:,:df_smaller.shape[1]-1]
y = df_smaller.iloc[:,df_smaller.shape[1]-1]

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7)

In [None]:
y.value_counts()

0.0    138512
1.0     39536
Name: 215, dtype: int64

### Borderline SMOTE is like SMOTE but with the particularity that it only uses misclassified data for its synthesis through tweaking. So for example, within the algorithm, it may use a KNN to classify the minority class and use only mislabeled data. We really would have loved to spend more time with this. 

In [None]:
from imblearn.over_sampling import BorderlineSMOTE

#Apply Borderline-SMOTE
oversample = BorderlineSMOTE()
X, y = oversample.fit_resample(X, y)

In [None]:
# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7)

In [None]:
# first neural network with keras make predictions

# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=df_smaller.shape[1]-1, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam')
# fit the keras model on the dataset
model.fit(X_train, y_train, epochs=4, batch_size=32, verbose=1)
# make class predictions with the model
predictions = (model.predict(X_test) > 0.5).astype("int32")
print(classification_report(y_test, predictions))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
              precision    recall  f1-score   support

         0.0       0.76      0.53      0.63     20676
         1.0       0.64      0.84      0.73     20878

    accuracy                           0.68     41554
   macro avg       0.70      0.68      0.68     41554
weighted avg       0.70      0.68      0.68     41554



### ADASYN (Adaptive Synthetic Sampling) is an oversampling method that synthesizes new data samples of the minority class by putting more emphasis on data samples that are harder to learn than others through a system of weights. ADASYN attempts to shift the decision boundary for classification towards the more difficult samples. It does not seem that there is a simple decision boundary in this dataset, but this method is another one we would like to explore futher.

In [None]:
from imblearn.over_sampling import ADASYN

In [None]:
# define dataset
X = df_smaller.iloc[:,:df_smaller.shape[1]-1]
y = df_smaller.iloc[:,df_smaller.shape[1]-1]


In [None]:
#Apply the ADASYN
oversample = ADASYN()
X, y = oversample.fit_resample(X, y)

In [None]:
# first neural network with keras make predictions

# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=df_smaller.shape[1]-1, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam')
# fit the keras model on the dataset
model.fit(X_train, y_train, epochs=4, batch_size=32, verbose=1)
# make class predictions with the model
predictions = (model.predict(X_test) > 0.5).astype("int32")
print(classification_report(y_test, predictions))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
              precision    recall  f1-score   support

         0.0       0.73      0.57      0.64     20676
         1.0       0.65      0.79      0.71     20878

    accuracy                           0.68     41554
   macro avg       0.69      0.68      0.68     41554
weighted avg       0.69      0.68      0.68     41554



In [None]:
'''
# Saving and Loading Models

# save
pickle.dump(model, open('test1', 'wb'))
# load
t = pickle.load(open('test1', 'rb'))
'''

"\n# Saving and Loading Models\n\n# save\npickle.dump(model, open('test1', 'wb'))\n# load\nt = pickle.load(open('test1', 'rb'))\n"