# Importing the data

In [None]:
import numpy as np
import pandas as pd


df = pd.read_csv("data.csv")
df.head()

# Preprocessing
First, I'm going to analyze the dataset to see if there are any missing data, and whether we need to scale the features.

In [None]:
print(df.isnull().sum())

Fortunately, there aren't any missing values in our dataset, but if there were, we could've either dropped the rows with missing values, or impute them. Having features that aren't on the same scale will cause problem in our model, so we need to check for that.

In [None]:
df.describe(include='all')

It seems our features are all on the same scale (1-5), so we don't need to standardize them. Multicollinearity is another issue that we need to look out for. We can observe the correlation among our features with a heat plot.

In [None]:
# let's separate the features and the response
df_x = df.loc[:, df.columns != 'Y']
df_y = df['Y']

df_x.columns

In [None]:

%matplotlib inline
import seaborn as sns

sns.heatmap(pd.concat([df_x, df_y], axis=1).corr(), annot=True)


We can see that there's a relatively high correlation between X1 and X5, and also X1 and X6. I won't remove the any of the features, but will keep this in mind. We can also do a VIF test.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor


vif_dataframe = pd.DataFrame()
vif_dataframe['feature'] = df_x.columns

vif_dataframe['VIF'] = [variance_inflation_factor(df_x.values, i) for i in range(len(df_x.columns))]
vif_dataframe

Again, we see there's a relatively high degree of multicollinearity among the features.

In [None]:
df_x.hist()

We can see that features X1 and X6 have relatively lower variance and are skewed. We need to consider this when performing feature selection.

In [None]:
sns.countplot(df_y)

The responses are balances, so we don't need to worry about it.

# Feature Selection

In [None]:
from sklearn.model_selection import train_test_split

X = df_x.to_numpy()
y = df_y.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

print(f"Train Set Count: {len(X_train)}\nTest Set Count {len(X_test)}")

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

model_tree = RandomForestClassifier(random_state=123, n_estimators=50)
model_tree.fit(X_train, y_train)
print(model_tree.feature_importances_)

sel_model = SelectFromModel(estimator=model_tree, prefit=True, threshold='mean')
X_train_transformed = sel_model.transform(X_train)
X_test_transformed = sel_model.transform(X_test)
print(X_train_transformed.shape)
print(sel_model.get_support())

We could see that the model has chosen X1, X2, X3, X5.

# Model Training
Since the data seems to be correlated, I'm going to perform PCA to solve the multicollinearity issue.

In [None]:
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

pca = PCA(0.95)
pca.fit(X_train_transformed)

pca_train = pca.transform(X_train_transformed)
pca_test = pca.transform(X_test_transformed)

model = KNeighborsClassifier()
model.fit(pca_train, y_train)
model.score(pca_test, y_test)

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, model.predict(pca_test))

Because of the nature of Principal Component Analysis, it's not easy to tell which features is more important in determining the response.

The model doesn't acheive the determined 73% accuracy; however, achieving a high accuracy when we have small dataset (126 data points) could be an indication of overfitting. I think an accuracy of 61% is adequate for a model with trained on this dataset.