**Clustering and Multiclass Classification (Predictive Modeling / Machine Learning)**

**INTRODUCTION:**

STG (The degree of study time for goal object materials)
<br>
SCG (The degree of repetition number of user for goal object materials)
<br>
STR (The degree of study time of user for related objects with goal object)
<br>
LPR (The exam performance of user for related objects with goal object)
<br>
PEG (The exam performance of user for goal objects)

In [None]:
import os
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
# Load the data from the excel file and look at column names
os.chdir("/kaggle/input")
orig = pd.read_csv('user-knowledge/User Knowledge.csv')
orig.columns

In [None]:
# Keep only the columns containing the data about student's knowledge
knowledge = orig.iloc[:,:5]
knowledge.head()

In [None]:
# Plot histograms of the featuers to visualize the data
knowledge.hist(bins=50, figsize = (8,8))
plt.show()

In this analysis, we will explore K-Means clustering and look closely at the elbow method.

In [None]:
# Perform k-Means Clustering with values of k from 1 to 10 and plot k v/s Within Cluster Sum of Squares
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=400, n_init=20, random_state=0)
    kmeans.fit(knowledge)
    wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
# K-Means Clustering with 3 clusters
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=400, n_init=20, random_state=0)
kmeans.fit(knowledge)
k_class = kmeans.predict(knowledge)

In [None]:
# Using PCA and filtering 3 principal components for data visualization
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(knowledge)
PDF = pd.DataFrame(data = principalComponents, columns = ['PC1', 'PC2', 'PC3'])

In [None]:
# Add a column 'Class' to the data sets
PDF.loc[:, 'Cluster'] = pd.Series(k_class)
knowledge_class = knowledge.copy()
knowledge_class['Class'] = k_class

In [None]:
# Count of points in each cluster
PDF['Cluster'].value_counts()

In [None]:
# Assign a color to each cluster
PDF['Color'] = PDF['Cluster'].map({0 : 'red', 1 : 'blue', 2 : 'green'})

In [None]:
# Plot the first 2 principal components and color by cluster
a1 = PDF['PC1']
a2 = PDF['PC2']
a3 = PDF['PC3']
c1 = PDF['Color']
plt.scatter(a1, a2, c = c1, alpha=0.3, cmap='viridis')

In [None]:
# 3-D plot of the data using 3 principal components
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(a1, a2, a3, alpha = 0.4, c = c1)

Let us look at how to the 3 classes differ by calculating their averages on each column.

In [None]:
knowledge_class.groupby(['Class']).mean()

* KNN (K-Nearest Neighbors) - KNN uses distance as the metric and the labels for the dataset were also obtained using distance as the metric when we applied K-Means Clustering. Thus, KNN may perform well on this dataset.
* Decision Tree Classifier - We almost always want to apply a few Machine Learning methods to any dataset and compare them based on a suitable evaluation metric rather than selecting one final model based only on intusion. Although decision tess may not perform best on a small data such as this one, they are highly interpretable.
* Naive Bayes - Based on assumption that variables are independent and making a probabilistic estimation using  amaximum likelihood hypothesis, this algorithm is highly efficient as compared to other Machine Lerning models.

In [None]:
# Slipt the data into train and test data sets
X = knowledge_class.iloc[:, :-1]
Y = knowledge_class.iloc[:, -1]
xTrain, xTest, yTrain, yTest = train_test_split(X, Y, test_size = 0.25, random_state = 0)

In [None]:
# KNN for various values of k and plot of k v/s accuracy
from sklearn.neighbors import KNeighborsClassifier
accuracy = []
for i in range(1,12):
    knn = KNeighborsClassifier(n_neighbors = i).fit(xTrain, yTrain)
    accuracy.append(knn.score(xTest, yTest))

plt.plot(range(1,12), accuracy)
plt.xlabel('k')
plt.ylabel('Accuracy') 
plt.title('k v/s Accuracy for KNN')

In [None]:
# KNN model and evaluation for optimal value of k (8 in this case)
knn = KNeighborsClassifier(n_neighbors = accuracy.index(max(accuracy))+1).fit(xTrain, yTrain)
knn_predictions = knn.predict(xTest)
knn_accuracy = knn.score(xTest, yTest)
knn_accuracy

In [None]:
knn_CM = confusion_matrix(yTest, knn_predictions) # KNN Confusion Matrix
knn_CM

In [None]:
# Decision Tree Classifier and evaluation for optimal value of k
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(xTrain, yTrain) 
dtree_predictions = dtree_model.predict(xTest)
dt_accuracy = dtree_model.score(xTest, yTest)
dt_accuracy

In [None]:
DT_CM = confusion_matrix(yTest, dtree_predictions) # Decision Tree confusion Matrix
DT_CM

In [None]:
# Gaussian Naive Bayes model and evaluation for optimal value of k
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(xTrain, yTrain)
gnb_predictions = gnb.predict(xTest)
gnb_accuracy = gnb.score(xTest, yTest)
gnb_accuracy

In [None]:
NB_CM = confusion_matrix(yTest, gnb_predictions) # Naive Bayes confusion Matrix
NB_CM

We conclude  that the Naive Bayes classifier performed better than KNN and Decision Tree classifier based on the results of accuracy as can be verified by comparing the confusion matrices.