<a href="https://colab.research.google.com/github/alluriharshasri/AI-ML-Lab/blob/main/AI_ML_ProjectB2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Project Objective**

**Dataset:** Kaggle's Amazon Employee Access Challenge dataset

The aim of this project is to develop a model using historical data that can effectively determine an employee's access requirements, thereby minimizing manual access transactions such as grants and revokes as the employee's attributes evolve over time. The model will take into account an employee's role information and a resource code to predict whether access should be granted or not.

The dataset comprises real historical data collected between 2010 and 2011. Access to resources has been manually approved or denied for employees over time. The task is to create an algorithm capable of learning from this historical data to predict approval or denial for a new set of employees.

**File Descriptions:**
- *train.csv:* This file contains the training set. Each row includes the ACTION (ground truth), RESOURCE, and details about the employee's role at the time of approval.
- *test.csv:* This file comprises the test set for which predictions are required. Each row in this file asks whether an employee with the listed characteristics should have access to the listed resource.

### **Project Index**

1. Data collection - Import and Read Data

2. Data Pre-processing
  
   a. Data Transformation
   
   b. Data Splitting

3. Models

   a. Model 1 - Logistic Regression Model

   b. Model 2 - Support Vector Machines (SVM)

   c. Model 3 - Decision Tree Model

   d. Model 4 - K-Nearest Neighbors (K-NN) Model

4. Model comparision and Conclusion

## **1. Data collection- Import and Read Data**

In [None]:
#Importing all necessory librariies

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import tensorflow as tf
from tensorflow import keras
from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from tensorflow.python.data import Dataset
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn import svm


In [None]:
# Reading training dataset

train_dataframe = pd.read_csv("/content/train.csv", sep=",")
train_dataframe = train_dataframe.reindex(np.random.permutation(train_dataframe.index))

train_dataframe.head()

FileNotFoundError: [Errno 2] No such file or directory: '/content/train.csv'

In [None]:
# Reading test dataset

test_dataframe = pd.read_csv("/content/test.csv", sep=",")
test_dataframe = test_dataframe.reindex(np.random.permutation(test_dataframe.index))

test_dataframe.head()

In [None]:
#Checking dataframe shapes for both datasets

print("Training dataframe shape: ",train_dataframe.shape)
print("Test dataframe shape: ",test_dataframe.shape)

In [None]:
# checking count of each categories for the target variable in the training dataframe

print(train_dataframe['ACTION'].value_counts())
sns.countplot(x='ACTION',data = train_dataframe, palette='hls')
plt.title("Action class distribution")
plt.show()
plt.savefig('count_plot_training')

## **2. Data Preprocessing**

In [None]:
#1. Handling Missing Data
# Checking if there is any missing value in the dataset

train_dataframe.isnull().sum()

#2. Handling Outliers
# As the dataset contains categoral and binary data, there is no need to check outliers as categiorical data means
# It's just the composition of the sample which you have selected.

In [None]:
#3. Feature selection
#As there are no missing values and outliers, let's proceed with finding unique categories for each column
train_dataframe.apply(lambda x: len(x.unique()))

In [None]:
# Checking distributions of all variables
f, axes = plt.subplots(3, 3, figsize=(15, 10), sharex=True)
sns.despine(left=True)

# distribution of RESOURCE
RESOURCE= sns.distplot(train_dataframe['RESOURCE'].values, ax = axes[0,0])
RESOURCE.title.set_text("RESOURCE distribution")

# distribution of MGR_ID
MGR_ID= sns.distplot(train_dataframe['MGR_ID'].values, ax = axes[0,1])
MGR_ID.title.set_text("MGR_ID distribution")

# distribution of ROLE_ROLLUP_1
ROLE_ROLLUP_1= sns.distplot(train_dataframe['ROLE_ROLLUP_1'].values, ax = axes[0,2])
ROLE_ROLLUP_1.title.set_text("ROLE_ROLLUP_1 distribution")

# distribution of ROLE_ROLLUP_2
ROLE_ROLLUP_2= sns.distplot(train_dataframe['ROLE_ROLLUP_2'].values, ax = axes[1,0])
ROLE_ROLLUP_2.title.set_text("ROLE_ROLLUP_2 distribution")

# distribution of ROLE_DEPTNAME
ROLE_DEPTNAME= sns.distplot(train_dataframe['ROLE_DEPTNAME'].values, ax = axes[1,1])
ROLE_DEPTNAME.title.set_text("ROLE_DEPTNAME distribution")

# distribution of ROLE_TITLE
ROLE_TITLE= sns.distplot(train_dataframe['ROLE_TITLE'].values, ax = axes[1,2])
ROLE_TITLE.title.set_text("ROLE_TITLE distribution")

# distribution of ROLE_FAMILY_DESC
ROLE_FAMILY_DESC= sns.distplot(train_dataframe['ROLE_FAMILY_DESC'].values, ax = axes[2,0])
ROLE_FAMILY_DESC.title.set_text("ROLE_FAMILY_DESC distribution")

# distribution of ROLE_FAMILY
ROLE_FAMILY= sns.distplot(train_dataframe['ROLE_FAMILY'].values, ax = axes[2,1])
ROLE_FAMILY.title.set_text("ROLE_FAMILY distribution")

# distribution of ROLE_CODE
ROLE_CODE= sns.distplot(train_dataframe['ROLE_CODE'].values, ax = axes[2,2])
ROLE_CODE.title.set_text("ROLE_CODE distribution")

In [None]:
# heat map of correlation of features
# They all have weak correlation with the target variable

correlation_matrix = train_dataframe.corr()
fig = plt.figure(figsize=(12,9))
sns.heatmap(correlation_matrix,vmax=1,square = True, annot=True)
plt.show()

In [None]:
# checking correlation between ROLE_TITLE, ROLE_CODE
# They have weak correlation as well so we will not be dropping any variables.

print(train_dataframe[["ROLE_TITLE","ROLE_CODE"]].corr())

### **2.a. Data Transformation**

In [None]:
#As seen in the last result, there are some categorical variables hence using One hot encoder to transform data for analysis

one_hot_encoder = OneHotEncoder(sparse=True, dtype=np.float32, handle_unknown='ignore')

# Using One hot encoding on training dataset
X_train_columns = [x for x in train_dataframe.columns if x!="ACTION"]
X = one_hot_encoder.fit_transform(train_dataframe[X_train_columns])

# Using One hot encoding on test dataset
X_test_columns = [x for x in test_dataframe.columns if x!="id"]
X_test = one_hot_encoder.transform(test_dataframe[X_test_columns])

#Splitting target variable in y for training dataset
y = train_dataframe["ACTION"].values

In [None]:
# Checking the data after one hot encoding
print("Training data: ",X[4])
print("\n Training Target: ",y)

print("\n Test data: ",X_test[1])

### **2.b.Data Splitting**

In [None]:
#Spittting Training dataset into training and validation datasets (validation dataset= 20%, Training dataset = 80%)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
#Checking dataframe shapes after splitting and one hot encoding (which are now sparse matrix, which will be easy for analysis)

print("Training dataframe shape: ", X_train.shape)
print("Validation dataframe shape: ",X_val.shape)
print("Test dataframe shape: ",X_test.shape)

### **3. Models**


**a. Model 1 - Logistic Regression Model**

In [None]:
# Building Logistic regression model

model_logisticRegression = LogisticRegression( random_state=623,
                                               solver = 'saga',
                                               max_iter = 10000,
                                               warm_start = False,
                                               verbose = 1,
                                               tol = 1e-5)

In [None]:
# Cross validating the Logistic regression model to check the score and summary of the model
statistics_cv = cross_validate(model_logisticRegression, X_train, y_train, groups=None, scoring='roc_auc', cv=5, n_jobs=2, return_train_score = True)

# Describing the summary of the Logistic regression model
statistics_cv = pd.DataFrame(statistics_cv)
statistics_cv.describe()

In [None]:
# Model Fitting

model_logisticRegression_history = model_logisticRegression.fit(X_train, y_train)

In [None]:
# Model evaluation

Accuracy_Logistic_Regression = model_logisticRegression_history.score(X_val, y_val)
print("Accuracy of Logistic Regression Model- Validation Dataset: %.3f%%" % (Accuracy_Logistic_Regression*100.0))

In [None]:
# Model evaluation

Accuracy_Logistic_Regression = model_logisticRegression_history.score(X_train, y_train)
print("Accuracy of Logistic Regression Model- Training Dataset: %.3f%%" % (Accuracy_Logistic_Regression*100.0))

In [None]:
# Confusion matrix for Validation dataset

y_val_predictions = model_logisticRegression_history.predict(X_val)
cm = metrics.confusion_matrix(y_val, y_val_predictions)
print(cm)

In [None]:
# Heat map for the confusion matrix

plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
plt.title("Accuracy- Logistic Regression", size = 15);

In [None]:
# Misclassification rate

print("Misclassifcation rate of Logistic regression model: ",
      (((cm[0][1] + cm[1][0])/cm.sum())*100), "%")

In [None]:
# Checking Logistic regression model summary of validation dataset

print(classification_report(y_val, y_val_predictions))

In [None]:
# Model prediction of Test dataset

y_test = model_logisticRegression_history.predict(X_test)

In [None]:
# saving predictions in dataframe

y_test_predictions = pd.DataFrame()
y_test_predictions["id"] = test_dataframe["id"]
y_test_predictions["ACTION"] = y_test
print(y_test_predictions)

# Saving results to csv file

y_test_predictions.to_csv("submission.csv", index = False)

In [None]:
# Finding ACTION target variable for test dataframe

print(y_test_predictions['ACTION'].value_counts())
sns.countplot(x='ACTION',data = y_test_predictions, palette='hls')
plt.show()
plt.savefig('count_plot')

In [None]:
# ROC curve for Logistic regression model
# The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers.
# The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).

logit_roc_auc = roc_auc_score(y_val, model_logisticRegression_history.predict(X_val))
fpr, tpr, thresholds = roc_curve(y_val, model_logisticRegression_history.predict_proba(X_val)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

**b. Model 3 - Support Vector Machines (SVM)**

In [None]:
# Building Support Vector Machines (SVM) model

model_svm = svm.SVC()

In [None]:
# Cross validating the Support Vector Machines (SVM) model to check the score and summary of the model

statistics_cv = cross_validate(model_svm, X_train, y_train, groups=None, scoring='roc_auc', cv=5, n_jobs=2, return_train_score = True)

statistics_cv = pd.DataFrame(statistics_cv)
statistics_cv.describe()

In [None]:
# Model Fitting

model_svm_history = model_svm.fit(X_train, y_train)

In [None]:
#Confusion matrix for Validation dataset

y_val_predictions = model_svm_history.predict(X_val)
cm_svm = metrics.confusion_matrix(y_val, y_val_predictions)
print(cm_svm)

In [None]:
# Heat map for the confusion matrix

plt.figure(figsize=(9,9))
sns.heatmap(cm_svm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
plt.title("Accuracy- Support Vector Machines", size = 15);

In [None]:
# Misclassification rate

print("Misclassifcation rate of SVM Model: ",
      (((cm_svm[0][1] + cm_svm[1][0])/cm_svm.sum())*100), "%")

In [None]:
# Checking accuracy on validation dataset using model

print(classification_report(y_val, y_val_predictions))

In [None]:
# Model accuracy

Accuracy_svm = model_svm_history.score(X_train, y_train)
print("Accuracy of SVM Model- Training dataset: %.3f%%" % (Accuracy_svm*100.0))

In [None]:
# Model evaluation

Accuracy_svm = model_svm_history.score(X_val, y_val)
print("Accuracy of SVM Model- - Validation Dataset: %.3f%%" % (Accuracy_svm*100.0))

In [None]:
#Predicting test dataset

y_test = model_svm_history.predict(X_test)

In [None]:
#saving predictions in dataframe
y_test_predictions = pd.DataFrame()
y_test_predictions["id"] = test_dataframe["id"]
y_test_predictions["ACTION"] = y_test
print(y_test_predictions)

#Saving results to csv file
y_test_predictions.to_csv("submission.csv", index = False)

In [None]:
# Finding ACTION target variable for test dataframe

count_Class= y_test_predictions['ACTION'].value_counts()

print("Count of ACTION variable: \n",count_Class)

In [None]:
# Pie chart for counts of ACTION target variable for test dataframe

count_Class.plot(kind = 'pie',  autopct='%1.0f%%')
plt.title('Pie chart of count of ACTION variable')
plt.ylabel('')
plt.show()

**c. Decision Tree Model**

In [None]:
# Building Decision Tree model
model_dt = DecisionTreeClassifier()

In [None]:
# Cross validating the Decision Tree model
statistics_cv_dt = cross_validate(model_dt, X_train, y_train, groups=None, scoring='roc_auc', cv=5, n_jobs=2, return_train_score = True)
statistics_cv_dt = pd.DataFrame(statistics_cv_dt)
statistics_cv_dt.describe()

In [None]:
# Training the Decision Tree model
model_dt_history = model_dt.fit(X_train, y_train)

In [None]:
# Confusion matrix for Validation dataset
y_val_predictions_dt = model_dt_history.predict(X_val)
cm_dt = metrics.confusion_matrix(y_val, y_val_predictions_dt)
print(cm_dt)

In [None]:
# Heat map for the confusion matrix
plt.figure(figsize=(9,9))
sns.heatmap(cm_dt, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.title("Accuracy- Decision Tree", size = 15)

In [None]:
# Misclassification rate
print("Misclassification rate of Decision Tree Model: ", (((cm_dt[0][1] + cm_dt[1][0])/cm_dt.sum())*100), "%")

In [None]:
# Checking accuracy on validation dataset using model
print(classification_report(y_val, y_val_predictions_dt))

In [None]:
# Model accuracy on training dataset
Accuracy_dt_train = model_dt_history.score(X_train, y_train)
print("Accuracy of Decision Tree Model- Training dataset: %.3f%%" % (Accuracy_dt_train*100.0))

In [None]:
# Model accuracy on validation dataset
Accuracy_dt_val = model_dt_history.score(X_val, y_val)
print("Accuracy of Decision Tree Model- Validation Dataset: %.3f%%" % (Accuracy_dt_val*100.0))

In [None]:
# Predicting on test dataset
y_test_dt = model_dt_history.predict(X_test)

In [None]:
# Saving predictions in dataframe
y_test_predictions_dt = pd.DataFrame()
y_test_predictions_dt["id"] = test_dataframe["id"]
y_test_predictions_dt["ACTION"] = y_test_dt
print(y_test_predictions_dt)

# Saving results to csv file
y_test_predictions_dt.to_csv("submission_dt.csv", index=False)

In [None]:
# Finding ACTION target variable for test dataframe
count_Class_dt = y_test_predictions_dt['ACTION'].value_counts()
print("Count of ACTION variable for Decision Tree: \n", count_Class_dt)

In [None]:
# Pie chart for counts of ACTION target variable for test dataframe
count_Class_dt.plot(kind='pie', autopct='%1.0f%%')
plt.title('Pie chart of count of ACTION variable for Decision Tree')
plt.ylabel('')
plt.show()


**d. K-Nearest Neighbors (K-NN) Model**

In [None]:
# Building K-NN model
model_knn = KNeighborsClassifier()

In [None]:
# Cross validating the K-NN model
statistics_cv_knn = cross_validate(model_knn, X_train, y_train, groups=None, scoring='roc_auc', cv=5, n_jobs=2, return_train_score=True)
statistics_cv_knn = pd.DataFrame(statistics_cv_knn)
statistics_cv_knn.describe()

In [None]:
# Training the K-NN model
model_knn_history = model_knn.fit(X_train, y_train)

In [None]:
# Confusion matrix for Validation dataset
y_val_predictions_knn = model_knn_history.predict(X_val)
cm_knn = metrics.confusion_matrix(y_val, y_val_predictions_knn)
print(cm_knn)

In [None]:
# Heat map for the confusion matrix
plt.figure(figsize=(9,9))
sns.heatmap(cm_knn, annot=True, fmt=".3f", linewidths=.5, square=True, cmap='Blues_r')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.title("Accuracy- K-Nearest Neighbors", size=15)

In [None]:
# Misclassification rate
print("Misclassification rate of K-NN Model: ", (((cm_knn[0][1] + cm_knn[1][0]) / cm_knn.sum()) * 100), "%")

In [None]:
# Checking accuracy on validation dataset using model
print(classification_report(y_val, y_val_predictions_knn))

In [None]:
# Model accuracy on training dataset
Accuracy_knn_train = model_knn_history.score(X_train, y_train)
print("Accuracy of K-NN Model- Training dataset: %.3f%%" % (Accuracy_knn_train * 100.0))

In [None]:
# Model accuracy on validation dataset
Accuracy_knn_val = model_knn_history.score(X_val, y_val)
print("Accuracy of K-NN Model- Validation Dataset: %.3f%%" % (Accuracy_knn_val * 100.0))

In [None]:
# Predicting on test dataset
y_test_knn = model_knn_history.predict(X_test)

In [None]:
# Saving predictions in dataframe
y_test_predictions_knn = pd.DataFrame()
y_test_predictions_knn["id"] = test_dataframe["id"]
y_test_predictions_knn["ACTION"] = y_test_knn
print(y_test_predictions_knn)

# Saving results to csv file
y_test_predictions_knn.to_csv("submission_knn.csv", index=False)

In [None]:
# Finding ACTION target variable for test dataframe
count_Class_knn = y_test_predictions_knn['ACTION'].value_counts()
print("Count of ACTION variable for K-NN: \n", count_Class_knn)

In [None]:
# Pie chart for counts of ACTION target variable for test dataframe
count_Class_knn.plot(kind='pie', autopct='%1.0f%%')
plt.title('Pie chart of count of ACTION variable for K-NN')
plt.ylabel('')
plt.show()


## **4. Model Comparison and Conclusion**


In [None]:
# Plotting accuracies of all models in a box plot

Accuracy_of_allModels = {'LR': [96.03, 94.74], 'SVM': [95.76, 94.54], 'DT': [100, 94.08], 'K-NN': [96.025, 94.492]}
df = pd.DataFrame(data=Accuracy_of_allModels)
sns.boxplot(data=df).set(title = 'Model Comparison', xlabel = 'Models', ylabel = 'Accuracy' )

* From the analysis of all models, it's evident that each model achieves a high level of accuracy, ranging from 94.08% to 100%. However, when considering the generalization error gap, the Support Vector Machine (SVM) model demonstrates the smallest gap, indicating better generalization capability. On the other hand, the Decision Tree model exhibits a larger generalization error gap, suggesting potential overfitting. This underscores the importance of regularization techniques in improving generalization performance while maintaining high accuracy.

* Considering both accuracy and generalization, the Support Vector Machine emerges as the best model. Its validation accuracy of 94.54% is slightly lower than the best-performing Linear Regression model (94.74%), but its smaller generalization gap makes it a more reliable choice for making predictions.
