### **Hello, World!**

### **This is Task 3 of Week 04 at Buildables Data Science Fellowship.**

### **In today's task, I have been assigned to explore multiple *Classification Algorithms* on various datasets.**

### **The Algorithms on which I will work today along with datasets are:**
1. Logistic Regression - Employee Attrition Dataset. (Preprocessing and Evaluation)
2. KNN - Heart Disease Dataset. (Preprocessing and Evaluation)
3. Logistic Regression - Hospital Re-admission Dataset (Preprocessing and Evaluation)
4. Decision Tree Classifier - Credit Cart Fraud Detection Dataset (Preprocessing and Evaluation)
5. Decision Tree Classifier - Wine Quality Dataset (Preprocessing and Evaluation)
6. Naive Bayes Classifier - SMS Spam Collection Dataset (Preprocessing and Evaluation)
7. Random Forest - PIMA Indians Diabetes Dataset (Preprocessing and Evaluation)
8. SVM - Iris Dataset (Preprocessing and Evaluation)
9. KNN - Breast Cancer Dataset  (Preprocessing and Evaluation)

In [2]:
#importing data manipulation libraries
import pandas as pd
import numpy as np

#importing models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

#importing dataset splitting and model evaluation metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    roc_auc_score,
    precision_score,
    recall_score,
    confusion_matrix,
    classification_report
)

#importing data pre-processing libraries
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler

## **1. Predicting Employee Attrition Using Logistic Regression**

In [3]:
df1 = pd.read_csv('./datasets/employee_attrition_dataset.csv')
df1.head()

Unnamed: 0,Employee_ID,Age,Gender,Marital_Status,Department,Job_Role,Job_Level,Monthly_Income,Hourly_Rate,Years_at_Company,...,Overtime,Project_Count,Average_Hours_Worked_Per_Week,Absenteeism,Work_Environment_Satisfaction,Relationship_with_Manager,Job_Involvement,Distance_From_Home,Number_of_Companies_Worked,Attrition
0,1,58,Female,Married,IT,Manager,1,15488,28,15,...,No,6,54,17,4,4,4,20,3,No
1,2,48,Female,Married,Sales,Assistant,5,13079,28,6,...,Yes,2,45,1,4,1,2,25,2,No
2,3,34,Male,Married,Marketing,Assistant,1,13744,24,24,...,Yes,6,34,2,3,4,4,45,3,No
3,4,27,Female,Divorced,Marketing,Manager,1,6809,26,10,...,No,9,48,18,2,3,1,35,3,No
4,5,40,Male,Divorced,Marketing,Executive,1,10206,52,29,...,No,3,33,0,4,1,3,44,3,No


In [4]:
cols = df1.columns
cols

Index(['Employee_ID', 'Age', 'Gender', 'Marital_Status', 'Department',
       'Job_Role', 'Job_Level', 'Monthly_Income', 'Hourly_Rate',
       'Years_at_Company', 'Years_in_Current_Role',
       'Years_Since_Last_Promotion', 'Work_Life_Balance', 'Job_Satisfaction',
       'Performance_Rating', 'Training_Hours_Last_Year', 'Overtime',
       'Project_Count', 'Average_Hours_Worked_Per_Week', 'Absenteeism',
       'Work_Environment_Satisfaction', 'Relationship_with_Manager',
       'Job_Involvement', 'Distance_From_Home', 'Number_of_Companies_Worked',
       'Attrition'],
      dtype='object')

In [5]:
for col in cols:
    print(col, ': ', df1[col].isna().sum(), df1[col].dtype)

Employee_ID :  0 int64
Age :  0 int64
Gender :  0 object
Marital_Status :  0 object
Department :  0 object
Job_Role :  0 object
Job_Level :  0 int64
Monthly_Income :  0 int64
Hourly_Rate :  0 int64
Years_at_Company :  0 int64
Years_in_Current_Role :  0 int64
Years_Since_Last_Promotion :  0 int64
Work_Life_Balance :  0 int64
Job_Satisfaction :  0 int64
Performance_Rating :  0 int64
Training_Hours_Last_Year :  0 int64
Overtime :  0 object
Project_Count :  0 int64
Average_Hours_Worked_Per_Week :  0 int64
Absenteeism :  0 int64
Work_Environment_Satisfaction :  0 int64
Relationship_with_Manager :  0 int64
Job_Involvement :  0 int64
Distance_From_Home :  0 int64
Number_of_Companies_Worked :  0 int64
Attrition :  0 object


In [6]:
#Pre-processing columns:
# OneHotEncoding for not introducing any cardinality in the df
ohe = OneHotEncoder()

trans_cols = ['Gender', 'Department', 'Overtime', 'Marital_Status']

for col in trans_cols:
    encoded_col = ohe.fit_transform(df1[[col]]).toarray()  # convert sparse matrix to array
    encoded_df = pd.DataFrame(encoded_col, columns=ohe.get_feature_names_out([col]), index=df1.index)
    df1 = pd.concat([df1.drop(columns=[col]), encoded_df], axis=1)


# Ordinal Encoding for Job Roles to preserve order or cardinality
job_order = ['Analyst', 'Assistant', 'Manager', 'Executive']
ole = OrdinalEncoder(categories=[job_order])
df1['Job_Role'] = ole.fit_transform(df1[['Job_Role']])

# Dropping Employee ID column as it will not serve any purpose in training.
df1.drop(columns='Employee_ID', axis=1, inplace=True)
df1.head()

Unnamed: 0,Age,Job_Role,Job_Level,Monthly_Income,Hourly_Rate,Years_at_Company,Years_in_Current_Role,Years_Since_Last_Promotion,Work_Life_Balance,Job_Satisfaction,...,Department_Finance,Department_HR,Department_IT,Department_Marketing,Department_Sales,Overtime_No,Overtime_Yes,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single
0,58,2.0,1,15488,28,15,4,2,1,3,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,48,1.0,5,13079,28,6,9,1,2,1,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2,34,1.0,1,13744,24,24,14,8,3,2,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,27,2.0,1,6809,26,10,8,2,3,5,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
4,40,3.0,1,10206,52,29,10,1,2,5,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0


In [7]:
# Converting Target Column into Binary Format.
df1['Attrition'] = df1['Attrition'].replace(['Yes', 'No'], [1, 0])

  df1['Attrition'] = df1['Attrition'].replace(['Yes', 'No'], [1, 0])


In [8]:
# Splitting dataset into features and targets
X = df1[['Age', 'Job_Role', 'Job_Level', 'Monthly_Income', 'Hourly_Rate',
       'Years_at_Company', 'Years_in_Current_Role',
       'Years_Since_Last_Promotion', 'Work_Life_Balance', 'Job_Satisfaction',
       'Performance_Rating', 'Training_Hours_Last_Year', 'Project_Count',
       'Average_Hours_Worked_Per_Week', 'Absenteeism',
       'Work_Environment_Satisfaction', 'Relationship_with_Manager',
       'Job_Involvement', 'Distance_From_Home', 'Number_of_Companies_Worked',
       'Gender_Female', 'Gender_Male', 'Department_Finance',
       'Department_HR', 'Department_IT', 'Department_Marketing',
       'Department_Sales', 'Overtime_No', 'Overtime_Yes',
       'Marital_Status_Divorced', 'Marital_Status_Married',
       'Marital_Status_Single']]
Y = df1['Attrition']


In [9]:
#Standardizing Features.
sscaler = StandardScaler()
X = sscaler.fit_transform(X)
X

array([[ 1.52953545,  0.42748113, -1.468616  , ..., -0.6749845 ,
         1.36878165, -0.71614196],
       [ 0.68021819, -0.46496382,  1.39000395, ..., -0.6749845 ,
         1.36878165, -0.71614196],
       [-0.50882597, -0.46496382, -1.468616  , ..., -0.6749845 ,
         1.36878165, -0.71614196],
       ...,
       [ 0.85008164,  1.31992608, -1.468616  , ...,  1.4815155 ,
        -0.73057671, -0.71614196],
       [-1.01841632,  1.31992608,  0.67534896, ..., -0.6749845 ,
         1.36878165, -0.71614196],
       [ 0.68021819, -1.35740877, -0.75396101, ...,  1.4815155 ,
        -0.73057671, -0.71614196]], shape=(1000, 32))

In [10]:
# Splitting the dataset into 80/20. 80% for training.
xTrain, xTest, yTrain, yTest = train_test_split(X, Y, random_state=42, test_size=0.2)


In [11]:
logRegModel = LogisticRegression()
logRegModel.fit(xTrain, yTrain)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [12]:
yPred = logRegModel.predict(xTest)
yPred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0])

In [13]:
precision = precision_score(yTest, yPred)
f1Score = f1_score(yTest, yPred)
recall = recall_score(yTest, yPred)
accuracy = accuracy_score(yTest, yPred)
cm = confusion_matrix(yTest, yPred)
print(f'Accuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1 Score: {f1Score}' )
print(cm)


Accuracy: 0.845
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
[[169   0]
 [ 31   0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


### **Predicting Heart Disease using KNN**

In [14]:
#importing Heart Disease Dataset for training KNN Model
df2 = pd.read_csv('./datasets/heart.csv')
df2.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [15]:
cols1 = df2.columns
cols1

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [16]:
for col in cols1:
    print(f'{col}: {df2[col].isna().sum()} \t {df2[col].dtype}')

age: 0 	 int64
sex: 0 	 int64
cp: 0 	 int64
trestbps: 0 	 int64
chol: 0 	 int64
fbs: 0 	 int64
restecg: 0 	 int64
thalach: 0 	 int64
exang: 0 	 int64
oldpeak: 0 	 float64
slope: 0 	 int64
ca: 0 	 int64
thal: 0 	 int64
target: 0 	 int64


In [17]:
#Splitting Data:
X1 = df2[
            [
                'age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
                'exang', 'oldpeak', 'slope', 'ca', 'thal'
            ]
        ]
Y1 = df2['target']
print(f'Shape of X1: {X1.shape}\nShape of Y1: {Y1.shape}')

Shape of X1: (1025, 13)
Shape of Y1: (1025,)


In [None]:
#Standardizing features (columns) by keeping the actual dataFrame intact.
sscaler = StandardScaler()
X1 = sscaler.fit_transform(X1)
print(X1)

[[-0.26843658  0.66150409 -0.91575542 ...  0.99543334  1.20922066
   1.08985168]
 [-0.15815703  0.66150409 -0.91575542 ... -2.24367514 -0.73197147
   1.08985168]
 [ 1.71659547  0.66150409 -0.91575542 ... -2.24367514 -0.73197147
   1.08985168]
 ...
 [-0.81983438  0.66150409 -0.91575542 ... -0.6241209   0.23862459
  -0.52212231]
 [-0.4889957  -1.51170646 -0.91575542 ...  0.99543334 -0.73197147
  -0.52212231]
 [-0.04787747  0.66150409 -0.91575542 ... -0.6241209   0.23862459
   1.08985168]]


In [19]:
#Splitting dataset into test and train with 20/80 ratio respectively.
xTrain1, xTest1, yTrain1, yTest1 = train_test_split(X1, Y1, random_state=42, test_size=0.2)

In [20]:
# Training KNN Model.
knnModel = KNeighborsClassifier(n_neighbors=5)
knnModel.fit(xTrain1, yTrain1)

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [21]:
yPred1 = knnModel.predict(xTest1)
yPred1

array([1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0,
       1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 1])

In [22]:
#KNN Evaluation:
auc = roc_auc_score(yTest1, yPred1)
acc = accuracy_score(yTest1, yPred1)
print(f'KNN Model Evaluation Results:\nAccuracy:{acc}\nROC_AUC: {auc}')

KNN Model Evaluation Results:
Accuracy:0.8341463414634146
ROC_AUC: 0.8338568437083572


### **Wine Quality Detection via Decision Tree**

In [48]:
df3 = pd.read_csv('./datasets/WineQT.csv')
df3.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


In [49]:
cols = df3.columns
cols

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'Id'],
      dtype='object')

In [50]:
for col in cols:
    print(f'{col}: {df3[col].isna().sum()} \t {df3[col].dtype}')

fixed acidity: 0 	 float64
volatile acidity: 0 	 float64
citric acid: 0 	 float64
residual sugar: 0 	 float64
chlorides: 0 	 float64
free sulfur dioxide: 0 	 float64
total sulfur dioxide: 0 	 float64
density: 0 	 float64
pH: 0 	 float64
sulphates: 0 	 float64
alcohol: 0 	 float64
quality: 0 	 int64
Id: 0 	 int64


In [51]:
# Splitting Dataset into training and testing sets by ration 80/20 respectively.
X2 = df3[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol']]

Y2 = df3['quality']

In [52]:
sscaler = StandardScaler()
X2 = sscaler.fit_transform(X2)
X2

array([[-0.52157961,  0.93933222, -1.36502663, ...,  1.27069495,
        -0.57365783, -0.96338181],
       [-0.29259344,  1.94181282, -1.36502663, ..., -0.70892755,
         0.1308811 , -0.59360107],
       [-0.29259344,  1.27349242, -1.16156762, ..., -0.32577481,
        -0.04525363, -0.59360107],
       ...,
       [-1.20853813,  0.38239855, -0.9581086 , ...,  0.88754221,
        -0.45623467,  0.05351522],
       [-1.38027776,  0.10393172, -0.8563791 , ...,  1.33455374,
         0.60057372,  0.70063152],
       [-1.38027776,  0.6330187 , -0.75464959, ...,  1.65384769,
         0.30701583, -0.22382033]], shape=(1143, 11))

In [53]:
# Splitting Dataset into Training and Testing Classes 80/20 Ratio respectively
xTrain2, xTest2, yTrain2, yTest2 = train_test_split(X2, Y2, random_state=42, test_size=0.2)

In [54]:
# Training Decision Tree Classifier.
dst = DecisionTreeClassifier()
dst.fit(xTrain2, yTrain2)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [55]:
yPred2 = dst.predict(xTest2)

In [58]:
#Evaluating DecisionTreeClassifier Model 
acc = accuracy_score(yTest2, yPred2)
print(f'Decision Tree Classifier Evaluations:\nAccuracy:{acc}')

Decision Tree Classifier Evaluations:
Accuracy:0.5545851528384279
