# Classification Problem Lab
## Predict Stay/Leave The Company

Your boss was extremely happy with your work and decided to entrust you with a task. They've seen a lot of people leave the company recently and they would like to understand why that's happening. They have collected historical data on employees and they would like you to build a model that is able to predict which employee will leave next. The would like a model that is better than random guessing. They also prefer false negatives than false positives, in this first phase. Fields in the dataset include:

- Employee satisfaction level
- Last evaluation
- Number of projects
- Average monthly hours
- Time spent at the company
- Whether they have had a work accident
- Whether they have had a promotion in the last 5 years
- Department
- Salary
- Whether the employee has left (เป็นตัวแปรที่ต้องการทำนาย)

Your goal is to predict the binary outcome variable `left` using the rest of the data. Since the outcome is binary, this is a classification problem.

This dataset comes from https://www.kaggle.com/ludobenistant/hr-analytics/ and is released under [CC BY-SA 4.0 License](https://creativecommons.org/licenses/by-sa/4.0/).

#### Step 1 — Data Pre-processing (อ่าน data เข้ามา มีากรดูสถิติเกี่ยวกับ data เบื้องต้น มีการจัดการ column ที่เป็นข้อความให้เป็นตัวเลขหรือรูปแบบที่สามารถนำไปใช้ได้)

In [24]:
#import ....
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('https://raw.githubusercontent.com/TipGreenTea/MUICT-AST-IntroToAI_Datasets/main/HR_comma_sep.csv')
df

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,sales,salary,left
0,0.38,0.53,2,157,3,0,0,sales,low,1
1,0.80,0.86,5,262,6,0,0,sales,medium,1
2,0.11,0.88,7,272,4,0,0,sales,medium,1
3,0.72,0.87,5,223,5,0,0,sales,low,1
4,0.37,0.52,2,159,3,0,0,sales,low,1
...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,0,support,low,1
14995,0.37,0.48,2,160,3,0,0,support,low,1
14996,0.37,0.53,2,143,3,0,0,support,low,1
14997,0.11,0.96,6,280,4,0,0,support,low,1


In [25]:
#CODE HERE
#สามารถตัด column sales, salary ออกจาก X ได้เลย
X = df.drop(['sales','salary', 'left'],axis=1)
y = df['left']


#### Step 2 — Separating Your Training and Testing Datasets

In [26]:
#CODE HERE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#### Step 3 — Transforming the Data (Optional จะทำ Feature Scaling หรือไม่ก็ได้ ถ้าไม่ทำ สามารถข้าม step นี้ไปได้เลย)

In [27]:
#CODE HERE
from sklearn.preprocessing import StandardScaler, MinMaxScaler
#sc = StandardScaler()
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train) #สร้างหน่วยที่จะแปลงแค่ dataset ชุดนี้
X_train

array([[0.72527473, 0.703125  , 0.6       , ..., 0.375     , 0.        ,
        0.        ],
       [0.04395604, 0.859375  , 0.        , ..., 0.375     , 0.        ,
        0.        ],
       [0.63736264, 0.59375   , 0.2       , ..., 0.125     , 0.        ,
        0.        ],
       ...,
       [0.65934066, 0.90625   , 0.6       , ..., 0.        , 0.        ,
        0.        ],
       [0.75824176, 0.359375  , 0.6       , ..., 0.125     , 0.        ,
        0.        ],
       [0.51648352, 0.5       , 0.6       , ..., 0.        , 0.        ,
        0.        ]])

#### Step 4 — Building ML models (สร้าง model และ train model จาก training data)

In [28]:
#CODE HERE
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
model = LogisticRegression(random_state = 0)
#model = SVC()

model.fit(X_train, y_train)




```
# This is formatted as code
```

#### Step 5 — Running Predictions on the Test Set (อธิบายว่าได้ accuracy, precision, recall เท่าไรให้กับ LA)

In [29]:
#CODE HERE
y_pred = model.predict(sc.transform(X_test))
y_pred

from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.80      0.92      0.86      2299
           1       0.49      0.25      0.33       701

    accuracy                           0.77      3000
   macro avg       0.65      0.59      0.59      3000
weighted avg       0.73      0.77      0.73      3000

