# People Analytics ML Model

created by: Ari Sulistiyo Prabowo

>**objective**:
1. Building end-to-end data science project from data preparation until machine learning model
2. Building API for machine learning model
3. Deploy machine learning model in Streamlit

>**About dataset**:
Human capital dataset, which the company would like to share the performance appraisal of the employee **whether the employee get promoted or not**. So the students are encouraged to **build a machine learning model to help the company to decide which employee should get the promotion**.

## Data Preparation

In [None]:
# import library
import pandas as pd
import numpy as np

# import visualization library
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# import machine learning model
from imblearn.over_sampling import SMOTE 
from sklearn.utils import resample 
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import pickle

import warnings
warnings.filterwarnings('ignore')

In [None]:
dataset = pd.read_csv("https://raw.githubusercontent.com/densaiko/data_science_learning/main/dataset/Human%20Capital.csv")
dataset = dataset.iloc[:,1:]
dataset.head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won,avg_training_score,is_promoted
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,0,49.0,0
1,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,60.0,0
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,50.0,0
3,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,50.0,0
4,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,73.0,0


In [None]:
# explore any duplicate values and missing values
print(f'Dataset dimensions\t: {dataset.shape}')
print(f'Rows duplicated\t\t: {dataset.duplicated().sum()}')

type_null = pd.DataFrame(dataset.dtypes).T.rename(index = {0: 'Columns Type'})
type_null = type_null.append(pd.DataFrame(dataset.isnull().sum()).T.rename(index = {0: 'Amount of Null Values'}))
type_null = type_null.append(pd.DataFrame(round(dataset.isnull().sum()/dataset.shape[0]*100,2)).T.rename(index = {0: 'Percentage of Null Values'}))
type_null = type_null.T
type_null = type_null.reset_index().rename(columns={'index':'feature'})
type_null

Dataset dimensions	: (54808, 12)
Rows duplicated		: 220


Unnamed: 0,feature,Columns Type,Amount of Null Values,Percentage of Null Values
0,department,object,0,0.0
1,region,object,0,0.0
2,education,object,2409,4.4
3,gender,object,0,0.0
4,recruitment_channel,object,0,0.0
5,no_of_trainings,int64,0,0.0
6,age,int64,0,0.0
7,previous_year_rating,float64,4124,7.52
8,length_of_service,int64,0,0.0
9,awards_won,int64,0,0.0


Insights:
- There are 220 rows with the duplicated data point which **will be removed** 
- education, previous_year_rating, and avg_training_score has null values which **will be filled with median for the numerical** and **the most-frequency for the object** data type

These two things will be handled in the data pre-processing



## Exploratory Data Analysis
let's take a look the origin dataset to understand the information in each feature. There are several questions that need to be answered.
- **Univariate Analysis**
  1. What is the composition of department feature?
  2. How many region that we have in the dataset? and show it in proportion
  3. What is the proportion of education?
  4. What is the composition of gender?
  5. What is the composition of recruitment channel?
  6. What is the proportion of no_of_training?
  7. How is the distribution of age?
  8. What is the proportion of previous_year_rating?
  9. How is the distribution of length_of_service?
  10. How is the composition of awards_won?
  11. How is the distribution of avg_training_score?
  12. How is the composition of is_promoted?

- **Bivariate Analysis** (towards the is_promoted feature)
  - Boxplot with is_promoted
    1. age
    2. avg_training_score

  - Barplot with is_promoted
    1. department
    2. education
    3. gender
    4. recruitment channel 
    5. no of training
    6. previous_year_rating
    7. length_of_service

In [None]:
def pivot_table(dataset, val, ind, agg):
  """
  Function to generate the summary of the feature 
  by using pivot table
  """
  pivot = pd.pivot_table(dataset, values=val, index=ind, aggfunc=agg).reset_index()
  pivot = pivot.rename(columns={val:'values'})
  pivot['percent_of_total'] = round(pivot['values']/pivot['values'].sum()*100,2)
  pivot = pivot.sort_values(by='values', ascending=False).reset_index(drop=True)
  pivot['label'] = "(" + pivot['values'].map("{:,}".format).astype('str') + ") " + pivot['percent_of_total'].astype('str') + "%"
  return pivot

In [None]:
def bar_chart_plotly(dataset, y_, x_, w, h, title_, x_range):
  """
  Creating bar chart plotly with the custom format
  """
  fig = px.bar(dataset, y=y_, x=x_,
             text='label', title=title_, orientation='h',
             width=w, height=h)
  fig.update_traces(textposition='outside')
  fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide',
                    yaxis=dict(autorange='reversed'), plot_bgcolor='white',
                    xaxis_range=[0, x_range])
  fig.show()

In [None]:
def bar_chart_vertical_plotly(dataset, y_, x_, w, h, title_, y_range):
  """
  Creating bar chart plotly with the custom format
  """
  fig = px.bar(dataset, y=y_, x=x_,
             text='label', title=title_, orientation='v',
             width=w, height=h)
  fig.update_traces(textposition='outside')
  fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide',
                    plot_bgcolor='white',
                    yaxis_range=[0, y_range])
  fig.show()

In [None]:
def stack_bar(data, x_val, y_val, hue, titles, text):
  """
  Visualize the stack bar chart with two features
  """

  fig = px.bar(gender_promoted, x=x_val, y=y_val, color=hue, 
               text=text, title=titles)
  fig.update_traces(textposition='inside')
  fig.show()

In [None]:
def pie_chart(dataset, val, ind, title, w, h):
  """
  Providing the composition of the data point in 
  the specific feature
  """
  fig = px.pie(dataset, values=val, names=ind, title=title,
               width=w, height=h)
  fig.show()  

In [None]:
def ecdf_viz(dataset, input):
  """
  Show the cumulative distribution function for the
  continues data
  """
  fig = px.ecdf(dataset, x=input)
  fig.show()

In [None]:
def boxplot(df, x_axis, y_axis):
  """
  For bivariate analysis to see the distribution and outlier
  towards the is_promoted variable
  """
  fig = px.box(df, x=x_axis, y=y_axis)
  return fig.show()

In [None]:
# copy_dataset = dataset.copy()
listing_age = []
rank_age = []
for i in dataset['age'].to_list():
  if i < 10:
    listing_age.append('Less than 10')
    rank_age.append(1)
  elif i >= 10 and i <= 20:
    listing_age.append('10-20')
    rank_age.append(2)
  elif i >= 21 and i <= 30:
    listing_age.append('21-30')
    rank_age.append(3)
  elif i >= 31 and i <= 40:
    listing_age.append('31-40')
    rank_age.append(4)
  elif i >= 41 and i <= 50:
    listing_age.append('41-50')
    rank_age.append(5)
  elif i >= 51 and i <= 60:
    listing_age.append('51-60')
    rank_age.append(6)
  elif i >= 61 and i <= 70:
    listing_age.append('61-70')
    rank_age.append(7)
  elif i >= 71 and i <= 80:
    listing_age.append('71-80')
    rank_age.append(8)
  else:
    listing_age.append('above 80')
    rank_age.append(9)

dataset['age_viz'] = listing_age
dataset['age_rank'] = rank_age

In [None]:
# copy_dataset = dataset.copy()
listing_avg_training_score = []
listing_avg_trainig_rank = []
for i in dataset['avg_training_score'].to_list():
  if i >= 39 and i <= 50:
    listing_avg_training_score.append('10-20')
    listing_avg_trainig_rank.append(1)
  elif i >= 51 and i <= 60:
    listing_avg_training_score.append('21-30')
    listing_avg_trainig_rank.append(2)
  elif i >= 51 and i <= 60:
    listing_avg_training_score.append('51-60')
    listing_avg_trainig_rank.append(3)
  elif i >= 61 and i <= 70:
    listing_avg_training_score.append('61-70')
    listing_avg_trainig_rank.append(4)
  elif i >= 71 and i <= 80:
    listing_avg_training_score.append('71-80')
    listing_avg_trainig_rank.append(5)
  else:
    listing_avg_training_score.append('above 80')
    listing_avg_trainig_rank.append(6)

dataset['avg_training_score_viz'] = listing_avg_training_score
dataset['rank_avg_training_score'] = listing_avg_trainig_rank

### Univariate Analysis

In [None]:
#department composition
department_composition = pivot_table(dataset, val='region',ind='department',agg='count')

bar_chart_plotly(department_composition, y_='department', x_='percent_of_total', 
                 title_='Proportion of Department', 
                 w=1000, h=600, 
                 x_range=department_composition['percent_of_total'].max()+10)

In [None]:
#Education composition
education_composition = pivot_table(dataset, val='region',ind='education',agg='count')

bar_chart_plotly(education_composition, y_='education', x_='percent_of_total', 
                 title_='Proportion of education', 
                 w=1000, h=600,
                 x_range=education_composition['percent_of_total'].max()+20)

In [None]:
#Gender composition
gender_composition = pivot_table(dataset, val='region',ind='gender',agg='count')

pie_chart(gender_composition, val="percent_of_total", ind="gender", title="Composition of Gender",
          w=700, h=500)

In [None]:
#recruitment_channel composition
recruitment_channel_composition = pivot_table(dataset, val='region',ind='recruitment_channel',agg='count')

# bar_chart_plotly(recruitment_channel_composition, y_='recruitment_channel', x_='percent_of_total', 
#                  title_='Proportion of recruitment channel', 
#                  w=1000, h=500,
#                  x_range=recruitment_channel_composition['percent_of_total'].max()+20)

pie_chart(recruitment_channel_composition, val="percent_of_total", ind="recruitment_channel", title="Composition of Recruitment Channel",
          w=700, h=500)

In [None]:
# number of trainings cumulative distribution
ecdf_viz(dataset, input='no_of_trainings')

In [None]:
#Gender composition
age_viz_composition = pivot_table(dataset, val='region',ind=['age_viz','age_rank'],agg='count')
age_viz_composition = age_viz_composition.sort_values(by='age_rank', ascending=True)


bar_chart_vertical_plotly(age_viz_composition, y_='percent_of_total', x_='age_viz', 
                 title_='Proportion of Age', 
                 w=1000, h=550,
                 y_range=age_viz_composition['percent_of_total'].max()+10)

In [None]:
# # from google.colab import files
# age_viz_composition.to_csv("age_viz_composition.csv")
# files.download("age_viz_composition.csv")

In [None]:
age_viz_composition

Unnamed: 0,age_viz,age_rank,values,percent_of_total,label
4,10-20,2,113,0.21,(113) 0.21%
1,21-30,3,18005,32.85,"(18,005) 32.85%"
0,31-40,4,26028,47.49,"(26,028) 47.49%"
2,41-50,5,7810,14.25,"(7,810) 14.25%"
3,51-60,6,2852,5.2,"(2,852) 5.2%"


In [None]:
#Gender composition
awards_composition = pivot_table(dataset, val='region',ind='awards_won',agg='count')

pie_chart(awards_composition, val="percent_of_total", ind="awards_won", title="Composition of Employee get awards",
          w=700, h=500)

In [None]:
#average trainig composition
avg_training_score_composition = pivot_table(dataset, val='region',ind=['avg_training_score_viz','rank_avg_training_score'],agg='count')
avg_training_score_composition = avg_training_score_composition.sort_values(by='rank_avg_training_score', ascending=True)


bar_chart_vertical_plotly(avg_training_score_composition, y_='percent_of_total', x_='avg_training_score_viz', 
                 title_='Proportion of avg_training_score', 
                 w=1000, h=550,
                 y_range=avg_training_score_composition['percent_of_total'].max()+10)

### Bivariate Analysis

In [None]:
boxplot(dataset, x_axis='is_promoted', y_axis='age')

In [None]:
gender_promoted = pivot_table(dataset, val='department', ind=['gender','is_promoted'], agg='count')

gender_promoted['percent'] = round(100 * gender_promoted['values'] / gender_promoted.groupby(['gender'])['values'].transform('sum'),2)
gender_promoted

Unnamed: 0,gender,is_promoted,values,percent_of_total,label,percent
0,m,0,35295,64.4,"(35,295) 64.4%",91.68
1,f,0,14845,27.09,"(14,845) 27.09%",91.01
2,m,1,3201,5.84,"(3,201) 5.84%",8.32
3,f,1,1467,2.68,"(1,467) 2.68%",8.99


In [None]:
stack_bar(gender_promoted, x_val='gender', y_val='percent', hue='is_promoted',
          text='percent', titles='Stack Bar Chart between Gender and promotion')

## Data Pre-processing
- Remove any duplicate values
- Filling null values towards:
  - education and previous_year_rating with **the most-frequent**
  - avg_training_score with **median**
- Label Encoding
- SMOTE implementation

In [None]:
data_prepro = dataset.copy()

# set the imputer
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')
imp_most_frequent = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# fill the missing value
education_imputer = imp_most_frequent.fit(data_prepro[['education']])
data_prepro['education'] = education_imputer.transform(data_prepro[['education']])

previous_year_imputer = imp_most_frequent.fit(data_prepro[['previous_year_rating']])
data_prepro['previous_year_rating'] = previous_year_imputer.transform(data_prepro[['previous_year_rating']])

avg_training_score_impt = imp_median.fit(data_prepro[['avg_training_score']])
data_prepro['avg_training_score'] = avg_training_score_impt.transform(data_prepro[['avg_training_score']])

# check any missing values

# remove duplicates
data_prepro = data_prepro.drop_duplicates()
print(f'Dataset dimensions\t: {data_prepro.shape}')
print(f'Rows duplicated\t\t: {data_prepro.duplicated().sum()}')

type_null = pd.DataFrame(data_prepro.dtypes).T.rename(index = {0: 'Columns Type'})
type_null = type_null.append(pd.DataFrame(data_prepro.isnull().sum()).T.rename(index = {0: 'Amount of Null Values'}))
type_null = type_null.append(pd.DataFrame(data_prepro.isnull().sum()/data_prepro.shape[0]*100).T.rename(index = {0: 'Percentage of Null Values'}))
type_null.T

Dataset dimensions	: (54516, 16)
Rows duplicated		: 0


Unnamed: 0,Columns Type,Amount of Null Values,Percentage of Null Values
department,object,0,0.0
region,object,0,0.0
education,object,0,0.0
gender,object,0,0.0
recruitment_channel,object,0,0.0
no_of_trainings,int64,0,0.0
age,int64,0,0.0
previous_year_rating,float64,0,0.0
length_of_service,int64,0,0.0
awards_won,int64,0,0.0


In [None]:
data_prepro['avg_training_score'].min()

39

### Label Encoder

In [None]:
data_prepro.head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won,avg_training_score,is_promoted,age_viz,age_rank,avg_training_score_viz,rank_avg_training_score
0,1,7,3,2,2,1,35,5,8,0,49,0,31-40,4,10-20,1
1,2,22,2,1,3,1,30,5,4,0,60,0,21-30,3,21-30,2
2,1,19,2,1,2,1,34,3,7,0,50,0,31-40,4,10-20,1
3,1,23,2,1,3,2,39,1,10,0,50,0,31-40,4,10-20,1
4,3,26,2,1,3,1,45,3,2,0,73,0,41-50,5,71-80,5


In [None]:
## data preprocessing 
data_prepro_new = data_prepro.copy()

columns = data_prepro_new.select_dtypes(include=['object']).columns.to_list()

label_encoding = LabelEncoder()

#encode the data into a label
for i in columns:
  data_prepro_new[i] = label_encoding.fit_transform(data_prepro_new[i])

In [None]:
pd.concat([data_prepro_new, data_prepro], axis=1)

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won,...,age.1,previous_year_rating.1,length_of_service.1,awards_won.1,avg_training_score,is_promoted,age_viz,age_rank,avg_training_score_viz,rank_avg_training_score
0,1,7,3,2,2,1,35,5,8,0,...,35,5,8,0,49,0,31-40,4,10-20,1
1,2,22,2,1,3,1,30,5,4,0,...,30,5,4,0,60,0,21-30,3,21-30,2
2,1,19,2,1,2,1,34,3,7,0,...,34,3,7,0,50,0,31-40,4,10-20,1
3,1,23,2,1,3,2,39,1,10,0,...,39,1,10,0,50,0,31-40,4,10-20,1
4,3,26,2,1,3,1,45,3,2,0,...,45,3,2,0,73,0,41-50,5,71-80,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54803,3,14,2,1,2,1,48,3,17,0,...,48,3,17,0,78,0,41-50,5,71-80,5
54804,2,27,3,2,3,1,37,2,6,0,...,37,2,6,0,56,0,31-40,4,21-30,2
54805,4,1,2,1,3,1,27,5,3,0,...,27,5,3,0,79,0,21-30,3,71-80,5
54806,1,9,2,1,2,1,29,1,2,0,...,29,1,2,0,60,0,21-30,3,above 80,6


In [None]:
data_prepro_new.head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won,avg_training_score,is_promoted,age_viz,age_rank,avg_training_score_viz,rank_avg_training_score
0,7,31,2,0,2,1,35,5.0,8,0,49.0,0,2,4,0,1
1,4,14,0,1,0,1,30,5.0,4,0,60.0,0,1,3,1,2
2,7,10,0,1,2,1,34,3.0,7,0,50.0,0,2,4,0,1
3,7,15,0,1,0,2,39,1.0,10,0,50.0,0,2,4,0,1
4,8,18,0,1,0,1,45,3.0,2,0,73.0,0,3,5,3,5


In [None]:
# Separating dependent and independent variable
X = data_prepro_new.drop(columns=["is_promoted","age_viz","avg_training_score_viz","age_rank","rank_avg_training_score"])
y = data_prepro_new["is_promoted"] 

# Performing train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

### Manual Encoding

In [None]:
dep = {'Sales & Marketing':1, 'Operations':2, 'Technology':3, 'Analytics':4,
       'R&D':5, 'Procurement':6, 'Finance':7, 'HR':8, 'Legal':9}
edu = {'Below Secondary':1, "Bachelor's":2, "Master's & above":3}
rec = {'referred':1, 'sourcing':2, 'other':3}
gen = {'m':1, 'f':2}
reg = {'region_1':1,'region_2':2,'region_3':3,'region_4':4,'region_5':5,
       'region_6':6,'region_7':7,'region_8':8,'region_9':9,'region_10':10,
       'region_11':11,'region_12':12,'region_13':13,'region_14':14,'region_15':15,
       'region_16':16,'region_17':17,'region_18':18,'region_19':19,'region_20':20,
       'region_21':21,'region_22':22,'region_23':23,'region_24':24,'region_25':25,
       'region_26':26,'region_27':27,'region_28':28,'region_29':29,'region_30':30,
       'region_31':31,'region_32':32,'region_33':33,'region_34':34}


# replacing
data_prepro = data_prepro.replace({'department': dep, 'education':edu,
                                   'gender':gen, 'recruitment_channel':rec,
                                   'region':reg})

data_prepro[['previous_year_rating','avg_training_score']] = data_prepro[['previous_year_rating','avg_training_score']].astype('int64')

In [None]:
# Label encoding

data_prepro.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54516 entries, 0 to 54807
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   department               54516 non-null  int64 
 1   region                   54516 non-null  int64 
 2   education                54516 non-null  int64 
 3   gender                   54516 non-null  int64 
 4   recruitment_channel      54516 non-null  int64 
 5   no_of_trainings          54516 non-null  int64 
 6   age                      54516 non-null  int64 
 7   previous_year_rating     54516 non-null  int64 
 8   length_of_service        54516 non-null  int64 
 9   awards_won               54516 non-null  int64 
 10  avg_training_score       54516 non-null  int64 
 11  is_promoted              54516 non-null  int64 
 12  age_viz                  54516 non-null  object
 13  age_rank                 54516 non-null  int64 
 14  avg_training_score_viz   54516 non-nul

In [None]:
# Separating dependent and independent variable
X = data_prepro.drop(columns=["is_promoted","age_viz","avg_training_score_viz","age_rank","rank_avg_training_score"])
y = data_prepro["is_promoted"] 

# Performing train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

## Modelling & Evaluation

In [None]:
def basic_model(model, x_train, y_train):
  """
  Executing the machine learning model
  as a benchmark
  """

  clf = model
  return clf.fit(x_train, y_train)

In [None]:
def evaluation(model, x_train, x_test, y_train, y_test):
  """
  evaluate the model
  """
  clf = model

  y_predict_train = clf.predict(x_train)
  y_predict_test = clf.predict(x_test)

  training_acc = accuracy_score(y_train, y_predict_train)
  testing_acc = accuracy_score(y_test, y_predict_test)

  print("Training Accuracy: {:.2}".format(training_acc))
  print("Testing Accuracy: {:.2}".format(testing_acc))

  return print(classification_report(y_test, y_predict_test))

In [None]:
# Logistic Regression
model_log = basic_model(LogisticRegression(), X_train, y_train)

evaluation(model_log, X_train, X_test, y_train, y_test)

Training Accuracy: 0.92
Testing Accuracy: 0.92
              precision    recall  f1-score   support

           0       0.92      0.99      0.96      9997
           1       0.61      0.10      0.17       907

    accuracy                           0.92     10904
   macro avg       0.77      0.55      0.56     10904
weighted avg       0.90      0.92      0.89     10904



In [None]:
# Gradient Boosting Classifier
model_grad = basic_model(GradientBoostingClassifier(), X_train, y_train)

evaluation(model_grad, X_train, X_test, y_train, y_test)

Training Accuracy: 0.94
Testing Accuracy: 0.94
              precision    recall  f1-score   support

           0       0.94      1.00      0.97      9997
           1       0.94      0.27      0.42       907

    accuracy                           0.94     10904
   macro avg       0.94      0.64      0.69     10904
weighted avg       0.94      0.94      0.92     10904



## Testing Your ML Model in API

In [None]:
# pickle.dump(model_log, open('model_log.pkl', 'wb'))

In [None]:
pickle.dump(model_grad, open('model_grad.pkl', 'wb'))

In [None]:
testing = pd.read_csv("https://raw.githubusercontent.com/densaiko/data_science_learning/main/dataset/Human%20Capital.csv")
testing.iloc[:,1:].head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won,avg_training_score,is_promoted
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,0,49.0,0
1,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,60.0,0
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,50.0,0
3,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,50.0,0
4,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,73.0,0


In [None]:
!pip install pyngrok

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyngrok
  Downloading pyngrok-6.0.0.tar.gz (681 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m681.2/681.2 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyngrok
  Building wheel for pyngrok (setup.py) ... [?25l[?25hdone
  Created wheel for pyngrok: filename=pyngrok-6.0.0-py3-none-any.whl size=19867 sha256=008fc5458a0edd80ee9ee1000b8608f08db1be1ce8df56deb976a39828f6a3df
  Stored in directory: /root/.cache/pip/wheels/5c/42/78/0c3d438d7f5730451a25f7ac6cbf4391759d22a67576ed7c2c
Successfully built pyngrok
Installing collected packages: pyngrok
Successfully installed pyngrok-6.0.0


In [None]:
!pip install flask_ngrok

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting flask_ngrok
  Downloading flask_ngrok-0.0.25-py3-none-any.whl (3.1 kB)
Installing collected packages: flask_ngrok
Successfully installed flask_ngrok-0.0.25


In [None]:
!ngrok authtoken your_token

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


In [None]:
from flask import Flask, jsonify, request
from flask_ngrok import run_with_ngrok
from flask.json import JSONEncoder
from datetime import datetime

app = Flask(__name__)
run_with_ngrok(app)

@app.route('/', methods=['GET'])
def index():
  return jsonify({"Usia":35})

@app.route('/predict', methods=['GET'])
def result():
  df = pd.read_csv("https://raw.githubusercontent.com/densaiko/data_science_learning/main/dataset/Human%20Capital.csv")
  df = df.iloc[:,1:]
  df = df.dropna()
  data = request.json
  df1 = pd.DataFrame(data, index=[0])

  label_department = LabelEncoder()
  label_region = LabelEncoder()
  label_education = LabelEncoder()
  label_gender = LabelEncoder()
  label_recruitment_channel = LabelEncoder()

  df.department = label_department.fit_transform(df.department)
  df.region = label_region.fit_transform(df.region)
  df.education = label_education.fit_transform(df.education)
  df.gender = label_gender.fit_transform(df.gender)
  df.recruitment_channel = label_recruitment_channel.fit_transform(df.recruitment_channel)


  #user input the data
  df1.department = label_department.transform(df1.department)
  df1.region = label_region.transform(df1.region)
  df1.education = label_education.transform(df1.education)
  df1.gender = label_gender.transform(df1.gender)
  df1.recruitment_channel = label_recruitment_channel.transform(df1.recruitment_channel)

  # Custom JSONEncoder class to handle int64 serialization
  class CustomJSONEncoder(JSONEncoder):
      def default(self, obj):
          if isinstance(obj, np.int64):
              return int(obj)  # Convert int64 to int
          return super().default(obj)

  # Register the custom JSONEncoder with Flask
  app.json_encoder = CustomJSONEncoder
  

  with open('/content/model_grad.pkl','rb') as file:
    extra_tree = pickle.load(file)

  prediction = extra_tree.predict(df1)

  return jsonify({"Status":"Complete", "Prediction":prediction[0]})

if __name__ == "__main__":
  app.run()

 * Serving Flask app '__main__' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m


 * Running on http://94c5-34-125-231-135.ngrok-free.app
 * Traffic stats available on http://127.0.0.1:4040


INFO:werkzeug:127.0.0.1 - - [02/Jun/2023 12:45:31] "GET /predict HTTP/1.1" 200 -
