# ML Pipelines and Streamlit App

1. **Education:** The educational qualifications of employees, including degree, institution, and field of study.

2. **Joining Year:** The year each employee joined the company, indicating their length of service.

3. **City:** The location or city where each employee is based or works.

4. **Payment Tier:** Categorization of employees into different salary tiers.

5. **Age:** The age of each employee, providing demographic insights.

6. **Gender:** Gender identity of employees, promoting diversity analysis.

7. **Ever Benched:** Indicates if an employee has ever been temporarily without assigned work.

8. **Experience in Current Domain:** The number of years of experience employees have in their current field.

Target Column

9. **Leave or Not:** Whether employee left us or Not (1 = Left)

---

**What is our Key target for ML?**

Are there any patterns in leave-taking behavior among employees?

You will find the answer to this question at EDA and also last stage of ML wher you interpret and find most influential features (Importance) to predict "Leave"


### **Importing Data**

In [90]:
## Importing Required Libraries
import pandas as pd

## Reading the dataset
df = pd.read_csv("Employee.csv")

In [91]:
## CHECK AND DROP DUPLICATES AND RECALL MORE DATA CLEANING STEPS WHILE REQUIRED (DONT DROP NA)

In [92]:
import seaborn as sns
import matplotlib.pyplot as plt

### **Data Cleaning**

##### EDA was Already done in previous assignment

In [93]:
df.shape

(4653, 9)

In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4653 entries, 0 to 4652
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Education                  4653 non-null   object
 1   JoiningYear                4653 non-null   int64 
 2   City                       4653 non-null   object
 3   PaymentTier                4653 non-null   int64 
 4   Age                        4653 non-null   int64 
 5   Gender                     4653 non-null   object
 6   EverBenched                4653 non-null   object
 7   ExperienceInCurrentDomain  4653 non-null   int64 
 8   LeaveOrNot                 4653 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 327.3+ KB


In [95]:
df.describe()

Unnamed: 0,JoiningYear,PaymentTier,Age,ExperienceInCurrentDomain,LeaveOrNot
count,4653.0,4653.0,4653.0,4653.0,4653.0
mean,2015.06297,2.698259,29.393295,2.905652,0.343864
std,1.863377,0.561435,4.826087,1.55824,0.475047
min,2012.0,1.0,22.0,0.0,0.0
25%,2013.0,3.0,26.0,2.0,0.0
50%,2015.0,3.0,28.0,3.0,0.0
75%,2017.0,3.0,32.0,4.0,1.0
max,2018.0,3.0,41.0,7.0,1.0


In [96]:
df.sample(10)

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
1843,Masters,2017,New Delhi,2,25,Female,Yes,3,1
1041,Masters,2013,New Delhi,1,24,Female,No,2,1
1774,Masters,2014,New Delhi,1,25,Female,No,3,0
2088,Bachelors,2014,Bangalore,3,26,Male,No,4,1
146,Bachelors,2014,Bangalore,3,25,Male,No,3,0
1524,Bachelors,2017,Bangalore,3,24,Female,No,2,0
4246,Bachelors,2013,Pune,2,28,Male,No,5,0
1705,Bachelors,2014,Bangalore,3,26,Male,No,4,0
2176,Masters,2016,New Delhi,3,28,Female,No,2,0
2057,Masters,2017,New Delhi,2,30,Male,No,2,0


In [97]:
df.isnull().sum()

Education                    0
JoiningYear                  0
City                         0
PaymentTier                  0
Age                          0
Gender                       0
EverBenched                  0
ExperienceInCurrentDomain    0
LeaveOrNot                   0
dtype: int64

In [98]:
print("Duplicates before: ", df.duplicated().sum())

Duplicates before: 1889


In [99]:
df  =  df.drop_duplicates()

In [100]:

df.duplicated().sum()



np.int64(0)

In [101]:
df['LeaveOrNot']= df['LeaveOrNot'].map({1:'Yes', 0:'No'})

### dropping target variable, don't want my model to cheat :)

In [102]:
X = df.drop('LeaveOrNot', axis  = 1)
y = df['LeaveOrNot']

In [103]:
X.head()

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain
0,Bachelors,2017,Bangalore,3,34,Male,No,0
1,Bachelors,2013,Pune,1,28,Female,No,3
2,Bachelors,2014,New Delhi,3,38,Female,No,2
3,Masters,2016,Bangalore,3,27,Male,No,5
4,Masters,2017,Pune,3,24,Male,Yes,2


In [104]:
y.head()

0     No
1    Yes
2     No
3    Yes
4    Yes
Name: LeaveOrNot, dtype: object

In [105]:
import numpy as np

In [106]:

categorical_features = X.select_dtypes(include =[object]).columns


categorical_features = list(categorical_features)

print("Categorical Features:\n ", categorical_features)



Categorical Features:
 ['Education', 'City', 'Gender', 'EverBenched']


In [107]:

numerical_features  = X.select_dtypes(include  = [np.int64, np.float64]).columns

numerical_features =  list(numerical_features)

print("Numerical Features:\n ", numerical_features)



Numerical Features:
 ['JoiningYear', 'PaymentTier', 'Age', 'ExperienceInCurrentDomain']


In [108]:
from sklearn.model_selection import train_test_split



In [109]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size  = 0.2, random_state  = 42)



In [110]:
print('Train Data','\n',y_train.value_counts(normalize  = True),'\n','\n','Test Data','\n', y_test.value_counts(normalize = True))



Train Data 
 LeaveOrNot
No     0.607417
Yes    0.392583
Name: proportion, dtype: float64 
 
 Test Data 
 LeaveOrNot
No     0.60217
Yes    0.39783
Name: proportion, dtype: float64


In [111]:
def summarize_cat(df,categorical_features):
  results  =[]

  for column in df[categorical_features]:

      members = df[column].unique().tolist()
    
      results.append([column, members])

  return pd.DataFrame(results, columns =  ['Column Name', 'Members'])


summarize_cat(X_train,categorical_features)



Unnamed: 0,Column Name,Members
0,Education,"[Bachelors, Masters, PHD]"
1,City,"[Bangalore, New Delhi, Pune]"
2,Gender,"[Male, Female]"
3,EverBenched,"[No, Yes]"


## Schema generate:

In [112]:
summarize_cat(df,categorical_features).to_dict()

{'Column Name': {0: 'Education', 1: 'City', 2: 'Gender', 3: 'EverBenched'},
 'Members': {0: ['Bachelors', 'Masters', 'PHD'],
  1: ['Bangalore', 'Pune', 'New Delhi'],
  2: ['Male', 'Female'],
  3: ['No', 'Yes']}}

In [113]:
my_feature_dict = {'CATEGORICAL' : summarize_cat(df,categorical_features).to_dict(), 'NUMERICAL' : {'Column Name': numerical_features}}

my_feature_dict.get('NUMERICAL')

{'Column Name': ['JoiningYear',
  'PaymentTier',
  'Age',
  'ExperienceInCurrentDomain']}

In [114]:
my_feature_dict

{'CATEGORICAL': {'Column Name': {0: 'Education',
   1: 'City',
   2: 'Gender',
   3: 'EverBenched'},
  'Members': {0: ['Bachelors', 'Masters', 'PHD'],
   1: ['Bangalore', 'Pune', 'New Delhi'],
   2: ['Male', 'Female'],
   3: ['No', 'Yes']}},
 'NUMERICAL': {'Column Name': ['JoiningYear',
   'PaymentTier',
   'Age',
   'ExperienceInCurrentDomain']}}

In [115]:
import pickle

with open('my_feature_dict.pkl', 'wb') as fp:
    pickle.dump(my_feature_dict, fp)
    
    print('dictionary saved successfully to file')
    

dictionary saved successfully to file


In [116]:
from sklearn.pipeline import Pipeline


In [118]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

pipeline_num = Pipeline(steps =[
    ('scale_data', StandardScaler()),
    ('simple_imputer1', SimpleImputer(strategy ='constant',fill_value  = 0)),
])

from sklearn.preprocessing import OneHotEncoder

pipeline_cat = Pipeline(steps  =[
    ('OneHotEncode', OneHotEncoder(handle_unknown="ignore"))
])

from sklearn.compose import ColumnTransformer

preprocessor_stage_2 = ColumnTransformer(
    transformers  =[
        ('cat', pipeline_cat, categorical_features),  
        ('num', pipeline_num, numerical_features),     
    ],remainder = 'drop')

preprocessor_stack = Pipeline(steps = [
    ('preprocessor_stage_2', preprocessor_stage_2)
])

In [119]:
preprocessor_stack

0,1,2
,steps,"[('preprocessor_stage_2', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('cat', ...), ('num', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,0
,copy,True
,add_indicator,False
,keep_empty_features,False


In [120]:
preprocessor_stack.fit(X_train)

0,1,2
,steps,"[('preprocessor_stage_2', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('cat', ...), ('num', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,0
,copy,True
,add_indicator,False
,keep_empty_features,False


In [142]:
pd.DataFrame(preprocessor_stack.transform(X_train),columns=preprocessor_stack[-1].get_feature_names_out())

Unnamed: 0,cat__Education_Bachelors,cat__Education_Masters,cat__Education_PHD,cat__City_Bangalore,cat__City_New Delhi,cat__City_Pune,cat__Gender_Female,cat__Gender_Male,cat__EverBenched_No,cat__EverBenched_Yes,num__JoiningYear,num__PaymentTier,num__Age,num__ExperienceInCurrentDomain
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.016242,-1.00359,0.773695,-1.639104
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,-0.586786,0.58933,0.382737,-1.017268
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,-1.655472,0.58933,-0.594661,1.470075
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.016242,-1.00359,-0.594661,-1.017268
4,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,-0.586786,0.58933,-0.985620,0.848239
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2206,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.016242,0.58933,0.187257,-1.017268
2207,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.481899,0.58933,-0.594661,-1.017268
2208,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,-1.655472,0.58933,-0.203702,-0.395432
2209,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,-1.121129,-1.00359,-0.203702,-1.017268


## mega pipeline

In [122]:
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline(steps  =[
    ('preprocessor', preprocessor_stack),
    ('classifier', RandomForestClassifier())
])


pipeline.fit(X_train, y_train)

0,1,2
,steps,"[('preprocessor', ...), ('classifier', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,steps,"[('preprocessor_stage_2', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('cat', ...), ('num', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,0
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [123]:
X_train.head()

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain
4629,Bachelors,2017,Bangalore,2,35,Male,No,0
3412,Bachelors,2014,Bangalore,3,33,Male,No,1
1082,Bachelors,2012,Bangalore,3,28,Male,No,5
292,Bachelors,2017,New Delhi,2,28,Male,No,1
2595,Bachelors,2014,Bangalore,3,26,Female,Yes,4


In [143]:
y_train_pred = pipeline.predict(X_train)


from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Accuracy: ", accuracy_score(y_train,y_train_pred))
print("\nClassification Report:\n ", classification_report(y_train,y_train_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_train,y_train_pred))


Accuracy:  0.9353233830845771

Classification Report:
                precision    recall  f1-score   support

          No       0.93      0.97      0.95      1343
         Yes       0.95      0.88      0.91       868

    accuracy                           0.94      2211
   macro avg       0.94      0.93      0.93      2211
weighted avg       0.94      0.94      0.93      2211


Confusion Matrix:
 [[1300   43]
 [ 100  768]]


In [138]:
my_pred_array=X_test.iloc[27:28:]

my_pred_array


Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain
2280,Bachelors,2014,Pune,2,28,Female,No,2


In [139]:
pd.DataFrame(preprocessor_stack.transform(my_pred_array),columns=preprocessor_stack[0].get_feature_names_out())


Unnamed: 0,cat__Education_Bachelors,cat__Education_Masters,cat__Education_PHD,cat__City_Bangalore,cat__City_New Delhi,cat__City_Pune,cat__Gender_Female,cat__Gender_Male,cat__EverBenched_No,cat__EverBenched_Yes,num__JoiningYear,num__PaymentTier,num__Age,num__ExperienceInCurrentDomain
0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,-0.586786,-1.00359,-0.594661,-0.395432


In [140]:
y_pred = pipeline.predict(my_pred_array)

y_pred

array(['Yes'], dtype=object)

In [130]:
import dill

with open('pipeline.pkl', 'wb') as file:
    dill.dump(pipeline, file)

print('pipeline saved successfully to file')



pipeline saved successfully to file


In [131]:
import dill


with open('pipeline.pkl', 'rb') as file:
    loaded_pipeline = dill.load(file)

print('pipeline loaded successfully to file')


pipeline loaded successfully to file


In [132]:
loaded_pipeline.__getstate__()

{'steps': [('preprocessor',
   Pipeline(steps=[('preprocessor_stage_2',
                    ColumnTransformer(transformers=[('cat',
                                                     Pipeline(steps=[('OneHotEncode',
                                                                      OneHotEncoder(handle_unknown='ignore'))]),
                                                     ['Education', 'City',
                                                      'Gender', 'EverBenched']),
                                                    ('num',
                                                     Pipeline(steps=[('scale_data',
                                                                      StandardScaler()),
                                                                     ('simple_imputer1',
                                                                      SimpleImputer(fill_value=0,
                                                                                    strategy=

In [133]:
y_pred = loaded_pipeline.predict(my_pred_array)

y_pred

array(['No'], dtype=object)

### MOVING To Streamit......