### Life cicle of machine learning project 

data analisys

- Understand the problem statement
- Data Collection
- Data checks to perform
- Exploratory data analysis

modeling development

- Understand the problem statement
- Data Collection
- Data preprocesing
- Feature Enginering
- Model training
- Choose best model

model deploying
- structure the code in modular programming
- configure the docker image to make code deployable
- deploy the model in aws


### 1.0 problem statement
A company X has employees, and the company keeps data such as their education level, the city they belong to, their age, and other information. The company has been tracking and has recorded which employees have left the company. The company wants to create a program that, based on an employee's data, predicts how likely it is that the employee will leave the company. Therefore, they have hired you to do this job

### 2.0 Data Collection

#### library to use

In [20]:
from sqlalchemy import create_engine
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score

#### 2.2 Data Extraction from database

In [3]:
driver = "ODBC+Driver+17+for+SQL+Server"
server_name = "localhost"
database = "BDdatasets"
UID = "sa"
PWD = "0440"

connection_string = f"mssql+pyodbc://{UID}:{PWD}@{server_name}/{database}?driver={driver}"

engine = create_engine(connection_string)

query = "SELECT * FROM Employees"

df = pd.read_sql_query(query , engine)

#### 2.3 show top 5 records

In [5]:
df.head(5)

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,Bachelors,2017,Bangalore,3,34,Male,No,0,False
1,Bachelors,2013,Pune,1,28,Female,No,3,True
2,Bachelors,2014,New Delhi,3,38,Female,No,2,False
3,Masters,2016,Bangalore,3,27,Male,No,5,True
4,Masters,2017,Pune,3,24,Male,Yes,2,True


#### 2.4 check dataset info

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4653 entries, 0 to 4652
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Education                  4653 non-null   object
 1   JoiningYear                4653 non-null   int64 
 2   City                       4653 non-null   object
 3   PaymentTier                4653 non-null   int64 
 4   Age                        4653 non-null   int64 
 5   Gender                     4653 non-null   object
 6   EverBenched                4653 non-null   object
 7   ExperienceInCurrentDomain  4653 non-null   int64 
 8   LeaveOrNot                 4653 non-null   bool  
dtypes: bool(1), int64(4), object(4)
memory usage: 295.5+ KB


### 3.0 data preprocessing 

#### 3.1 preparing X and y

In [13]:
X = df.drop(columns = ["LeaveOrNot"])
y = df["LeaveOrNot"]

#### 3.2 creating the preprocessor 

In [27]:
numerical_features = [feature for feature in X.columns if df[feature].dtype != "O"]
categorical_features = [feature for feature in X.columns if df[feature].dtype == "O"]

num_pipeline = Pipeline(
    steps = [
        ("imputer" , SimpleImputer(strategy= "mean") ) , 
        ("scaler" ,StandardScaler() )
            ]
)

cat_pipeline  = Pipeline(
    steps = [
        ("imputer" , SimpleImputer(strategy = "most_frequent")) , 
        ("one_hot_encoder" , OneHotEncoder())
    ]
)

preprocessor = ColumnTransformer(
    [
        ("num_pipeline" , num_pipeline ,numerical_features ) , 
        ("cat_pipeline"  , cat_pipeline , categorical_features)
    ]
)

#### 3.3 preprocessing features

In [28]:
X = preprocessor.fit_transform(X)

In [29]:
X.shape

(4653, 14)

#### 3.4 dividing the dataset in train test

In [30]:
X_train , X_test , y_train , y_test =   train_test_split(X , y , test_size = 0.2 , random_state = 42)
X_train.shape , X_test.shape

((3722, 14), (931, 14))

#### 4.0 model training 

#### 4.1 creating a function for get the metrics

In [31]:
def evaluate_model(y_true , y_pred):
    accuracy = accuracy_score(y_true , y_pred)
    precision = precision_score(y_true , y_pred)
    recall = recall_score(y_true , y_pred)
    roc_score = roc_auc_score(y_true , y_pred)
    return ( accuracy, precision ,recall , roc_score)

#### 4.1 training varios models

In [35]:
models = {
    "XGBClassifier":XGBClassifier() , 
    "AdaBoostClassifier":AdaBoostClassifier() ,
    "LogisticRegression":LogisticRegression()
}

model_and_score = []

for model_name , model in models.items():
    model.fit(X_train , y_train)
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    accuracy_train, precision_train ,recall_train , roc_score_train = evaluate_model(y_train , y_pred_train)
    accuracy_test, precision_test ,recall_test , roc_score_test = evaluate_model(y_test , y_pred_test)

    print(f"model : {model_name}")
    print("Model perfomance for training set")
    print(f"accuracy score : {accuracy_train}")
    print(f"precision score : {precision_train}")
    print(f"recall score : {recall_train}")
    print(f"roc score : {roc_score_train}")
    print(f"-"*35)
    print("Model perfomance for test set")
    print(f"accuracy score : {accuracy_test}")
    print(f"precision score : {precision_test}")
    print(f"recall score : {recall_test}")
    print(f"roc score : {roc_score_test}")
    print(f"="*35)
    print("\n\n")
    model_and_score.append({"model name": model_name , "accuracy score": accuracy_test, "precision score":precision_test ,  "recall score" : recall_test ,  "roc score": roc_score_test })
    

model : XGBClassifier
Model perfomance for training set
accuracy score : 0.8981730252552391
precision score : 0.9385964912280702
recall score : 0.7529319781078968
roc score : 0.8635720062459254
-----------------------------------
Model perfomance for test set
accuracy score : 0.8625134264232008
precision score : 0.8614232209737828
recall score : 0.7165109034267912
roc score : 0.8279275828609366







model : AdaBoostClassifier
Model perfomance for training set
accuracy score : 0.7960773777538958
precision score : 0.8066037735849056
recall score : 0.5347928068803753
roc score : 0.7338311148605724
-----------------------------------
Model perfomance for test set
accuracy score : 0.8141783029001074
precision score : 0.8303571428571429
recall score : 0.5794392523364486
roc score : 0.7585720851846176



model : LogisticRegression
Model perfomance for training set
accuracy score : 0.7364320257925846
precision score : 0.6965699208443272
recall score : 0.41282251759186867
roc score : 0.6593379882269617
-----------------------------------
Model perfomance for test set
accuracy score : 0.7432867883995704
precision score : 0.7091836734693877
recall score : 0.43302180685358255
roc score : 0.6697895919513814





#### 5.0 choose the best model

In [41]:
df_models_and_scores = pd.DataFrame(model_and_score ).sort_values(by = "roc score" , ascending = False)
df_models_and_scores

Unnamed: 0,model name,accuracy score,precision score,recall score,roc score
0,XGBClassifier,0.862513,0.861423,0.716511,0.827928
1,AdaBoostClassifier,0.814178,0.830357,0.579439,0.758572
2,LogisticRegression,0.743287,0.709184,0.433022,0.66979


In [45]:
df_models_and_scores["recall score"].idxmin()

2

# end of modeling