# Life cycle of machine learning project
###  data analysis
- Understand the project statement
- Data Collection
- Data check to perform 
- Exploraroty data analysis

### model development
- Understand the project statement
- Data Collection
- Data Cleaning
- feature engineering
- Data preprocessing
- Model training
- Choose the best model
### model deployment
- structure the code in modular programming
- configure the docker image to make the code deployable
- deploy the model in aws

<h1> libraries necesaries</h1>

In [1]:
#data extraction
from sqlalchemy import create_engine
import pandas as pd
import numpy as np

#data pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# data cleaning 
from sklearn.impute import SimpleImputer

#data preprocessing
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#data models
from sklearn.model_selection import GridSearchCV

from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier

#models metrics

from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

import warnings
warnings.filterwarnings("ignore")

### 1.0 problem statement

Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.

He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.

Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. But he is not so good at Machine Learning. So he needs your help to solve this problem.

In this problem you do not have to predict actual price but a price range indicating how high the price is

### 2.0 Data Collection

#### 2.1 Data extraction from database

In [3]:
driver = "ODBC+Driver+17+for+SQL+Server"
server_name = "localhost"
database = "BDdatasets"
UID = "sa"
PWD = "0440"

connection_string = f"mssql+pyodbc://{UID}:{PWD}@{server_name}/{database}?driver={driver}"

engine = create_engine(connection_string)

query = "SELECT * FROM MobilePrice"

df = pd.read_sql_query( query , engine )

#### 2.2 check top 5 records

In [4]:
df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


#### 2.3 check if there is correlation

In [5]:
df.corr()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
battery_power,1.0,0.011252,0.011482,-0.041847,0.033334,0.015665,-0.004004,0.034085,0.001844,-0.029727,...,0.014901,-0.008402,-0.000653,-0.029959,-0.021421,0.05251,0.011522,-0.010516,-0.008343,0.200723
blue,0.011252,1.0,0.021419,0.035198,0.003593,0.013443,0.041177,0.004049,-0.008605,0.036161,...,-0.006872,-0.041533,0.026351,-0.002952,0.000613,0.013934,-0.030236,0.010061,-0.021863,0.020573
clock_speed,0.011482,0.021419,1.0,-0.001315,-0.000434,-0.043073,0.006545,-0.014364,0.01235,-0.005724,...,-0.014523,-0.009476,0.003443,-0.029078,-0.007378,-0.011432,-0.046433,0.019756,-0.024471,-0.006606
dual_sim,-0.041847,0.035198,-0.001315,1.0,-0.029123,0.003187,-0.015679,-0.022142,-0.008979,-0.024658,...,-0.020875,0.014291,0.041072,-0.011949,-0.016666,-0.039404,-0.014008,-0.017117,0.02274,0.017444
fc,0.033334,0.003593,-0.000434,-0.029123,1.0,-0.01656,-0.029133,-0.001791,0.023618,-0.013356,...,-0.00999,-0.005176,0.015099,-0.011014,-0.012373,-0.006829,0.001793,-0.014828,0.020085,0.021998
four_g,0.015665,0.013443,-0.043073,0.003187,-0.01656,1.0,0.00869,-0.001823,-0.016537,-0.029706,...,-0.019236,0.007448,0.007313,0.027166,0.037005,-0.046628,0.584246,0.016758,-0.01762,0.014772
int_memory,-0.004004,0.041177,0.006545,-0.015679,-0.029133,0.00869,1.0,0.006886,-0.034214,-0.02831,...,0.010441,-0.008335,0.032813,0.037771,0.011731,-0.00279,-0.009366,-0.026999,0.006993,0.044435
m_dep,0.034085,0.004049,-0.014364,-0.022142,-0.001791,-0.001823,0.006886,1.0,0.021756,-0.003504,...,0.025263,0.023566,-0.009434,-0.025348,-0.018388,0.017003,-0.012065,-0.002638,-0.028353,0.000853
mobile_wt,0.001844,-0.008605,0.01235,-0.008979,0.023618,-0.016537,-0.034214,0.021756,1.0,-0.018989,...,0.000939,9e-05,-0.002581,-0.033855,-0.020761,0.006209,0.001551,-0.014368,-0.000409,-0.030302
n_cores,-0.029727,0.036161,-0.005724,-0.024658,-0.013356,-0.029706,-0.02831,-0.003504,-0.018989,1.0,...,-0.006872,0.02448,0.004868,-0.000315,0.025826,0.013148,-0.014733,0.023774,-0.009964,0.004399


### 3.0 Data cleaning 

#### 3.1 setting numerical and categorical columns

In [11]:
label = ["price_range"]
numerical_columns  = [ column for column in df.columns if df[column].nunique()>30  and column not in  label ]
categorical_columns = [column for column in df.columns if df[column].nunique()<=30 and column not in label  ]
features_columns = numerical_columns + categorical_columns

#### 3.2 setting X and y

In [12]:
X = df[features_columns]
y = df[label]

#### 3.3 make the cleaning pipeline

In [30]:
#features_columns = numerical_columns + categorical_columns

num_cleaning = Pipeline(steps = 
                        [
                         ("imputer", SimpleImputer(strategy = "mean"))   
                        ]  
                       ) 
cat_cleaning = Pipeline(steps = 
                        [
                         ( "imputer", SimpleImputer(strategy = "most_frequent") )  
                        ]
                       ) 

cleaning_pipeline = ColumnTransformer(
    [
      ( "num_cleaning",num_cleaning , numerical_columns ) , 
      ( "cat_cleaning",cat_cleaning , categorical_columns)
    ]
)


In [31]:
print(numerical_columns)

print(categorical_columns)

['battery_power', 'int_memory', 'mobile_wt', 'px_height', 'px_width', 'ram']
['blue', 'clock_speed', 'dual_sim', 'fc', 'four_g', 'm_dep', 'n_cores', 'pc', 'sc_h', 'sc_w', 'talk_time', 'three_g', 'touch_screen', 'wifi']


In [8]:
X_cleaned = pd.DataFrame(cleaning_pipeline.fit_transform(X) , columns = features_columns )

### 4.0 feature engineering

In [9]:
X_cleaned["px_area"] = X_cleaned["px_height"] * X_cleaned["px_width"] 
X_cleaned["sc_area"] =  X_cleaned["sc_h"] * X_cleaned["sc_w"] 


### 5.0 data preprocessing 

#### 5.1 create the preprocessor pipeline

In [11]:
#features_columns = numerical_columns + categorical_columns

num_preprocessor = Pipeline(steps = [
                                    ("scaler" , StandardScaler())
                                    ])

cat_preprocessor = Pipeline(steps = [
                                ("Encoder" , OrdinalEncoder()) , 
                                ("scaler" , StandardScaler())
                            ])



preprocessor_pipeline = ColumnTransformer(
    [
        ("num_preprocessor" , num_preprocessor , numerical_columns) , 
        ("cat_preprocessor" , cat_preprocessor , categorical_columns)
    ]
)


In [17]:
X_preprocessed = preprocessor_pipeline.fit_transform(X_cleaned)

#### 5.2 dividing the dataset

In [22]:
X_train, X_test , y_train , y_test = train_test_split( X_preprocessed , y , test_size = 0.2 , random_state = 42)

###  6.0 Model training 

#### 6.1 defining the evaluating funcion

In [44]:
def evaluate_model(y_true , y_pred):
    accuracy = accuracy_score(y_true , y_pred )
    recall = recall_score(y_true , y_pred , average = "macro")
    precision = precision_score(y_true , y_pred , average = "macro")
    f1 = f1_score(y_true , y_pred , average = "macro")
    return (accuracy , recall, precision , f1)

#### 6.2 defining the models and params 

In [32]:
models = {
                        "LogisticRegression": LogisticRegression(), 
                        "AdaBoostClassifier": AdaBoostClassifier() , 
                        "XGBClassifier":XGBClassifier() , 
                    }
models_params = {
                        "LogisticRegression": {
                                                'C': [ 0.1, 1,],
                                                    'solver': ['liblinear', 'saga'],  # Agrega solver para compatibilidad
                                                'penalty': ['l1', 'l2', 'none'],
                                                'max_iter': [ 200, 500],
                                                
                        },
                        "AdaBoostClassifier": {
                                                'n_estimators': [50, 100,],
                                                'learning_rate': [0.1, 0.5,],
                        },
                        "XGBClassifier": {
                                                'n_estimators': [50, 100],
                                                'learning_rate': [ 0.1, 0.2],
                                                'max_depth': [ 5, 7],
                        }
                        
                    }



#### 6.3 training the models

In [35]:
models.items()

dict_items([('LogisticRegression', LogisticRegression()), ('AdaBoostClassifier', AdaBoostClassifier()), ('XGBClassifier', XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...))])

In [46]:
model_and_score = []
print("models perfomance")

for model_name , model in models.items():
    params = models_params[model_name]
    gs = GridSearchCV(model , params , cv = 3)
    gs.fit(X_train , y_train)
    model.set_params(**gs.best_params_)
    
    model.fit(X_train , y_train)

    y_train_pred  = model.predict(X_train)
    y_test_pred  = model.predict(X_test)

    accuracy_train , recall_train, precision_train , f1_train = evaluate_model(y_train , y_train_pred)
    accuracy_test , recall_test, precision_test , f1_test = evaluate_model(y_test , y_test_pred)

    print(f"model_name {model_name}")
    print("perfomance in training set")
    print(f"accuracy : {accuracy_train}")
    print(f"recall: {recall_train}")
    print(f"precision: {precision_train}")
    print(f"f1_score: {f1_train}")
    print("-"*40)
    print("perfomance in test set")
    print(f"accuracy : {accuracy_test}")
    print(f"recall: {recall_test}")
    print(f"precision: {precision_test}")
    print(f"¨f1_score: {f1_test}")
    print("="*40)
    print("\n\n")

    model_and_score.append(
        { "model_name":model_name , "accuracy": accuracy_test , "recall" : recall_test , "precision":precision_test , "f1_score": f1_test}
    )    
model_and_score_df = pd.DataFrame(model_and_score)

models perfomance
model_name LogisticRegression
perfomance in training set
accuracy : 0.985
recall: 0.9851180922423163
precision: 0.9850854372341312
f1_score: 0.9850839910076417
----------------------------------------
perfomance in test set
accuracy : 0.975
recall: 0.9751423395445135
precision: 0.9745162412043855
¨f1_score: 0.9744608637076619



model_name AdaBoostClassifier
perfomance in training set
accuracy : 0.511875
recall: 0.5060808718136615
precision: 0.6588332105133874
f1_score: 0.46379129225071236
----------------------------------------
perfomance in test set
accuracy : 0.4425
recall: 0.46604753941710464
precision: 0.6163952493749177
¨f1_score: 0.40234506470219933



model_name XGBClassifier
perfomance in training set
accuracy : 1.0
recall: 1.0
precision: 1.0
f1_score: 1.0
----------------------------------------
perfomance in test set
accuracy : 0.895
recall: 0.8939575370281891
precision: 0.89236150796
¨f1_score: 0.8927623171836184





7.0 Choose the best model

#### create a average score

In [58]:
model_and_score_df["score_average"] = (model_and_score_df["accuracy"] + model_and_score_df["recall"] + model_and_score_df["precision"] + model_and_score_df["f1_score"])/4

In [59]:
model_and_score_df

Unnamed: 0,model_name,accuracy,recall,precision,f1_score,score_average
0,LogisticRegression,0.975,0.975142,0.974516,0.974461,0.97478
1,AdaBoostClassifier,0.4425,0.466048,0.616395,0.402345,0.481822
2,XGBClassifier,0.895,0.893958,0.892362,0.892762,0.89352


In [63]:
model_and_score_df.sort_values(by = "score_average" , ascending = False)

Unnamed: 0,model_name,accuracy,recall,precision,f1_score,score_average
0,LogisticRegression,0.975,0.975142,0.974516,0.974461,0.97478
2,XGBClassifier,0.895,0.893958,0.892362,0.892762,0.89352
1,AdaBoostClassifier,0.4425,0.466048,0.616395,0.402345,0.481822


#### conclusion :
- the best models is LogisticRegression outperfoming the XGBBoost

#nota
- el considerar varios modelos a la vez me hace darme cuenta de que , al tener un buen
repertorio de modelos, se pueden obtener mejores resultados, con modelos menos complejos, como es en este caso
que se escogio como modelo a la regression logistica

#  modeling end