# Life cycle of machine learning project
###  data analysis
- Understand the project statement
- Data Collection
- Data check to perform 
- Exploraroty data analysis

### model development
- Understand the project statement
- Data Collection
- Data Cleaning
- feature engineering
- Data preprocessing
- Model training
- Choose the best model
### model deployment
- structure the code in modular programming
- configure the docker image to make the code deployable
- deploy the model in aws

### libraries necesary

In [90]:
#data extraction
from sqlalchemy import create_engine
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

#data cleaning
from sklearn.impute import SimpleImputer

#data preprocesing 
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

#models
from sklearn.model_selection import GridSearchCV

from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier

#models metrics

from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score



### 1.0 problem statement

Company X specializes in buying second-hand cars. To determine which car to purchase, an employee of the company must examine the car’s features, such as the number of passengers it can carry, the number of doors it has, among others. Based on various characteristics, the employee determines whether it is acceptable to buy the car by assigning one of the following options: evaluation level (unacceptable, acceptable, good, very good). The company wants to make this process more automated, and you have been given access to the database, so your task is to create a program that, based on the car's characteristics, 
automatically determines if it is acceptable to buy or not
<br>
url: https://www.kaggle.com/datasets/stealthtechnologies/car-evaluation-classification

### 2.0 Data Collection

#### 2.1 data extracion from database

In [7]:
driver = "ODBC+Driver+17+for+SQL+Server"
server_name = "localhost"
database = "BDdatasets"
UID = "sa"
PWD = "0440"

connection_string = f"mssql+pyodbc://{UID}:{PWD}@{server_name}/{database}?driver={driver}"

engine = create_engine(connection_string)

query  = "Select * FROM CarsBuyClassification"

df = pd.read_sql_query( query,engine )

#### 2.2 check top 5 records

In [8]:
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


#### 2.3 contingency table

In [10]:
for column in df.columns:
    if column != "class":
        contingency_table = pd.crosstab(df[column] , df["class"])
        print(f"relation between {column} and class")
        print(contingency_table)
        print()

relation between buying and class
class   acc  good  unacc  vgood
buying                         
high    108     0    324      0
low      89    46    258     39
med     115    23    268     26
vhigh    72     0    360      0

relation between maint and class
class  acc  good  unacc  vgood
maint                         
high   105     0    314     13
low     92    46    268     26
med    115    23    268     26
vhigh   72     0    360      0

relation between doors and class
class  acc  good  unacc  vgood
doors                         
2       81    15    326     10
3       99    18    300     15
4      102    18    292     20
5more  102    18    292     20

relation between persons and class
class    acc  good  unacc  vgood
persons                         
2          0     0    576      0
4        198    36    312     30
more     186    33    322     35

relation between lug_boot and class
class     acc  good  unacc  vgood
lug_boot                         
big       144    24    368  

#### 2.4 check if there are nulls values

In [13]:
df.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

### 3.0 Data cleaning

#### 3.1 setting numerical and categorical columns

In [12]:
#from the eda we know that all features are categorical
label = ["class"]
categorical_columns = [column for column in df.columns if column not in label]

In [130]:
categorical_columns

['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']

#### 3.2 getting the uniques values for each column

In [17]:
for column in df.columns:
    print(f" {column}")
    print(df[column].unique())

 buying
['vhigh' 'high' 'med' 'low']
 maint
['vhigh' 'high' 'med' 'low']
 doors
['2' '3' '4' '5more']
 persons
['2' '4' 'more']
 lug_boot
['small' 'med' 'big']
 safety
['low' 'med' 'high']
 class
['unacc' 'acc' 'vgood' 'good']


In [47]:
#the order is setting in this way because, the firt category is assinged 0, de second 1, ans so on
features_categories = [
    ["low" , "med" , "high" ,"vhigh"] , # buying
    ["low" , "med" , "high" ,"vhigh"],     # maint
    ['2', '3', '4', '5more'],            # doors
    ['2', '4', 'more'],                  # persons
    ['small', 'med', 'big'],             # lug_boot
    ['low', 'med', 'high']               # safety
]
label_categories = [['unacc' ,'acc' , 'good']]

####  3.2 join the label vgood and good in one categorie 

In [28]:
df.loc[df["class"] == "vgood" , "class"] = "good"

In [31]:
df["class"].unique()

array(['unacc', 'acc', 'good'], dtype=object)

#### 3.3 setting X and y

In [33]:
X = df[categorical_columns]
y = df[label]

#### 3.4 making the cleaning pipeline

In [71]:
#due to the fact that there is just categorical columns the clearning pipeline is shorter
cat_cleaning_pipeline = ColumnTransformer([
                            ("cat_clearning" , SimpleImputer(strategy = "most_frequent") , categorical_columns)
                                          ] )


In [72]:
X_cleaned = pd.DataFrame(cat_cleaning_pipeline.fit_transform(X) , columns = categorical_columns)

### 4.0 feature engineering 

In [45]:
####dude the nature of the problem i think that is not posible do feature engineering

### 5.0 data preprocesing

5.1 create the preprocesor pipeline

In [75]:
preprocessor_pipeline = ColumnTransformer(
    [
        ("ordinal_encoder" , OrdinalEncoder(categories = features_categories) , categorical_columns)
    ]
)


In [76]:
X_preprocessed = preprocessor_pipeline.fit_transform(X_cleaned)

In [79]:
#muestra con un solo ejemplo
example = X_cleaned.iloc[0].values

print(example)

example = pd.DataFrame([example], columns=X_cleaned.columns)

print(preprocessor_pipeline.transform(example))

['vhigh' 'vhigh' '2' '2' 'small' 'low']
[[3. 3. 0. 0. 0. 0.]]


#### 5.2 preprocess the labels 

In [83]:
label_preprocessor = ColumnTransformer([
    ("label_transformer" , OrdinalEncoder(categories =label_categories ),label  )
])

In [106]:
y_preprocessed = label_preprocessor.fit_transform(y)[: , 0]

In [107]:
y_preprocessed.shape

(1728,)

### 5.3 dividing dataset

In [109]:
X_train , X_test , y_train , y_test  = train_test_split( X_preprocessed, y_preprocessed , test_size = 0.2 , random_state = 42 )

In [119]:
np.isnan(X_train).any()

np.False_

### 6.0 Model training

#### 6.1 defining evaluating function 

In [110]:
def evaluate_model(y_true , y_pred):
    accuracy = accuracy_score(y_true , y_pred )
    recall = recall_score(y_true , y_pred , average ="macro")
    precision =precision_score(y_true , y_pred , average ="macro")   
    f1 = f1_score(y_true , y_pred , average ="macro")
    average_score = ( accuracy + recall + precision + f1 ) / 4
    return  ( accuracy,recall ,precision ,f1 , average_score)

#### 6.2 defining the models and params

In [124]:
models = {
    "LogisticRegression": LogisticRegression(),
    "AdaBoostClassifier": AdaBoostClassifier(),
    "XGBClassifier": XGBClassifier(),
    "KNeighborsClassifier": KNeighborsClassifier()
}

# Definimos los hiperparámetros para cada modelo
models_params = {
    "LogisticRegression": {
        'C': [0.1, 1],
        'solver': ['liblinear', 'saga'],  # Agrega solver para compatibilidad
        'penalty': ['l1', 'l2'],
        'max_iter': [200, 500]
    },
    "AdaBoostClassifier": {
        'n_estimators': [50, 100],
        'learning_rate': [0.1, 0.5]  ,    
        'algorithm': ['SAMME']  
    },
    "XGBClassifier": {
        'n_estimators': [50, 100],
        'learning_rate': [0.1, 0.2],
        'max_depth': [5, 7]
    },
    "KNeighborsClassifier": {
        'n_neighbors': [3, 5, 7 , 9 , 12],       # Número de vecinos
        'weights': ['uniform', 'distance'],  # Peso de los puntos
        'metric': ['euclidean', ] # Métrica de distancia
    }
}


#### 6.3 training the models

In [125]:
model_and_score = []
print("models perfomance")

for model_name , model in models.items():
    params = models_params[model_name]
    gs = GridSearchCV(model , params , cv = 3)
    gs.fit(X_train , y_train)

    model.set_params(**gs.best_params_)

    model.fit(X_train , y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_accuracy , train_recall , train_precision , train_f1 , train_average_score = evaluate_model(y_train , y_train_pred)
    test_accuracy , test_recall , test_precision , test_f1 , test_average_score = evaluate_model(y_test , y_test_pred)

    print(f"model_name {model_name}")
    print(f"perfomance in training set")
    print(f"train_accuracy: {train_accuracy}")
    print(f"train_recall : {train_recall}")
    print(f"train_precision : {train_precision}")
    print(f"train_f1 : {train_f1}")
    print(f"train_average_score : {train_average_score}")
    print("-" * 40)
    print(f"perfomance in test set")
    print(f"test_accuracy : {test_accuracy}")
    print(f"test_recall : {test_recall}")
    print(f"test_precision : {test_precision}")
    print(f"test_f1 : {test_f1}")
    print(f"test_average_score : {test_average_score}")
    print("="*40)
    print("\n\n")
    model_and_score.append(
        {
         "model_name": model_name,
         "test_accuracy":test_accuracy ,
         "test_recall":test_recall ,
         "test_precision":test_precision ,
         "test_f1":test_f1 ,
         "test_average_score":test_average_score 
        }
    )
model_and_score_df = pd.DataFrame(model_and_score)

models perfomance
model_name LogisticRegression
perfomance in training set
train_accuracy: 0.8393632416787264
train_recall : 0.7366841485753289
train_precision : 0.7783227952725819
train_f1 : 0.754931758320696
train_average_score : 0.7773254859618333
----------------------------------------
perfomance in test set
test_accuracy : 0.8236994219653179
test_recall : 0.7623338338155051
test_precision : 0.7795532969446013
test_f1 : 0.7662953555784426
test_average_score : 0.7829704770759667



model_name AdaBoostClassifier
perfomance in training set
train_accuracy: 0.85383502170767
train_recall : 0.7130767462755175
train_precision : 0.8636545782408335
train_f1 : 0.7086640988106075
train_average_score : 0.7848076112586571
----------------------------------------
perfomance in test set
test_accuracy : 0.8901734104046243
test_recall : 0.7442615446588787
test_precision : 0.8948294914295993
test_f1 : 0.7521416899162622
test_average_score : 0.8203515341023412



model_name XGBClassifier
perfomance i

#### 7.0 choose de best model 

#### 7.1 chosing by score average 

In [129]:
model_and_score_df.sort_values(by = "test_average_score" , ascending = False)

Unnamed: 0,model_name,test_accuracy,test_recall,test_precision,test_f1,test_average_score
2,XGBClassifier,0.982659,0.968015,0.943697,0.954321,0.962173
3,KNeighborsClassifier,0.942197,0.896013,0.913757,0.901865,0.913458
1,AdaBoostClassifier,0.890173,0.744262,0.894829,0.752142,0.820352
0,LogisticRegression,0.823699,0.762334,0.779553,0.766295,0.78297


#### conclusion
- the best model was XGBClassifier

<h1>END OF MODELING</h1>