# Life cycle of machine learning project
###  data analysis
- Understand the project statement
- Data Collection
- Data check to perform 
- Exploraroty data analysis

### model development
- Understand the project statement
- Data Collection
- Data Cleaning
- feature engineering
- Data preprocessing
- Model training
- Choose the best model
### model deployment
- structure the code in modular programming
- configure the docker image to make the code deployable
- deploy the model in aws

# libraries necesaries

In [11]:
#data extraction
from sqlalchemy import create_engine
import pandas as pd
import numpy as np

#data clearing and preprocessing 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

#models
from sklearn.model_selection import GridSearchCV

from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier

#models metrics

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score


### 1.0 problem statement

University X aims to analyze potential grades for students based on their characteristics. Using historical data on students, including fields such as StudentID, Age, Gender, Ethnicity, Parental Education, Weekly Study Time, Absences, Tutoring, Parental Support, Extracurricular Activities, Sports, Music, Volunteering, GPA, and GradeClass, they intend to create a program that can automatically predict GradeClass based on these student attributes. For this purpose, they have hired you to develop the predictive model.

dataset url: https://www.kaggle.com/datasets/rabieelkharoua/students-performance-dataset

### 2.0 Data collection

### 2.1 Data extraction from data base

In [12]:
driver = "ODBC+Driver+17+for+SQL+Server"
server_name = "localhost"
database = "BDdatasets"
UID = "sa"
PWD = "0440"

connection_string = f"mssql+pyodbc://{UID}:{PWD}@{server_name}/{database}?driver={driver}"

engine = create_engine(connection_string)

query = "SELECT * FROM StudentPerformance"

df = pd.read_sql_query( query ,engine )

#### 2.2 show top 5 records

In [13]:
df.head()

Unnamed: 0,StudentID,Age,Gender,Ethnicity,ParentalEducation,StudyTimeWeekly,Absences,Tutoring,ParentalSupport,Extracurricular,Sports,Music,Volunteering,GPA,GradeClass
0,1001,17,Female,Caucasian,Some College,19.833723,7,True,Moderate,False,False,True,False,2.929196,C
1,1002,18,Male,Caucasian,High School,15.408756,0,False,Low,False,False,False,False,3.042915,B
2,1003,15,Male,Asian,Bachelor's,4.21057,26,False,Moderate,False,False,False,False,0.112602,F
3,1004,17,Female,Caucasian,Bachelor's,10.02883,14,False,High,True,False,False,False,2.054218,D
4,1005,17,Female,Caucasian,Some College,4.672495,17,True,High,False,False,False,False,1.288061,F


#### 2.3 check if there is null values

In [14]:
df.isnull().sum()

StudentID            0
Age                  0
Gender               0
Ethnicity            0
ParentalEducation    0
StudyTimeWeekly      0
Absences             0
Tutoring             0
ParentalSupport      0
Extracurricular      0
Sports               0
Music                0
Volunteering         0
GPA                  0
GradeClass           0
dtype: int64

### 3.0 data cleaning 

#### 3.1 setting bool column as int

In [23]:
bool_columns = df.select_dtypes(include = ["bool"]).columns
df[bool_columns]  = df[bool_columns].astype(int)

#### 3.1 setting nominal, ordinal , numerical columns

In [93]:
label = ["GradeClass"]

numerical_columns = ['Age', 'StudyTimeWeekly', 'Absences']

nominal_columns = [ "Gender" ,"Ethnicity" ]

ordinal_columns = [ 'ParentalEducation',  'ParentalSupport', 
                 "Tutoring" , 'Extracurricular', 'Sports', 'Music', 'Volunteering' ]


categorical_columns = nominal_columns + ordinal_columns 

#### 3.2 setting ordinal column values

In [198]:
ordinal_columns_values = [
    ['None', 'High School', 'Some College', "Bachelor's", 'Higher'],  # ParentalEducation 
    ['None', 'Low', 'Moderate', 'High', 'Very High'] , 
    
]

#### 3.3 setting X and y

In [109]:
X = df.drop(columns = [ "StudentID", "GPA" , "GradeClass"])
y = df[["GradeClass"]]

#### 3.4 making cleaning pipeline

In [110]:
cleaning_pipeline = ColumnTransformer(
    [
         ("categorical_imputer" , SimpleImputer(strategy = "most_frequent") , categorical_columns ),
         ("numerical_imputer" , SimpleImputer(strategy = "mean") , numerical_columns)
    ]
)

In [186]:
X_cleaned = pd.DataFrame(cleaning_pipeline.fit_transform(X) , columns = categorical_columns + numerical_columns)

In [187]:
X_cleaned.head(1)

Unnamed: 0,Gender,Ethnicity,ParentalEducation,ParentalSupport,Tutoring,Extracurricular,Sports,Music,Volunteering,Age,StudyTimeWeekly,Absences
0,Female,Caucasian,Some College,Moderate,1,0,0,1,0,17.0,19.833723,7.0


#### 4.0 Feature enginerring 

In [194]:
new_num_columns = ["extra_activities_studies" , "extra_activities_no_studies"]
X_cleaned["extra_activities_studies"] = X_cleaned["Tutoring"] + X_cleaned["Extracurricular"]
X_cleaned["extra_activities_no_studies"] = X_cleaned["Sports"] + X_cleaned["Music"] + X_cleaned["Volunteering"]
X_cleaned = X_cleaned.drop(columns = ["Tutoring" , "Extracurricular" , "Sports" , "Music" , "Volunteering"])

In [195]:
X_cleaned

Unnamed: 0,Gender,Ethnicity,ParentalEducation,ParentalSupport,Age,StudyTimeWeekly,Absences,extra_activities_studies,extra_activities_no_studies
0,Female,Caucasian,Some College,Moderate,17.0,19.833723,7.0,1,1
1,Male,Caucasian,High School,Low,18.0,15.408756,0.0,0,0
2,Male,Asian,Bachelor's,Moderate,15.0,4.21057,26.0,0,0
3,Female,Caucasian,Bachelor's,High,17.0,10.02883,14.0,1,0
4,Female,Caucasian,Some College,High,17.0,4.672495,17.0,1,0
...,...,...,...,...,...,...,...,...,...
2387,Female,Caucasian,Bachelor's,Very High,18.0,10.680554,2.0,1,0
2388,Male,Caucasian,High School,Very High,17.0,7.583217,4.0,1,1
2389,Female,Caucasian,Some College,Moderate,16.0,6.8055,20.0,0,1
2390,Female,African American,,Moderate,16.0,12.416653,17.0,0,2


### 5.0 Data preprocessing

#### 5.1 make the preprocessor

In [199]:

numerical_columns = ['Age', 'StudyTimeWeekly', 'Absences'] + new_num_columns

nominal_columns = [ "Gender" ,"Ethnicity" ]

ordinal_columns = [ 'ParentalEducation',  'ParentalSupport']


cat_nominal_preprocessing_steps = Pipeline(
    steps = [
        ("one_hot_encoder_steps" , OneHotEncoder())
    ]
)

cat_ordinal_preprocessing_steps = Pipeline(
    steps =  [
        ("ordinal_encoder_steps" , OrdinalEncoder(categories =  ordinal_columns_values )  ) , 
        ("ordinal_scaler" , StandardScaler())
    ]
)

num_preprocessing_steps = Pipeline(
    steps = [
        ("standard_scaler_steps" , StandardScaler())
    ]
)

preprocessor = ColumnTransformer(
    [
        ("cat_nominal_preprocessor" ,cat_nominal_preprocessing_steps ,nominal_columns  ), 
        ("cat_ordinal_preprocessor",   cat_ordinal_preprocessing_steps, ordinal_columns),
        ("num_preprocessor" , num_preprocessing_steps , numerical_columns )
    ]
)


In [211]:
X_preprocessed = preprocessor.fit_transform(X_cleaned)

In [212]:
X_preprocessed.shape

(2392, 13)

### 5.2 Making the label preprocessor

In [213]:
label_values = [["F" , "D" , "C" , "B" , "A"]]
label_encoder = OrdinalEncoder(categories = label_values )

In [214]:
y_preprocessed = label_encoder.fit_transform(y)[: ,0 ]

In [215]:
y_preprocessed.shape

(2392,)

### 5.3 dividing the dataset

In [216]:
X_train , X_test , y_train , y_test =  train_test_split(X_transformed , y_preprocessed , test_size =  0.1 , random_state = 42)

In [217]:
X_train.shape , X_test.shape

((2152, 16), (240, 16))

### 6.0 Model training

#### 6.1 defining evaluating function

In [218]:
def evaluate_model(y_true,  y_pred):
    accuracy = accuracy_score( y_true , y_pred)
    recall = recall_score( y_true , y_pred , average = "macro" )
    precision=  precision_score( y_true ,y_pred , average = "macro" )
    f1 = f1_score(y_true , y_pred , average = "macro")
    average_score = ( accuracy + recall + precision+ f1 ) /4
    return ( accuracy , recall , precision , f1 , average_score )

#### 6.2 defining the models and params

In [220]:
models = {
    "LogisticRegression": LogisticRegression(),
    "AdaBoostClassifier": AdaBoostClassifier(),
    "XGBClassifier": XGBClassifier(),
    "KNeighborsClassifier": KNeighborsClassifier()
}

# Definimos los hiperparámetros para cada modelo
models_params = {
    "LogisticRegression": {
        'C': [0.1, 1],
        'solver': ['liblinear', 'saga'],  # Agrega solver para compatibilidad
        'penalty': ['l1', 'l2'],
        'max_iter': [200, 500]
    },
    "AdaBoostClassifier": {
        'n_estimators': [50, 100],
        'learning_rate': [0.1, 0.5]  ,    
        'algorithm': ['SAMME']  
    },
    "XGBClassifier": {
        'n_estimators': [50, 100],
        'learning_rate': [0.1, 0.2],
        'max_depth': [5, 7]
    },
    "KNeighborsClassifier": {
        'n_neighbors': [3, 5, 7 , 9 , 12],       # Número de vecinos
        'weights': ['uniform', 'distance'],  # Peso de los puntos
        'metric': ['euclidean', ] # Métrica de distancia
    }
}


### 6.3 training the models

In [221]:
model_and_score = []
print("models perfomances")
for model_name , model in models.items():
    params = models_params[model_name]
    gs = GridSearchCV(model , params , cv = 3)

    gs.fit(X_train , y_train)
    model.set_params(**gs.best_params_)
    
    model.fit(X_train , y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    

    train_accuracy , train_recall , train_precision , train_f1 , train_average_score = evaluate_model( y_train , y_train_pred )
    test_accuracy ,  test_recall  , test_precision , test_f1 , test_average_score = evaluate_model(y_test,  y_test_pred)

    print(f"model_name {model_name}")
    print(f"perfomance in training set")
    print(f"train_accuracy: {train_accuracy}")
    print(f"train_recall : {train_recall}")
    print(f"train_precision : {train_precision}")
    print(f"train_f1 : {train_f1}")
    print(f"train_average_score : {train_average_score}")
    print("-" * 40)
    print(f"perfomance in test set")
    print(f"test_accuracy : {test_accuracy}")
    print(f"test_recall : {test_recall}")
    print(f"test_precision : {test_precision}")
    print(f"test_f1 : {test_f1}")
    print(f"test_average_score : {test_average_score}")
    print("="*40)
    print("\n\n")

    model_and_score.append(
        {
            "model_name":model_name ,
            "test_accuracy":test_accuracy, 
            "test_recall":test_recall, 
            "test_precision":test_precision , 
            "test_f1":test_f1 , 
            "test_average_score" :test_average_score 
        }
        
    )
model_and_score_df = pd.DataFrame(model_and_score)

models perfomances
model_name LogisticRegression
perfomance in training set
train_accuracy: 0.7434944237918215
train_recall : 0.5416425797758888
train_precision : 0.5730243526004472
train_f1 : 0.5396173534408708
train_average_score : 0.599444677402257
----------------------------------------
perfomance in test set
test_accuracy : 0.7083333333333334
test_recall : 0.48662206108547573
test_precision : 0.48938820912124587
test_f1 : 0.4779112765077677
test_average_score : 0.5405637200119556





  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


model_name AdaBoostClassifier
perfomance in training set
train_accuracy: 0.6854089219330854
train_recall : 0.4657995324219639
train_precision : 0.4611800907485539
train_f1 : 0.45772497753921193
train_average_score : 0.5175283806607038
----------------------------------------
perfomance in test set
test_accuracy : 0.6541666666666667
test_recall : 0.41471478795869043
test_precision : 0.39770397356604253
test_f1 : 0.39765270696084815
test_average_score : 0.466059533788062



model_name XGBClassifier
perfomance in training set
train_accuracy: 0.854089219330855
train_recall : 0.7637567517362991
train_precision : 0.8590381002380021
train_f1 : 0.7986614091325145
train_average_score : 0.8188863701094178
----------------------------------------
perfomance in test set
test_accuracy : 0.725
test_recall : 0.5490380001014147
test_precision : 0.6464945226917058
test_f1 : 0.5722184337738424
test_average_score : 0.6231877391417407



model_name KNeighborsClassifier
perfomance in training set
train_acc

#### 7.0 chose the best model

#### 7.1 chosing by accuracy

In [223]:
model_and_score_df.sort_values(by = "test_accuracy" , ascending = False )

Unnamed: 0,model_name,test_accuracy,test_recall,test_precision,test_f1,test_average_score
2,XGBClassifier,0.725,0.549038,0.646495,0.572218,0.623188
0,LogisticRegression,0.708333,0.486622,0.489388,0.477911,0.540564
1,AdaBoostClassifier,0.654167,0.414715,0.397704,0.397653,0.46606
3,KNeighborsClassifier,0.583333,0.329988,0.413698,0.334521,0.415385


##### conclusion
- the best model by accuracy was XGBClassifier , but due that there is not many data its performing is poor

<h1> END OF MODELING</h1>