<h4 style="text-align: center; color: #BD6C37;"> <i> Ecole Polytechnique de Thiès <br>  Département Génie Informatique et Télécommunications </i> </h4>
<h1 style="text-align: center"> Principes MLOps </h1>
<h5 style="text-align: center">DIC3-GIT, 2023-2024</h5>
<h5 style="text-align: center">Mme Mously DIAW</h5>
<h1 style="text-align: center; color:#90edaa">Projet matière : Natural Language Processing with Disaster Tweets</h1>
<h5 style="text-align: center"> Par Kikia DIA, Mouhamadou Naby DIA, Ndeye Awa SALANE </h5>
<h3 style="text-align: center; color:#9000aa; text-decoration:underline"> II. Models Exploration </h3>


<a id="0"></a> <br>
### Sommaire
#### [Introduction](#1)
1. [Features Selection](#3)
1. [Vectorisation](#2)
1. [Models building](#4)
1. [Pipeline](#8)
#### [Conclusion](#5)
* <i>[References](#6)</i>
* <i>[Authors](#7)</i>

<a id="1"></a> 
#### Introduction [⏮️]()[👆🏽](#0)[⏭️](#2)

#### Import des librairies

In [21]:
# Ajouter le répertoire parent pour les imports de module
import sys
sys.path.append('..')

from gensim.models import Word2Vec
import numpy as np
import matplotlib.pyplot as plt
import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient
from nltk.tokenize import word_tokenize
import pandas as pd
import pycaret
from pycaret.classification import *
from pyngrok import ngrok
import ppscore as pps
import seaborn as sns
from settings.params import MODEL_PARAMS, SEED
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, StackingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from src.data.make_dataset import get_dataset
from xgboost import XGBClassifier
import warnings

In [22]:
train, test = get_dataset(raw=False)

In [23]:
train.shape

(7591, 5)

In [24]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,deed reason earthquake may forgive,1
1,4,,,forest fire near la canada,1
2,5,,,resident ask shelter place notify officer evac...,1
3,6,,,receive wildfire evacuation order,1
4,7,,,get send photo ruby smoke pour school,1


In [25]:
# train = train.dropna(subset=['text'])

In [26]:
train.shape

(7591, 5)

<a id="2"></a> 
#### 1. Feature Selection [⏮️](#1)[👆🏽](#0)[⏭️](#3)

In [27]:
# Concatenate 'text' and 'Keywords' if 'Keywords' is not NaN
# train['combined_text'] = train.apply(
#     # lambda row: f"{row['text']} {row['keyword'] or ''} ".strip(),
#     lambda row: f"{row['text']} {row['keyword'] or ''} {row['location'] or ''}".strip(),

#     axis=1
# )
# X = train['combined_text']
X = train["text"]
y = train['target']
X

0                      deed reason earthquake may forgive
1                              forest fire near la canada
2       resident ask shelter place notify officer evac...
3                       receive wildfire evacuation order
4                   get send photo ruby smoke pour school
                              ...                        
7586     two giant crane hold bridge collapse nearby home
7587    aria control wild fire even northern part stat...
7588                                              volcano
7589    police investigate bike collide car little bik...
7590                     late home raze northern wildfire
Name: text, Length: 7591, dtype: object

##### CHI2 test

In [28]:


# # Sample data - Replace this with your tweets and labels
# data = train
# tweets = data["text"]
# labels = data["target"]

# # Convert text data to numerical features
# vectorizer = CountVectorizer(stop_words='english')
# X = vectorizer.fit_transform(tweets)


# # Apply Chi-Square test
# chi2_scores, p_values = chi2(X, y)

# # Create a DataFrame to view feature names and their scores
# feature_names = vectorizer.get_feature_names_out()
# chi2_df = pd.DataFrame({'Feature': feature_names, 'Chi2 Score': chi2_scores})

# # Sort by Chi2 Score in descending order
# chi2_df = chi2_df.sort_values(by='Chi2 Score', ascending=False)

# # Display the top features
# print(chi2_df.head(20))


<a id="2"></a> 
#### 2. Vectorisation [⏮️](#2)[👆🏽](#0)[⏭️](#4)

##### Bag of words

In [29]:
# cv = CountVectorizer()

In [30]:
# X_vect = cv.fit_transform(X)
# X_vect = X_vect.toarray()

##### TF-IDF

In [31]:
tfidf = TfidfVectorizer()

In [32]:
X_vect = tfidf.fit_transform(X)
X_vect = X_vect.toarray()
X_vect

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [33]:
# df = pd.DataFrame(X_vect[2].T.todense(),
#     	index=tfidf.get_feature_names_out(), columns=["TF-IDF"])
# df = df.sort_values('TF-IDF', ascending=False)
# df

#### Word2Vec

In [34]:
# # Preprocess the text data
# def preprocess_text(text):
#     # Tokenize the text
#     tokens = word_tokenize(text.lower())
#     return tokens

# # Apply preprocessing to the text data
# X_tokenized = X.apply(preprocess_text)

# # Train Word2Vec model
# word2vec_model = Word2Vec(sentences=X_tokenized, vector_size=100, window=5, min_count=1, workers=4, sg=1)

# # Transform text data into vectors by averaging the word vectors
# def transform_text_to_vector(tokens, model):
#     word_vectors = [model.wv[word] for word in tokens if word in model.wv]
#     if len(word_vectors) == 0:
#         return np.zeros(model.vector_size)
#     return np.mean(word_vectors, axis=0)

# X_vect = X_tokenized.apply(lambda tokens: transform_text_to_vector(tokens, word2vec_model))
# X_vect = np.vstack(X_vect)



<a id="3"></a> 
#### 3. Model building [⏮️](#3)[👆🏽](#0)[⏭️](#5)

##### Train / Validation split


In [35]:
X_train_vect, X_val_vect, y_train, y_val = train_test_split(X_vect, y, test_size=0.15, random_state=50)

In [36]:
warnings.filterwarnings("ignore")

# Set up MLflow tracking
local_registry = "sqlite:///mlruns.db"
mlflow.set_tracking_uri(local_registry)

# Create an experiment
exp_name = "disaster_tweets_classification"
experiment = mlflow.get_experiment_by_name(exp_name)
if not experiment:
    experiment_id = mlflow.create_experiment(exp_name)
else:
    experiment_id = experiment.experiment_id

#### Pycaret

In [37]:
# # Combine features and labels into a DataFrame
# df = pd.DataFrame(X_vect)
# df['target'] = y

# # Use PyCaret for classification
# clf = setup(data=df, target='target', session_id=123)

# # Compare different models
# best_model = compare_models()

# # Pull the results
# results = pull()

# # Print the results
# print(results)

# # Finalize the best model (optional)
# final_model = finalize_model(best_model)

# # Print the best model details
# print(final_model)

##### Grid Search for some models

In [38]:
def train_and_evaluate_with_grid_search(model, param_grid, X_train, y_train, X_val, y_val, model_name):
    # Perform Grid Search
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, 
                               cv=5, n_jobs=-1, verbose=2, scoring='accuracy')
    grid_search.fit(X_train, y_train)
    
    # Get the best model from Grid Search
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    
    with mlflow.start_run(run_name=model_name, experiment_id=experiment_id) as run:
        # Train the best model
        best_model.fit(X_train, y_train)

        # Make predictions
        y_train_pred = best_model.predict(X_train)
        y_val_pred = best_model.predict(X_val)

        # Calculate metrics for training set
        train_f1 = f1_score(y_train, y_train_pred, average='weighted')
        train_precision = precision_score(y_train, y_train_pred, average='weighted')
        train_recall = recall_score(y_train, y_train_pred, average='weighted')

        # Calculate metrics for validation set
        val_f1 = f1_score(y_val, y_val_pred, average='weighted')
        val_precision = precision_score(y_val, y_val_pred, average='weighted')
        val_recall = recall_score(y_val, y_val_pred, average='weighted')

        # Log parameters and metrics
        mlflow.log_params(best_params)
        mlflow.log_metric("train_f1_score", train_f1)
        mlflow.log_metric("train_precision", train_precision)
        mlflow.log_metric("train_recall", train_recall)
        mlflow.log_metric("val_f1_score", val_f1)
        mlflow.log_metric("val_precision", val_precision)
        mlflow.log_metric("val_recall", val_recall)

        # Log the best model
        mlflow.sklearn.log_model(best_model, model_name, input_example=X_train[:30])

        # Print metrics
        print(f'{model_name} Training F1 Score: {train_f1}')
        print(f'{model_name} Training Precision: {train_precision}')
        print(f'{model_name} Training Recall: {train_recall}')
        print(f'{model_name} Validation F1 Score: {val_f1}')
        print(f'{model_name} Validation Precision: {val_precision}')
        print(f'{model_name} Validation Recall: {val_recall}')

        # Print best parameters
        print(f'Best parameters for {model_name}: {best_params}')
    return grid_search.best_estimator_



In [39]:
# Define classification models with fixed random seed
models = {
    'RandomForestClassifier': RandomForestClassifier(random_state=SEED),
    'SVC': SVC(C=100, gamma='scale', kernel='linear', random_state=SEED),
    'Naive Bayes': MultinomialNB(),
    'KNN': KNeighborsClassifier(),
    'XGBoost': XGBClassifier(eval_metric='logloss', random_state=SEED),
    'Logistic Regression': LogisticRegression(C=0.8)
}

# Parameter grids for each model
param_grids = {
    'RandomForestClassifier': [{
        'n_estimators': [10, 50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }],
    'SVC': [{
        'C': [0.1, 1, 10, 100],
        'gamma': ['scale', 'auto'],
        'kernel': ['linear', 'rbf', 'poly', 'sigmoid']
    }],
    'Naive Bayes': [{
        'alpha': [0.01, 0.1, 1, 10]
    }],
    'KNN': [{
        'n_neighbors': [3, 5, 7, 9],
        'weights': ['uniform', 'distance'],
        'metric': ['euclidean', 'manhattan', 'minkowski']
    }],
    'XGBoost': [{
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 10],
        'learning_rate': [0.01, 0.1, 0.2],
        'subsample': [0.6, 0.8, 1.0]
    }],
    'Logistic Regression': [{
        'C': [0.01, 0.1, 0.8, 1, 10, 100],
        'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
        'max_iter': [100, 200, 500, 1000],
        'penalty': ['l1', 'l2', 'elasticnet', 'none'],
        'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]  # Only used if penalty is 'elasticnet'
    }]
}


In [40]:
# Perform Grid Search on each model
best_models = {}
for model_name, model in models.items():
    print(f"Performing Grid Search for {model_name}...")
    best_models[model_name] = train_and_evaluate_with_grid_search(model, param_grids[model_name], X_train_vect, y_train, X_val_vect, y_val, model_name)

Performing Grid Search for RandomForestClassifier...
Fitting 5 folds for each of 144 candidates, totalling 720 fits


KeyboardInterrupt: 

In [41]:
# Define a function to train and evaluate models
def train_and_evaluate(model, X_train, y_train, X_val, y_val, model_name):
    with mlflow.start_run(run_name=model_name, experiment_id=experiment_id) as run:
        # Train the model
        model.fit(X_train, y_train)

        # Make predictions
        y_train_pred = model.predict(X_train)
        y_val_pred = model.predict(X_val)

        # Calculate metrics for training set
        train_f1 = f1_score(y_train, y_train_pred, average='weighted')
        train_precision = precision_score(y_train, y_train_pred, average='weighted')
        train_recall = recall_score(y_train, y_train_pred, average='weighted')

        # Calculate metrics for validation set
        val_f1 = f1_score(y_val, y_val_pred, average='weighted')
        val_precision = precision_score(y_val, y_val_pred, average='weighted')
        val_recall = recall_score(y_val, y_val_pred, average='weighted')

        # Log parameters and metrics
        mlflow.log_params(model.get_params())
        mlflow.log_metric("train_f1_score", train_f1)
        mlflow.log_metric("train_precision", train_precision)
        mlflow.log_metric("train_recall", train_recall)
        mlflow.log_metric("val_f1_score", val_f1)
        mlflow.log_metric("val_precision", val_precision)
        mlflow.log_metric("val_recall", val_recall)

        # Log the model
        mlflow.sklearn.log_model(model, model_name)

        # Print metrics
        print(f'{model_name} Training F1 Score: {train_f1}')
        print(f'{model_name} Training Precision: {train_precision}')
        print(f'{model_name} Training Recall: {train_recall}')
        print(f'{model_name} Validation F1 Score: {val_f1}')
        print(f'{model_name} Validation Precision: {val_precision}')
        print(f'{model_name} Validation Recall: {val_recall}')

#### Single model evaluation

In [42]:
# train_and_evaluate(model, param_grids[model_name], X_train_vect, y_train, X_val_vect, y_val, model_name)

In [43]:
for model_name, model in models.items():
    train_and_evaluate(model,  X_train_vect, y_train, X_val_vect, y_val, model_name)



RandomForestClassifier Training F1 Score: 0.9816806422576758
RandomForestClassifier Training Precision: 0.9818872281650376
RandomForestClassifier Training Recall: 0.98171109733416
RandomForestClassifier Validation F1 Score: 0.7823297680413072
RandomForestClassifier Validation Precision: 0.7874803797384721
RandomForestClassifier Validation Recall: 0.7857769973661106




SVC Training F1 Score: 0.9764021832471756
SVC Training Precision: 0.9765986596552716
SVC Training Recall: 0.9764414135151891
SVC Validation F1 Score: 0.7407608782093515
SVC Validation Precision: 0.7405772973652628
SVC Validation Recall: 0.7410008779631255




Naive Bayes Training F1 Score: 0.86870304428356
Naive Bayes Training Precision: 0.8734729655368948
Naive Bayes Training Recall: 0.870272783632982
Naive Bayes Validation F1 Score: 0.7782802540920541
Naive Bayes Validation Precision: 0.7846659395802227
Naive Bayes Validation Recall: 0.7822651448639157




KNN Training F1 Score: 0.8247413837493035
KNN Training Precision: 0.8305193403409592
KNN Training Recall: 0.8273403595784253
KNN Validation F1 Score: 0.7555838177878885
KNN Validation Precision: 0.7620752382663547
KNN Validation Recall: 0.7603160667251976




XGBoost Training F1 Score: 0.8436545450751911
XGBoost Training Precision: 0.8572830449371619
XGBoost Training Recall: 0.847334159950403
XGBoost Validation F1 Score: 0.7707382424445831
XGBoost Validation Precision: 0.7806826590989302
XGBoost Validation Recall: 0.7761194029850746




Logistic Regression Training F1 Score: 0.8593260438056409
Logistic Regression Training Precision: 0.8679312620431813
Logistic Regression Training Recall: 0.8617482951022939
Logistic Regression Validation F1 Score: 0.777894968203914
Logistic Regression Validation Precision: 0.7914551062193333
Logistic Regression Validation Recall: 0.7840210711150132


##### Voting & Stacking Classifiers with best hyperparameters

In [46]:
# Create a VotingClassifier with the best models
voting_clf = VotingClassifier(estimators=[
    # ('rf', models['RandomForestClassifier']),
    # ('svc', models['SVC']),
    ('nb', models['Naive Bayes']),
    # ('knn', models['KNN']),
    ('xgb', models['XGBoost']),
    ('lr',  models['Logistic Regression']),
], voting='hard')

stacking_clf = StackingClassifier(estimators=[
    # ('rf', models['RandomForestClassifier']),
    # ('svc', models['SVC']),
    ('nb', models['Naive Bayes']),
    # ('knn', models['KNN']),
    ('xgb', models['XGBoost']),
        ('lr',  models['Logistic Regression']),

], final_estimator=LogisticRegression(),
    cv=5)

In [45]:
def train_and_evaluate_ensemble(model, X_train, y_train, X_val, y_val, model_name):
    with mlflow.start_run(run_name=model_name, experiment_id=experiment_id) as run:
        # Train the model
        model.fit(X_train, y_train)

        # Make predictions
        y_train_pred = model.predict(X_train)
        y_val_pred = model.predict(X_val)

        # Calculate metrics for training set
        train_f1 = f1_score(y_train, y_train_pred, average='weighted')
        train_precision = precision_score(y_train, y_train_pred, average='weighted')
        train_recall = recall_score(y_train, y_train_pred, average='weighted')

        # Calculate metrics for validation set
        val_f1 = f1_score(y_val, y_val_pred, average='weighted')
        val_precision = precision_score(y_val, y_val_pred, average='weighted')
        val_recall = recall_score(y_val, y_val_pred, average='weighted')

        # Log parameters and metrics
        mlflow.log_param("ensemble_method", "VotingClassifier")
        mlflow.log_metric("train_f1_score", train_f1)
        mlflow.log_metric("train_precision", train_precision)
        mlflow.log_metric("train_recall", train_recall)
        mlflow.log_metric("val_f1_score", val_f1)
        mlflow.log_metric("val_precision", val_precision)
        mlflow.log_metric("val_recall", val_recall)

        # Log the model
        mlflow.sklearn.log_model(model, model_name, input_example=X_train[:30])

        # Print metrics
        print(f'{model_name} Training F1 Score: {train_f1}')
        print(f'{model_name} Training Precision: {train_precision}')
        print(f'{model_name} Training Recall: {train_recall}')
        print(f'{model_name} Validation F1 Score: {val_f1}')
        print(f'{model_name} Validation Precision: {val_precision}')
        print(f'{model_name} Validation Recall: {val_recall}')

# Train and evaluate the ensemble model
train_and_evaluate_ensemble(voting_clf, X_train_vect, y_train, X_val_vect, y_val, "Ensemble_VotingClassifier")


Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Ensemble_VotingClassifier Training F1 Score: 0.863603770780156
Ensemble_VotingClassifier Training Precision: 0.8722241400949332
Ensemble_VotingClassifier Training Recall: 0.8659330440173589
Ensemble_VotingClassifier Validation F1 Score: 0.7912193359751131
Ensemble_VotingClassifier Validation Precision: 0.8028585655908783
Ensemble_VotingClassifier Validation Recall: 0.7963125548726954


In [47]:
train_and_evaluate_ensemble(stacking_clf, X_train_vect, y_train, X_val_vect, y_val, "Ensemble_StackingClassifier")

Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Ensemble_StackingClassifier Training F1 Score: 0.8749792125648127
Ensemble_StackingClassifier Training Precision: 0.8790034884796407
Ensemble_StackingClassifier Training Recall: 0.8763174209547427
Ensemble_StackingClassifier Validation F1 Score: 0.7945039994239871
Ensemble_StackingClassifier Validation Precision: 0.7983152842613692
Ensemble_StackingClassifier Validation Recall: 0.797190517998244


In [None]:
# Terminate open tunnels if they exist
ngrok.kill()
NGROK_AUTH_TOKEN = "2iXzpfikPynAXlcC1qn3NDhkhMS_7BbPuGQ5Sc8R7aohgotz6"
ngrok.set_auth_token(NGROK_AUTH_TOKEN)
public_url = ngrok.connect(5000)
print("MLflow Tracking UI:", public_url)

# Transition models to production
client = MlflowClient()
for model_name in models.keys():
    try:
        model_versions = client.search_model_versions(f"name='{model_name}'")
        if model_versions:
            # Get the latest model version
            latest_version = model_versions[-1]
            client.transition_model_version_stage(
                name=model_name,
                version=latest_version.version,
                stage="production"
            )
    except Exception as e:
        print(f"Error transitioning model {model_name}: {e}")


MLflow Tracking UI: NgrokTunnel: "https://5c31-41-82-155-235.ngrok-free.app" -> "http://localhost:5000"



<a id="3"></a> 
#### 4. Pipeline [⏮️](#4)[👆🏽](#0)[⏭️](#6)

In [48]:
idf_pip=Pipeline(
                [
                    ('tf_idf',TfidfVectorizer(ngram_range=(1,1))),
                    ('model',LogisticRegression(C=.8,solver='sag',max_iter=1000))
                ]
                )
idf_pip.fit(X_train,y_train)

NameError: name 'Pipeline' is not defined

<a id="5"></a> 
#### Conclusion [⏮️](#4)[👆🏽](#0)[⏭️](#6)


<a id="6"></a> 
#### <i>References</i> [⏮️](#5)[👆🏽](#0)[⏭️](#7)
Here is some text with a reference to the [Python documentation](https://docs.python.org/).

...

Here are some references for more information on the libraries used:

- [Pandas documentation](https://pandas.pydata.org/docs/)
- [NumPy documentation](https://numpy.org/doc/stable/)
- [Tuning the hyper-parameters of an estimator](https://scikit-learn.org/stable/modules/grid_search.html)

🍀 Auteurs
- 🧑🏾‍💻 Kikia DIA
- 🧑🏾‍💻 Mouhamadou Naby DIA
- 🧑🏾‍💻 Ndeye Awa SALANE

🍀 Affiliations
- 🎓 Ecole Polytechnique de THIES

🍀 Département 
- 💻 Génie Informatique et Télécommunications

🍀 Niveau
- 📚 DIC3