<a href="https://colab.research.google.com/github/aymanboufarhi/NLP-language-models/blob/main/NLP_language_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab** : Get familiar with NLP language models using Sklearn library

## **Part 1 :**

## Language Modeling / Regression :

Dataset : https://github.com/dbbrandt/short_answer_granding_capstone_project/blob/master/data/sag/answers.csv

Upload the dataset into GoogleDrive :

In [None]:
from google.colab import files
uploaded = files.upload()

Saving answers.csv to answers.csv


Preprocessing NLP pipeline : (Tokenization stemming lemmatization, Stop words )

In [None]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Load dataset
df = pd.read_csv('answers.csv')

# Initialize NLP tools
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    # Tokenization
    tokens = word_tokenize(text)
    # Remove stop words
    tokens = [word for word in tokens if word.lower() not in stop_words]
    # Stemming and Lemmatization
    tokens = [ps.stem(word) for word in tokens]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

# Apply preprocessing
df['processed_answer'] = df['answer'].apply(preprocess)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


Encode Data Vectors :

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
import numpy as np

# Bag of Words
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(df['answer'])

# TF-IDF
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['answer'])

# Word2Vec
sentences = df['processed_answer'].tolist()
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)  # CBOW
word2vec_model_sg = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1)  # Skip Gram

# Transform answers into word vectors
def vectorize_text(text, model):
    vector = []
    for word in text:
        if word in model.wv:
            vector.append(model.wv[word])
    if len(vector) > 0:
        return np.mean(vector, axis=0)
    else:
        return np.zeros(model.vector_size)

df['w2v'] = df['processed_answer'].apply(lambda x: vectorize_text(x, word2vec_model))
df['w2v_sg'] = df['processed_answer'].apply(lambda x: vectorize_text(x, word2vec_model_sg))

# Ensure all vectors are of the same length and not empty
df = df[df['w2v'].apply(lambda x: len(x) == 100)]

Train models by using SVR, Naive Bayes, Linear Regression , Decision Tree Algorithms :

In [None]:
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Prepare data
X = np.array(df['w2v'].tolist())
y = df['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# SVR
svr = SVR()
svr.fit(X_train, y_train)
y_pred_svr = svr.predict(X_test)
mse_svr = mean_squared_error(y_test, y_pred_svr)

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)

# Decision Tree Regressor
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
mse_dt = mean_squared_error(y_test, y_pred_dt)

# Discretize the 'score' column into 5 bins so we can use Gaussian Naive Bayes
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
y_discrete = discretizer.fit_transform(df[['score']]).ravel()

# Add the discretized scores to the dataframe
df['score_discrete'] = y_discrete

# Prepare data
X = np.array(df['w2v'].tolist())
y_discrete = df['score_discrete']
X_train, X_test, y_train, y_test = train_test_split(X, y_discrete, test_size=0.2, random_state=42)

# Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)

# Evaluate the model
accuracy_nb = accuracy_score(y_test, y_pred_nb)
print("Naive Bayes Accuracy:", accuracy_nb)

# Print MSE for each model
print("SVR MSE:", mse_svr)
print("Linear Regression MSE:", mse_lr)
print("Decision Tree MSE:", mse_dt)

Naive Bayes Accuracy: 0.5950920245398773
SVR MSE: 1.7126606119228087
Linear Regression MSE: 1.140539987001417
Decision Tree MSE: 2.016559304703476


Evaluate Models :

In [None]:
from sklearn.metrics import mean_squared_error

# Evaluation
rmse_svr = mse_svr ** 0.5
rmse_lr = mse_lr ** 0.5
rmse_dt = mse_dt ** 0.5

print(f"SVR MSE: {mse_svr}, RMSE: {rmse_svr}")
print(f"Linear Regression MSE: {mse_lr}, RMSE: {rmse_lr}")
print(f"Decision Tree MSE: {mse_dt}, RMSE: {rmse_dt}")

SVR MSE: 1.7126606119228087, RMSE: 1.3086865980527227
Linear Regression MSE: 1.140539987001417, RMSE: 1.0679606673475466
Decision Tree MSE: 2.016559304703476, RMSE: 1.420056092097589


Interpret the Obtained Results :

* SVR (Support Vector Regression) :

   - MSE : 1.7126606119228087
   - RMSE : 1.3086865980527227


* Linear Regression :

   - MSE : 1.140539987001417
   - RMSE : 1.0679606673475466

* Decision Tree Regressor :

   - MSE : 2.016559304703476
   - RMSE : 1.420056092097589

## Interpretation of Results
* Mean Squared Error (MSE) :
   - MSE measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.
   - A lower MSE indicates a better fit.

* Root Mean Squared Error (RMSE) :
   - RMSE is the square root of MSE and provides a measure of the magnitude of the error.
   - Like MSE, a lower RMSE indicates better model performance.
   
## Analysis

* Linear Regression :

   - Achieves the lowest MSE (1.140539987001417) and RMSE (1.0679606673475466), indicating it has the best performance among the three models.
   - This suggests that a linear relationship between the features and the target variable fits the data well.

* SVR (Support Vector Regression):

   - Has a higher MSE (1.7126606119228087) and RMSE (1.3086865980527227) compared to Linear Regression.
   - SVR might not be capturing the underlying patterns in the data as effectively as Linear Regression for this specific task.

* Decision Tree Regressor:

   - Exhibits the highest MSE (2.016559304703476) and RMSE (1.420056092097589) among the models.
   - Decision Trees can overfit the training data, leading to poorer generalization to the test data, which might be the case here.

## Conclusion
Based on the MSE and RMSE metrics, Linear Regression is the best-performing model for this dataset. It indicates that a linear approach captures the relationship between the features and the target variable effectively. SVR, while useful in many contexts, does not perform as well here, and Decision Tree Regressor shows signs of overfitting, leading to higher error rates.

## Language Modeling / Classification :

Dataset : https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis

Upload the dataset into GoogleDrive :

In [None]:
from google.colab import files
uploaded = files.upload()

Saving twitter_training.csv to twitter_training.csv
Saving twitter_validation.csv to twitter_validation.csv


Preprocessing NLP pipeline : (Tokenization stemming lemmatization, Stop words )

In [None]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Load datasets with appropriate column names
column_names = ['id', 'entity', 'sentiment', 'text']
df_train = pd.read_csv('twitter_training.csv', names=column_names, header=None, encoding='latin1')
df_val = pd.read_csv('twitter_validation.csv', names=column_names, header=None, encoding='latin1')

# Combine training and validation datasets for preprocessing
df = pd.concat([df_train, df_val])

# Initialize NLP tools
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    if isinstance(text, str):
        # Tokenization
        tokens = word_tokenize(text)
        # Remove stop words
        tokens = [word for word in tokens if word.lower() not in stop_words]
        # Stemming and Lemmatization
        tokens = [ps.stem(word) for word in tokens]
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
        return ' '.join(tokens)
    return ""

# Apply preprocessing
df['processed_text'] = df['text'].apply(preprocess)

# Separate back into training and validation sets
df_train = df.iloc[:len(df_train)]
df_val = df.iloc[len(df_train):]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Encode Data Vectors :

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
import numpy as np

# Bag of Words
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(df_train['processed_text'])
X_val_bow = vectorizer.transform(df_val['processed_text'])

# TF-IDF
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(df_train['processed_text'])
X_val_tfidf = tfidf.transform(df_val['processed_text'])

# Word2Vec
sentences = df['processed_text'].apply(word_tokenize).tolist()
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)  # CBOW
word2vec_model_sg = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1)  # Skip Gram

# Transform tweets into word vectors
def vectorize_text(text, model):
    vector = [model.wv[word] for word in text if word in model.wv]
    return np.mean(vector, axis=0) if vector else np.zeros(model.vector_size)

# Use .loc to avoid SettingWithCopyWarning
df_train.loc[:, 'w2v'] = df_train['processed_text'].apply(lambda x: vectorize_text(word_tokenize(x), word2vec_model))
df_val.loc[:, 'w2v'] = df_val['processed_text'].apply(lambda x: vectorize_text(word_tokenize(x), word2vec_model))

Train Models :

In [None]:
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Prepare data
X_train = np.array(df_train['w2v'].tolist())
X_val = np.array(df_val['w2v'].tolist())
y_train = df_train['sentiment']
y_val = df_val['sentiment']

# SVM
svm = SVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_val)

# Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_val)

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_val)

# AdaBoost
ab = AdaBoostClassifier()
ab.fit(X_train, y_train)
y_pred_ab = ab.predict(X_val)

Evaluate Models :

In [None]:
# SVM Evaluation
svm_accuracy = accuracy_score(y_val, y_pred_svm)
svm_f1 = f1_score(y_val, y_pred_svm, average='weighted')
print("SVM Accuracy:", svm_accuracy)
print("SVM F1 Score:", svm_f1)
print("\nSVM Classification Report:")
print(classification_report(y_val, y_pred_svm))

# Naive Bayes Evaluation
nb_accuracy = accuracy_score(y_val, y_pred_nb)
nb_f1 = f1_score(y_val, y_pred_nb, average='weighted')
print("Naive Bayes Accuracy:", nb_accuracy)
print("Naive Bayes F1 Score:", nb_f1)
print("\nNaive Bayes Classification Report:")
print(classification_report(y_val, y_pred_nb))

# Logistic Regression Evaluation
lr_accuracy = accuracy_score(y_val, y_pred_lr)
lr_f1 = f1_score(y_val, y_pred_lr, average='weighted')
print("Logistic Regression Accuracy:", lr_accuracy)
print("Logistic Regression F1 Score:", lr_f1)
print("\nLogistic Regression Classification Report:")
print(classification_report(y_val, y_pred_lr))

# AdaBoost Evaluation
ab_accuracy = accuracy_score(y_val, y_pred_ab)
ab_f1 = f1_score(y_val, y_pred_ab, average='weighted')
print("AdaBoost Accuracy:", ab_accuracy)
print("AdaBoost F1 Score:", ab_f1)
print("\nAdaBoost Classification Report:")
print(classification_report(y_val, y_pred_ab))

SVM Accuracy: 0.565
SVM F1 Score: 0.54482376414488

SVM Classification Report:
              precision    recall  f1-score   support

  Irrelevant       0.47      0.27      0.35       172
    Negative       0.58      0.77      0.66       266
     Neutral       0.63      0.38      0.48       285
    Positive       0.54      0.73      0.62       277

    accuracy                           0.56      1000
   macro avg       0.56      0.54      0.53      1000
weighted avg       0.57      0.56      0.54      1000

Naive Bayes Accuracy: 0.452
Naive Bayes F1 Score: 0.4540254091768486

Naive Bayes Classification Report:
              precision    recall  f1-score   support

  Irrelevant       0.26      0.44      0.33       172
    Negative       0.53      0.64      0.58       266
     Neutral       0.51      0.41      0.46       285
    Positive       0.58      0.32      0.41       277

    accuracy                           0.45      1000
   macro avg       0.47      0.45      0.44      1000
w

Interpret the Obtained Results :

* Support Vector Machine (SVM) :

   - Accuracy : 56.5%
   - F1 Score : 54.48%
   - Precision : Ranges from 47% to 63% for different classes.
   - Recall : Ranges from 27% to 77% for different classes.
   - Interpretation : SVM performs moderately well with an accuracy slightly above chance. However, it seems to struggle with precision and recall, especially for the "Irrelevant" and "Neutral" classes.

* Naive Bayes :

   - Accuracy : 45.2%
   - F1 Score : 45.40%
   - Precision : Ranges from 26% to 58% for different classes.
   - Recall : Ranges from 32% to 64% for different classes.
   - Interpretation : Naive Bayes shows lower performance compared to SVM, with lower accuracy and F1 score. It tends to have varied precision and recall across different classes, indicating a more generalized performance.

* Logistic Regression:

   - Accuracy : 51.8%
   - F1 Score : 49.28%
   - Precision : Ranges from 43% to 57% for different classes.
   - Recall : Ranges from 12% to 73% for different classes.
   - Interpretation : Logistic Regression performs slightly better than Naive Bayes but worse than SVM. It shows decent precision but struggles with recall, particularly evident in the "Irrelevant" class.

* AdaBoost:

   - Accuracy : 50.1%
   - F1 Score : 47.51%
   - Precision : Ranges from 47% to 53% for different classes.
   - Recall : Ranges from 12% to 73% for different classes.
   - Interpretation : AdaBoost's performance is similar to Logistic Regression, with slightly lower accuracy and F1 score. It also exhibits a struggle with recall, especially for the "Irrelevant" class.

Overall, SVM outperforms the other classifiers, followed by Logistic Regression and AdaBoost, while Naive Bayes exhibits the lowest performance in terms of accuracy and F1 score. All models encounter challenges in correctly classifying instances, particularly in certain classes, as indicated by variations in precision and recall across classes.