# Lab 4: Classification Models Part 1: Natural Language Processing and Classification Models




In this lab, we'll build a classifier for product reviews (restricted to the magazine category), like:

Excellent! I look forward to every issue. I had no idea just how much I didn't know. The letters from the subscribers are educational, too.

Label: ⭐️⭐️⭐️⭐️⭐️ (good)

My son waited and waited, it took the 6 weeks to get delivered that they said it would but when it got here he was so dissapointed, it only took him a few minutes to read it.

Label: ⭐️ (bad)



# Loading the data
First, let's load the train/test sets and take a look at the data.

Method 1: Basic File Import

If your dataset is in the same directory as your notebook, use:

In [None]:
import pandas as pd
train = pd.read_csv("reviews_train.csv")
test = pd.read_csv("reviews_test.csv")

test.sample(5)

Unnamed: 0,review,label
266,Love some PS. Have been ready it for years and...,good
951,"Interesting, informative articles and exotic r...",bad
463,I gave my first year college daughter The New ...,good
256,WE love this magazine. The tips in the magazin...,good
246,How can you go wrong for only $10.00. Great id...,good


Method 2: Load Data from Google Drive

If your dataset is stored in Google Drive, follow these steps:



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
test_path="/content/drive/MyDrive/Colab Notebooks/datasets/reviews_test.csv"
train_path="/content/drive/MyDrive/Colab Notebooks/datasets/reviews_train.csv"

train1 = pd.read_csv(train_path)
test1 = pd.read_csv(test_path)

train1.head()

Unnamed: 0,review,label
0,Based on all the negative comments about Taste...,good
1,I still have not received this. Obviously I c...,bad
2,</tr>The magazine is not worth the cost of sub...,good
3,This magazine is basically ads. Kindve worthle...,bad
4,"The only thing I've recieved, so far, is the b...",bad


#Training a baseline model
There are many approaches for training a sequence classification model for text data. In this lab, we're giving you code that mirrors what you find if you look up how to [train a text classifier](https://scikit-learn.org/stable/auto_examples/text/index.html), where we'll train an SVM on tf-idf features (numeric representations of each text field based on word occurrences).

In [None]:
#test preprocessing
#1- how to clean raw text data
#split text into words=tokenization
#convert words to base form=lemmatization

#2- how to turn it into numbers
#text into vectors= vectorization or word impedding

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

In [None]:
#method 1 using pipeline
classifier1 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', LinearSVC())
])

classifier =classifier1.fit(train['review'], train['label'])

In [None]:
#method 2
vectorizer = TfidfVectorizer() #term frequency
train_vectors = vectorizer1.fit_transform(train['review']) #create vectorizer and apply it on the review feature
test_vectors = vectorizer1.transform(test['review'])


classe=LinearSVC()
classe.fit(train_vectors, train['label']) #fitting

# Evaluating model accuracy


In [None]:
from sklearn import metrics
from sklearn.metrics import accuracy_score

In [None]:
def evaluate (model):
  predicted = model.predict(test['review'])
  print(metrics.classification_report(test['label'], predicted))
  print("Accuracy:", accuracy_score(test['label'], predicted)) #how much system is precise

In [None]:
evaluate(classifier)

#recall: out of the actual dataset how many were correctly identified (based on whole thing)
#precision: out of all the predicted postive cases how many were correct
#f1score: mean between precision and recall
#support: how many instance in the sample

              precision    recall  f1-score   support

         bad       0.73      0.70      0.71       500
        good       0.71      0.74      0.73       500

    accuracy                           0.72      1000
   macro avg       0.72      0.72      0.72      1000
weighted avg       0.72      0.72      0.72      1000

Accuracy: 0.72


# Trying another model
76% accuracy is not great for this binary classification problem. Can you do better with a different model, or by tuning hyperparameters for the SVM trained with SGD?

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier, RidgeClassifier, PassiveAggressiveClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, classification_report


classifier2 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', KNeighborsClassifier())
])

classifier =classifier2.fit(train['review'], train['label'])

evaluate(classifier)

              precision    recall  f1-score   support

         bad       0.81      0.77      0.79       500
        good       0.78      0.82      0.80       500

    accuracy                           0.79      1000
   macro avg       0.79      0.79      0.79      1000
weighted avg       0.79      0.79      0.79      1000

Accuracy: 0.791


In [None]:
classifier3 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', RandomForestClassifier())
])

classifier =classifier3.fit(train['review'], train['label'])

evaluate(classifier)

              precision    recall  f1-score   support

         bad       0.82      0.89      0.85       500
        good       0.88      0.81      0.84       500

    accuracy                           0.85      1000
   macro avg       0.85      0.85      0.85      1000
weighted avg       0.85      0.85      0.85      1000

Accuracy: 0.849


In [None]:
classifier4 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', GradientBoostingClassifier())
])

classifier=classifier4.fit(train['review'], train['label'])

evaluate(classifier)

              precision    recall  f1-score   support

         bad       0.86      0.69      0.77       500
        good       0.74      0.89      0.81       500

    accuracy                           0.79      1000
   macro avg       0.80      0.79      0.79      1000
weighted avg       0.80      0.79      0.79      1000

Accuracy: 0.792


In [None]:
classifier5 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', SVC())
])

classifier=classifier5.fit(train['review'], train['label'])

evaluate(classifier)

              precision    recall  f1-score   support

         bad       0.83      0.83      0.83       500
        good       0.83      0.82      0.83       500

    accuracy                           0.83      1000
   macro avg       0.83      0.83      0.83      1000
weighted avg       0.83      0.83      0.83      1000

Accuracy: 0.829


In [None]:
classifier6 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

classifier=classifier6.fit(train['review'], train['label'])

evaluate(classifier)

              precision    recall  f1-score   support

         bad       0.85      0.85      0.85       500
        good       0.85      0.85      0.85       500

    accuracy                           0.85      1000
   macro avg       0.85      0.85      0.85      1000
weighted avg       0.85      0.85      0.85      1000

Accuracy: 0.853


In [None]:
classifier7 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', DecisionTreeClassifier())
])

classifier=classifier7.fit(train['review'], train['label'])

evaluate(classifier)

              precision    recall  f1-score   support

         bad       0.78      0.79      0.78       500
        good       0.79      0.78      0.78       500

    accuracy                           0.78      1000
   macro avg       0.78      0.78      0.78      1000
weighted avg       0.78      0.78      0.78      1000

Accuracy: 0.784


In [None]:
classifier8 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', RidgeClassifier())
])

classifier=classifier8.fit(train['review'], train['label'])

evaluate(classifier)

              precision    recall  f1-score   support

         bad       0.75      0.72      0.74       500
        good       0.73      0.75      0.74       500

    accuracy                           0.74      1000
   macro avg       0.74      0.74      0.74      1000
weighted avg       0.74      0.74      0.74      1000

Accuracy: 0.739


In [None]:
classifier9 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', AdaBoostClassifier())
])

classifier=classifier9.fit(train['review'], train['label'])

evaluate(classifier)

              precision    recall  f1-score   support

         bad       0.88      0.51      0.64       500
        good       0.65      0.93      0.77       500

    accuracy                           0.72      1000
   macro avg       0.77      0.72      0.71      1000
weighted avg       0.77      0.72      0.71      1000

Accuracy: 0.719


In [None]:
classifier10 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', BaggingClassifier())
])

classifier=classifier10.fit(train['review'], train['label'])

evaluate(classifier)

              precision    recall  f1-score   support

         bad       0.86      0.91      0.89       500
        good       0.90      0.85      0.88       500

    accuracy                           0.88      1000
   macro avg       0.88      0.88      0.88      1000
weighted avg       0.88      0.88      0.88      1000

Accuracy: 0.882


In [None]:
classifier11 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', LogisticRegression())
])

classifier=classifier11.fit(train['review'], train['label'])

evaluate(classifier)

              precision    recall  f1-score   support

         bad       0.80      0.76      0.78       500
        good       0.77      0.81      0.79       500

    accuracy                           0.78      1000
   macro avg       0.78      0.78      0.78      1000
weighted avg       0.78      0.78      0.78      1000

Accuracy: 0.781


In [None]:
classifier12 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', BernoulliNB())
])

classifier=classifier12.fit(train['review'], train['label'])

evaluate(classifier)

              precision    recall  f1-score   support

         bad       0.86      0.75      0.80       500
        good       0.78      0.88      0.83       500

    accuracy                           0.82      1000
   macro avg       0.82      0.82      0.82      1000
weighted avg       0.82      0.82      0.82      1000

Accuracy: 0.816


In [None]:
classifier13 = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', PassiveAggressiveClassifier())
])

classifier=classifier13.fit(train['review'], train['label'])

evaluate(classifier)

              precision    recall  f1-score   support

         bad       0.67      0.60      0.63       500
        good       0.63      0.70      0.67       500

    accuracy                           0.65      1000
   macro avg       0.65      0.65      0.65      1000
weighted avg       0.65      0.65      0.65      1000

Accuracy: 0.648


# Exercise 1

Can you train a more accurate model on the dataset (without changing the dataset)? You might find this [scikit-learn classifier comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) handy, as well as the documentation for [supervised learning in scikit-learn](https://scikit-learn.org/stable/supervised_learning.html).

One idea for a model you could try is a [naive Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html).

You could also try experimenting with different values of the model hyperparameters, perhaps tuning them via a [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

Or you can even try training multiple different models and [ensembling](https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier) their predictions, a strategy often used to win prediction competitions like Kaggle.

Advanced: If you want to be more ambitious, you could try an even fancier model, like training a Transformer neural network. If you go with that, you'll want to fine-tune a pre-trained model. This guide from [HuggingFace](https://huggingface.co/docs/transformers/training) may be helpful.

In [None]:
# YOUR CODE HERE

# evaluate your model and see if it does better than the ones we provided

#in order to enhnace accuracy we can use 2 strategies:
# 1- model centric ai approach try all and then choose the best
# 2- data centric ai done by quality of data

In [None]:
#Try all models in  scikit-learn classifier comparison

In [None]:
#model centric ai-->
#naive bayes model
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

Naive_clf= Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

Naive_clf.fit(train['review'], train['label'])

evaluate(Naive_clf)

              precision    recall  f1-score   support

         bad       0.85      0.85      0.85       500
        good       0.85      0.85      0.85       500

    accuracy                           0.85      1000
   macro avg       0.85      0.85      0.85      1000
weighted avg       0.85      0.85      0.85      1000

Accuracy: 0.853


In [None]:
#random forest
random_clf= Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()), #give each word a number
    ('clf', RandomForestClassifier())
])

random_clf.fit(train['review'], train['label'])

evaluate(random_clf)

              precision    recall  f1-score   support

         bad       0.81      0.86      0.84       500
        good       0.85      0.80      0.83       500

    accuracy                           0.83      1000
   macro avg       0.83      0.83      0.83      1000
weighted avg       0.83      0.83      0.83      1000

Accuracy: 0.832


In [None]:
sdg_clf= Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier())
])

sdg_clf.fit(train['review'], train['label'])

evaluate(sdg_clf)

#hyperparameter tuning
from sklearn.model_selection import GridSearchCV

paramgrid = {
    'clf__alpha': (1e-2, 1e-3, 1e-4,1e-2), #each model has parameter we are working on parametes
    'clf__penalty': ['l1', 'l2'],
}
    #overfitting the model understand data but too much if i change data wont do good
    #underfitting system couldnt find relationship between x and y training wasnt enough

gridsearch=GridSearchCV(sdg_clf, paramgrid, cv=5, verbose=1)
gridsearch.fit(train['review'], train['label'])
print(gridsearch.best_params_)
evaluate(gridsearch)

              precision    recall  f1-score   support

         bad       0.76      0.76      0.76       500
        good       0.76      0.76      0.76       500

    accuracy                           0.76      1000
   macro avg       0.76      0.76      0.76      1000
weighted avg       0.76      0.76      0.76      1000

Accuracy: 0.759
Fitting 5 folds for each of 8 candidates, totalling 40 fits
{'clf__alpha': 0.001, 'clf__penalty': 'l2'}
              precision    recall  f1-score   support

         bad       0.87      0.91      0.89       500
        good       0.90      0.87      0.89       500

    accuracy                           0.89      1000
   macro avg       0.89      0.89      0.89      1000
weighted avg       0.89      0.89      0.89      1000

Accuracy: 0.889


In [None]:
#function takes all model and does some comparision(also a type of model centric)
from sklearn.base import ClassifierMixin

def compare_models(models, traindata, testdata):
    results = []
    for name, model_class in models:  # model_class is a class, not an instance
        if issubclass(model_class, ClassifierMixin):  # Check if it's a classifier
            try:
                pipeline = Pipeline([
                    ('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', model_class())
                ])
                pipeline.fit(traindata['review'], traindata['label'])
                pred = pipeline.predict(testdata['review'])
                acc = metrics.accuracy_score(testdata['label'], pred)
                results.append((name, acc))
                print(f"{name}: {acc}")

            except Exception as e:
                print(f"Error training {name}: {e}")

    return results  # Return results after evaluating all models


In [None]:
from sklearn.utils import all_estimators

# Get all classifiers from sklearn
allclassifiers = all_estimators(type_filter='classifier')

# Run model comparison
modelComparision = compare_models(allclassifiers, train, test)

print("\nTop models", modelComparision)
#for name, acc in modelComparision:
   # print(f"{name}: {acc}")

AdaBoostClassifier: 0.719
BaggingClassifier: 0.895
BernoulliNB: 0.816
CalibratedClassifierCV: 0.724
Error training CategoricalNB: Sparse data was passed for X, but dense data is required. Use '.toarray()' to convert to a dense numpy array.
Error training ClassifierChain: ClassifierChain.__init__() missing 1 required positional argument: 'base_estimator'
ComplementNB: 0.853
DecisionTreeClassifier: 0.781
DummyClassifier: 0.5
ExtraTreeClassifier: 0.647
ExtraTreesClassifier: 0.859
Error training FixedThresholdClassifier: FixedThresholdClassifier.__init__() missing 1 required positional argument: 'estimator'
Error training GaussianNB: Sparse data was passed for X, but dense data is required. Use '.toarray()' to convert to a dense numpy array.
Error training GaussianProcessClassifier: Sparse data was passed for X, but dense data is required. Use '.toarray()' to convert to a dense numpy array.
GradientBoostingClassifier: 0.78
Error training HistGradientBoostingClassifier: Sparse data was pass

In [None]:
from sklearn.datasets import make_circles, make_classification, make_moons
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn import metrics

names = [
    "Nearest Neighbors",
    "Linear SVM",
    "RBF SVM",
    "Gaussian Process",
    "Decision Tree",
    "Random Forest",
    "Neural Net",
    "AdaBoost",
    "Naive Bayes",
    "QDA",
]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025, random_state=42),
    SVC(gamma=2, C=1, random_state=42),
    GaussianProcessClassifier(1.0 * RBF(1.0), random_state=42),
    DecisionTreeClassifier(max_depth=5, random_state=42),
    RandomForestClassifier(
        max_depth=5, n_estimators=10, max_features=1, random_state=42
    ),
    MLPClassifier(alpha=1, max_iter=1000, random_state=42),
    AdaBoostClassifier(random_state=42),
    GaussianNB(),
    QuadraticDiscriminantAnalysis(),
]

def compareClassifiersOnReviews(names, classifiers, traindata, testdata):
    results = []
    for name, model in zip(names, classifiers):
        try:
            pipeline = Pipeline([
                ('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', model)
            ])
            pipeline.fit(traindata['review'], traindata['label'])
            pred = pipeline.predict(testdata['review'])
            acc = metrics.accuracy_score(testdata['label'], pred)
            results.append((name, acc))
            print(f"{name}: {acc}")

        except Exception as e:
            print(f"Error training {name}: {e}")

    return results


In [None]:
model_comparision = compareClassifiersOnReviews(names, classifiers, train, test)

print("\nTop models:", model_comparision)
for name, acc in model_comparision:
    print(f"{name}: {acc}")

Nearest Neighbors: 0.773
Linear SVM: 0.877
RBF SVM: 0.85
Error training Gaussian Process: Sparse data was passed for X, but dense data is required. Use '.toarray()' to convert to a dense numpy array.
Decision Tree: 0.669
Random Forest: 0.498
Neural Net: 0.938
AdaBoost: 0.719
Error training Naive Bayes: Sparse data was passed for X, but dense data is required. Use '.toarray()' to convert to a dense numpy array.
Error training QDA: Sparse data was passed for X, but dense data is required. Use '.toarray()' to convert to a dense numpy array.

Top models: [('Nearest Neighbors', 0.773), ('Linear SVM', 0.877), ('RBF SVM', 0.85), ('Decision Tree', 0.669), ('Random Forest', 0.498), ('Neural Net', 0.938), ('AdaBoost', 0.719)]
Nearest Neighbors: 0.773
Linear SVM: 0.877
RBF SVM: 0.85
Decision Tree: 0.669
Random Forest: 0.498
Neural Net: 0.938
AdaBoost: 0.719
