<a href="https://colab.research.google.com/github/bobby1/UCSC-Natural-Language-Processing/blob/main/AISV_801_lab3_Wen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##DBDA.X425.(12) Deep Learning and Artificial Intelligence
Spring 2023
# Instructor  Joseph Meyer
Lab3: Text Classification

Bobby Wen

Dataset: rotten_tomatoes

    1. Lemmatize text
    
    2. Vectorize text via bag of words or tf-idf (Remove stop words, set max_features to 500)
    
    3. Train test split
    
    4. Choose two algorithms (e.g., logistic regression and decision tree)
    
    5. Perform grid search on each of the algorithms
    
    6. Inference on test
    
    7. Print classification report
    
    8. Write out interpretation of report


# Install pre-requisits if not present already in the kernel

In [None]:
!pip install datasets



In [None]:
!pip install transformers



In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
# #Load rotten_tomatos dataset
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")



  0%|          | 0/3 [00:00<?, ?it/s]

# Check dataset

In [None]:
print (dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})


In [None]:
# a. Exploratory function
#    i.Prints out
#        1. Top n most common words
#        2. Average text length
#        3. Longest text length
import nltk
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
from collections import Counter

def explore_dataset(n=10, dataset_type="train"):
    """
    Print descriptive information about a dataset

    Args:
        n is the number of top most common words from the dataset

        dataset_type is if there are multiple data set, which one to use

    Returns:
        none
    """
    # Use the the rotten_tomatoes dataset already loaded
    reviews = dataset[dataset_type]["text"]

    # Top n most common words
    words = " ".join(reviews).split()
    word_counts = Counter(words)
    top_words = word_counts.most_common(n)

    print(f"Top {n} most common words:")
    for word, count in top_words:
        print(f"{word}: {count}")

    # Average text length
    total_length = sum(len(review) for review in reviews)
    average_length = total_length / len(reviews)
    print(f"\nAverage text length: {average_length:.0f} characters")

    # Longest text length
    longest_text = max(reviews, key=len)
    longest_length = len(longest_text)
    print(f"Longest text length: {longest_length} characters")
    print (f"Longest text", longest_text)

In [None]:
# Let's explore the data from the default training data set
explore_dataset(n=15)

Top 15 most common words:
.: 11197
the: 8024
,: 8001
a: 5855
and: 4914
of: 4814
to: 3415
is: 2700
in: 2111
that: 1974
it: 1799
as: 1407
but: 1322
with: 1271
this: 1184

Average text length: 114 characters
Longest text length: 267 characters
Longest text . . . spiced with humor ( 'i speak fluent flatula , ' advises denlopp after a rather , er , bubbly exchange with an alien deckhand ) and witty updatings ( silver's parrot has been replaced with morph , a cute alien creature who mimics everyone and everything around )


In [None]:
# Let's explore the data from the test data set
explore_dataset(n=15,dataset_type="test")

Top 15 most common words:
.: 1411
the: 1046
,: 1021
a: 725
and: 627
of: 596
to: 421
is: 343
in: 274
it: 243
that: 237
as: 203
but: 160
its: 147
for: 144

Average text length: 116 characters
Longest text length: 261 characters
Longest text this is a children's film in the truest sense . it's packed with adventure and a worthwhile environmental message , so it's great for the kids . parents , on the other hand , will be ahead of the plot at all times , and there isn't enough clever innuendo to fil


In [None]:
# Let's explore the data from the validation data set
#explore_dataset(n=5,dataset_type="validation")

3. Train test split

Load the individual data splits to be able to run the train test split
the train and test could be loaded directly from the data splits but we would be able to show the train

In [None]:
# Using the train_test_split function, we are able to create training and testing data set from a single data set
X_train, X_test, y_train, y_test = train_test_split(dataset["train"]['text'], dataset["train"]['label'])

In [None]:
# Check of number of rows in the new training data set
print ("Number of rows in training data set:", len(X_train))
print ("Number of rows in training label set:", len(y_train))

Number of rows in training data set: 6397
Number of rows in training label set: 6397


In [None]:
# Check of number of rows in the new testing data set
print ("Number of rows in testing data set:", len(X_test))
print ("Number of rows in testing label set:", len(y_test))

Number of rows in testing data set: 2133
Number of rows in testing label set: 2133


In [None]:
# Check of the training dataset to make sure we it still contains valid data
print(X_train[4])

scarlet diva has a voyeuristic tug , but all in all it's a lot less sensational than it wants to be .


In [None]:
# Check of the training dataset to make sure we it still contains valid data
print(y_train[4])

0


1. Lemmatize text

In [None]:
import nltk

nltk.download('wordnet')

# Initialize the lemmatizer
lemmatizer = nltk.WordNetLemmatizer()

# Lemmatize the array of text
def lemmatize_text(text):
    return lemmatizer.lemmatize(text)

# Lemmatize the array of text
X_train = list(map(lemmatize_text, X_train))


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


2. Vectorize text via bag of words or tf-idf (Remove stop words, set max_features to 500)

In [None]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# def vectorization(dataset_name, column_name):
#     # Load the dataset
#     dataset = load_dataset(dataset_name)
#     df = dataset["train"]

#     # Get the text data
#     text_data = df[column_name]

# Create TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Create bigram vectorizer
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
X_train_bigram = bigram_vectorizer.fit_transform(X_train)

print (X_train_tfidf[0], X_train_bigram[0])

  (0, 7912)	0.3020312998274668
  (0, 1107)	0.115925612247401
  (0, 12866)	0.0704885378738042
  (0, 6765)	0.07550765814614321
  (0, 3455)	0.1511054634436645
  (0, 3212)	0.26544986818339117
  (0, 13890)	0.1622734819616994
  (0, 13969)	0.13857307861935397
  (0, 10124)	0.1640738044043707
  (0, 13976)	0.16128680668980458
  (0, 4746)	0.2452668466624723
  (0, 5970)	0.2270681140966442
  (0, 2685)	0.2484404711189431
  (0, 568)	0.0589749637004156
  (0, 7369)	0.3020312998274668
  (0, 8650)	0.12155682428892886
  (0, 11140)	0.26031698134500975
  (0, 12691)	0.10364549022785599
  (0, 6679)	0.1411258879603038
  (0, 13101)	0.27895098617497904
  (0, 5911)	0.2558706725224913
  (0, 4598)	0.20398780044415643
  (0, 9635)	0.3020312998274668
  (0, 13986)	0.16128680668980458
  (0, 4786)	0.09712310013893696   (0, 21084)	1
  (0, 66682)	1
  (0, 46217)	1
  (0, 19910)	1
  (0, 26416)	1
  (0, 63190)	1
  (0, 29765)	1
  (0, 59171)	1
  (0, 50955)	1
  (0, 41260)	1
  (0, 34399)	1
  (0, 3047)	1
  (0, 13521)	1
  (0, 26917)	


The **bag-of-words** model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.


In [None]:
# Create the bag of words model
vectorizer = CountVectorizer(stop_words='english',max_features=500)
X_train_bag = vectorizer.fit_transform(X_train)
X_test_bag = vectorizer.transform(X_test)

In [None]:
# Check the vectorized data to make sure it is valid
print (X_train_bag[3])

  (0, 151)	1
  (0, 264)	1
  (0, 158)	1
  (0, 100)	2
  (0, 32)	1
  (0, 416)	1
  (0, 222)	1
  (0, 21)	1
  (0, 334)	1
  (0, 249)	1


4. Choose two algorithms (e.g., logistic regression and decision tree)

I tried logististic regression, but could not get the grid search to work.  I am using RandomForest Classifier and  DecisionTree Classifier.

**Random Forest** model standalone as check of model

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Train the Random Forest model
classifier = RandomForestClassifier()
classifier.fit(X_train_bag, y_train)
# Make predictions on the test set
predictions = classifier.predict(X_test_bag)
# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)


Accuracy: 0.6586966713548992


In [None]:
# Model Evaluation metrics
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score

print('Accuracy Score : ' + str(accuracy_score(y_test,predictions)))
print('Precision Score : ' + str(precision_score(y_test,predictions)))
print('Recall Score : ' + str(recall_score(y_test,predictions)))
print('F1 Score : ' + str(f1_score(y_test,predictions)))

#Dummy Classifier Confusion matrix
from sklearn.metrics import confusion_matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,predictions)))

Accuracy Score : 0.6586966713548992
Precision Score : 0.6762430939226519
Recall Score : 0.5845272206303725
F1 Score : 0.6270491803278688
Confusion Matrix : 
[[793 293]
 [435 612]]


**DecisionTree** model standalone as check of model

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Train the DecisionTreeClassifier model
classifier = DecisionTreeClassifier()
classifier.fit(X_train_bag, y_train)
# Make predictions on the test set
predictions = classifier.predict(X_test_bag)
# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)


Accuracy: 0.6043131739334271


In [None]:
# Model Evaluation metrics
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score

print('Accuracy Score : ' + str(accuracy_score(y_test,predictions)))
print('Precision Score : ' + str(precision_score(y_test,predictions)))
print('Recall Score : ' + str(recall_score(y_test,predictions)))
print('F1 Score : ' + str(f1_score(y_test,predictions)))

#Dummy Classifier Confusion matrix
from sklearn.metrics import confusion_matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,predictions)))

Accuracy Score : 0.6043131739334271
Precision Score : 0.6178861788617886
Recall Score : 0.5081184336198663
F1 Score : 0.5576519916142557
Confusion Matrix : 
[[757 329]
 [515 532]]


5. Perform grid search on each of the algorithms
# GridSearchCV #
**Grid Search Cross Validation** is a technique to search through the best parameter values from the given set of the grid of parameters. It is a cross-validation method. the model and the parameters are required to be fed in. Best parameter values are extracted and then the predictions are made.

Logistic LogisticRegression model standalone as check of model

The model **failed** to fit grid search does not provide optimized solution

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import GridSearchCV
#from sklearn.cross_validation import  cross_val_score

# Define the hyperparameters to tune
param_grid = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']}
# Perform grid search with cross-validation
grid_search = GridSearchCV(LogisticRegression(),param_grid,cv=10)
#grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train_bag, y_train)

# Get the best hyperparameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_ # Make predictions on the test set using the best model predictions = best_model.predict(X_test_bow)


30 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



*** Random Forest *** Classifier grid search

In [None]:
from sklearn.ensemble import RandomForestClassifier

param_grid_rf= {
    'n_estimators': [200, 500],
    'max_features': ['sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
grid_search_rf = GridSearchCV(RandomForestClassifier(), param_grid_rf, cv= 5)
grid_search_rf.fit(X_train_bag, y_train)

In [None]:
print("Best: %f using %s" % (grid_search.best_score_, grid_search.best_params_))

# # Get the best hyperparameters and model
# # Make predictions on the test set using the best model
best_model_rf = grid_search.best_estimator_
score_rf = best_model.score(X_test_bag, y_test)

Best: 0.670313 using {'C': 0.1, 'penalty': 'l2'}


In [None]:
print("Best Model Prediction:", best_model_rf)
print ("Best Model Score: ", score_rf )

Best Model Prediction: LogisticRegression(C=0.1)
Best Model Score:  0.6694796061884669


**Decision Tree** Classifier grid search

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import GridSearchCV

### Grid Search Python Implementation
# Define the hyperparameters to tune
param_grid = {'criterion':['gini','entropy'],
              'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150]
              }

# Perform grid search with cross-validation
grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid_search.fit(X_train_bag, y_train)

# Get the best hyperparameters and model
# Make predictions on the test set using the best model
best_params_dt = grid_search.best_params_
best_model_dt = grid_search.best_estimator_
score_dt = best_model.score(X_test_bag, y_test)
predictions_dt = best_model.predict(X_test_bag)

In [None]:
print (best_model_dt )
print (best_params_dt )
print ("Best Model Score: ", score_dt )

DecisionTreeClassifier(max_depth=150)
{'criterion': 'gini', 'max_depth': 150}
Best Model Score:  0.6694796061884669


6. Inference on test
model inference is the process of using a trained model to infer a result from test or live data

7. Print classification report
A Classification report is used to measure the quality of predictions from a classification algorithm. How many predictions are True and how many are False. More specifically, True Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a classification report as shown below.

In [None]:
### classification report
from sklearn.metrics import classification_report, confusion_matrix

print("classification_report \n", classification_report(y_test, predictions))
print("confusion_matrix \n",confusion_matrix(y_test, predictions))

classification_report 
               precision    recall  f1-score   support

           0       0.60      0.70      0.64      1086
           1       0.62      0.51      0.56      1047

    accuracy                           0.60      2133
   macro avg       0.61      0.60      0.60      2133
weighted avg       0.61      0.60      0.60      2133

confusion_matrix 
 [[757 329]
 [515 532]]


##Classification Report##
It is one of the performance evaluation metrics of a classification-based machine learning model. It displays your model’s precision, recall, F1 score and support. It provides a better understanding of the overall performance of our trained model. To understand the classification report of a machine learning model, you need to know all of the metrics displayed in the report. For a clear understanding, I have explained all of the metrics below so that you can easily understand the classification report of your machine learning model:

**Precision**: A classification metric that measures the proportion of correctly predicted positive instances among all instances predicted as positive. It focuses on the accuracy of positive predictions and is calculated as the ratio of true positives to the sum of true positives and false positives.

**Recall**: Also known as sensitivity or true positive rate, it measures the proportion of correctly predicted positive instances out of all actual positive instances. Recall emphasizes the ability of a model to correctly identify positive instances and is calculated as the ratio of true positives to the sum of true positives and false negatives.

**F1 Score**: A metric that combines precision and recall into a single value to evaluate classification models. It provides a balance between precision and recall, taking into account both false positives and false negatives. F1 Score is the harmonic mean of precision and recall, and it ranges from 0 to 1, with 1 representing the best possible performance.

# 8. Write out interpretation of report #  
***The numbers may change due to the iteration or run of the model.***


The classification reports show of the model overall system performance.  Precision is the number of times the predicted value was correct, or true positives, 58% in our model, for 0 values, and 61% of the time for 1 values. This also shows the number of false positives of the model, or 42% (1-.58).

Similarly, the recall show the number of time model was correct for 0 negative values, or true negatives, 64% of the time and 58% of the time for 1 negative values, or true negatives.  These are the ratios of how well the model predicted correctly and when it was correct or wrong.

F1 score is a metric that measures a model's accuracy. It combines the precision and recall scores of a model. The accuracy metric computes how many times a model made a correct prediction across the entire dataset.  F1 = 2 * (precision * recall) / (precision + recall).  

Support is the number of data in the label set.

Macro-averge is the mean of the score, or the sum of the scores divided by the number of calculations, or support, i.e.  for f1-score macro average is .63+.61/2 = .62

The weighted average is is the score weighed by the number of sampes for the data.  i.e. for the f1 score, .60 (1056/(1056+1077) + .60 (1077/(1056+1077) = .60 The weighted average accounts for the distribution of the data.

The confusion_matrix is the raw output of the data
[[732 382]
 [469 550]]