<a href="https://colab.research.google.com/github/dfday0529/Car-Dealer-Chatbot/blob/main/CSC228_Lesson04_TextClassification_TfIdf_Sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Text Classification using TFIDF Vectorization and Sklearn
# Text is Classified using Binary Classification of positive or negative sentiment.
# using supervised Yelp, Imdb (movie reviews), and Amazon datasets. 
#
# Uses Libraries:  sklearn
# Runtime:  Google CoLab (cpu)
#
# Owner:  Lorrie Tomek
# 
# Data: 
# UCI Machine Learning Repository
# General URL:  https://archive.ics.uci.edu/
# Sentiment Labeled Sentences URL: 
# https://archive.ics.uci.edu/ml/datasets/sentiment+labelled+sentences
#
# Reference:  Real Python 
# URL:  https://realpython.com/python-keras-text-classification/
# The tutorial is freely available on the internet.  (Verified December 2022.)  
# It is modified in this notebook for teaching purposes, especially n-grams.

# Text Classification using Tfidf

This is a follow-on lesson to "Text classification using Bag-of-Ngrams".  We will again use sklearn in this lesson. 

Instead of using the CountVectorizer and counts of vocabulary n-grams, we will instead use the TfidfVectorizer.

The TfidfVectorizer class works almost exactly like the CountVectorizer, however it calculates word frequencies differently. 
Word (or n-gram) frequencies are calculated as follows. For each word, the overall frequency is a product of the term frequency and the inverse document frequency. Term frequency is the number of times the word occurs in the document. Inverse document frequency is the total number of documents divided by the number of documents where the word occurs.  In this terminology, a document is a "sample utterance".

The content is identical, until we get to "Vocabulary" section.  The previous "identical" part is kept here so it is "stand alone". 

In [None]:
# import python libraries
import pandas as pd
import numpy as np
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# some constants
MIN_DF = 0.05
MAX_DF = 0.8
MAX_FEATURES = 50000
LOWERCASE = False

## Choose and Download a Dataset

We will use a dataset from the UCI Machine Learning Repository which has labeled sentences for Sentiment Analysis.

Investigate the UCI Machine Learning Repository at this URL: 
https://archive.ics.uci.edu/

If you have not downloaded this, please do so. 

In [None]:
# download and rename the zip file
# we can use many unix commands with an exclamation point in front of them in jupyter notebooks
!wget -c "https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip" -O sentiment_labelled_sentences.zip

--2023-01-30 19:34:55--  https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



In [None]:
# if you already unzipped the data, comment this out or it will stop and ask 
# if you want to replace. 
# unzip to current directory only, to make it easier to find the .txt files
#!unzip -j sentiment_labelled_sentences.zip

In [None]:
!ls -r 

yelp_labelled.txt		  sample_data  imdb_labelled.txt
sentiment_labelled_sentences.zip  readme.txt   amazon_cells_labelled.txt


In [None]:
# read the data from the 3 labeled .txt files into a python pandas dataframe
# pandas dataframes are similar to excel spreadsheets with rows and columns
filepath_dict = {'yelp':   'yelp_labelled.txt',
                 'amazon': 'amazon_cells_labelled.txt',
                 'imdb':   'imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names = ['sentence', 'label'], sep = '\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0])

sentence    Wow... Loved this place.
label                              1
source                          yelp
Name: 0, dtype: object


## Explore the Data

Next, we want to explore the data, so we gain an intuitive understanding.  In our case, we have downloaded a dataset that is quite "clean", so we won't do a lot of **data cleansing**.  Our dataset is also already labeled as positive or negative sentiment, so we will be able to use **supervised machine learning**. 

In [None]:
# Look at the first 5 rows of data 
# notice that the source is 'yelp' one of our 3 text files
# and the label is 1 or 0
# Look back at the documentation at the UCI Machine Learning Repository 
# where we downloaded this file to see what the labels mean.
# In this case, 1 is positive sentiment, and 0 is negative sentiment.
df[:5]

Unnamed: 0,sentence,label,source
0,Wow... Loved this place.,1,yelp
1,Crust is not good.,0,yelp
2,Not tasty and the texture was just nasty.,0,yelp
3,Stopped by during the late May bank holiday of...,1,yelp
4,The selection on the menu was great and so wer...,1,yelp


In [None]:
# In pandas, columns have data types that are inferred;
# Here we can see that the sentence column is a general 'object' rather than 'string' 
# Sometimes this means that there is missing (or NaN) data
df.dtypes

sentence    object
label        int64
source      object
dtype: object

In [None]:
# What are all the possible values of the source column?
set(df['source'].tolist())

{'amazon', 'imdb', 'yelp'}

In [None]:
# What are all the unique values of the label?
set(df['label'].tolist())

{0, 1}

In [None]:
# What are the first 5 rows of data from the imdb source?
df[df['source']=='imdb'][:5]

Unnamed: 0,sentence,label,source
0,"A very, very, very slow-moving, aimless movie ...",0,imdb
1,Not sure who was more lost - the flat characte...,0,imdb
2,Attempting artiness with black & white and cle...,0,imdb
3,Very little music or anything to speak of.,0,imdb
4,The best scene in the movie was when Gerardo i...,1,imdb


### Vocabulary 

Finding subsets of rows and columns of data using the pandas library is quite straight forward, but what if we want to know something more complex that requires processing a column of data. 

We may want to know the list of the single word (1-gram) **vocabulary** words in the sentence column.

Let's first try just using a simple list of 2 sentences.  
sentences = ['John likes ice cream', 'John hates chocolate.']

The **vocabulary** is ['John', 'likes', 'ice', 'cream', 'hates', 'chocolate'] 

However, in this lesson we also want to learn n-grams, that is words that appear frequently together, and use that positional awareness to train our model. 

We will use a **TfidfVectorizer** from **sklearn** this time.

General URL to sklearn: https://scikit-learn.org/stable/ 
In the search box, you can type "TfidfVectorizer" 

**TfidfVectorizer** is in sklearn.feature_extraction.text 
In the previous lesson, we created our own custom features. In this case, we are going to use sklearn to provide some features. 

In [None]:
# Use sklearn TfidfVectorizer to find the vocabulary in our small list of sentences
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = ['John likes ice cream', 'John hates chocolate.']

small_vectorizer = TfidfVectorizer(min_df=MIN_DF, ngram_range=(1, 3), 
    lowercase=LOWERCASE, max_features=MAX_FEATURES, use_idf=True)
small_vectorizer.fit(sentences)
small_vectorizer.vocabulary_

{'John': 0,
 'likes': 11,
 'ice': 9,
 'cream': 6,
 'John likes': 3,
 'likes ice': 12,
 'ice cream': 10,
 'John likes ice': 4,
 'likes ice cream': 13,
 'hates': 7,
 'chocolate': 5,
 'John hates': 1,
 'hates chocolate': 8,
 'John hates chocolate': 2}

In [None]:
# How many vocabulary words are there?  Notice our vocablary words now 
# include bi-grams (like "ice cream") and tri-grams (like "likes ice cream")
# Our vocabulary size is therefore larger.  
len(small_vectorizer.vocabulary_)

14

In [None]:
# Use sklearn TfidfVectorizer to find the vocabulary of our full dataset
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = df['sentence'].tolist()

vectorizer = TfidfVectorizer(min_df=MIN_DF, ngram_range=(1,3), 
  lowercase=LOWERCASE, max_features=MAX_FEATURES,use_idf=True)
vectorizer.fit(sentences) # train the vectorizer
# uncomment the following line to see vocabulary of our entire dataset
# warning it is very long, and cuts off at 5000 (and you'll want to recomment 
# the line after and run the cell again
#vectorizer.vocabulary_, width=79, compact=True

TfidfVectorizer(lowercase=False, max_features=50000, min_df=0.05,
                ngram_range=(1, 3))

### Create a Feature Vector using TfidfVectorizer

The TfIdfVectorizer performs tokenization and creates a feature vector.  


Let's build describes our sample sentences using the transform() method on our trained TfidfVectorizer().  We have two sentences, so we get an array with 2 rows.  We have 14 vocabulary n-grams, so we have 14 columns. Each column is a count for the number of times a specific vocabulary n-gram exists in the sample sentence.  The first column (index 0) will be the count of the number of times the word 'John' is in the sentence.  The second column (index 1) will be the count of the number of times the word 'John hates' is in the sentence, etc.





In [None]:
small_vectorizer.vocabulary_

{'John': 0,
 'likes': 11,
 'ice': 9,
 'cream': 6,
 'John likes': 3,
 'likes ice': 12,
 'ice cream': 10,
 'John likes ice': 4,
 'likes ice cream': 13,
 'hates': 7,
 'chocolate': 5,
 'John hates': 1,
 'hates chocolate': 8,
 'John hates chocolate': 2}

In [None]:
# how many vocabulary n-grams
len(small_vectorizer.vocabulary_)

14

In [None]:
# Let's build a feature vector that describes our sample sentences.
# This provides an array where each entry is the count or number of times 
# the vocabulary word associated with the column is in the sentence
sentences = ['John likes ice cream', 'John hates chocolate.']
small_vectorizer.transform(sentences).toarray()

array([[0.24395573, 0.        , 0.        , 0.34287126, 0.34287126,
        0.        , 0.34287126, 0.        , 0.        , 0.34287126,
        0.34287126, 0.34287126, 0.34287126, 0.34287126],
       [0.30321606, 0.4261596 , 0.4261596 , 0.        , 0.        ,
        0.4261596 , 0.        , 0.4261596 , 0.4261596 , 0.        ,
        0.        , 0.        , 0.        , 0.        ]])

In [None]:
# Let's try adding a few more words to our example sentences so we have 
# duplicate words in some sentences.

sentences = ['John likes ice cream, and eats ice cream daily.', 
             'John hates, hates, hates chocolate.']
small_vectorizer.transform(sentences).toarray()

array([[0.17005267, 0.        , 0.        , 0.23900309, 0.23900309,
        0.        , 0.47800618, 0.        , 0.        , 0.47800618,
        0.47800618, 0.23900309, 0.23900309, 0.23900309],
       [0.20119468, 0.2827721 , 0.        , 0.        , 0.        ,
        0.2827721 , 0.        , 0.84831629, 0.2827721 , 0.        ,
        0.        , 0.        , 0.        , 0.        ]])

In [None]:
# We can just as easily create feature vectors for all sentences in our dataframe
# Here we see that sklearn, knowing that there are mostly zeros in the array, 
# used a sparce object to represent the array. 
vectorizer.transform(df['sentence'].tolist())

<2748x27 sparse matrix of type '<class 'numpy.float64'>'
	with 8624 stored elements in Compressed Sparse Row format>

## Build a Baseline Model

When you work with machine learning, one important step is to define a **baseline model**. The baseline model is a simple model that can be quickly created.  It is used to compare more advanced models that you want to experiment with.  In our case, we will compare the baseline model with more advanced methods using (deep) neural network models.  

We will start by taking our dataframe of labeled data.  We will start by just looking at data sourced from 'yelp' to train and test our model.  This is mostly an arbitrary decision.

### Train/Test Split

We will split our data into a training set and a test set.  We will use a customary split, that is 80% of our data will be for training our model, and 20% of our data for testing our model.  After we train on the training data, we will evaluate how good our model is using the set of test data that the model has not seen yet.  

We split the data randomly, so that any ordering in how the data is collected (which often occurs in real datasets) will not impact the results.   

We use the train_test_split() method from sklearn to randomly split our data.  We pass random_state = 1000 which serves as a random number seed for randomizing the selection.  By selecting a specific random number seed, we can have consistent/reproducible results.  Each time we use the same random_state, we will split the data in the same way.  The actual value chosen for random_state is arbitrary.

In [None]:
from sklearn.model_selection import train_test_split

# Build a pandas Dataframe for just yelp data.
df_yelp = df[df['source'] == 'yelp']

# Get a list of sentences from our 'yelp' sourced data
sentences = df_yelp['sentence'].values
# Get the corresponding 'label' values 0 or 1 for each of these sentences
y = df_yelp['label'].values

# split the data into train/test with 80% train, and 20% test
sentences_train, sentences_test, y_train, y_test = train_test_split(
  sentences, y, test_size=0.2, random_state=1000)

### Train TfidfVectorizer and Build Feature Vectors

Train the TfidfVectorizer on the sentences in our training data.  The fit() method does the training. 

Create our features vectors (numerical arrays) for our training set and testing set using the transform() method.

We will use X_train, X_test variables names for our numeric feature vectors.  

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# train TfidfVectorizer with ngrams=(1,3), and setting a max_features to avoid explosion of features
vectorizer_yelp = TfidfVectorizer(min_df=MIN_DF,ngram_range=(1,3),
  lowercase=LOWERCASE,max_features=MAX_FEATURES,use_idf=True)
vectorizer_yelp.fit(sentences_train)

# build our feature vectors (numerical arrays)
X_all = vectorizer_yelp.transform(sentences)
X_train_yelp = vectorizer_yelp.transform(sentences_train)
X_test_yelp = vectorizer_yelp.transform(sentences_test)
X_train_yelp

<800x29 sparse matrix of type '<class 'numpy.float64'>'
	with 2553 stored elements in Compressed Sparse Row format>

### Consider our Feature Vectors

We can see that the feature vectors in X_train still has 800 samples (representing the 800 sentences in our training data) after the train/test split.  The number of columns represents the vocabulary n-grams in our training data.  So each of the sentences is represented by a vector of decimal numbers. 

Note: If we had not chosen 80% train and 20% test or if we chose a different random number seed (random_state) we would have a larger or smaller number of vocabulary words.  Also, you can see that we get a sparse matrix. This is a data type that is optimized for matrices with only a few non-zero elements, which only keeps track of the non-zero elements reducing the memory load.

### Build LogisticRegression Model as Baseline

The classification model we are going to use is the logistic regression which is a simple yet powerful linear model that is mathematically speaking in fact a form of regression between 0 and 1 based on the input feature vector. By specifying a cutoff value (by default 0.5), the regression model is used for classification. 

This is a fairly arbitrary decision. 

In [None]:
from sklearn.linear_model import LogisticRegression

classifier_yelp = LogisticRegression()
classifier_yelp.fit(X_train_yelp, y_train) # train our classifier 
score = classifier_yelp.score(X_test_yelp, y_test) # score our classifier on test data

print("Accuracy of our Yelp data only model:", score)

Accuracy of our Yelp data only model: 0.59


In [None]:
df[df['source']=='yelp'][:10]

Unnamed: 0,sentence,label,source
0,Wow... Loved this place.,1,yelp
1,Crust is not good.,0,yelp
2,Not tasty and the texture was just nasty.,0,yelp
3,Stopped by during the late May bank holiday of...,1,yelp
4,The selection on the menu was great and so wer...,1,yelp
5,Now I am getting angry and I want my damn pho.,0,yelp
6,Honeslty it didn't taste THAT fresh.),0,yelp
7,The potatoes were like rubber and you could te...,0,yelp
8,The fries were great too.,1,yelp
9,A great touch.,1,yelp


In [None]:
def explore_yelp_classifier(user_sentence):
  """ explore the yelp classifier results """
  X_user  = vectorizer_yelp.transform([user_sentence])
  # We needed an array that is 1 x our vocabulary size to predict()
  # so we added [] around our user_sentence.
  # print(X_user.shape)  # uncomment this to see th shape of the X_user
  # now use the predict() to predict all of our inputs                                 
  result = classifier_yelp.predict(X_user)
  # predict() returns a list of predicted results, since we only have one input
  # we index on [0]
  # print(result[0])
  # We can display a nicer message since we know what these labels mean.
  if result[0] == 1:
    predicted_sentiment = "positive"
  else:
    predicted_sentiment = "negative"
  print(f"Utterance: {user_sentence}")
  print(f"Predicted Sentiment: {predicted_sentiment}")

In [None]:
# Let's try some examples
explore_yelp_classifier("I hate peas.")

Utterance: I hate peas.
Predicted Sentiment: negative


In [None]:
# Here is a sentence that our classifier gets wrong.
# It uses "not" and "hate" both are negative words, but "not hate" is 
# intuitively positive.  This error is due to using Bag-of-Words to create 
# our feature vectors.  
explore_yelp_classifier("I do not hate peas.")

Utterance: I do not hate peas.
Predicted Sentiment: negative


In [None]:
# Try some of your own examples ! 
explore_yelp_classifier("your text here")

Utterance: your text here
Predicted Sentiment: negative


In [None]:
# Add the model predictions to the dataframe
df['predict'] = classifier_yelp.predict(vectorizer_yelp.transform(df['sentence'].tolist()))

In [None]:
# Dataframe where the model predicted correctly
df_correct = df[df['predict'] == df['label']]

In [None]:
# Look at some correct predictions
df_correct[:10]

Unnamed: 0,sentence,label,source,predict
0,Wow... Loved this place.,1,yelp,1
1,Crust is not good.,0,yelp,0
3,Stopped by during the late May bank holiday of...,1,yelp,1
4,The selection on the menu was great and so wer...,1,yelp,1
6,Honeslty it didn't taste THAT fresh.),0,yelp,0
8,The fries were great too.,1,yelp,1
10,Service was very prompt.,1,yelp,1
11,Would not go back.,0,yelp,0
12,The cashier had no care what so ever on what I...,0,yelp,0
13,"I tried the Cape Cod ravoli, chicken,with cran...",1,yelp,1


In [None]:
# Dataframe where the model predicted incorrectly
df_incorrect = df[df['predict'] != df['label']]

In [None]:
# Look at some incorrect predictions
df_incorrect[:10]

Unnamed: 0,sentence,label,source,predict
2,Not tasty and the texture was just nasty.,0,yelp,1
5,Now I am getting angry and I want my damn pho.,0,yelp,1
7,The potatoes were like rubber and you could te...,0,yelp,1
9,A great touch.,1,yelp,0
16,Highly recommended.,1,yelp,0
21,"The food, amazing.",1,yelp,0
24,So they performed.,1,yelp,0
29,The worst was the salmon sashimi.,0,yelp,1
31,This was like the final blow!,0,yelp,1
32,I found this place by accident and I could not...,1,yelp,0


### Build Logistic Regression Models for each data source

Each of our data sources has different characteristics of data.  Let's build a Logistic Regression model for each of: yelp, amazon, and imdb.

You may find it inconvenient to reuse variable names for different purposes in a notebook.  You may want to take this part, and name each variable differently (as was done with yelp), so you can copy/modify the explore_yelp_classifier and thus you can explore each of these classifiers.

In [None]:
for source in df['source'].unique():
    df_source = df[df['source'] == source]
    sentences = df_source['sentence'].values
    y = df_source['label'].values

    # Train/Test split our data for the data source
    sentences_train, sentences_test, y_train, y_test = train_test_split(
        sentences, y, test_size=0.2, random_state=1000)

    # build and train a TfidfVectorizer for the data source
    vectorizer = TfidfVectorizer(min_df=MIN_DF,lowercase=LOWERCASE,
      ngram_range=(1,3),max_features=MAX_FEATURES,use_idf=True)
    vectorizer.fit(sentences_train)
    X_train = vectorizer.transform(sentences_train)
    X_test  = vectorizer.transform(sentences_test)

    # build and train our LogisticRegression classifier for the data source
    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)

    # score our test data for accuracy
    score = classifier.score(X_test, y_test)
    print('Accuracy for {} data: {:.4f}'.format(source, score))

Accuracy for yelp data: 0.5900
Accuracy for amazon data: 0.6300
Accuracy for imdb data: 0.6333


The accuracy of our classifiers is not bad but of course, not state of the art.  

Suggestion:  comment out the !unzip or it will stall and wait for you to answer on rewriting the file.