# Intro to Text Classification

## Skills

1. **Understand the basic vocabulary of machine learning.**
2. **Explain the importance of training and testing data.**
3. **Train and evaluate a Support Vector Machine**
4. Build a classification pipeline.
6. Use a multilabel classifier.
7. Train and evaluate a transformer classifier.

## Vocabulary List

* **hyperparameter.** An option in a model that is not fit by training, but chosen by the user.
* **model.** A mathematical/computational that takes in some data, outputs some other data, and has parameters that can be fit. A line ($y=mx+b$) is a very simple model to get you from $x$ to $y$.
* **parameter.** Variables in a model that are fit during training.
* **Support Vector Machine.** A simple model that attempts to put a line between two classes of observations.
* **Transformer Architecture.** A deep-learning model which takes into account word meanings and order.
* **x variables.** The inputs to a model.
* **y variable.** The output to a model, the thing we are trying to predict.
* **training set.** The data used to train a model.
* **validation set.** Data used to evaluate a model and find the best hyperparameters.
* **testing set.** Data used in a final evaluation of a model.

## Additional Resources

1. [Scikit-Learn SVM Page](https://scikit-learn.org/stable/modules/svm.html)
2. [Working with Text Data in Scikit-Learn](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
3. [Working with Text Documents](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html)

In [None]:
from sklearn.feature_extraction.text import *
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn import metrics

from sklearn.naive_bayes import MultinomialNB

import pandas as pd

## Motivation

Consider the following questions that we may want to answer using text analysis:

1. Can we automatically determine if a tweet about a product is positive or negative?
2. From an article's text, can we determine whether it is sports, politics, entertainment, ...?
3. Can we determine if an email or online comment is from a real user or a spam bot?
4. Based on word choice, who is the unknown author of this text?

We've looked at relative frequency and TF-IDF analysis, both of which give you an idea if certain words are more frequent or more related to certain kinds of text. So we could look at every word in a tweet and see if each one is more likely to be from a positive or negative tweet. But we want to decide *overall* whether that particular tweet is positive or negative, which is a different, but related task. That's where classification comes in.

## Machine Learning

How does this work? An analogy to linear regression may be helpful here. You've likely seen this before (perhaps in Excel) where you make a scatterplot and fit a line to the data:

<div style="text-align:center;"><img src="https://imgur.com/CdQ3Kdb.png"></div>

Here what we have is an $x$ variable which we want to use to predict a $y$ variable, using a model (the line; $y = mx + b$) which has parameters ($m$ and $b$, the slope and intercept). But in order to figure out the best-fitting line, we first need a bunch of data where we know both the $x$ and the $y$ values, called the training dataset. Once we've done that, we can take new $x$ observations and predict what their value of $y$ is.

For us, each of those things are:

* **The $x$ Variable.** The text to classify.
* **The $y$ Variable.** The category that the text belongs to: spam/not spam, positive/negative, &c.
* **The Model.** We'll be using two different models: the Support Vector Machine (SVM), and a Transformer architecture.

Let's take a look at the SVM:

## Support Vector Machines (SVMs)

At its core, the SVM algorithm tries to find a line which does the best job of separating your data. Take a look at the plot below, which is a plot of two variables, $A$, and $B$, in a dataset. The colors and shapes of the points (the blue crosses and red circles) are the classes. The line plotted shown is the line which best splits the blue and red points.

There are variations on the SVM algorithm that add "kernels"—which are effectively ways of making the boundary line not a line. They are useful for lots of classification tasks, but generally not recommended for very high-dimensional work like text classification.

<div style="text-align:center;"><img src="https://imgur.com/iqtFyZy.png"></div>

## Worked Example

The code below does the following steps, which are quite generic for many types of supervised machine learning. You'll see this workflow again if you do any machine learning (DIDA 325, DIDA 340, ...).

1. Split into training and testing sets
2. Perform Data Transforms
2. Fit a Model
3. Evaluate the Model

In [None]:
reviews = pd.read_csv("https://raw.githubusercontent.com/Greg-Hallenbeck/class-datasets/main/datasets/tripadvisor_hotel_reviews.csv")

reviews.head(5)

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


In [None]:
#netflix = pd.read_csv("https://raw.githubusercontent.com/Greg-Hallenbeck/HARP-210-NLP/main/pandas/netflix.csv")

#netflix.head(2)

In [None]:
# Sample pre-processing on the netflix dataset

"""# Pre-processing to get a columns to get a *ton* of categories
netflix.loc[netflix["genres"].str.contains("drama"), "genre"] = "drama"
netflix.loc[netflix["genres"].str.contains("comedy"),"genre"] = "comedy"
netflix.loc[netflix["genres"].str.contains("documentation"),"genre"] = "documentary"
netflix.loc[netflix["genres"].str.contains("thriller"),"genre"] = "thriller"
netflix.loc[netflix["genres"].str.contains("action"),"genre"] = "action"
netflix.loc[netflix["genres"].str.contains("scifi"),"genre"] = "scifi"
netflix.loc[netflix["genres"].str.contains("romance"),"genre"] = "romance"
netflix.loc[netflix["genres"].str.contains("crime"),"genre"] = "crime"
netflix.loc[netflix["genres"].str.contains("reality"),"genre"] = "reality"
netflix.loc[netflix["genres"].str.contains("fantasy"),"genre"] = "fantasy"

netflix = netflix.loc[~netflix["genre"].isna()]"""

#### Split into Training/Testing Data

In [None]:
# First, we split the data into a training (~80% of the data) and a testing set (20%).

# It's very easy for text models to overfit by "memorizing" the correct answers,
# so after we train the model, we test it on a second dataset that it hasn't seen.

X = reviews["Review"]
y = reviews["Rating"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#### Transform the Data

In [None]:
# Instead of using the text as the input directly, we'll use
# the TF-IDF of each word. We do that with some scikit-learn functions

# This is set up to have a maximum vocabulary size, and remove stop words.
# These need to be defined on the training set, as the testing set will have
# new words not in the training set
cv    = CountVectorizer(max_features=1000, stop_words='english')
X_train = cv.fit_transform(X_train)

tfidf = TfidfTransformer(use_idf=True)
X_train = tfidf.fit_transform(X_train)

In [None]:
# Also, transform the testing data as well
# Note that we're using .transform(), not .fit_transform()
X_test = cv.transform(X_test)
X_test = tfidf.transform(X_test)

#### Fit the Model

In [None]:
# Create a classifier model
classifier = SGDClassifier()

# Then train the model using the training sets
classifier = classifier.fit(X_train, y_train)

#### Evaluate the Model

In [None]:
# First, what's the worst the model could do?
# If you just guessed the same thing for every datapoint, you'd get...44.3% accuracy.
y_train.value_counts(normalize=True)

NameError: name 'y_train' is not defined

In [None]:
# How well did the model do on the training data?
# Hopefully very well.

# Compare the predictions with the
y_pred_train = classifier.predict(X_train)
metrics.accuracy_score(y_train, y_pred_train)

0.6634943875061006

In [None]:
# Now, determine how well it did on the testing set
# It will be lower, but hopefully not too much.
# If it's anywhere near random guessing, we've got a bad fit.

y_pred_test = classifier.predict(X_test)
metrics.accuracy_score(y_test, y_pred_test)

0.595023176384484

In [None]:
# Print out a report of how well the classifier does
# precision of X = If I predict something is X, what % of the time am I right?
# recall of X = what % of X do I find?
print(metrics.classification_report(y_test, y_pred_test, zero_division=0))

NameError: name 'metrics' is not defined

In [None]:
# We can also test it on new data that we don't have the information for.
new_obs = ["I loved this place. Will come back."]

new_obs = cv.transform(new_obs)
new_obs = tfidf.transform(new_obs)
classifier.predict(new_obs)

NameError: name 'cv' is not defined

### Alternate Classifiers in Scikit-Learn

The scikit-learn library contains many different classification models, and it can be useful to train several and compare them. To do so, just import a different classifier from scikit-learn, then change the line `classifier = SGDClassifier()` to use the new classifier. And you're done.

At the top of the notebook, the `MultinomialNB()` classifier has already been imported for you, so give that a try. [Comparisons of several classifiers](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) can be found on the library's website.

This is generally how the Scikit-Learn package works: you use a different classification (or regression) model, whose implementation has been taken care of for you. Its your goal primarily to choose a model, and do any pre-processing to get data in a usable format.

This process will be simplified even more when we use pipelines in the next notebook.

### Standard Classification Tasks in Natural Language Processing

Historically, these have been given specific names, but ultimately they are all classifiers, and can use the same

* **Sentiment Analysis.** Do people like the thing? Hate the thing? This has a lot of usage in business: there are a hundred thousand social media posts and I don't have time to read them all. The example classifier above is a sentiment classifier.

* **Part-of-Speech Tagging.** Is this word in a given sentence a noun, an adjective, &c.? This is used

* **Proper Noun Detection.** Which words in this sentence are proper nouns? This is useful for identifying people and places in documents (e.g. letters, newspaper articles) when you don't have a complete list beforehand.

For each of these classification tasks, what would the training data (X and y) need to look like?