# Text Classification

Sentiment analysis uses binary classification, it examines a sample of text such as a review and scores it on a scale of 0.0 to 1.0 where values closer to 0.0 are negative and values closer to 1.0 are positive. Since machine learning works with numbers and not text the text has to be converted into numbers, one common approach is to build a table of word frequencies called bag of words.

## Preparing Text for Classification

First the text has to be converted into numbers, this process is known as vectorization. One common techniqe for vectorization is to build a table where each row represents a text sample such as a review, and each column represents a word in the training text. The numbers in the rows are word counts and the final number in each row is a label 0 for negative and 1 for positive. Before text is vectorized it is cleaned, there are several ways to clean text:
1. Converting all characters to lowercase
2. Removing punctuation symbols
3. Removing stop words (common words such as "the" and "and" that have little impact on the outcome)

Once cleaned sentences are divided into individual words (tokenized) and the words are used to produce the table columns. N-grams are combinations of a few or more consecutive words that should be treated as a single word. The idea is that some words might be more meaningful if they appear next to each other in a sentence than if they appear far apart. Without n-grams, the relative proximity of words is ignored.

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

lines = [
    'Four score and 7 years ago our fathers brought forth,',
    '... a new NATION, conceived in liberty $$$,',
    'and dedicated to the PrOpOsItIoN that all men are created equal',
    'One nation\'s freedom equals #freedom for another $nation!'
]

vectorizer = CountVectorizer(stop_words='english')
word_matrix = vectorizer.fit_transform(lines)

feature_names = vectorizer.get_feature_names_out()
line_names = [f'Line {(i + 1):d}' for i, _ in enumerate(word_matrix)]

df = pd.DataFrame(data=word_matrix.toarray(), index=line_names, columns=feature_names)
df.head()

Unnamed: 0,ago,brought,conceived,created,dedicated,equal,equals,fathers,forth,freedom,liberty,men,nation,new,proposition,score,years
Line 1,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1
Line 2,0,0,1,0,0,0,0,0,0,0,1,0,1,1,0,0,0
Line 3,0,0,0,1,1,1,0,0,0,0,0,1,0,0,1,0,0
Line 4,0,0,0,0,0,0,1,0,0,2,0,0,2,0,0,0,0


Stemming and lemmatizing words are also text preparation techniques where words such as equal and equals get counted as the same word.

To remove numbers from text first create a function that removes numbers and then pass it as a parameter to CountVectorizer.

In [4]:
import re

def preprocess_text(text):
    return re.sub(r'\d+', '', text).lower()

vectorizer = CountVectorizer(stop_words='english', preprocessor=preprocess_text)
word_matrix = vectorizer.fit_transform(lines)