# Linear Text Classification

## By Brea Koenes

### Overview

I use python libraries (NLTK) to do NLP Linear Text Classification.

NLTK is an open-source library that provides various tools and modules for working with human language data. It offers functionalities for tasks like tokenization, stemming, tagging, parsing, and even some basic machine learning implementations.

In [1]:
import nltk

### 1 - Data Preparation

Import the data into four lists of reviews:

`neg_train` ('data/train/allneg.txt') for the negative training data,
`pos_train` ('data/train/allpos.txt') for the positive training data,
`neg_test` ('data/test/allneg.txt') for the negative testing data,
`pos_test` ('data/test/allpos.txt') for the positive testing data

Each line MUST HAVE spaces removed from its start and end.

In [3]:
# Function to read in data
def GetReviewsList(path):
    with open(path) as f:
        lines = f.readlines()

    return lines

In [4]:
# Read data
path = './data/'
neg_train = GetReviewsList(path+'train/allneg.txt')
pos_train = GetReviewsList(path+'train/allpos.txt')
neg_test = GetReviewsList(path+'test/allneg.txt')
pos_test = GetReviewsList(path+'test/allpos.txt')

### 2 - Data Preprocessing:
NLTK requires text data to be preprocessed and converted into a format suitable for its classifiers. Preprocessing may include several steps including (but not restricted) to:

- tokenization
- lowercasing
- removing punctuation
- removing 'stop words'


### 2a - Tokenization
The first step in NLP Data Processing is tokenization. Tokenization is the process of breaking down a text (usually a sentence or a paragraph, and in our case, a review) into smaller units called tokens. These tokens can be individual words, phrases, or even characters, depending on the level of granularity required for the specific NLP task.

Without tokenization, NLP models have difficulty understanding the structure and meaning of a given text.

Create a function called **'tokenizeReview'** that takes a list of reviews and return a list of tokenizen words for each review. 

Load punkt from ntlk (`nltk.download('punkt')`). This works with the tokenizer to handle abbreviations, words that often start sentences, and collocations.

In [5]:
nltk.download('punkt')
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /Users/bkoenes/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/bkoenes/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [7]:
# Tokenize reviews function
def tokenizeReview(reviews):
    tokenized_reviews = []
    for review in reviews:
        tokens = word_tokenize(review)
        tokenized_reviews.append(tokens)
    return tokenized_reviews

### 2b - Lowercasing, removing punctuation and 'stop words'

Create a function called `cleanTokenizedReview` that takes in one list of tokenized reviews and return an updated list with all tokens in **lowercase**, without **punctuation** and without **'stop words'**.

In [6]:
import string
from nltk.corpus import stopwords
nltk.download('stopwords')

# Clean tokenized reviews function
def cleanTokenizedReview(tokenized_reviews):
    
    # Initialize
    punctuation_list = list(string.punctuation)
    stop_words = stopwords.words('english')

    cleaned_reviews = []

    for review in tokenized_reviews:
        cleaned_tokens = []

        # Lowercase
        for token in review:
            token = token.lower()

            # No punctuation or stop words
            if token in punctuation_list or token in stop_words:
                continue

            cleaned_tokens.append(token)

        cleaned_reviews.append(cleaned_tokens)

    return cleaned_reviews

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bkoenes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 2c - Tokenize and Clean the four lists of reviews
Update 'neg_train', 'pos_train', 'neg_test', 'pos_test' using the functions.

In [7]:
# Update lists
tokenized_neg_train = tokenizeReview(neg_train)
tokenized_pos_train = tokenizeReview(pos_train)
tokenized_neg_test = tokenizeReview(neg_test)
tokenized_pos_test = tokenizeReview(pos_test)

neg_train = cleanTokenizedReview(tokenized_neg_train)
pos_train = cleanTokenizedReview(tokenized_pos_train)
neg_test = cleanTokenizedReview(tokenized_neg_test)
pos_test = cleanTokenizedReview(tokenized_pos_test)

### 3a - Feature Extraction

NLTK's classifiers work with features represented as dictionaries (dict), where each word is a feature and its value is typically set to `True`.

Create a function called `featureExtraction` that takes a list of cleaned and tokenized reviews and a label (**'pos'** for positive reviews and **'neg'** for negative reviews) and returns a list of tuples composed of a Python dict (with every word as key and `True` as value) and the label associated to that review.

In [8]:
# Extraxt features function
def featureExtraction(tokenized_reviews, label):
    feature_sets = []

    for review in tokenized_reviews:
        feature_dict = {}
        
        for word in review:
            if isinstance(word, str):
                feature_dict[word] = True
        
        feature_sets.append((feature_dict, label))
        
    return feature_sets

### 3b - Apply Feature Extraction
Update 'neg_train', 'pos_train', 'neg_test', 'pos_test' using the `featureExtraction` function.

In [9]:
# Update lists with feature extraction
neg_train = featureExtraction(neg_train, 'neg')
pos_train = featureExtraction(pos_train, 'pos')
neg_test = featureExtraction(neg_test, 'neg')
pos_test = featureExtraction(pos_test, 'pos')

### 4 - Prepare for training

Create a unified train list (joining 'neg_train' and 'pos_train'). After creating this unified dataset, shuffle the training data for better training. 
Do the same for the test reviews ('neg_test', 'pos_test').


In [10]:
import random

# Unify dataset
train_features = neg_train + pos_train
test_features = neg_test + pos_test

# Shuffle
random.shuffle(train_features)
random.shuffle(test_features)

### 5 - Training Naive Bayes Classifier
From `nltk.classify` import the `NaiveBayesClassifier`.

Create a Naive Bayes Classifier called `nb_classifier` calling the `NaiveBayesClassifier.train` function on the 'train_features'.

In [11]:
from nltk.classify import NaiveBayesClassifier

# Call classifier function on training features
nb_classifier = NaiveBayesClassifier.train(train_features)

### 6 - Naive Bayes - Model evaluation

Using the `classifier` object created, separate labels from data and run the model.

In [12]:
# Seperate labels
train_labels = [label for (_, label) in train_features]
train_data = [features for features, _ in train_features]    
test_labels = [label for (_, label) in test_features]
test_data = [features for features, _ in test_features]

# Run model
nb_predictions = nb_classifier.classify_many(test_data)

### 7 - Naive Bayes - Calculate the Accuracy

In [13]:
# Calculate accuracy from predictions
correct_predictions = sum(1 for pred_label, true_label in zip(nb_predictions, test_labels) if pred_label == true_label)
nb_test_accuracy = correct_predictions / len(test_labels)

# Print results
print('Naive Bayes')
print('Testing data accuracy:', nb_test_accuracy)

Naive Bayes
Testing data accuracy: 0.8414


### 8 - Training Perceptron Classifier

For the Perceptron, we will use `sklearn` instead of `NTLK` library. We will need to convert the feature dictionaries to feature vectors.

In [14]:
from sklearn.linear_model import Perceptron 
from sklearn.feature_extraction import DictVectorizer

Feature dictionaries are convenient for representing features in a human-readable format but most machine learning algorithms, including the Perceptron model from scikit-learn, expect input data in the form of feature vectors, which are numerical arrays.

Create an object called `vectorizer` from the `DictVectorizer` function. Pass `sparse=True` as a parameter.

Create an object called `X_train_vec` from `vectorizer.fit_transform` using `train_data`.

Create an object called `X_test_vec` from `vectorizer.transform` using `test_data`.

In [15]:
vectorizer = DictVectorizer(sparse=True)

X_train_vec = vectorizer.fit_transform(train_data)
X_test_vec = vectorizer.transform(test_data)

Do the same for the labels.

Create an object called `y_train` from train_labels by replacing 'neg' with 0 and 'pos' with 1.

Create an object called `y_test` from test_labels by replacing 'neg' with 0 and 'pos' with 1.


In [16]:
y_train = [0 if label == 'neg' else 1 for label in train_labels]
y_test = [0 if label == 'neg' else 1 for label in test_labels]

Create and train the Perceptron model.

Create an object called `perceptron_model`. The parameter 'max_iter' is the number of epochs. Fit, or train, the model.

In [17]:
perceptron_model = Perceptron(max_iter=15)
perceptron_model.fit(X_train_vec, y_train)



### 9 - Perceptron - Model evaluation
Run the model to assess accuracy.

In [18]:
accuracy = perceptron_model.score(X_test_vec, y_test)

### 10 - Perceptron - Calculate the Accuracy

In [19]:
# Print results
print('Perceptron')
print('Testing data accuracy:', accuracy)

Perceptron
Testing data accuracy: 0.85492


### 11 - Export Models
Export the models

In [21]:
import pickle

#Save the models in disk
with open('nb_classifier.pkl', 'wb') as file:
    pickle.dump(nb_classifier , file)

with open('perceptron_model.pkl', 'wb') as file:
    pickle.dump(perceptron_model , file)
