This notebook evaluates different methods for tokenization and stemming/lemmatization
and assesses the impact on binary sentiment classification, using a train/dev dataset of sample of 1000 reviews from the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/).  Each tokenization method is evaluated on the same learning algorithm ($\ell_2$-regularized logistic regression); the only difference is the tokenization process. For more, see: http://sentiment.christopherpotts.net/tokenizing.html

### Cell 2: Importing Libraries
This cell imports all the necessary Python libraries and modules for the notebook's tasks. It includes libraries for natural language processing (`nltk`, `spacy`), a custom class for testing (`TokenizationTest`), and a specialized tokenizer (`happyfuntokenizing`).

In [None]:
# Import the Natural Language Toolkit library, a popular library for NLP tasks.
import nltk
# Import the spaCy library, another powerful library for advanced NLP.
import spacy
# From NLTK's stemming module, import the PorterStemmer algorithm.
from nltk.stem.porter import *
# Import the custom class that will be used to run the evaluation tests.
from TokenizationTest import TokenizationTest
# Import a tokenizer designed for social media and sentiment analysis from a library by Christopher Potts.
from happyfuntokenizing import Tokenizer as potts

### Cell 3: Downloading the spaCy Model
This cell attempts to download a small English language model (`en_core_web_sm`) for spaCy. This model is pre-trained and required for tasks like tokenization, part-of-speech tagging, and lemmatization. **Note:** The output shows an error (`No module named spacy`), indicating that the spaCy library was not installed in the environment where this command was run. This would cause subsequent cells that rely on spaCy to fail.

In [None]:
# This is a shell command executed from the notebook to download the specified spaCy model.
!python -m spacy download en_core_web_sm

### Cell 4: Initializing Models and Tools
Here, the various tools needed for tokenization and stemming are initialized. This includes loading the spaCy language model and creating instances of the NLTK Porter Stemmer and the Potts tokenizer. Assuming spaCy was installed correctly, this cell sets up the objects that will process the text data.

In [None]:
# Load the small English spaCy model. Disable the Named Entity Recognition (ner) and parser components to speed up processing,
# as only the tagger (for lemmatization) is needed.
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
# Explicitly remove the 'ner' component from the processing pipeline.
nlp.remove_pipe('ner')
# Explicitly remove the 'parser' component from the processing pipeline.
nlp.remove_pipe('parser')

# Create an instance of the NLTK PorterStemmer.
stemmer = PorterStemmer()

# Create an instance of the Potts sentiment-aware tokenizer.
potts_tokenizer=potts()

### Cell 5: Defining spaCy Wrapper Functions
This cell defines two functions that act as wrappers for spaCy's functionality. One function is for tokenization (splitting text into words), and the other is for lemmatization (reducing words to their base or dictionary form). These functions make it easy to pass spaCy's methods to the evaluation tester.

In [None]:
# Define a function that takes raw text data as input for spaCy tokenization.
def spacy_tokenizer(data):
    # Process the data with the spaCy nlp object.
    spacy_tokens=nlp(data)
    # Return a list of the token texts (the actual words).
    return [token.text for token in spacy_tokens]

# Define a function that takes raw text data as input for spaCy lemmatization.
def spacy_lemmatizer(data):
    # Process the data with the spaCy nlp object.
    spacy_tokens=nlp(data)
    # Return a list of the lemmas for each token.
    return [token.lemma_ for token in spacy_tokens]

### Cell 6: Setting Up the Evaluation Tester
This cell initializes the `TokenizationTest` class with the paths to the training and development (validation) datasets. This `tester` object will handle the process of training the logistic regression model and evaluating its accuracy for each tokenization method.

In [None]:
# Create an instance of the TokenizationTest class.
# The constructor takes the file paths for the training data and the development (test) data.
tester=TokenizationTest("../data/sentiment.1000.train.txt", "../data/sentiment.1000.dev.txt")

### Cell 7: Evaluation 1: Basic String Splitting
The first evaluation uses Python's built-in `str.split` method, which tokenizes text by splitting it at whitespace characters. This is a simple, baseline approach. The output shows an accuracy of **85.6%**.

In [None]:
# Call the evaluate method of the tester object, passing the built-in string split method as the tokenizer.
tester.evaluate(str.split)

### Cell 8: Evaluation 2: Porter Stemming
This evaluation uses the NLTK Porter Stemmer. Stemming reduces words to their root form (e.g., "running" becomes "run"), which can help the model generalize. However, it can sometimes be too aggressive. The accuracy here is **82.5%**, lower than the baseline.

In [None]:
# Evaluate the Porter Stemmer's stemming method.
tester.evaluate(stemmer.stem)

### Cell 9: Evaluation 3: NLTK Word Tokenizer
This test uses `nltk.word_tokenize`, a more sophisticated tokenizer than `str.split`. It handles punctuation and contractions more effectively (e.g., "don't" becomes "do" and "n't"). This improved tokenization results in a higher accuracy of **87.5%**.

In [None]:
# Evaluate the NLTK library's standard word tokenizer.
tester.evaluate(nltk.word_tokenize)

### Cell 10: Evaluation 4: spaCy Tokenizer
Here, the `spacy_tokenizer` function defined earlier is evaluated. spaCy's tokenizer is highly advanced and context-aware. It achieves an accuracy of **87.1%**, which is very good but slightly below the NLTK tokenizer in this specific test.

In [None]:
# Evaluate the custom spaCy tokenizer function.
tester.evaluate(spacy_tokenizer)

### Cell 11: Evaluation 5: spaCy Lemmatizer
This test evaluates the `spacy_lemmatizer` function. Lemmatization is similar to stemming but more linguistically informed, always reducing words to their valid dictionary form (e.g., "better" becomes "good"). This results in an accuracy of **87.3%**.

In [None]:
# Evaluate the custom spaCy lemmatizer function.
tester.evaluate(spacy_lemmatizer)

### Cell 12: Evaluation 6: Potts Sentiment Tokenizer
The final evaluation uses the tokenizer from the `happyfuntokenizing` library, which is specifically designed for sentiment analysis and social media text. It handles features like emoticons, slang, and repeated punctuation well. It achieves the highest accuracy of all methods tested: **88.6%**.

In [None]:
# Evaluate the tokenize method from the Potts tokenizer instance.
tester.evaluate(potts_tokenizer.tokenize)