### Problem Statement:
Suppose we are building a sentiment analysis system for movie reviews to determine whether reviews are positive or negative. The reviews contain a mix of casual language, punctuations, numbers, special characters, and stop words that do not contribute to sentiment analysis. The goal is to preprocess the text data efficiently to improve the performance of the sentiment classifier.

For this, we need to implement several key text preprocessing steps, such as tokenization, lowercasing, removing stop words, punctuation removal, stemming, lemmatization, handling numbers, removing special characters, and text normalization.

### 1. Tokenization
Tokenization is the process of splitting a piece of text into individual words (tokens) or sentences. For this problem, we will tokenize each review into words for further processing.

In [12]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')  # Download tokenization models

reviews = ["The movie was AMAZING!! I loved it! 10/10", 
           "Worst movie ever. Total waste of time. 0/10"]
           
# Tokenizing the reviews
tokenized_reviews = [word_tokenize(review) for review in reviews]
print(tokenized_reviews)


[['The', 'movie', 'was', 'AMAZING', '!', '!', 'I', 'loved', 'it', '!', '10/10'], ['Worst', 'movie', 'ever', '.', 'Total', 'waste', 'of', 'time', '.', '0/10']]


[nltk_data] Downloading package punkt to /Users/devarsh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 2. Lowercasing
Lowercasing ensures that words like “Movie” and “movie” are treated the same by converting everything to lowercase.

In [2]:
# Converting tokens to lowercase
lowercased_reviews = [[word.lower() for word in tokens] for tokens in tokenized_reviews]
print(lowercased_reviews)


[['the', 'movie', 'was', 'amazing', '!', '!', 'i', 'loved', 'it', '!', '10/10'], ['worst', 'movie', 'ever', '.', 'total', 'waste', 'of', 'time', '.', '0/10']]


### 3. Removing Stop Words
Stop words (e.g., "the", "is", "in") are common words that don’t add significant meaning in text classification tasks.

In [14]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# Defining the set of English stopwords
stop_words = set(stopwords.words('english'))

# Removing stop words
filtered_reviews = [[word for word in tokens if word not in stop_words] for tokens in lowercased_reviews]
print(filtered_reviews)


[['movie', 'amazing', '!', '!', 'loved', '!', '10/10'], ['worst', 'movie', 'ever', '.', 'total', 'waste', 'time', '.', '0/10']]


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/devarsh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

### 4. Removing Punctuation
Punctuation can often be noise in text processing tasks, so we will remove them.

In [15]:
import string

# Removing punctuation
reviews_no_punct = [[word for word in tokens if word not in string.punctuation] for tokens in filtered_reviews]
print(reviews_no_punct)


[['movie', 'amazing', 'loved', '10/10'], ['worst', 'movie', 'ever', 'total', 'waste', 'time', '0/10']]


### 5. Stemming and Lemmatization
Stemming reduces words to their base form (e.g., "loved" to "love"), while lemmatization reduces words to their lemma (e.g., "better" to "good").

In [16]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Applying stemming and lemmatization
stemmed_reviews = [[stemmer.stem(word) for word in tokens] for tokens in reviews_no_punct]
lemmatized_reviews = [[lemmatizer.lemmatize(word) for word in tokens] for tokens in reviews_no_punct]
print(stemmed_reviews)
print(lemmatized_reviews)


[['movi', 'amaz', 'love', '10/10'], ['worst', 'movi', 'ever', 'total', 'wast', 'time', '0/10']]
[['movie', 'amazing', 'loved', '10/10'], ['worst', 'movie', 'ever', 'total', 'waste', 'time', '0/10']]


[nltk_data] Downloading package wordnet to /Users/devarsh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 6. Handling Numbers
Numbers may or may not be important in some tasks, but for sentiment analysis, ratings like "10/10" or "0/10" can be valuable. We will remove other numbers and retain sentiment-related ratings.

In [17]:
# Removing standalone numbers, while retaining sentiment ratings (e.g., "10/10")
reviews_no_numbers = [[word for word in tokens if not word.isdigit() or word in ['10', '0']] for tokens in lemmatized_reviews]
print(reviews_no_numbers)


[['movie', 'amazing', 'loved', '10/10'], ['worst', 'movie', 'ever', 'total', 'waste', 'time', '0/10']]


### 7. Removing Special Characters
Special characters such as “#”, “@”, or emoji may need to be removed or treated based on their relevance.

In [7]:
import re

# Removing special characters
def remove_special_chars(tokens):
    return [re.sub(r'[^A-Za-z0-9]+', '', word) for word in tokens]

reviews_cleaned = [remove_special_chars(tokens) for tokens in reviews_no_numbers]
print(reviews_cleaned)


[['movie', 'amazing', 'loved', '1010'], ['worst', 'movie', 'ever', 'total', 'waste', 'time', '010']]


### 8. Text Normalization
Text normalization ensures consistency across words by transforming them into a standard format. This includes correcting spelling mistakes, expanding contractions, and dealing with abbreviations.

In [20]:
# Expanding common contractions (e.g., "I've" to "I have")
contractions = {"i've": "i have", "it's": "it is", "won't": "will not", "can't": "cannot","4":"for"}
#contractions

def expand_contractions(tokens):
    return [contractions[word] if word in contractions else word for word in tokens]

normalized_reviews = [expand_contractions(tokens) for tokens in reviews_cleaned]
print(normalized_reviews)

[['movie', 'amazing', 'loved', '1010'], ['worst', 'movie', 'ever', 'total', 'waste', 'time', '010']]


### Amazon Sentiment Sleuth : Analyzing Product Reviews with LLMs
Build and evaluate an NLP model for sentiment analysis of product reviews using the Amazon Customer Reviews dataset from AWS. Students will apply text preprocessing, classification techniques, and advanced NLP methods to analyze sentiment in reviews.


Project Deliverables.
 Model cod5
 Sentiment analysis result
 Jupyter notebook with documentation

### NLP with Python: Using NLTK and SpaCy for text classification.


In [21]:
import spacy

In [23]:
# Load English tokenizer, POS tagger, parser, NER, and word vectors
nlp = spacy.load('')

# Function to preprocess text using SpaCy
def preprocess_text_spacy(text):
    doc = nlp(text)
    
    # Tokenization, Lowercasing, Removing stop words and punctuation, and Lemmatization
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
    
    return tokens

# Preprocess reviews
for review in reviews:
    cleaned_tokens = preprocess_text_spacy(review)
    print(f"Original: {review}")
    print(f"Cleaned Tokens: {cleaned_tokens}\n")

OSError: [E050] Can't find model 'reviews'. It doesn't seem to be a Python package or a valid path to a data directory.