----
### Key Takeaways
By the end of this module, you will gain a comprehensive understanding of sentiment analysis techniques, be able to preprocess text data effectively, build and evaluate sentiment analysis models using both traditional machine learning algorithms and deep learning techniques, and apply these skills to real-world NLP tasks.

----

### Introduction to NLP
- Understanding Natural Language Processing (NLP) and its applications.
- Overview of NLP techniques for text analysis and understanding.
- Importance of sentiment analysis in understanding customer opinions and feedback.
- -----

[link](https://www.me.com)

### Text Preprocessing
- __Tokenization:__ Breaking down text into smaller units such as words or sentences.
- __Stopword Removal:__ Removing common words (e.g., "the", "is", "and") that do not carry significant meaning.
- __Lemmatization:__ Reducing words to their base or root form (e.g., "running" to "run", "better" to "good").
- ----

#### Bag of Words Model

- Introduction to the Bag of Words (BoW) model.
- Creating a vocabulary of unique words from the text corpus.
- Representing text documents as numerical vectors based on word frequency.
- -----


#### Naive Bayes Classifier

- Understanding the Naive Bayes algorithm for text classification.
- Training a Naive Bayes classifier using labeled data for sentiment analysis.
- Evaluating the classifier performance using accuracy, precision, recall, and F1-score.
-------
#### LSTM (Long Short-Term Memory) Model

- Introduction to Recurrent Neural Networks (RNNs) and LSTM architecture.
- Preprocessing text data for LSTM input.
- Building and training an LSTM model for sentiment analysis.
- Fine-tuning the model and optimizing hyperparameters.
- Evaluating the LSTM model performance and comparing it with traditional methods.
--------

### Practical Applications and Case Studies

- Sentiment analysis on customer reviews for products or services.
- Analyzing social media sentiments for brand perception.
- Understanding public opinions on political or social issues through text analysis.
- -----

### Libraries

In [77]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import re
import nltk # NLP toolkit
# nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
import string
from collections import Counter

# Download NLTK resources if not already downloaded
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('vader_lexicon')

import warnings
warnings.simplefilter('ignore', category=Warning, lineno=0, append=False)

### Preprocess Text Data

In [78]:
# Load the text
text_data = [
    "Wow... Loved this place.",
    "Crust is not good.",
    "Not tasty and the texture was just nasty.",
    "Stopped by during the late May bank holiday of...",
    "The selection on the menu was great and so wer..."
]

#### 1. Convert text to lowercase

In [79]:
text_lower = [text.lower() for text in text_data]
text_lower

['wow... loved this place.',
 'crust is not good.',
 'not tasty and the texture was just nasty.',
 'stopped by during the late may bank holiday of...',
 'the selection on the menu was great and so wer...']

#### 2. Remove Punctuation

In [80]:
text_punc = [text.translate(str.maketrans('', '', string.punctuation)) for text in text_lower]

#### 3. Tokenize

In [81]:
# Tokenize, using list comprehension
tokenized_data = [nltk.word_tokenize(text) for text in text_punc]
tokenized_data

[['wow', 'loved', 'this', 'place'],
 ['crust', 'is', 'not', 'good'],
 ['not', 'tasty', 'and', 'the', 'texture', 'was', 'just', 'nasty'],
 ['stopped', 'by', 'during', 'the', 'late', 'may', 'bank', 'holiday', 'of'],
 ['the', 'selection', 'on', 'the', 'menu', 'was', 'great', 'and', 'so', 'wer']]

#### 4. Stopwords

In [82]:
english_stopwords = set(stopwords.words('english'))

#### 5. Remove stopwords

In [83]:
# Remove stopwords
filtered_data = [[word for word in text if word not in english_stopwords] for text in tokenized_data]
filtered_data

[['wow', 'loved', 'place'],
 ['crust', 'good'],
 ['tasty', 'texture', 'nasty'],
 ['stopped', 'late', 'may', 'bank', 'holiday'],
 ['selection', 'menu', 'great', 'wer']]

#### 6. Lemmatization

In [84]:
filtered_data

[['wow', 'loved', 'place'],
 ['crust', 'good'],
 ['tasty', 'texture', 'nasty'],
 ['stopped', 'late', 'may', 'bank', 'holiday'],
 ['selection', 'menu', 'great', 'wer']]

In [85]:
# Instantiate Lemmatizer
lm = WordNetLemmatizer()


text = "cats are running on the roads"

# Tokenize the text
tokens = word_tokenize(text)

# Lemmatize each token
lemmatized_tokens = [lm.lemmatize(token) for token in tokens]
lemmatized_tokens

['cat', 'are', 'running', 'on', 'the', 'road']

In [88]:

# Lemmatize the text
lemmatized_text = [[lm.lemmatize(word) for word in text] for text in filtered_data]
lemmatized_text

[['wow', 'loved', 'place'],
 ['crust', 'good'],
 ['tasty', 'texture', 'nasty'],
 ['stopped', 'late', 'may', 'bank', 'holiday'],
 ['selection', 'menu', 'great', 'wer']]

### 7. Stemmatization

In [91]:
stemmer = PorterStemmer()
stemmed_text= [[stemmer.stem(word) for word in text] for text in lemmatized_text]
stemmed_text

[['wow', 'love', 'place'],
 ['crust', 'good'],
 ['tasti', 'textur', 'nasti'],
 ['stop', 'late', 'may', 'bank', 'holiday'],
 ['select', 'menu', 'great', 'wer']]

#### Create a vocabulary

In [92]:
# Create a vocabulary
vocabulary = set([word for text in stemmed_text for word in text])
vocabulary

{'bank',
 'crust',
 'good',
 'great',
 'holiday',
 'late',
 'love',
 'may',
 'menu',
 'nasti',
 'place',
 'select',
 'stop',
 'tasti',
 'textur',
 'wer',
 'wow'}

### Import 

In [96]:
df = pd.read_csv('./data/Restaurant_Reviews.tsv', delimiter = '\t')
df['Review']

0                               Wow... Loved this place.
1                                     Crust is not good.
2              Not tasty and the texture was just nasty.
3      Stopped by during the late May bank holiday of...
4      The selection on the menu was great and so wer...
                             ...                        
995    I think food should have flavor and texture an...
996                             Appetite instantly gone.
997    Overall I was not impressed and would not go b...
998    The whole experience was underwhelming, and I ...
999    Then, as if I hadn't wasted enough of my life ...
Name: Review, Length: 1000, dtype: object

In [99]:
# Function to preprocess text
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

In [100]:
# Preprocess the data
data_series_preprocessed = df['Review'].apply(preprocess_text)
data_series_preprocessed

0                                        wow loved place
1                                             crust good
2                                    tasty texture nasty
3      stopped late may bank holiday rick steve recom...
4                            selection menu great prices
                             ...                        
995                    think food flavor texture lacking
996                              appetite instantly gone
997                      overall impressed would go back
998    whole experience underwhelming think well go n...
999    hadnt wasted enough life poured salt wound dra...
Name: Review, Length: 1000, dtype: object

*Polarity Score* - value ranging from -1 to 1, but also depends on the analysis tool or method used. Common ranges include: -1 to 1, -100 to 100, or 0 to 1.

In [101]:
# Initialize sentiment analyzer
sid = SentimentIntensityAnalyzer()

# Perform sentiment analysis
sentiments = data_series_preprocessed.apply(
    lambda x: sid.polarity_scores(x)['compound']
)
sentiments

0      0.8271
1      0.4404
2     -0.5574
3      0.6908
4      0.6249
        ...  
995    0.0000
996    0.0000
997    0.4767
998    0.2732
999    0.3875
Name: Review, Length: 1000, dtype: float64

In [102]:
# Classify sentiment as positive or negative based on compound score
sentiment_class = sentiments.apply(
    lambda x: 'positive' if x > 0 else ('neutral' if x == 0 else 'negative')
)

In [103]:
# Add sentiment scores to the DataFrame
df_with_sentiment = pd.DataFrame({
    'Review': df['Review'], 
     'Sentiment': sentiment_class
})

# Display the DataFrame with sentiment scores
df_with_sentiment

Unnamed: 0,Review,Sentiment
0,Wow... Loved this place.,positive
1,Crust is not good.,positive
2,Not tasty and the texture was just nasty.,negative
3,Stopped by during the late May bank holiday of...,positive
4,The selection on the menu was great and so wer...,positive
...,...,...
995,I think food should have flavor and texture an...,neutral
996,Appetite instantly gone.,neutral
997,Overall I was not impressed and would not go b...,positive
998,"The whole experience was underwhelming, and I ...",positive


### Naives Bayes Approach

### Prepare Data

In [104]:
dataset = pd.read_csv('./data/Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
dataset

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


In [105]:
corpus = []
for i in range(0, 1000):
  review = re.sub('[^a-zA-Z]', ' ', df['Review'][i])
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not')
  review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
  review = ' '.join(review)
  corpus.append(review)

In [109]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features = 1500)

# Separate features from labels
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values

### Split

In [110]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

### Build Model

In [111]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

In [112]:
# Build model
model = (
    BernoulliNB()
    )

# Fit model
model.fit(X_train, y_train)

### Evaluate

In [121]:
# Evaluate the model
y_pred = model.predict(X_test)

# Compute accuracy
acc_score = accuracy_score(y_test, y_pred)
print(f'Accuracy: {acc_score*100:.2f}%')

Accuracy: 74.00%


In [124]:
# Make predictions on new dataset
new_text = ['perfectly fine for me']

new_vector = cv.transform(new_text)
prediction = model.predict(new_vector)

print(f'Sentiment: {'positive' if prediction > 0 else ('neutral' if prediction == 0 else 'negative')}')

Sentiment: positive
