## Sentiment Analysis with Python

The following tutorial and the associated examples are based on "NLTK Sentiment Analysis Tutorial for Beginners" : https://www.datacamp.com/tutorial/text-analytics-beginners-nltk

In [2]:
# If nltk is not installed, install it
!pip install nltk



In [3]:
# import libraries

import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [4]:
df = pd.read_csv('amazon_reviews.csv')
df.head()

Unnamed: 0,reviewText,Positive
0,This is a one of the best apps acording to a b...,1
1,This is a pretty good version of the game for ...,1
2,this is a really cool game. there are a bunch ...,1
3,"This is a silly game and can be frustrating, b...",1
4,This is a terrific game on any pad. Hrs of fun...,1


In [5]:
df.shape

(20000, 2)

In [6]:
df_temp = df.sample(200)

## Text Preprocessing
The following function preprocesses text through **tokenization**, **stop word removal**, and **lemmatization**.

### Tokenization

**Tokenization** in Natural Language Processing (NLP) is the process of breaking down text into smaller units called tokens. These tokens can be words, phrases, or even characters, depending on the level of granularity needed for the task.

**Steps in Tokenization:**
* Text Segmentation: The text is split based on spaces, punctuation, or other predefined rules. For example, the sentence "I love NLP!" might be tokenized into ["I", "love", "NLP", "!"].
* Handling Special Cases: Tokenizers also manage edge cases like contractions (e.g., "don't" into ["do", "n't"]), hyphenated words, and abbreviations.
* Normalization: Tokens might be converted to lowercase, numbers might be replaced with a special token, or other forms of text normalization might be applied.

**Types of Tokenization:**
* Word Tokenization: Splits text into words. Example: "Tokenization is fun" → ["Tokenization", "is", "fun"].
* Subword Tokenization: Breaks down words into meaningful subword units, useful in handling unknown words. Example: "unhappiness" → ["un", "happiness"].
* Character Tokenization: Splits text into individual characters. Example: "cat" → ["c", "a", "t"].

### Stop Word Removal
Stop word removal is a process in Natural Language Processing (NLP) where common words that carry little meaningful information are filtered out from the text. These words, known as "stop words," include terms like "the," "is," "in," "and," etc.

Stop words are typically removed because they appear frequently in text but don't contribute much to the understanding of the content. By removing them, the focus is placed on more significant words that carry the essence of the text.

For the sentence "The cat is sitting on the mat," after stop word removal, it might become "cat sitting mat."

**Benefits:**
* Reduces Text Size: It helps in reducing the dimensionality of the text, making processing more efficient.
* Improves Model Performance: By eliminating irrelevant words, models can focus on the words that actually contribute to the task, potentially improving performance in tasks like text classification, search engines, or sentiment analysis.

### Lemmatization
Lemmatization is a process in Natural Language Processing (NLP) that reduces words to their base or root form, known as the "lemma." Unlike stemming, which just cuts off word endings, lemmatization considers the word's meaning and context to transform it into a valid dictionary form.

**How It Works:**
* Context-Aware: Lemmatization uses the context of the word (like its part of speech) to accurately reduce it to its lemma. For example, "running" becomes "run," and "better" becomes "good."
* Morphological Analysis: It analyzes the structure and morphology of the word to ensure that the lemma is a valid word in the language.

**Example:**
* "running" → "run"
* "was" → "be"
* "mice" → "mouse"

**Benefits:**
Lemmatization helps in reducing inflectional forms and grouping similar words together, which improves the accuracy of text processing tasks like information retrieval, text mining, and sentiment analysis.


In [7]:
# create preprocess_text function
def preprocess_text(text):

    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]


    # Lemmatize the tokens

    lemmatizer = WordNetLemmatizer()

    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]


    # Join the tokens back into a string
    processed_text = ' '.join(lemmatized_tokens)

    return processed_text

In [8]:
# apply the function df

df_temp['reviewTextProcessed'] = df_temp['reviewText'].apply(preprocess_text)
df_temp

Unnamed: 0,reviewText,Positive,reviewTextProcessed
3453,"For all you Low Information people out there, ...",1,"low information people , turn & amp ; # 34 ; p..."
3826,CalenGoo is the killer app on my Android Table...,1,calengoo killer app android tablet idevices . ...
11795,I've been using this alarm clock for over a ye...,1,'ve using alarm clock year thoroughly satisfie...
15994,I can't play it due to login messes and confli...,0,ca n't play due login mess conflicting game . ...
9185,"i like watching music videos on this app, it a...",1,"like watching music video app , also found new..."
...,...,...,...
13726,"I think this is a very effective app, i used l...",1,"think effective app , used lot , easy use what..."
14883,I really like this format. it is much easier t...,1,really like format . much easier type . best k...
14983,Plenty of other apps and games that require co...,0,plenty apps game require communication simply ...
12969,THis is a very nice bible. It is easy to use a...,1,nice bible . easy use find function work . tha...


In [10]:
# initialize NLTK sentiment analyzer

analyzer = SentimentIntensityAnalyzer()


# create get_sentiment function

def get_sentiment(text):

    scores = analyzer.polarity_scores(text)

    sentiment = 1 if scores['pos'] > 0 else 0

    return sentiment



In [12]:
# apply get_sentiment function

df_temp['sentiment'] = df_temp['reviewText'].apply(get_sentiment)

df_temp

Unnamed: 0,reviewText,Positive,reviewTextProcessed,sentiment
3453,"For all you Low Information people out there, ...",1,"low information people , turn & amp ; # 34 ; p...",0
3826,CalenGoo is the killer app on my Android Table...,1,calengoo killer app android tablet idevices . ...,1
11795,I've been using this alarm clock for over a ye...,1,'ve using alarm clock year thoroughly satisfie...,1
15994,I can't play it due to login messes and confli...,0,ca n't play due login mess conflicting game . ...,0
9185,"i like watching music videos on this app, it a...",1,"like watching music video app , also found new...",1
...,...,...,...,...
13726,"I think this is a very effective app, i used l...",1,"think effective app , used lot , easy use what...",1
14883,I really like this format. it is much easier t...,1,really like format . much easier type . best k...,1
14983,Plenty of other apps and games that require co...,0,plenty apps game require communication simply ...,0
12969,THis is a very nice bible. It is easy to use a...,1,nice bible . easy use find function work . tha...,1
