# NLP -Sentiment Analysis
Sentiment Analysis is also known as Opinion mining. With the help of Sentiment Analysis, we humans can determine whether the text is showing positive or negative sentiment. Sentiment analysis can help in reducing churn, increase sales of a product, create brand awareness and to analyze the reviews of customers inorder to improve your products.

In this project, I will create an NLP machine learning model to predict if a new incoming customer review is positive or negative. I will use amazon’s food review dataset available at [kaggle](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews)

## Feature extraction and Text pre-processing
Machines can not understand English or any text data by default. The text data needs a special preparation before you can give text data to the machine to predict something out of it. That special preparation includes several steps such as removing stops words, correcting spelling mistakes, removing meaningless words, removing rare words and many more.

The first step of preparing text data is applying feature extraction and basic text pre-processing. In feature extraction and basic text pre-processing there several steps as follows,

- Removing Punctuations
- Removing HTML tags
- Special Characters removal
- Removing AlphaNumeric words
- Tokenization
- Removal of Stopwords
- Lower casing
- Lemmatization

#### Load our dataset

Before we apply feature extraction and text pre-processing, lets first load out dataset using pandas library.

In [1]:
"""
NLP sentiment analysis in python
"""
 
import pandas as pd
 
# Importing the dataset
dataset = pd.read_csv(r'C:\Users\user\Datasets\amazonreviews.csv')

Lets view the data

In [2]:
dataset.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


Let’s import libraries for text pre-processing and later we will use these libraries to do the basic text pre-processing.

=> We will import bs4 for Removing HTML tags from the text.

=> The re library will help in Removing Alphanumeric Text and Special Characters.

=> And As always nltk library is useful in so many ways and we will use it later in the program.

In [3]:
from bs4 import BeautifulSoup
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### Removing Punctuations
Next will start with removing punctuations from the text. For humans, it adds value but for the machine, it isn’t really useful.

=> The re library will be helpful to remove Punctuations here.

In [4]:
def removeApostrophe(review):
    phrase = re.sub(r"won't", "will not", review)
    phrase = re.sub(r"can\'t", "can not", review)
    phrase = re.sub(r"n\'t", " not", review)
    phrase = re.sub(r"\'re", " are", review)
    phrase = re.sub(r"\'s", " is", review)
    phrase = re.sub(r"\'d", " would", review)
    phrase = re.sub(r"\'ll", " will", review)
    phrase = re.sub(r"\'t", " not", review)
    phrase = re.sub(r"\'ve", " have", review)
    phrase = re.sub(r"\'m", " am", review)
    return phrase

### Removing HTML tags
When you get the text data from web scrapping, it is very common that you end having HTML tags in your dataset. HTML is for decorating the texts in the Web pages, which is not helpful in Model building.

=> Here we will use The bs4 library to remove HTML tags.

=> In general, removing HTML tags good practice to follow,

In [5]:
def removeHTMLTags(review):
    soup = BeautifulSoup(review, 'lxml')
    return soup.get_text()

### Special Characters removal
You might find some words or characters in the dataset which have special characters, which are not helpful in NLP. The best example is the usage of Hashtags in comments.

=> To remove Special Characters we will use the re library.

In [6]:
def removeSpecialChars(review):
    return re.sub('[^a-zA-Z]', ' ', review)

### Removing AlphaNumeric words
Again, AlphaNumeric words don’t help in building a predictive model. These words don’t have meaning, so it’s better to get rid of them as well.

=> To remove Special Characters we will use the re library.

In [7]:
def removeAlphaNumericWords(review):
    return re.sub("\S*\d\S*", "", review).strip()

### Tokenization, Removing Stopwords, Lowercasing, and Lemmatization
In this section we will perform Tokenization, Removing Stopwords, Lowercasing, and Lemmatization. Tokenization means that parsing your text into a list of words. Basically, it helps in other pre-processing steps, such as Removing stop words which is our next point. stopwords should be removed from the text data, these words are commonly occurring words in text data, for example, is, am, are and so on. One of the most important steps is converting words into lower case. This will reduce duplicate copies of the same word if they are in different cases. Lemmatization removes the inflectional endings of the word by using the vocabulary and morphological analysis of words.

=> We will create a doTextCleaning() function, which will use the above-created methods.

=> Also, in this method we will perform Tokenization, Removing Stopwords, Lowercasing, and Lemmatization

In [8]:
def doTextCleaning(review):
    review = removeHTMLTags(review)
    review = removeApostrophe(review)
    review = removeAlphaNumericWords(review)
    review = removeSpecialChars(review) 

    review = review.lower()  # Lower casing
    review = review.split()  # Tokenization
    
    # Removing Stopwords and Lemmatization
    lmtzr = WordNetLemmatizer()
    review = [lmtzr.lemmatize(word, 'v') for word in review if not word in set(stopwords.words('english'))]
    
    review = " ".join(review)    
    return review

### Creating Document Corpus and Advance text preprocessing
In this section will use make use of all the functions that we have created till now and we will perform Advance text preprocessing on the reviews.

Now we will create document corpus on which we will apply Bag of words model. The document corpus is a collection of all reviews in the document, where the document is your dataset.

=> In the below code we created corpus array and we have applied for loop on our dataset.

=> In the for loop we will calldoTextCleaning() function, which will return the cleaned text review. Once we receive the cleaned and preprocessed text, we will append it into the corpus array.

In [9]:
from tqdm import tqdm

In [None]:
corpus = []   
for index, row in tqdm(dataset.iterrows()):
    review = doTextCleaning(row['Text'])
    corpus.append(review)

333401it [1:48:04, 48.31it/s] 

The next step is to perform Advance text preprocessing on the reviews, which will convert the reviews into Numeric Vectors and using that we can create our Machine Learning model.

Using the document corpus we create Bag of Word model along with applying Tri-grams. Bag of Word model creates a set of words or in other words, it creates a dictionary of words from the single document. Then it converts that dictionary of words into a vector, where each word is a separate dimension.

Grams(Tri-gram) are useful in creating the word dimensions from the document corpus. The Uni-grams is the default method used in BoW model while creating Vectors from the text data. Although you can specify which method should be used in the BoW model. But here we will use Tri-grams to create word dimension.

=> First thing first, import theCountVectorizer transform from scikit library.

=> We are creating the transform usingCountVectorizer with tri-grams.

=> You can specify which grams you want to use ngram_range parameter, for tri-gram usengram_range=(1,3).

=> At the end, we have two vectors to create Machine learning model.

In [None]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer

# Creating the transform with Tri-gram
cv = CountVectorizer(ngram_range=(1,3), max_features = 2)

X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,6].values

### Building NLP sentiment analysis Machine learning model
Now the last the part of the NLP sentiment analysis is to create Machine learning model. In this project, I will use the Naive Bayes classification model.

As of now, we have two vectors i.e. X and Y. The first step to create a machine learning model is to split the dataset into the Training set and Test set. Using the training set I will create a Naive Bayes classification model. Then With the test set I can check the performance of the Naive Bayes classification model.

=> In the below code, first I have imported the train_test_split API to split the vectors into test and traing set.

=> I have importedGaussianNB() class to create a Naive Bayes classification model.

=> After creating the Naive Bayes classification model, then I will fit the training set into the Naive Bayes classifier.

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB

# Creating Naive Bayes classifier
classifier = GaussianNB()

# Fitting the training set into the Naive Bayes classifier
classifier.fit(X_train, y_train)

NLP sentiment analysis In Action
Now that our model is ready to predict the sentiments based on the Reviews, so why not write a code to test it? By doing this we will understand how well our model is predicting the result and that our end goal as well. So the steps are very straight forward here,

=> First we have createdpredictNewReview() function, which will ask to write a review in CMD and then it will use the above-created classifier to predict the sentiment.

=> As soon aspredictNewReview() function will get a new review it will do all the text cleaning process usingdoTextCleaning() function.

=> Once the text cleaning is performed, then using BOW model transform we will convert the Review the numeric vector.

=> After the conversion, the Naive Bayes classification model can be used to predict the result using classifier.predict() method.

In [None]:
#Predict sentiment for new Review
def predictNewReview():
    newReview = input("Type the Review: ")
    
    if newReview =='':
        print('Invalid Review')  
    else:
        newReview = doTextCleaning(newReview)
        reviewVector = cv.transform([newReview]).toarray()  
        prediction =  classifier.predict(reviewVector)
        if prediction[0] == 1:
            print( "Positive Review" )
        else:        
            print( "Negative Review")