## Sentiment Analysis using NLP( Using a **naïve Bayesian classifier** )


### Import the following:

In [1]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

### Define a function to extract features:

In [2]:
def extract_features(word_list):
  return dict ([(word, True) for word in word_list])

### For training data, we will use movie reviews in NLTK:

In [3]:
import nltk
nltk.download('movie_reviews')

if __name__=='__main__':
   # Load positive and negative reviews  
   positive_fileids = movie_reviews.fileids('pos')
   negative_fileids = movie_reviews.fileids('neg')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


### Separating these into positive and negative reviews:

In [4]:
features_positive = [(extract_features(movie_reviews.words(fileids=[f])), 
           'Positive') for f in positive_fileids]

features_negative = [(extract_features(movie_reviews.words(fileids=[f])), 
           'Negative') for f in negative_fileids]

### Dividing the data into training and testing data:

In [5]:
# Split the data into train and test (80/20)
threshold_factor = 0.8
threshold_positive = int(threshold_factor * len(features_positive))
threshold_negative = int(threshold_factor * len(features_negative))

### Extracting the features:

In [6]:
features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]  
print("\nNumber of training datapoints:", len(features_train))
print("Number of test datapoints:", len(features_test))


Number of training datapoints: 1600
Number of test datapoints: 400


### We will use a **Naive Bayes classifier**. 
#### Define the object and train it:

In [7]:
# Train a Naive Bayes classifier
classifier = NaiveBayesClassifier.train(features_train)
print("\nAccuracy of the classifier:", nltk.classify.util.accuracy(classifier, features_test))


Accuracy of the classifier: 0.735


### The classifier object contains the most informative words that it obtained during analysis. These words basically have a strong say in what’s classified as a positive or a negative review. Let’s print them out:

In [8]:
print("\nTop 10 most informative words:")
for item in classifier.most_informative_features()[:10]:
       print(item[0])   


Top 10 most informative words:
outstanding
insulting
vulnerable
ludicrous
uninvolving
astounding
avoids
fascination
anna
affecting


### Creating a random input sentence:

In [9]:
# Sample input reviews
input_reviews = [
       "It is an awful movie"]

### Run the classifier on that input sentence and obtain the predictions:

In [10]:
print("\nPredictions:")
for review in input_reviews:
     print("\nReview:", review)
     probdist = classifier.prob_classify(extract_features(review.split()))
     pred_sentiment = probdist.max()  


Predictions:

Review: It is an awful movie


### And the output is:

In [11]:
print("Predicted sentiment:", pred_sentiment)
print("Probability:", round(probdist.prob(pred_sentiment), 2))

Predicted sentiment: Negative
Probability: 0.85
