<a href="https://colab.research.google.com/github/WybeTuring/DataScience-Project2/blob/main/NLP_Lab_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Sentiment Classifier**

This is an attempt to write a crude sentimental analysis function. The basic idea is to have a bag of words that could likely appear in a positive review, and those that could appear in a negative review. The task is then to check for the occurence of words that are labeled as positive and those that are labeled as negative. If the positive words occur more than the negative words, we can say that the review is a positive review. If the negative words appear more than positive words, the review is negative. The situation in which the number of negative and positive words is equal means that the review is a neutral review. One way this approach could be used to provide numbered ratings (-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5) will be to quantify the amounts of negative/positive words with relation to the overall number of words in the review. 

The proposed steps are as follows


1.   Read the reviews from a text file
2.   Create a dictionary of words that can potentially be found in a good review, and those that can potentially be found in a bad review. Where it makes sense, care is taken to use only the lemma of the words. So that it becomes more likely that we can not only detect exact occurences, but also variations. 
3.   Prepare the data such that each review is on a line, and convert the whole document to lower case. 
4.   For each line, pass a function that tries to classify the text.  
5.   Write the review and it's classification to a new file



In [12]:
import pandas as pd 

In [13]:
good_indicators = ["elaborate", "effective", "like", "fun", "rare", "honest", "keen", "absolute", "refreshing", "different", "worth", "tender",
                   "great", "compelling", "illuminate", "beautiful", "masterpiece", "ripe", "beauty", "true", "provocative", "thoughtful",
                   "top", "lovely", "amazing", "amusing", "nice", "dedicated", "effective", "ideal", "idealistic", "move", "guaranteed", "masterful", "unique",
                   "cute", "more", "distinguished", "distinctive", "master",  ]
bad_indicators = ["not", "disturbing", "frightening", "lacks", "occasionally", "forgettable", "unfortunately", "static", "never"]

The text file containing the reviews has been hosted on a Github repo, so here we read it. 

In [5]:
url = 'https://raw.githubusercontent.com/WybeTuring/DataScience-Project2/main/original_rt_snippets.txt'
df = pd.read_csv(url, error_bad_lines=False, sep = '\n' )

In [21]:
df.shape
df.columns = ["Review"]
df.head()

Unnamed: 0,Review
0,The gorgeously elaborate continuation of ``The...
1,Effective but too-tepid biopic
2,If you sometimes like to go to the movies to h...
3,"Emerges as something rare, an issue movie that..."
4,The film provides some great insight into the ...


This function goes through the words in a sentence to get the number of positive terms. 

In [14]:
def positive_score(post_dict, sentence):
  words = sentence.split()
  score = 0
  for k in words:
    if k in post_dict:
      score += 1
  return score

This function goes through the sentence to get the number of negative words. 

In [19]:
def negative_score(neg_dict, sentence):
  words = sentence.split()
  score = 0
  for k in words:
    if k in neg_dict:
      score += -1
  return score

After having these two functions, we pass them through each sentence and get both the positive and the negative scores. Based on the sign of the resulting score, we declare the sentence either positive, negative or neutral. 

The final function returns whether the sentence is a positive, negative or neutral. 

In [16]:
def classifier(post_dict, neg_dict, sentence):
  score = positive_score(post_dict, sentence) + negative_score(neg_dict, sentence)
  if score < 0:
    return -1
  elif score > 0:
    return +1
  else:
    return 0

In [17]:
sen = ["The Rock is destined to be the 21st Century's new ``Conan'' and that he's going to make a splash even greater than Arnold Schwarzenegger, Jean-Claud Van Damme or Steven Segal.",
       "The gorgeously elaborate continuation of ``The Lord of the Rings'' trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson's expanded vision of J.R.R. Tolkien's Middle-earth.",
       "If you sometimes like to go to the movies to have fun, Wasabi is a good place to start.",
       "Emerges as something rare, an issue movie that's so honest and keenly observed that it doesn't feel like one.",
       "The film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game.",
       "A disturbing and frighteningly evocative assembly of imagery and hypnotic music composed by Philip Glass."]

Examples 

In [20]:
for s in sen:
  n = classifier(good_indicators, bad_indicators, s)
  print(s)
  if n > 0:
    print("Positive review")
  elif n < 0:
    print("Negative review")
  else:
    print("Neutral review")

The Rock is destined to be the 21st Century's new ``Conan'' and that he's going to make a splash even greater than Arnold Schwarzenegger, Jean-Claud Van Damme or Steven Segal.
Neutral review
The gorgeously elaborate continuation of ``The Lord of the Rings'' trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson's expanded vision of J.R.R. Tolkien's Middle-earth.
Positive review
If you sometimes like to go to the movies to have fun, Wasabi is a good place to start.
Positive review
Emerges as something rare, an issue movie that's so honest and keenly observed that it doesn't feel like one.
Positive review
The film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game.
Positive review
A disturbing and frighteningly evocative assembly of imagery and hypnotic music composed by Philip Glass.
Negative review


In [25]:
df['Label'] = classifier(good_indicators, bad_indicators,str(df['Review']))

In [28]:
df.head()

Unnamed: 0,Review,Label
0,The gorgeously elaborate continuation of ``The...,1
1,Effective but too-tepid biopic,1
2,If you sometimes like to go to the movies to h...,1
3,"Emerges as something rare, an issue movie that...",1
4,The film provides some great insight into the ...,1
