<a href="https://colab.research.google.com/github/atolman01/Sentiment_Analysis_Manning/blob/main/Building_Lexicon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Segmentation breaks up smaller texts into smaller segments, with more focused information

A tokenizer used for compiling computer languages is often called a scanner or lexer. 

A **scanner (lexer)** *refers to a tokenizer used for compiling computer languages*

 **lexicon** *The vocabulary (the set of all the valid tokens) for a computer language* 

If the tokenizer is incorporated into the computer language compiler’s parser, the parser is often called a scannerless parser. And tokens are the end of the line for the context-free grammars (CFG) used to parse computer languages. 

They are called terminals because they terminate a path from the root to the leaf in CFG. 

**A tokenizer breaks unstructured data down into chunks of information that can be counted as discrete elements**

---

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import altair as alt
import string
import re

import nltk
nltk.download('popular',quiet=True)
nltk.download('opinion_lexicon',quiet=True)
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords,opinion_lexicon
from collections import Counter


In [None]:
#read in small corpus; create a dataframe with only the reviews 
with open ("/content/drive/My Drive/NLP Python/Data/small_corpus.csv", 'r') as read_file:
  sm_corpus = pd.read_csv(read_file)

reviews = list(sm_corpus['reviews'])


**TOKENIZATION**

Steps to score a sentence

**NEED TO GET A SCORE BETWEEN -1 and 1**

It doesn't really matter if it is a decimal, because the decimal would tell us if the review is more negative or positive

Negative review represents -1;
Positive review represents 1;
Neutral review represents 0

Use cases:

    1.The most negative words a sentence can obtain is all of them (length of sentence)
      How do I figure out how many negative words are in the sentence?
         Use opinion lexicon and thier list of negative words 
         Go through each word in the sentence to determine if it is negative
         Count as you go
         Divide the number of negative words / total words in sentence
         
         we need to get a -1 if all the words in the sentence are negative
         if all words are negative, 0 are positive
           to get negative number we subtract amount of positives by amount of negatives divided by the total amount of words
    2.The most positive words a sentence can obtain is all of them (length of sentence)
      Do the same as above with the negative words, but instead use positive

    3.The least amount of negative words is 0
    4.The least amount of positive words is 0
    5.The sentence does not have any words ( length of 0 )


It would probably be good to lower case all sentences since There and there are the same word and don't need to be counted as different words
    


---

Functions

In [None]:
#function that returns sentence clear of punctuation
def remove_punct(sentence):
  sentence = re.sub(r'[^\w\s]','',sentence)
  return sentence

#function that returns sentence without stopwords
def remove_stopwords(sentence):
  return [word for word in sentence if not word in stopwords]

#returns score of a sentence
def score_sentence(sentence):
  lowered = [word.lower() for word in sentence]
  positives = len([word for word in sentence if word in pos_words])
  negatives = len([word for word in sentence if word in neg_words])
  if len(sentence) > 0:
    return (positives - negatives)/ len(sentence)
  else:
    return 0

#score reviews
def score_review(review):
    sentiment_scores = []
    sents = sent_tokenize(review)
    for sent in sents:
        wds = word_tokenize(sent)
        sent_scores = score_sentence(wds)
        sentiment_scores.append(sent_scores)
    return sum(sentiment_scores) / len(sentiment_scores)

In [None]:
#sets of positive and negative words from opinion_lexicon module
pos_words = set(opinion_lexicon.positive())
neg_words = set(opinion_lexicon.negative())

In [None]:
sentiment_reviews = [score_review(r) for r in reviews]
ratings = list(sm_corpus['ratings'])
sentiment_df = pd.DataFrame({"rating":ratings
                        ,"review":reviews
                        ,"review dictionary sentiment":sentiment_reviews})

sentiment_df.to_csv('/content/drive/My Drive/NLP Python/Data/dictionary_sentiment.csv')
#sentiment_df.head()


In [None]:
# returns the distribution of ratings from 1.0 - 5.0
# 1.0 : 1500, 2.0 : 500, 3.0 : 500, 4.0 : 500, 5.0 : 1500
rating_counts = Counter(ratings)

data1 = pd.DataFrame(
    {
        "ratings": [str(e) for e in list(rating_counts.keys())],
        "counts": list(rating_counts.values()),
    }
)

chart1 = alt.Chart(data1).mark_bar().encode(x="ratings", y="counts")


hist, bin_edges = np.histogram(sentiment_reviews, density=True)
labels = list(zip(bin_edges, bin_edges[1:]))
labels = [(str(e[0]), str(e[1])) for e in labels]
labels = [" ".join(e) for e in labels]

data2 = pd.DataFrame({"sentiment scores": labels, "counts": hist})

chart2 = (
    alt.Chart(data2)
    .mark_bar()
    .encode(x=alt.X("sentiment scores", sort=labels), y="counts")
)

source = pd.DataFrame(
    {"ratings": [str(e) for e in ratings], "sentiments": sentiment_reviews}
)


chart4 = (
    alt.Chart(source)
    .mark_circle(size=60)
    .encode(
        x="ratings", y="sentiments", color="ratings", tooltip=["ratings", "sentiments"]
    )
    .interactive()
)

chart4


More Solutions to Milestone 2:

https://github.com/mariandb/Growth-Hacking-with-NLP-and-Sentiment-Analysis/blob/master/2.%20Creating%20a%20dictionary-based%20sentiment%20analyzer.ipynb