# Sentiment Analysis

Sentiment analysis is a NLP technique which helps in determining whether the given text data has a positive senitment or not. It can categorize a given text into three categories: Positive, Negative and Neutral. Here we make use of the nltk library for sentiment analysis. <br>
We give each wikipedia page a sentiment score using sentiment analysis and then sort them according to those scores. This way we can see which are the top positive and negative pages from the list of website the crawler extracted.

### Importing the libraries:

First we import all the necessary libraries required for sentiment analysis.

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
nltk.download('vader_lexicon')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\gaura\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gaura\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gaura\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Defining the functions:

sentimentAnalyzerScore() function is used to calculate the sentiment score of a given text passed to it. It returns the score of the text passed to it.

In [3]:
analyzer = SentimentIntensityAnalyzer()
def sentimentAnalyzerScore(text):
    score = analyzer.polarity_scores(text)
    return score

stopWordListDel() function is used to delete all the stopwords present in the dataframe. It traverses the dataframe and removes the stopwords present in the wikipedia page data. For loop is used for this traversal. The stopwords are predefined in the nltk library.

In [4]:
def stopWordListDel(cleanedData):
    stop_words = set(stopwords.words('english'))
    for i in range (0,len(cleanedData.index)):
        sentence = cleanedData.iloc[i,1]
        filteredSent = " "
        pageWord = sentence.split()
        for r in pageWord:
            if not r in stop_words:
                filteredSent = filteredSent + " " + r
        cleanedData.iloc[i,1] = filteredSent
    return cleanedData

listSentScore() uses a for loop to loop through the whole dataframe and return the sentiment score of each wikipedia page. This is stored in a new dataframe which contains the title of the page, the sentiment score and the compound sentiment score.

In [5]:
def listSentScore(data):
    sentList = pd.DataFrame(columns=['Title','Sentiment Score','Compound Score'])
    for i in range (0,len(data)):
        score = sentimentAnalyzerScore(data.iloc[i,1])
        sentList = sentList.append({'Title':data.iloc[i,0],'Sentiment Score':score,'Compound Score':score['compound']},ignore_index = True)
    return sentList

sortAndPrint() function is used to sort the wikipedia pages according to the compound sentiment score of the wikipedia page.

In [None]:
def sortAndPrint(data):
    data.sort_values(by=['Compound Score'],inplace = True)
    data.head()
    data.tail()