# **`Part II`**

### Conduct Sentiment Analysis

Sentiment Analysis which focuses on analyzing sentiment of various text resources ranging from corporate feedback surveys to movie reviews is probably the most popular application of text analytics. The main aspect of sentiment analysis is to analyze a body of text in order to decipher the opinion expressed by it including factors like emotion, feelings and mood.

**When does "Sentiment Analysis" work best?**


Sentiment Analysis works best on text that has a subjective context rather than an objective context. When a body of text has an objective perspective, it usually reflects real factual statements with no emotion or feelings associated with it. In contrast, subjective text includes opinions encompassing emotions and feelings that are expressed by humans. Given the proliferation of social media channels, Sentiment Analysis is increasingly being leveraged by a host of entities (be it a business, a public sector organization, government, etc.) to extract the subjective and opinion related information like emotions, attitude, mood and use the extracted subjective information to the detect the sentiment of people.

**What is covered in this objective?**

In a nutshell, sentiment analysis can be defined as a classification problem in which the classification is either —binary classification (positive or negative) and multi-class classification (positive, negative, or neutral).

Within this objective, we will explore a range of related topics encompassing **1)** Constructing a Sentiment Analysis Model, **2)** Determining the subjectivity of text, **3)** Examining the intensity or polarity of a sentiment and **4)** Performing sentiment analysis on tweets.

**Utilize a Sentiment Dictionary to decipher the sentiment of text**

A sentiment dictionary is the mapping of words to sentiment values. For example: the word awesome (which is a positive sentiment) could have a value of +3.7 and the word horrible (which is a negative sentiment) could have a value of -3.1. While using a sentiment dictionary, the values of the sentiment words are summed to get the overall sentiment of the text.

For example: I loved the ambience of the restaurant but the drive to the restaurant was horrendous. Overall, it was a good evening.

Now let's say the value of the word love is +3.9, the value of the word horrendous is -4.2 and the value of the word good is +2.9. So, the overall sentiment of the text is positive since the aggregate of the values of the sentiment words is positive.

To decipher the sentiment of text, we will utilize NLTK's **VADER** Sentiment Tool. VADER stands for Valence Aware Dictionary for Sentiment Reasoning. The dictionary was designed specifically for Twitter and contains emoticons and slang. **It also provides support for sentiment intensifiers  (words such as incredibly funny) and negations (words such as "not bad" which is a slight/small positive sentiment)**.

How it works? VADER analyzes a piece of text to check if any of the words in the text are present in the lexicon. It  produces 4 sentiment metrics from the word ratings i.e. positive, neutral, negative and compound. The compound score is the sum of all of the lexicon ratings which is standardized to a range between -1 and 1.

In [None]:
# Install the VADER Sentiment Tool

!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/126.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m122.9/126.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [None]:
#Load the SentimentIntensityAnalyzer object from the VADER package
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

#Create a handle to the SentimentIntensityAnalyzer object
analyzer = SentimentIntensityAnalyzer()

#function that outputs the sentiment ratings
def print_sentiment_ratings(sentence):
    sent = analyzer.polarity_scores(sentence)
    print("{} {}".format(sentence, sent))

#Examining the sentiment ratings for different pieces of text
#No sentiment expressed

print_sentiment_ratings("I have to work on the weekend")

#Overall rating is neutral

I have to work on the weekend {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


In [None]:
#Sentiment expressed via emoticon

print_sentiment_ratings("I have to work on the weekend :(")

#Overall rating is negative

I have to work on the weekend :( {'neg': 0.293, 'neu': 0.707, 'pos': 0.0, 'compound': -0.4404}


In [None]:
#Expressing a more intense feeling via 2 emoticons

print_sentiment_ratings("I have to work on the weekend :( :(")

#Overall rating is even more negative than the above piece of text

I have to work on the weekend :( :( {'neg': 0.453, 'neu': 0.547, 'pos': 0.0, 'compound': -0.7003}


In [None]:
#VADER handles emotion intensifiers (i.e. words such as very, really, super, etc.)

print_sentiment_ratings("I did well on the test")

#The sentiment rating for the sentence below is higher than the one above

print_sentiment_ratings("I did very well on the test")

I did well on the test {'neg': 0.0, 'neu': 0.704, 'pos': 0.296, 'compound': 0.2732}
I did very well on the test {'neg': 0.0, 'neu': 0.715, 'pos': 0.285, 'compound': 0.3384}


In [None]:
#VADER takes into consideration how the words are written - capitalization has an impact on the sentiment ratings

print_sentiment_ratings("I had a super day")

#The sentiment rating for the sentence below is higher than the one above

print_sentiment_ratings("I had a SUPER day")

I had a super day {'neg': 0.0, 'neu': 0.506, 'pos': 0.494, 'compound': 0.5994}
I had a SUPER day {'neg': 0.0, 'neu': 0.463, 'pos': 0.537, 'compound': 0.6841}


In [None]:
#Finally, VADER handles changes in sentiment intensity; specifically when a sentence contains the word "but". Higher weighting is given to the sentiment after the word "but".
#The overall rating for the sentence below is negative

print_sentiment_ratings(" I loved the ambience of the restaurant but the drive to the restaurant was horrendous")

 I loved the ambience of the restaurant but the drive to the restaurant was horrendous {'neg': 0.252, 'neu': 0.63, 'pos': 0.119, 'compound': -0.5789}


In [None]:
!pip install -U textblob

Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m626.3/626.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: textblob
  Attempting uninstall: textblob
    Found existing installation: textblob 0.17.1
    Uninstalling textblob-0.17.1:
      Successfully uninstalled textblob-0.17.1
Successfully installed textblob-0.18.0.post0


In [None]:
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


In [None]:
#Test drive TextBlob
from textblob import TextBlob

#Initialize a variable
txtblob = TextBlob("Lambda School is dreadful.")

#Get the POS
#txtblob.tags

#Get the polarity of the sentiment
txtblob.polarity


-1.0

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=b6483ce3336795c89287ed354557247bb98c5b0f4cf68173e80ed352ba6cda0f
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [None]:
!pip install matplotlib



In [None]:
# Import necessary modules
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import display, clear_output
import numpy as np

In [None]:

#Function to generate a simulated data stream
def generate_stream(max_iterations=10):
    import time
    import random

    for i in range(max_iterations):
        # Generate random data
        data = (random.randint(1, 10), random.random())
        yield data
        time.sleep(1)  # Simulate 1 second interval between data points

    print("Reached maximum number of iterations. Stopping stream.")

# Check if a SparkContext already exists
sc = SparkContext.getOrCreate()

# If no SparkContext exists, create a new one
if sc is None:
    sc = SparkContext("local[*]", "SparkStreamingDemo")

# Create a StreamingContext with batch interval of 5 seconds
ssc = StreamingContext(sc, 5)

# Create a DStream from the simulated data stream
stream = ssc.queueStream([generate_stream()])

# Process the data stream
stream.foreachRDD(lambda rdd: print(rdd.collect()))

# Start the streaming context
ssc.start()

# Wait for the streaming to finish
ssc.awaitTerminationOrTimeout(30)  # Stop after 30 seconds, in case for some unexpected execution delays

# Stop the streaming context
ssc.stop(stopSparkContext=True)



Reached maximum number of iterations. Stopping stream.
[(9, 0.2560046458826183), (9, 0.08402663900363261), (9, 0.32449373782571045), (2, 0.9124250198300456), (8, 0.14505745639498924), (9, 0.13218739909292931), (4, 0.11740313072434883), (3, 0.5881328634181099), (3, 0.24361625082552307), (4, 0.4970356207987032)]
[]
[]
[]
[]
[]


Summary
* Set up PySpark in Google Colab: Install the necessary dependencies and initialize PySpark.
Create a Data Streaming Script: A Python script that simulates streaming data by sending random values to a socket.
PySpark Streaming Code: Set up a streaming context in PySpark to read data from the socket and perform transformations.

# In Class Project / Discussion
**Objective**:<br>

* **To show understanding of big data anlaytics**
* **Instruction:**
  *  3-4 students per group
  *  work on a small project (any topic), but data is required in 1 hour
  * KPIs:
    * Team memember names:
    * Topic/ Problem defination
    * Why do you want to study this?
    * Who are your audience?
    * Study Design:
      * Pipleline
        * Data source
        * Data Defination
        * Size
      * Methodology
        * EDA
        * Model
      * Analysis
      * Conclusion
      * Limitation
* **Evaluation Criteria**
  * Each group has 10-12 minutes for the presentation. (`50 point`)
  * Each group member should have at least one slide to present. (`10 point`)
  * Pre-presentation:
    * Post group project's outline or content in Canvas/ Day 8 Discussion (`20 pints`)
  * Post-presentation:
    * Comment other group presentatino in Canvas / Day 8 Discussion (`20 points`)<br>


**Please note** that your comments should be professional and insightful, meaning you should avoid using phrases like 'I agree with you' or 'Yes, you are right,' as well as any non-respectful comments.*


