# Analysis Question-4 Toxicity (Sentiment Analysis)

Toxicity: using Sentiment Analysis, determine the top 5 positive subreddits and top 5 negative subreddits based on comment sentiment.

Choose two subreddits focused on similar topics but with different views, e.g., /r/apple and /r/android. Compare the toxicity of the two.

# What is Sentiment Analysis

![alt text](https://i.imgur.com/MijPBwS.jpg "Logo Title Text 1")

Sentiment analysis is the automatic process of identifying positive, negative and neutral emotions in text. Equipped with AI, sentiment analysis allows businesses to understand how customers feel about their products and services and extract valuable insights that lead to better decision-making.

In a world where we generate 2.5 quintillion bytes of data every day, sentiment analysis has become a key tool for making sense of that data.


# Our Implementation

We sampled original data to 1% for first part of this question. 1% data is enough to determine the top 5 positive subreddits and top 5 negative subreddits based on comment sentiment. Important Note: If there are less than 100 comments of a specific subreddit we didn't consider it in this analysis. 

In [3]:
%%time
originDF = sqlContext.read.json("hdfs://orion11:11001/sampled_reddit_v3/*")

CPU times: user 2.42 ms, sys: 2.19 ms, total: 4.61 ms
Wall time: 9.64 s


Cleaning the comment data

In [5]:
import re

def pre_process(text):
    # lowercase
    text=text.lower()
    
    # remove special characters and digits
    text=re.sub("(\\d|[^\\w|\\s]|(\_))+","",text)
    text=re.sub("(\\s)+"," ",text)
    #print(text)
    return text.strip()

We used NLTK /VADER library for Sentiment Analysis.

# Sentiment Analysis

Important Note: If there are less than 100 comments of a specific subreddit we didn't consider it in this analysis. 

In [50]:
import nltk # be sure to have stopwords installed for this using nltk.download_shell()
import string
from pyspark.sql import functions as func
from pyspark.sql import types as types

# This Cell takes 4 minutes... (199923 records for soccer)
nltk.downloader.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# install Vader and make sure you download the lexicon as well
sid = SentimentIntensityAnalyzer()

def calculatue_score(listBody):    
    #0.1 sample
    #newsamp = sqlDF.filter(sqlDF.subreddit.like(category)&~sqlDF['author'].isin(['[deleted]']))
    
    #0.01 sample
    #newsamp = originDF.filter(originDF.subreddit.like(category)&~originDF['author'].isin(['[deleted]']))
#     newsamp = originDF.filter(originDF['subreddit'].isin(category)&~originDF['author'].isin(['[deleted]']))    
#     iteratebody = newsamp.select("body").rdd.flatMap(list).collect()    
    
    if(len(listBody)>100):                            
        # this step will return an error if you have not installed the lexicon 
        result = 0.0;
        for message in listBody:   
            #clean the comments
            message = pre_process(message)

            ss = sid.polarity_scores(message)
            result += ss["compound"]

        #print(summary)
        return result
    else:
        #print('Sorry less than 100 comments.')
        return 0
    
calculatue_score_udf = func.udf(calculatue_score, types.FloatType())



[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home4/hpbui/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Doing sentiment analysis for each subreddit category:

In [51]:
%%time

analyzeDF = (originDF.groupBy('subreddit')
.agg(calculatue_score_udf(F.collect_list('body')).alias('analyze'), F.count(F.lit(1)).alias('count')))
analyzeDF = analyzeDF.filter(analyzeDF['count']>=100)
analyzeDF.show()

+-------------------+---------+-----+
|          subreddit|  analyze|count|
+-------------------+---------+-----+
|            Amateur|  24.4157|  153|
|              HPMOR|  22.6543|  162|
|         MLBTheShow|  63.3394|  380|
|         MensRights|-197.9045| 3299|
|         NHLStreams|  28.2473|  171|
|         QuotesPorn|  32.2833|  289|
|       SaltLakeCity|  41.0967|  280|
|UnresolvedMysteries| -21.0719|  461|
|         WahoosTipi|  27.2657|  257|
|              anime| 1483.797| 9240|
|           lacrosse|  27.2878|  110|
|       marvelheroes|  67.5551|  320|
|         mistyfront|   0.3322|  131|
|             travel| 451.8471| 1454|
|            ukraina|  16.0944|  655|
| Anarcho_Capitalism| 120.7049| 1526|
|          BABYMETAL| 108.9668|  390|
|      ClickerHeroes|  90.6882|  357|
|             Hawaii|  45.1344|  237|
|          JUSTNOMIL|  70.3518|  735|
+-------------------+---------+-----+
only showing top 20 rows

CPU times: user 33.2 ms, sys: 7.26 ms, total: 40.5 ms
Wall tim

In [52]:
analyzeDF.schema

StructType(List(StructField(subreddit,StringType,true),StructField(analyze,FloatType,true),StructField(count,LongType,false)))

In [53]:

analyzeDF.sort(func.col('analyze').desc()).show()

+--------------------+---------+------+
|           subreddit|  analyze| count|
+--------------------+---------+------+
|           AskReddit|16001.021|284960|
|     leagueoflegends| 5536.479| 48008|
|              gaming|3953.6223| 41689|
|                pics|3703.8025| 60123|
|            gonewild|3235.5254| 13058|
|                IAmA| 3080.609| 21385|
|               funny|2940.4287| 64062|
|                 nfl|2826.1062| 32719|
|        pcmasterrace|2776.1965| 18199|
|                 nba| 2721.648| 26823|
|              soccer| 2531.825| 22251|
|               trees|2440.6042| 17487|
|Random_Acts_Of_Am...| 2357.632|  8142|
|            buildapc|2257.9685|  8402|
|               DotA2|2079.6892| 18848|
|              movies|1959.7386| 19706|
|electronic_cigarette|1865.1583|  7779|
|       SquaredCircle|1733.7367| 17271|
|       pokemontrades|1631.5997|  7578|
|              hockey|1625.2948| 17844|
+--------------------+---------+------+
only showing top 20 rows



# Compare 2 Subreddits based on Sentiment Analysis

# Basketball Sentiment Analysis


In [None]:
import nltk # be sure to have stopwords installed for this using nltk.download_shell()
import pandas as pd 
import string

# This Cell takes 4 minutes...
nltk.downloader.download('vader_lexicon')

newsamp = df.filter(df.subreddit.like("nba")&~df['author'].isin(['[deleted]']))
iteratebody = newsamp.select("body").rdd.flatMap(list).collect()

from nltk.sentiment.vader import SentimentIntensityAnalyzer
# install Vader and make sure you download the lexicon as well
sid = SentimentIntensityAnalyzer()
# this step will return an error if you have not installed the lexicon
summary = {"positive":0,"neutral":0,"negative":0}
for i, message in enumerate(iteratebody):    
    
    #clean the comments
    message = pre_process(message)
        
    ss = sid.polarity_scores(message)
    #for k in sorted (ss):
    #    print("{0}:{1},".format(k,ss[k]),end="/n ")
    
    if ss["compound"] == 0.0: 
        summary["neutral"] +=1
    elif ss["compound"] > 0.0:
        summary["positive"] +=1
    else:
        summary["negative"] +=1
        
    i=i+1
    #print(i); 
    
        
print(i);         
print(summary);