## I. Setting Up Packages

In [224]:
# import packages

# to read in csv's and work in a python dataframe format (standard)
import pandas as pd
# decode html
from html import unescape
import time
# general math processing package (standard)
import numpy as np
# standard nlp tool used for data cleaning
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.util import ngrams
# package for string manipulations (standard)
import string
# package for regular expressions (standard)
import re

# one of the state-of-the-art NLP packages (includes sentiment analysis)
import flair
from flair.data import Sentence
# textblob is a versatile package (can do sentiment analysis, but also part of speech tagging and subjectivity analysis)
from textblob import TextBlob
# using nltk's vader sentiment analysis tool
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# google translator
from googletrans import Translator
# snow NLP for simplified chinese
from snownlp import SnowNLP

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\ericluo04\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## II. Cleaning Text

In [160]:
# set path
path = "C:/Users/ericluo04/Documents/GitHub/Bots-Project/Code/"

# read in .csv file as a dataframe
df1 = pd.read_csv(path + "2. HK Training/polarities/master_new_1.csv")
df2 = pd.read_csv(path + "2. HK Training/polarities/master_new_2.csv")
df = df1.append(df2)

In [161]:
def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;'),
            ("'", '^'),
    )
    
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    
    return s

In [165]:
df['tweet_text'] = [html_decode(x) for x in df['tweet']]

In [169]:
df['tweet_text'] = [x.encode().decode('unicode_escape', 'ignore') for x in df['tweet_text']]

  """Entry point for launching an IPython kernel.


## III. Sentiment Analysis (English)

1. __NLTK’s Vader__ sentiment analysis tool uses a bag of words approach (a lookup table of positive and negative words). Vader accounts for punctuation (e.g. !), capitalization, intensifiers (e.g. extremely, very, absolutely), conjunctions (e.g. but), tri-grams for negation (so that negation is only taken into account within three words of the negating word), emojis, slang, and emoticons. This means that you should NOT clean your text - usually extraneous info is taken into account for sentiment analysis. Vader has shown to be very successful when dealing with social media, NY Times editorials, movie reviews, and product reviews.

2. __TextBlob__ works like Vader (bag of words classifier), but TextBlob also has Subjectivity Analysis (scores how factual/opinionated text is). However, it doesn’t contain the heuristics that NLTK has, and so it won’t intensify or negate a sentence’s sentiment.

3. __Flair__ is based on a character-level LSTM neural network which takes sequences of letters and words into account when predicting. So, it can predict sentiment for words that the algorithm has never seen before (e.g. typos). Flair also takes into account negation and intensifiers. 

Vader and TextBlob are computationally cheap and will run very quickly over large amounts of data. Flair, on the other hand, is extremely slow and takes a rather long time to download the pre-trained model, but probably has state-of-the-art performance. For all three tools, sentiment scores range between negative one and positive one: -1 = Negative, 0 = Neutral, 1 = Positive. Blob Text's subjectivity score ranges between zero and one: 0 = very objective and 1 = very subjective.

I recommend Vader for sentiment analysis and TextBlob for subjectivity analysis. 

### Vader

In [197]:
# read in vader sentiment analysis function
vaderSA = SentimentIntensityAnalyzer()

def vaderSentiment(data, column):
    # initialize lists
    compound = []
    neg = []
    neu = []
    pos = []
    # fill in lists with sentiment
    for index, row in data.iterrows():
        compound.append(vaderSA.polarity_scores(row[column])['compound'])    # total sentiment (neg+neu+pos)
        neg.append(vaderSA.polarity_scores(row[column])['neg'])              # negative sentiment
        neu.append(vaderSA.polarity_scores(row[column])['neu'])              # neutral sentiment
        pos.append(vaderSA.polarity_scores(row[column])['pos'])              # positive sentiment
    data['sentiment_vader'] = compound

### TextBlob

In [198]:
def blobSentiment(data, column):
    # initialize lists
    polarity = []
    subjectivity = []
    # fill in lists with sentiment
    for index, row in data.iterrows():
        polarity.append(TextBlob(row[column]).sentiment.polarity)             # sentiment
        subjectivity.append(TextBlob(row[column]).sentiment.subjectivity)     # subjectivity
    data['sentiment_blob'] = polarity
    data['subjectivity_blob'] = subjectivity

In [17]:
df.to_csv(path + 'sentimentData.csv', index=None, header=True)

## IV. Sentiment Analysis (Chinese)

1. We can simply translate the Chinese into English using google translate. Now, with the translated column, we can use any of the methods from the English sentiment analysis section.
2. SnowNLP is a Simplified Chinese Text processing module. This versatile package that can do sentiment analysis, POS tagging, convert to pinyin/simplified, and choose keywords/key sentences. 

### Google Translate

In [206]:
translator.translate("進化版的「返校」來了，全球玩家一起光復香港").text

'Evolutionary version of "back to school" came, along with the recovery of the global players in Hong Kong'

In [226]:
# google translate function
translator = Translator()

def translate(data, column):
    # initialize lists
    translated = []
    count = 0
    # fill in lists with translation
    for index, row in data.iterrows():
        print(count)
        translated.append(translator.translate(row[column]).text)
        count+=1
        if count % 20 == 0:
            time.sleep(10)
    data['translation'] = translated

In [228]:
translate(df, 'tweet_text')

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [None]:
vaderSentiment(df, 'tweet_text')

In [None]:
df