# Politician Rating Prediction by Analysing Twitter Data

We start by importing the NLTK package and the twitter_samples dataset from the NLTK package.

In [1]:
import nltk
from nltk.corpus import twitter_samples

### The twitter_samples dataset contains 3 json file: 
1. negative_tweets.json 
2. positive_tweets.json
3. tweets.20150430-223406.json

As the name suggests, the first file contains negative tweets, the second contains positive tweets, and the third contains test data to test its accuracy.

In [2]:
fileids = [fileid for fileid in twitter_samples.fileids()]
print(fileids[:])

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']


A sample of the data is show below.

In [3]:
strings = twitter_samples.strings(fileids[0])

for string in strings[:10]:
    print(string)

hopeless for tmr :(
Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(
@Hegelbon That heart sliding into the waste basket. :(
“@ketchBurning: I hate Japanese call him "bani" :( :(”

Me too
Dang starting next week I have "work" :(
oh god, my babies' faces :( https://t.co/9fcwGvaki0
@RileyMcDonough make me smile :((
@f0ggstar @stuartthull work neighbour on motors. Asked why and he said hates the updates on search :( http://t.co/XvmTUikWln
why?:("@tahuodyy: sialan:( https://t.co/Hv1i0xcrL2"
Athabasca glacier was there in #1948 :-( #athabasca #glacier #jasper #jaspernationalpark #alberta #explorealberta #… http://t.co/dZZdqmf7Cz


##### As you can see, the data contains tweets by real users, and the above tweets are from the <em>negative_tweets.json</em> file. It requires quite a bit of preprocessing to make it ready and trainable. This requires a few steps:

1. Remove the twitter handle mentions (as they don't serve purpose in determining whether a tweet is negative or positive and may lead to bias).
2. Word Tokenize the sentences.
3. Part of speech (pos) tagging for easier lemmatization
4. Lemmatization using the correct pos-tagging
5. Filter stop words to reduce the dataset size

The <b>remove_handle</b> functions uses a regular expression to identify twitter handle mentions (ex: @TheRealDonald) and remove them.

In [4]:
import re

def remove_handle(string):
    regex = r"(^|[^@\w])@(\w{1,15})\b"
    return re.sub(regex, '', string)

The <b>get_wordnet_pos</b> function returns the correct Wordnet POS object for the given POS tag.

In [5]:
from nltk.corpus import wordnet
def get_wordnet_pos(pos):
    if pos.startswith('J'):
        return wordnet.ADJ
    elif pos.startswith('V'):
        return wordnet.VERB
    elif pos.startswith('N'):
        return wordnet.NOUN
    elif pos.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

#### The main processing functions is given below:

The first step is to import the required packages. The <em> stopwords</em> package contains a corpus of stopwords. Stop words are words that don't particularly add meaning and is useful only for human communication and does not help computer understand meaning.

We use a special kind of tokenizer called TweetTokenizer, which can identify emoticons and group them together instead of separating the characters like a standard word_tokenizer would do. This helps in maintaining the emoticons which can give an insight into whether a tweet is negative or positive.

The WordNetLemmatizer is a powerful lemmatizer. A lemmatizer reduces a given word to its root word. <br>
<em>Example: Better -> Good and Eating->Eat </em> <br> This can prevent bias and improve reliability.

The <b>process_string</b> function performs the 5 steps mentioned previously to each string.

In [6]:
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer

def process_string(string):
    string = remove_handle(string) #remove twitter handle
    
    tokenizer = TweetTokenizer()
    word_tokens = tokenizer.tokenize(string)#tokenize the sentence
    
    tagged = nltk.pos_tag(word_tokens)#pos tagging
    
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w, get_wordnet_pos(pos)) for (w , pos) in tagged]
    
    stop_words = set(stopwords.words('english'))
    filtered_sentence = [w for w in word_tokens if not w in stop_words] #filter stop words
    
    return filtered_sentence

The <b>process_data</b> functions processes the whole document of data.

In [7]:
def process_data(data_strings):
    processed_data = []
    
    for string in data_strings:
        processed_data.append(process_string(string))
    
    return processed_data

The below code processes every tweet from both <em>negative_tweets.json</em> and <em>positive_tweets.json</em>, and adds the class for each tweet (pos/neg). Then all the data is combined together and shuffled. Shuffling has to be done so that any chance of overfitting can be reduced.

In [8]:
processed_data = []

for fileid in fileids[:2]:
    strings = twitter_samples.strings(fileid)
    
    p_s = process_data(strings)
    for p in p_s:
        processed_data.append((p, fileid[:3]))

In [9]:
import random
random.shuffle(processed_data)

In [10]:
for data in processed_data[:10]:
    print(data)

(['Shopping', 'fun', ':('], 'neg')
(['emon', ':', ')'], 'pos')
(['I', 'wanna', 'go', 'favorite', 'spot', '&', 'watch', 'sun', 'rise', "that's", 'way', 'across', 'town', ':('], 'neg')
(["That's", 'shocking', '!', 'Why', "can't", 'people', 'live', '&', 'let', 'live', '?', 'Sometimes', 'I', 'despair', 'world', '!', ':(', 'Hope', 'OK', '.'], 'neg')
(['🐢', '—', 'I', 'like', '🐢', ':)', 'http://t.co/kPl3kaPqGo'], 'pos')
(["Don't", 'get', 'hope', ',', 'must', 'travel', 'million', 'years', 'get', 'first', 'place', ':)', '#fls', '#sorry'], 'pos')
(['well', 'kids', 'broke', 'get', 'early', ':)'], 'pos')
(['OKAY', 'PLEAS', 'EDOMNT', 'FCK', 'QIH', 'MY', 'FEELINGS', 'SHXBS', 'ENOUGH', '!', '!', '!', ':(', '#ZaynIsComingBackOnJuly26'], 'neg')
(['#AskUrban', '.', 'Ask', 'us', 'question', '.', ':)'], 'pos')
(['Morning', 'All', ':)'], 'pos')
