# Preprocessing

In this notebook we want to accomplish the following (this list will be updated):
* Prepare the tweet_text so that we can use them in NLP algorithms. For example, we can:
  * Make words in lower case (not hard so will do it later).
  * Use Stemming (not hard so will do it later)
  * Delete Punctuation/special characters (Maybe we can study the relevance of punctuation/special characters) and maybe it is better to no delete these. 
  * Create Functions to count punctuation and special characters.
  * Create Function to count Hashtags, extract Hashtags, count Tags, extract Tags, count emojis, extract emojis.
  * Assign a sentiment score to each Tweet using some pretrained NLP tool.
* Study a way to process the location_geocode data.


In [1]:
#import some packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import collections
import math
import re
from datetime import datetime
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import emoji
from nltk.stem.snowball import EnglishStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
                           
stop_words = nltk.corpus.stopwords.words("english")

sns.set_theme(style="darkgrid")

In [2]:
df_tw  = pd.read_csv("Data/CleanedData.csv")
df_geo = pd.read_csv("Data/location_geocode.csv")

In [3]:
#We will find/build tools to process text.

#Create a list of hashtags in a given tweet
def extract_hashtags(tweet):
    return [w[1:] for w in tweet.split() if w.startswith('#') ]

#Create a list of tags in a given tweet
def extract_tags(tweet):
    return [w[1:] for w in tweet.split() if w.startswith('@') ]

#Have some issues with flags
def extract_emojis(tweet):
    ls = emoji_pattern.findall(tweet)
    emojis_list = []
    for e in ls:
        emojis_list = emojis_list + list(e)
    return emojis_list


In [4]:
for i in range(1,200):
    tweet_example = list(df_tw['full_text'].loc[[i]])[0]
    if extract_emojis(tweet_example) != []:
        print(tweet_example)
        print(extract_emojis(tweet_example))

@narendramodi @smritiirani Coverage of indian election on SBS tv channel, Australia. Jai hind 🇮🇳🙏 https://t.co/90qplBEAf8
['🇮', '🇳', '🙏']
😭Sad day. @tanya_plibersek is the kind of leader Australia needs. https://t.co/Fjwy0arcCt
['😭']
Latest #blameafarmer hysteria 
-  approx 85000 farms in Australia (employing about 300000 in total) being blamed for an election result with an electorate of 16,855,289 😂😂😂 https://t.co/25DoEvt1tT
['😂', '😂', '😂']
@elcokolo @OsbornBrett @LyleShelton @Pontifex @jesus @ScottMorrisonMP @IzzyFolau @HorizonSydney @bradbonhomme With respect Elizabeth i hear your not interested tweet, will not tag again. Election is over g8

🇦🇺 Australia has elected a prime minister who also regularly attends ⛪️ Political social media involving Christianity won’t disappear

Jesus is either friend or foe

It ws fun

👋Bye
['🇦', '🇺', '👋']
Fraser Anning, the far-right Australian lawmaker egged by teen after New Zealand mosque terror attack, voted out of office 💁‍♂️https://t.co/bMThAsz

In [5]:
#We will used a pretrained Sentiment Analyzer (VADER) to assign polarity scores to our tweets. For simplicity we will only decide if the tweet is positive or negative. We will say that compud = 0 is negative. Although it my be a good idea to get neutral scores. Change of mind we will get neutral also using what this links is doing https://towardsdatascience.com/comparing-vader-and-text-blob-to-human-sentiment-77068cf73982. I feel should used a little value that pm 0.05

tweet_example = list(df_tw['full_text'].loc[[11]])[0]
sia = SentimentIntensityAnalyzer()
sia.polarity_scores(tweet_example)['compound']

#The value number is a positive number <1 to determine the range of neutral tweets. The neutral tweet will be when -value<comput < value
def get_sentiment_label(tweet,value):
    polarity_info = sia.polarity_scores(tweet)
    if polarity_info['compound'] >= value:
        return 'positive'
    elif polarity_info['compound'] < value and polarity_info['compound'] > -value:
        return 'neutral'
    else:
        return 'negative'


In [6]:
#Some Exmpples
tweet_example = list(df_tw['full_text'].loc[[1343]])[0]
tweet_example,get_sentiment_label(tweet_example,0.2), sia.polarity_scores(tweet_example)

('Australia Votes Out Far-Right Lawmaker Egged By Teen  Dont egg vote.  https://t.co/KLN2EWRGk9',
 'neutral',
 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0})

In [7]:
stemmer = EnglishStemmer()

def remove_stop_words(doc):
    words = doc.split()
    words = [word for word in words if not word in stop_words]
    #words =  [word for word in words if word.isalnum()]
    processed_text = " ".join(words)
    return processed_text

def remove_emojis(doc):
    return emoji_pattern.sub(r'', doc)

def remove_url(doc):
    return re.sub(r'http\S+', '',doc)

def remove_hashsymbol(doc):
    return doc.replace('#',"")


analyzer = CountVectorizer().build_analyzer()
def stemming(doc):
    return " ".join([stemmer.stem(w) for w in analyzer(doc)]
    )

#Comment the ones you dont want to use. I don't know what is the best option
def tweet_preprocess(doc):
   text_temp = doc
   text_temp = remove_url(text_temp)
   text_temp = remove_emojis(text_temp)
   text_temp = remove_hashsymbol(text_temp)
   text_temp = remove_stop_words(text_temp)
   text_temp = stemming(text_temp)
   return text_temp

In [8]:
#Here We are going to Prepare the data to do some visualization. We will include the following:
# tweet, sentiment_label, emojis, hashtags, tags, likes, retweet, date

tweets = list(df_tw['full_text'])
sentiment_label = []
emojis = []
hashtags = []
tag = []
likes = list(df_tw['favorite_count'])
retweet = list(df_tw['retweet_count'])
date = []


for i in df_tw.index:
   date_temp = 'No Date'
   if type(df_tw['created_at'].loc[i]) == str:
      date_temp = datetime.strptime(df_tw['created_at'].loc[i], '%Y-%m-%d %H:%M:%S')
   tweet_temp = list(df_tw['full_text'].loc[[i]])[0]
   sentiment_label = sentiment_label +  [get_sentiment_label(tweet_temp,.2)]
   emojis = emojis + [extract_emojis(tweet_temp)]
   hashtags = hashtags + [extract_hashtags(tweet_temp)]
   tag = tag  + [extract_tags(tweet_temp)]
   date = date + [date_temp]


#TODO
#tweet_prepro = [tweet_preprocess(tw) for tw in tweets]
#number_words = [ len([w for w in tw.split()]) for tw in tweets ]
#number_emoji = [ len(tw) for tw in emojis ]
#number_hash = [ len(tw) for tw in hashtags ]
#number_tags = [ len(tw) for tw in tags ]

In [9]:
tweet_prepro = [tweet_preprocess(tw) for tw in tweets]
number_words = [ len([w for w in tw.split()]) for tw in tweets ]
number_emoji = [ len(tw) for tw in emojis ]
number_hash = [ len(tw) for tw in hashtags ]
number_tags = [ len(tw) for tw in tag ]

In [10]:
df_prepro = pd.DataFrame([tweets, sentiment_label, emojis, hashtags, tag, likes, retweet, date,number_words, number_emoji, number_hash,
       number_tags, tweet_prepro]).transpose()

In [11]:
df_prepro = df_prepro.rename(columns = {0 : "tweet", 1: "sentiment_label", 2: 'emojis', 3: 'hashtags', 4: "tags", 5: "likes", 6: "retweets", 7: "date", 8: 'number_words', 9: 'number_emoji', 10:'number_hash', 11:'number_tags', 12:'tweet_prepro' })

In [12]:
df_prepro.head(10)

Unnamed: 0,tweet,sentiment_label,emojis,hashtags,tags,likes,retweets,date,number_words,number_emoji,number_hash,number_tags,tweet_prepro
0,After the climate election: shellshocked green...,positive,[],[],[],0.0,0.0,2019-05-20 09:13:44,10,0,0,0,after climat elect shellshock green group rema...
1,@narendramodi @smritiirani Coverage of indian ...,neutral,"[🇮, 🇳, 🙏]",[],"[narendramodi, smritiirani]",0.0,0.0,2019-05-20 09:13:43,15,3,0,2,narendramodi smritiirani coverag indian elect ...
2,@workmanalice Do you know if Facebook is relea...,negative,[],[],[workmanalice],0.0,0.0,2019-05-20 09:13:33,25,0,0,1,workmanalic do know facebook releas elect post...
3,@vanbadham We all understand we have a compuls...,negative,[],[],[vanbadham],0.0,0.0,2019-05-20 09:13:29,38,0,0,1,vanbadham we understand compulsori prefer syst...
4,"Shares were mixed in Asia, with India and Aust...",positive,[],[],[],0.0,0.0,2019-05-20 09:13:23,25,0,0,0,share mix asia india australia lead gain regio...
5,Australia's pollsters to review incorrect elec...,neutral,[],[],[],0.0,0.0,2019-05-20 09:13:21,8,0,0,0,australia pollster review incorrect elect fore...
6,It is disappointing that @tanya_plibersek has ...,negative,[],[auspol],"[tanya_plibersek, AustralianLabor]",0.0,0.0,2019-05-20 09:12:57,20,0,1,2,it disappoint tanya_plibersek rule australianl...
7,@robynesc I feel like this exact thing happens...,positive,[],[],[robynesc],0.0,0.0,2019-05-20 09:12:28,40,0,0,1,robynesc feel like exact thing happen australi...
8,'Quiet Australians' are the latest to upset el...,negative,[],[],[],0.0,0.0,2019-05-20 09:12:04,11,0,0,0,quiet australian latest upset elect forecast e...
9,Conservatives look set to form gov't after Aus...,neutral,[],[],[],0.0,0.0,2019-05-20 09:11:47,10,0,0,0,conserv look set form gov australia vote


In [13]:
set(df_prepro['sentiment_label'])

{'negative', 'neutral', 'positive'}

In [14]:
collections.Counter(list(df_prepro['sentiment_label']))

Counter({'positive': 67663, 'neutral': 65973, 'negative': 49734})

In [15]:
#This is the data set after deleating some repeated/incorrect entriees
df_prepro.to_csv('Data/PreProData.csv', index=False)