# Preprocessing

In this notebook we want to accomplish the following (this list will be updated):
* Prepare the tweet_text so that we can use them in NLP algorithms. For example, we can:
  * Make words in lower case (not hard so will do it later).
  * Use Stemming (not hard so will do it later)
  * Delete Punctuation/special characters (Maybe we can study the relevance of punctuation/special characters) and maybe it is better to no delete these. 
  * Create Functions to count punctuation and special characters.
  * Create Function to count Hashtags, extract Hashtags, count Tags, extract Tags, count emojis, extract emojis.
  * Assign a sentiment score to each Tweet using some pretrained NLP tool.
* Study a way to process the location_geocode data.


In [1]:
#import some packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import collections
import math
import re
from datetime import datetime
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import emoji

stopwords = nltk.corpus.stopwords.words("english")

sns.set_theme(style="darkgrid")

In [2]:
df_tw  = pd.read_csv("Data/CleanedData.csv")
df_geo = pd.read_csv("Data/location_geocode.csv")

In [3]:
#We will find/build tools to process text.

#Create a list of hashtags in a given tweet
def extract_hashtags(tweet):
    return [w[1:] for w in tweet.split() if w.startswith('#') ]

#Create a list of tags in a given tweet
def extract_tags(tweet):
    return [w[1:] for w in tweet.split() if w.startswith('@') ]

def extract_emojis(tweet):
    return [ch for ch in list(tweet) if ch in emoji.UNICODE_EMOJI['en']]


In [4]:
tweet_example = list(df_tw['full_text'].loc[[1]])[0]

print(extract_hashtags(tweet_example))
print(extract_tags(tweet_example))
print(extract_emojis(tweet_example))

[]
['narendramodi', 'smritiirani']
['🙏']


In [5]:
#We will used a pretrained Sentiment Analyzer (VADER) to assign polarity scores to our tweets. For simplicity we will only decide if the tweet is positive or negative. We will say that compud = 0 is negative. Although it my be a good idea to get neutral scores. Change of mind we will get neutral also using what this links is doing https://towardsdatascience.com/comparing-vader-and-text-blob-to-human-sentiment-77068cf73982. I feel should used a little value that pm 0.05

tweet_example = list(df_tw['full_text'].loc[[11]])[0]
sia = SentimentIntensityAnalyzer()
sia.polarity_scores(tweet_example)['compound']

#The value number is a positive number <1 to determine the range of neutral tweets. The neutral tweet will be when -value<comput < value
def get_sentiment_label(tweet,value):
    polarity_info = sia.polarity_scores(tweet)
    if polarity_info['compound'] >= value:
        return 'positive'
    elif polarity_info['compound'] < value and polarity_info['compound'] > -value:
        return 'neutral'
    else:
        return 'negative'


In [6]:
#Some Exmpples
tweet_example = list(df_tw['full_text'].loc[[1343]])[0]
tweet_example,get_sentiment_label(tweet_example,0.2), sia.polarity_scores(tweet_example)

('Australia Votes Out Far-Right Lawmaker Egged By Teen  Dont egg vote.  https://t.co/KLN2EWRGk9',
 'neutral',
 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0})

In [7]:
#Here We are going to Prepare the data to do some visualization. We will include the following:
# tweet, sentiment_label, emojis, hashtags, tags, likes, retweet, date

tweets = dict(df_tw['full_text'])
sentiment_label = {}
emojis = {}
hashtags = {}
tag = {}
likes = dict(df_tw['favorite_count'])
retweet = dict(df_tw['retweet_count'])
date = {}
for i in df_tw.index:
   date_temp = 'No Date'
   if type(df_tw['created_at'].loc[i]) == str:
      date_temp = datetime.strptime(df_tw['created_at'].loc[i], '%Y-%m-%d %H:%M:%S')
   tweet_temp = list(df_tw['full_text'].loc[[i]])[0]
   sentiment_label[i] = get_sentiment_label(tweet_temp,.2)
   emojis[i] = extract_emojis(tweet_temp)
   hashtags[i] = extract_hashtags(tweet_temp)
   tag[i] = extract_tags(tweet_temp)
   date[i] = date_temp



In [8]:
df_prepro = pd.DataFrame([tweets, sentiment_label, emojis, hashtags, tag, likes, retweet, date]).transpose()

In [9]:
df_prepro = df_prepro.rename(columns = {0 : "tweet", 1: "sentiment_label", 2: 'emojis', 3: 'hashtags', 4: "tags", 5: "likes", 6: "retweets", 7: "date"})

In [10]:
set(df_prepro['sentiment_label'])

{'negative', 'neutral', 'positive'}

In [11]:
collections.Counter(list(df_prepro['sentiment_label']))

Counter({'positive': 67663, 'neutral': 65973, 'negative': 49734})

In [13]:
#This is the data set after deleating some repeated/incorrect entriees
df_prepro.to_csv('Data/PreProData.csv', index=False)