# Preprocessing Twitter data

This notebook is to pre-process the Twitter data for topic modeling and sentiment analysis.

Data cleaning:
- Limit to tweets in English
- Transform to all lowercase
- Remove URLs and HTML reference characters
- Remove placeholders
- Remove non-letter characters
- Removes unnecessary columns
- Change date to pd.datetime

To discuss with Steve:
- Remove stop words? [This article](https://www.aclweb.org/anthology/L14-1265/) says that removing stop words might affect sentiment analysis performance, but might be necessary for topic modeling
- Stem/lemmatize the words?

### To-do:
- [X] Pre-process data
- [X] Run preliminary sentiment analysis
- [ ] Tweak sentiment analysis
- [ ] Use n-gram finder
- [ ] Split data into different dataframes based on sentiment analysis.
- [ ] Analyze by date?

In [1]:
import pandas as pd
import re
import os

import nltk
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

In [2]:
# Pre-process the data
def preprocess_data(df):
    '''
    Pre-processes the data as described above
    '''
    processed_df = df.loc[df.lang == "en", :].copy()
    columns_to_keep = [
        'date', 'content', 'url', 'coordinates', 'place', 'id', 'username', 
        'replyCount', 'retweetCount', 'likeCount', 'quoteCount',
        'conversationId', 'retweetedTweet', 'quotedTweet', 'outlinks', 
        'tcooutlinks', 'media', 'mentionedUsers'
    ]
    
    processed_df = processed_df[columns_to_keep]
    processed_df.date = pd.to_datetime(processed_df.date[:10], yearfirst=True, format="%Y-%m-%d")
    processed_df.content = processed_df.content.str.lower()
    processed_df.content = processed_df.content.apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
    processed_df.content = processed_df.content.apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))
    processed_df.content = processed_df.content.apply(lambda x: re.sub(r'{link}', '', x))
    processed_df.content = processed_df.content.apply(lambda x: re.sub(r"\[video\]", '', x))
    processed_df.content = processed_df.content.apply(lambda x: re.sub(r'&[a-z]+;', '', x))
    processed_df.content = processed_df.content.apply(lambda x: re.sub(r"[^a-z\s\(\-:\)\\\/\];='#]", '', x))
    
    processed_df['tokens'] = processed_df['content'].apply(tknzr.tokenize)
    
    return processed_df

In [3]:
# Load Twitter data
path, dirs, files = next(os.walk("data/"))
df_list = []
  
# Pre-process
for file in files:
    print("Working on:", file)
    raw_df = pd.read_csv("data/" + file)
    clean_df = preprocess_data(raw_df)
    df_list.append(clean_df)

Working on: tweets_9.csv
Working on: tweets_8.csv
Working on: tweets_29.csv
Working on: tweets_15.csv
Working on: tweets_14.csv
Working on: tweets_28.csv
Working on: tweets_16.csv
Working on: tweets_17.csv
Working on: tweets_13.csv
Working on: tweets_12.csv
Working on: tweets_10.csv
Working on: tweets_38.csv
Working on: tweets_39.csv
Working on: tweets_11.csv
Working on: tweets_20.csv
Working on: tweets_34.csv
Working on: tweets_35.csv
Working on: tweets_21.csv
Working on: tweets_37.csv
Working on: tweets_23.csv
Working on: tweets_22.csv
Working on: tweets_36.csv
Working on: tweets_32.csv
Working on: tweets_26.csv
Working on: tweets_27.csv
Working on: tweets_33.csv
Working on: tweets_25.csv
Working on: tweets_31.csv
Working on: tweets_19.csv
Working on: tweets_18.csv
Working on: tweets_30.csv
Working on: tweets_24.csv
Working on: tweets_3.csv
Working on: tweets_2.csv
Working on: tweets_40.csv
Working on: tweets_1.csv
Working on: tweets_5.csv
Working on: tweets_4.csv
Working on: tweets_

In [4]:
# Append all the dataframes and save to data directory
main_df = pd.concat([df for df in df_list], ignore_index=True)
main_df.to_csv('data/preprocessed_data.csv')

In [5]:
main_df.shape

(246014, 19)

In [6]:
main_df.columns

Index(['date', 'content', 'url', 'coordinates', 'place', 'id', 'username',
       'replyCount', 'retweetCount', 'likeCount', 'quoteCount',
       'conversationId', 'retweetedTweet', 'quotedTweet', 'outlinks',
       'tcooutlinks', 'media', 'mentionedUsers', 'tokens'],
      dtype='object')

# Sentiment Analysis - Exploration

Messing around with NLTK library, following [this tutorial](https://towardsdatascience.com/step-by-step-twitter-sentiment-analysis-in-python-d6f650ade58d)

In [35]:
from textblob import TextBlob
from nltk.sentiment import SentimentIntensityAnalyzer

In [40]:
main_df[['polarity', 'subjectivity']] = main_df.content.apply(lambda Text: pd.Series(TextBlob(Text).sentiment))

for index, row in main_df.content.iteritems():
    score = SentimentIntensityAnalyzer().polarity_scores(row)
    neg = score['neg']
    neu = score['neu']
    pos = score['pos']
    comp = score['compound']
    
    if neg > pos:
        main_df.loc[index, 'sentiment'] = "negative"
    elif pos > neg:
        main_df.loc[index, 'sentiment'] = "positive"
    else:
        main_df.loc[index, 'sentiment'] = "neutral"
        
    main_df.loc[index, 'neg'] = neg
    main_df.loc[index, 'neu'] = neu
    main_df.loc[index, 'pos'] = pos
    main_df.loc[index, 'compound'] = comp

In [41]:
main_df.head(10)

Unnamed: 0,date,content,url,coordinates,place,id,username,replyCount,retweetCount,likeCount,...,media,mentionedUsers,tokens,polarity,subjectivity,sentiment,neg,neu,pos,compound
0,2021-03-30 08:50:14+00:00,louder #stopasianhate #stopaapihate,https://twitter.com/jennicav95/status/13768190...,,,1376819042892148740,jennicav95,0,0,1,...,[None],[None],"[louder, #stopasianhate, #stopaapihate]",0.0,0.0,neutral,0.0,1.0,0.0,0.0
1,2021-03-30 08:50:14+00:00,benciintears we stand against racial discrimin...,https://twitter.com/000mariaaa000/status/13768...,,,1376819041533329410,000mariaaa000,1,0,0,...,[None],"[[User(username='benciintears', displayname='b...","[benciintears, we, stand, against, racial, dis...",0.285714,0.535714,negative,0.204,0.701,0.095,-0.5574
2,2021-03-30 08:50:13+00:00,vantaehworld we stand against racial discrimin...,https://twitter.com/vcuttparis/status/13768190...,,,1376819040606359552,vcuttparis,0,0,0,...,[None],"[[User(username='vantaehworld', displayname='a...","[vantaehworld, we, stand, against, racial, dis...",0.285714,0.535714,negative,0.225,0.671,0.104,-0.5574
3,2021-03-30 08:50:13+00:00,rosesblackswan mal ya bunlar harbi mal\nwe sta...,https://twitter.com/yunoyuno65/status/13768190...,,,1376819037519417345,yunoyuno65,0,0,2,...,[None],"[[User(username='rosesblackswan', displayname=...","[rosesblackswan, mal, ya, bunlar, harbi, mal, ...",0.285714,0.535714,negative,0.173,0.747,0.08,-0.5574
4,2021-03-30 08:50:13+00:00,lacittadi you i and we all have the right to b...,https://twitter.com/FlLTERM0ON/status/13768190...,,,1376819036445667330,FlLTERM0ON,0,1,1,...,[None],"[[User(username='LacittaDi', displayname='SMER...","[lacittadi, you, i, and, we, all, have, the, r...",0.285714,0.535714,positive,0.0,0.838,0.162,0.4767
5,2021-03-30 08:50:11+00:00,tkrinces we stand against racial discriminatio...,https://twitter.com/jiminielatte/status/137681...,,,1376819032125497345,jiminielatte,0,0,1,...,[None],"[[User(username='tkrinces', displayname='tinus...","[tkrinces, we, stand, against, racial, discrim...",0.285714,0.535714,negative,0.198,0.71,0.092,-0.5574
6,2021-03-30 08:50:11+00:00,sputniktr btsturkey we stand against racial di...,https://twitter.com/ARMY45161493/status/137681...,,,1376819030342926340,ARMY45161493,0,1,8,...,[None],"[[User(username='sputnik_TR', displayname='Spu...","[sputniktr, btsturkey, we, stand, against, rac...",0.285714,0.535714,negative,0.204,0.701,0.095,-0.5574
7,2021-03-30 08:50:10+00:00,herallizm we stand against racial discriminati...,https://twitter.com/qupiqupie/status/137681902...,,,1376819027696349184,qupiqupie,1,0,1,...,[None],"[[User(username='herallizm', displayname=""Hera...","[herallizm, we, stand, against, racial, discri...",0.285714,0.535714,negative,0.198,0.71,0.092,-0.5574
8,2021-03-30 08:50:10+00:00,imonnielab sonunda bighito\n\nyou i and we all...,https://twitter.com/Bangtanswagtan/status/1376...,,,1376819025204891649,Bangtanswagtan,0,0,0,...,[None],"[[User(username='iMONNIELAB', displayname='fes...","[imonnielab, sonunda, bighito, you, i, and, we...",0.285714,0.535714,positive,0.0,0.819,0.181,0.4767
9,2021-03-30 08:50:09+00:00,stayforyouth \n\nwe stand against racial discr...,https://twitter.com/kookiie_7_/status/13768190...,,,1376819022663139330,kookiie_7_,0,0,0,...,[None],"[[User(username='stayforyouth', displayname='h...","[stayforyouth, we, stand, against, racial, dis...",0.285714,0.535714,negative,0.204,0.701,0.095,-0.5574
