<a href="https://colab.research.google.com/github/frederik-kilpinen/ASDS2/blob/main/Notebooks/data_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Processing

**By: Frederik, Connor, Matias, Lukas**

This notebook contains all the data-processing steps taken before analysis is done. The data comes from two-sources:
1. Meta Data about Australian parlamentarians(MPs) comes from the research project http://twitterpoliticians.org/download. We call this MP data.
2. The latest 3200 tweets from Australian MPs that we have collected. We call this tweet data.

In short we do the following preprocessing steps:

1. Process the MP data by:
    * subsetting relevant variables
    * renaming Nick Xenophon Team to center alliance (its later name)
    * removing titles such as Mr or Ms from MP names
2. Merge the two data-sets
3. Subset on MPs that were active MPs during the time of their tweet
4. Subset on the time-period 1 year before the bushfire (1. June 2018) and 1 year after the bushfire (1. May 2021)
5. clean the tweet text by:
    * lower-casing the text
    * remove special characters, punctuation, symbols, mentions, emojis
    * remove english stop words (nltk)
    * columns where we retain lemmas and stems
    * columns where we retain part-of-speech from lemmas and stems


The final data frame contains the following columns:

* screen_name
* user_id 
* tweet_id
* created_at
* full_text',
* favorite_count
* retweet_count
* in_reply_to_screen_name',
* hashtags
* user_mentions
* url
* image_url
* name
* party
* legislative_period
* lemmas
* stems
* pos_lemmas 
* pos_stems
    

In [32]:
#Necessary imports
import pandas as pd
import numpy as np
from tqdm import tqdm
import tweepy
from datetime import date
import pickle 
import time
import matplotlib.pyplot as plt
import re
import string
import nltk
from nltk.tokenize import TweetTokenizer
from collections import defaultdict
from textblob import TextBlob

# If google colab:
#nltk.download('wordnet')
#nltk.download('averaged_perceptron_tagger')
#nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [33]:
class DataProcessing:
    
    def __init__(self):
        
        #Set the file path. Change if necessary
        #tweet_data_path = "data/mp_tweets.csv"
        #mp_data_path = "data/full_member_info.csv"

        tweet_data_path = "/content/drive/MyDrive/Digital methods/mp_tweets"
        mp_data_path = "/content/drive/MyDrive/Digital methods/full_member_info.csv"

        self.tweet_data = pd.read_csv(tweet_data_path, index_col = 0)
        self.mp_data = pd.read_csv(mp_data_path)
    
    
    def compile_final_df(self):
        """
        This method compiles the final data-set used in our analysis. Doing the following steps:
            1. Loads and 
        
        """
        start_time = time.time()
        #Clean the Twitter data
        tweet_df = self.clean_tweet_data(self.tweet_data)
        #Clean the mp_info data
        mp_df = self.clean_mp_data(self.mp_data)
        
        print("-"*66)
        print(f"Shape of twitter data: {tweet_df.shape}\nShape of MP data: {mp_df.shape}")
        
        #Merge to final df
        final_df = tweet_df.merge(mp_df, on = "user_id", how = "left")
        
        #Subset on active MPs
        final_df = final_df.loc[((final_df["legislative_period"] == "45") & (final_df["created_at"] < "2019-07-01"))|
                                ((final_df["legislative_period"] == "46") & (final_df["created_at"] > "2019-07-01"))]
        
        #Subset tweets from 1 year before the bushfire (1. June 2018) and 1 year after the bushfire (1. May 2021)
        final_df = final_df.loc[(final_df["created_at"] >= "2018-06-01") & (final_df["created_at"] <= "2021-04-30")]
           
        #Restetting index for final df
        final_df = final_df.reset_index(drop = True)
        
        print("-"*66)
        print(f"Shape of final data-frame: {final_df.shape}" )
        print("Time to execute: ", "--- %s seconds ---" % (time.time() - start_time))
        start_time = time.time()
        print("-"*66)
        print("Begining to process the tweet text. Restarting timer...")
        
        #Get the stems and lemmas
        final_df["lemmas"] = final_df["full_text"].apply(lambda tweet: self.preprocess_lemma(tweet))
        final_df["stems"] = final_df["full_text"].apply(lambda tweet: self.preprocess_stem(tweet))
        
        # replace the nan values with empty strings
        final_df["lemmas"] = final_df["lemmas"].apply(lambda x: "" if str(x) == "nan" else x)
        print("Lematizing finished at: ", "--- %s seconds ---" % (time.time() - start_time))
        final_df["stems"] = final_df["stems"].apply(lambda x: "" if str(x) == "nan" else x)
        print("Stemming finished at: ", "--- %s seconds ---" % (time.time() - start_time))
        
        #Create column of part-of-speech from lemmas and stems
        final_df["pos_lemmas"] = final_df["lemmas"].apply(lambda tweet: self.get_pos(tweet))
        final_df["pos_stems"] = final_df["stems"].apply(lambda tweet: self.get_pos(tweet))
        
        print("-"*66)
        print("FINISHED: time to execute: ", "--- %s seconds ---" % (time.time() - start_time))

        return final_df

    def clean_tweet_data(self, tweet_df):


        #Drop 6 tweets that are corrupt. Because of it only being 6 tweets we drop them instead of re-running the collection from the API
        remove_idx = [175522, 190414, 211953, 212012, 212013, 212298 ]
        tweet_df = tweet_df.drop(tweet_df.index[remove_idx])

        #Make data into date-time object, remove h-m-s from dt
        tweet_df["created_at"] = pd.to_datetime(pd.to_datetime(tweet_df["created_at"]).dt.date)
        
        tweet_df["user_id"] = tweet_df["user_id"].astype(int)
        
        return tweet_df
    
    def clean_mp_data(self, mp_df):
        
        #Select relevant columns
        mp_df = mp_df[['p.country', 'm.name', 'p.party', 'm.uid', 'lp.official_legislative_period']]
        mp_df = mp_df.loc[mp_df["p.country"]=="Australia"]
        
        #Drop australia column
        mp_df = mp_df.drop(columns = ["p.country"])
        #Rename some columns
        mp_df = mp_df.rename(columns = {"m.name":"name", "p.party":"party",
                                       "lp.official_legislative_period":"legislative_period"})
        
        #Rename user id column for merging with members_info data
        mp_df = mp_df.rename(columns = {"m.uid":"user_id"})
        
        #remove titles from the names
        remove = r"(^[A-Za-z]{2}\s{1}|\s{1}[A-Z]{2,}|^Hon\s{1}|^Mrs\s{1}|(Dr\s)|,)"
        mp_df["name"] = mp_df["name"].str.replace(remove, "", regex = True)
        
        #remove mps that don't have twitter
        mp_df = mp_df.loc[mp_df["user_id"] != "\\N"]
        mp_df["user_id"] = mp_df["user_id"].astype(int)
        
        # Merge the Nick Xenophon Team and Centre Alliance 
        mp_df["party"] = mp_df["party"].apply(lambda x: "Centre Alliance" if x == "Nick Xenophon Team" else x)
        
        return mp_df
    
    def preprocess_text(self, text):

        #Lowercasing words
        text = str(text)
        text = text.lower()

        #Removing '&amp' which was found to be common
        text = re.sub(r'&amp','', text)

        #Replace other instances of "&" with "and"
        text = re.sub(r'&','and', text)

        #Removing mentions 
        text = re.sub(r'@\w+ ', '', text)

        #Removing 'RT' and 'via'
        text = re.sub(r'(^rt|^via)((?:\b\W*@\w+)+): ', '', text)

        #Removing emojis
        EMOJI_PATTERN = re.compile(
          "["
          "\U0001F1E0-\U0001F1FF"  # flags (iOS)
          "\U0001F300-\U0001F5FF"  # symbols & pictographs
          "\U0001F600-\U0001F64F"  # emoticons
          "\U0001F680-\U0001F6FF"  # transport & map symbols
          "\U0001F700-\U0001F77F"  # alchemical symbols
          "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
          "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
          "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
          "\U0001FA00-\U0001FA6F"  # Chess Symbols
          "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
          "\U00002702-\U000027B0"  # Dingbats
          "\U000024C2-\U0001F251" 
          "]+"
          )
        
        text = re.sub(EMOJI_PATTERN, '', text)

        #Removing punctuation
        my_punctuation = string.punctuation.replace('#','')
        my_punctuation = my_punctuation.replace('-','')

        text = text.translate(str.maketrans('', '', my_punctuation))
        text = re.sub(r' - ','', text) #removing dash lines bounded by whitespace (and therefore not part of a word)
        text = re.sub(r'[’“”—,!]','',text) #removing punctuation that is not captured by string.punctuation

        #Removing odd special characters
        text = re.sub(r"[┻┃━┳┓┏┛┗]","", text)
        text = re.sub(r"\u202F|\u2069|\u200d|\u2066","", text)

        #Removing URLs
        text = re.sub(r'http\S+', '', text)

        #Removing numbers
        text = re.sub(r'[0-9]','', text)

        #Removing separators and superfluous whitespace
        text = text.strip()
        text = re.sub(r' +',' ',text)

        #Tokenizing
        tokenizer = TweetTokenizer()
        tokens = tokenizer.tokenize(text)

        return tokens


    def preprocess_lemma(self, tokens):

        #Running the preprocess function
        tokens = self.preprocess_text(tokens)

        #Lemmatizing
        tag_map = defaultdict(lambda : nltk.corpus.wordnet.NOUN)      #POS map
        tag_map['J'] = nltk.corpus.wordnet.ADJ
        tag_map['V'] = nltk.corpus.wordnet.VERB
        tag_map['R'] = nltk.corpus.wordnet.ADV    

        lemmatizer = nltk.WordNetLemmatizer()             #Creating lemmatizer.
        text_lemmatized = []                              #Empty list to save lemmatized sentence

        for word, tag in nltk.pos_tag(tokens):
            lemma = lemmatizer.lemmatize(word, tag_map[tag[0]])
            text_lemmatized.append(lemma)

        tokens = text_lemmatized

        #Removing stopwords
        stop_words_list = nltk.corpus.stopwords.words("english")
        text = " ".join([i for i in tokens if i not in stop_words_list])

        return text

    def preprocess_stem(self, tokens):

        #Running the preprocess function
        tokens = self.preprocess_text(tokens)

        #Removing stopwords
        stop_words_list = nltk.corpus.stopwords.words("english")
        tokens = [i for i in tokens if i not in stop_words_list]

        #Stemming
        stemmer = nltk.PorterStemmer()    #Creating stemmer
        sent_stemmed = []                 #Empty list to save stemmed sentence

        for word in tokens:
            stem = stemmer.stem(word)     #Stemming words
            sent_stemmed.append(stem)

        tokens = sent_stemmed

        return " ".join(tokens)
    
    
    def get_pos(self, text):
        blob = TextBlob(text)
        pos = [word for (word,tag) in blob.tags if tag in ["NN", "NNP", "VD"]]
        
        return " ".join(pos)

In [34]:
processor = DataProcessing()
final_df = processor.compile_final_df()

  if self.run_code(code, result):


------------------------------------------------------------------
Shape of twitter data: (335969, 12)
Shape of MP data: (258, 4)
------------------------------------------------------------------
Shape of final data-frame: (170338, 15)
Time to execute:  --- 0.9455933570861816 seconds ---
------------------------------------------------------------------
Begining to process the tweet text. Restarting timer...
Lematizing finished at:  --- 354.2302391529083 seconds ---
Stemming finished at:  --- 354.2888534069061 seconds ---
------------------------------------------------------------------
FINISHED: time to execute:  --- 735.1697945594788 seconds ---


In [35]:
final_df.shape

(170338, 19)

In [36]:
final_df.head()

Unnamed: 0,screen_name,user_id,tweet_id,created_at,full_text,favorite_count,retweet_count,in_reply_to_screen_name,hashtags,user_mentions,url,image_url,name,party,legislative_period,lemmas,stems,pos_lemmas,pos_stems
0,AlanTudgeMP,185932331,1.388275e+18,2021-04-30,Get the fundamentals right and lift our game. ...,18,7.0,,[],[],https://www.theaustralian.com.au/inquirer/get-...,,Alan Tudge,Liberal Party of Australia,46,get fundamental right lift game see thought pr...,get fundament right lift game see thought prio...,lift game priority curriculum review,game nation curriculum review
1,AlanTudgeMP,185932331,1.388274e+18,2021-04-30,RT @australian: State and federal education mi...,0,6.0,,[],['australian'],,,Alan Tudge,Liberal Party of Australia,46,state federal education minister set oppose el...,state feder educ minist set oppos element prop...,state education minister element school curric...,state feder minist element school curriculum b...
2,AlanTudgeMP,185932331,1.388039e+18,2021-04-30,Great support for our $53m Higher Education su...,12,3.0,,[],[],https://twitter.com/ITECAust/status/1387941814...,,Alan Tudge,Liberal Party of Australia,46,great support high education support package,great support higher educ support packag,support education support package,support support packag
3,AlanTudgeMP,185932331,1.388035e+18,2021-04-30,RT @IndependentHEA: IHEA welcomes today’s anno...,0,2.0,,[],"['IndependentHEA', 'AlanTudgeMP', 'stuartrober...",,,Alan Tudge,Liberal Party of Australia,46,ihea welcome todays announcement unveil suite ...,ihea welcom today announc unveil suit budget m...,welcome announcement suite budget measure …,ihea welcom today suit budget measur provid …
4,AlanTudgeMP,185932331,1.388031e+18,2021-04-30,RT @ITECAust: The Australian — As @TimDoddEDU ...,0,1.0,,[],"['ITECAust', 'TimDoddEDU']",,,Alan Tudge,Liberal Party of Australia,46,australian report iteca welcome australian gov...,australian report iteca welcom australian gove...,report government support …,report iteca welcom support provid …


In [None]:
final_df.to_csv("data/final_df.csv")