# Cleaning Meltwater data

In this notebook, we clean and extract information from the text field of the Twitter data extracted from Meltwater. 

As of 30/11, the data consists of a sample collected for the period from 01/10/2020 to 29/11/2020. Tweets are only from the cities of Nottingham and Liverpool and are based on the following keyword query: `vaccin* OR vax OR vaxx OR vaxxx OR jab OR Pfizer OR AstraZeneca OR (Astra NEAR/1 Zeneca) OR Moderna OR antivac* OR antivax* OR anti-vac* OR anti-vax* OR (anti NEAR/1 vac*) OR immun*`.

For data protection purposes, the dataset that is used in this notebook is not provided here. If you want to replicate the analysis on this dataset, please contact the authors. 

#### Input
- Dataset with tweets from Nottingham: `Vaccines_search_Nottingham.csv`
- Datasets with tweets from Liverpool: `Vaccines_search_Liverpool_1.csv` and `Vaccines_search_Liverpool_1.csv`


#### Output
- Unprocessed (but still combined) dataset: `Meltwater_unprocessed.csv`
- Processed dataset: `Meltwater_processed.csv`

### 1. Preliminaries

Here we don't really touch the text in a meaningful way


#### 1.1. Import packages and data

In [1]:
!pip install gensim
!pip install spacy
!python -m spacy download en_core_web_sm

In [2]:
import zipfile
import os
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.util import ngrams
import string
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import re
import gensim
from gensim import corpora
from pprint import pprint
import spacy
spcy = spacy.load('/opt/conda/envs/Python-3.6-WMLCE/lib/python3.6/site-packages/en_core_web_sm/en_core_web_sm-2.3.1')

import warnings
warnings.filterwarnings('ignore')
def wrng():
  warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
  warnings.simplefilter("ignore")
  wrng()

nltk.download('punkt')
nltk.download('stopwords')


In [3]:
# Nottingham data
df_notts = pd.read_csv('/project_data/data_asset/Vaccines_search_Nottingham.csv')
print('Lenght of dataset:', len(df_notts))

In [6]:
# Liverpool data
df_liv1 = pd.read_csv('/project_data/data_asset/Vaccines_search_Liverpool_1.csv')
df_liv2 = pd.read_csv('/project_data/data_asset/Vaccines_search_Liverpool_2.csv')

print('Lengh of dataset:', len(df_liv1), '+', len(df_liv2))

In [7]:
df = df_notts.append(df_liv1)
df = df.append(df_liv2)
print('Lengh of combined dataset:', len(df))

#### 1.2. Remove duplicates

In [8]:
def del_duplicate(df__):
    '''
    df__  -->  dataframe to be modified
    '''    
    print("\033[94m" + "\033[1m" + "Original length of data:", len(df__))
    df__.drop_duplicates(subset = ['Hit Sentence', 'Influencer', 'Date'], keep = 'first', inplace = True) # Remove if there are duplicates tweets in terms of date, text and author
    print("\033[94m" + "\033[1m" + "After Removing Duplicates Total length of data is - ", len(df__))


    if df__["Hit Sentence"].isin([np.nan]).any() == True:                                             #Removing Rows if NULL Tweets exists from Processed
        df__ = df__.dropna(subset=["Hit Sentence"], axis = 0).reset_index(drop=True)
        
    print("\033[94m" + "\033[1m" + "After Removing Null Tweets Total length of data is - ", len(df__))
    
    return df__


df_unproc =  del_duplicate(df)

In [9]:
# Save unprocessed data
df_unproc.to_csv('/project_data/data_asset/Meltwater_unprocessed', index = False)

### 2. Process data

Here we do process text and produce a "clean" version. We can use the clean text to do sentiment analysis and topic modelling. 

At the end of this point, we will get two "clean" text fields: `Clean text_original text` and `Clean text_comment`. The distinction is necessary due to quoted tweets (QT). A quoted tweet has the following structure: `QT @user: <comment> : <original text>`. Here, our influencer (the person whose tweet we are seeing, in Meltwater's terminology) is quoting a tweet originally published by `@user`, who posted `<original text>`. In her publication, the influencer is adding her `<comment>`. Hence `Clean text_original text` refers to the tweet being quoted, whilst `Clean text_comment` is the comment by the influencer. 

If the tweet is not a QT, the two columns should be the same.

#### 2.1. Remove RT/QT indicator

In [10]:
# Dummy variables for whether or not tweet is RT (retweet) or QT (quoted tweet)
df_unproc['Is RT'] = df_unproc['Hit Sentence'].str.find('RT @', 0, 4)
df_unproc['Is RT'].replace(to_replace={0:1, -1:0}, inplace = True)

df_unproc['Is QT'] = df_unproc['Hit Sentence'].str.find('QT @', 0, 4)
df_unproc['Is QT'].replace(to_replace={0:1, -1:0}, inplace = True)

In [11]:
def extract_RT_QT(_df):
    _df['Text_original text'] = _df['Hit Sentence'].str.replace(r'RT @[A-Za-z0-9_-]*: ', '') # Eliminates the RT identifier --> RT @someone:
    _df['Text_original text'] = _df['Text_original text'].str.replace(r'QT @*[A-Za-z0-9 \s\S]* ; ', '') # Eliminates the QT identifier and the comment, leaving only the original tweet --> QT @someone: <comment>

    _df['Text_comment'] = _df['Hit Sentence'].str.replace(r'RT @[A-Za-z0-9_-]*: ', '') # Eliminates the RT identifier --> RT @someone:
    _df['Text_comment'] = _df['Text_comment'].str.replace(r'QT @[A-Za-z0-9_-]*: ', '') # Eliminates the QT identifier and the original tweet, leaving only the comment from the user (this line and the one below)
    _df['Text_comment'][_df['Is QT'] == 1] = _df['Text_comment'].str.split(r' ; ', expand = True).iloc[:,0]
    
    return _df

In [12]:
df_proc = extract_RT_QT(df_unproc)

#### 2.2. Clean text



In [167]:
def process(dframe_):
    '''
    dframe_  -->  dataframe to be processed
    '''
    def del_mention(sent):
        '''
        Removing all the mentions (@...) from dataset
        '''
        final_sent = re.sub("@[^\s]+", "", sent).replace("  ", " ")
        return final_sent
    
    def del_hash(sent):
        '''
        Removing all the Hash-tags (#...) from dataset
        '''
        final_sent = re.sub("#[^\s]+", "", sent).replace("  ", " ")
        return final_sent

    def del_url(sent):
        '''
        Removing all the URLs (http / https...) from dataset
        '''
        final_sent = re.sub(r'http\S+', '', sent).replace("  ", " ")
        return final_sent

    def tokenize(sent):
        '''
        Tokenizing sentences 
        '''
        final_tokens = word_tokenize(sent)
        return final_tokens

    def del_punct(sent):
        '''
        Removing all punctuations from dataset
        '''
        final_sent = re.sub(r'[^\w\s]', ' ', sent).replace("  ", " ")
        final_sent = re.sub(r'\_', ' ', final_sent).replace("  ", " ")    
        return final_sent
    
    
    """
    # This will convert chars like ÔŸóÉŻâÙÛé.... and remove chars with different languages

    def del_accnt(sent):
        final_sent = unicodedata.normalize('NFD', sent).encode('ascii', 'ignore').decode("utf-8")
        return final_sent
    """

    def del_lang_emoji_spclChar(sent):
        '''
        Removing all the special characters, emojis, words of different languages from dataset
        '''
        final_sent = re.sub("[^a-zA-Z0-9 \t\n\r\f\v]", "", sent).replace("  ", " ")
        return final_sent


    stop_words = (stopwords.words("english"))


    for text_var in ['_original text', '_comment']:
        sentences = []
        for line in dframe_["Text" + text_var].values.tolist():
            sentence = line.lower().replace("\n", "")                                           # Remove "\n" in sentence
            sentence = del_mention(sentence)                                   
            sentence = del_hash(sentence)                                    
            sentence = del_url(sentence) 
            sentence = del_punct(sentence)
            tokenized_word = tokenize(sentence)                                      
            words = [w for w in tokenized_word if len(w) > 2 and w not in stop_words]           # Removing stopwords and words having length <= 2
            sentence2 = " ".join(words)
            sentence2 = del_lang_emoji_spclChar(sentence2)                                      # Removing words from different language, punctuations, special chars and emojis
            tokenized_word2 = tokenize(sentence2)                                      
            words2 = [w2 for w2 in tokenized_word2 if len(w2) > 2]                              # words having length <= 2 again
            sentence2 = " ".join(words2)  
            sentences.append(sentence2.strip())
        
        dframe_["Clean text" + text_var] = sentences

    return dframe_



In [13]:
%%time
# Running the above function
df_proc = process(df_proc)
df_proc.head(10)

In [14]:
# Check there are no empty values in our text variable after processing
assert df_proc['Clean text_original text'].isnull().any() == False
assert df_proc['Clean text_comment'].isnull().any() == False

assert df_proc['Text_original text'].isnull().any() == False
assert df_proc['Text_comment'].isnull().any() == False

In [15]:
# Save processed data
df_proc.to_csv('/project_data/data_asset/Meltwater_processed.csv', index = False)

________

#### Authors
- **Álvaro Corrales Cano** is a Data Scientist within IBM's Cloud Pak Acceleration team. With a background in Economics, Álvaro specialises in a wide array Econometric techniques and causal inference, including regression, discrete choice models, time series and duration analysis.
- **Ananda Pal** is a Data Scientist and Performance Test Analyst at IBM, where he specialises in Data Science and Machine Learning Solutions. 

Copyright © IBM Corp. 2020. Licensed under the Apache License, Version 2.0. Released as licensed Sample Materials.
