## Data Pre-processing for Topic Modelling

In this notebook, we pre-process our data a bit further in order to make it ready for topic modelling.

For data protection purposes, the dataset that is used in this notebook is not provided here. If you want to replicate the analysis on this dataset, please contact the authors. 

#### Input
- Our processed Meltwater dataset: `Meltwater_processed.csv`

#### Output
- A slightly more processed dataset that's suitable for topic modelling: `vacc_proc_for_topicMdl.csv`

### Importing Necessary Libraries

In [None]:
# ----------------------------------------
# Libraries need to be installed
# ----------------------------------------

!pip install TextBlob
!pip install gensim
!pip install spacy
!python -m spacy download en_core_web_sm


# ----------------------------------------    
# For File operations
# ----------------------------------------

import zipfile

# ----------------------------------------
# Data read, write and other operations on Texts
# ----------------------------------------

import pandas as pd
import numpy as np
import string
import re
import unicodedata
from pprint import pprint

# ----------------------------------------
# For Libaries for NLP applications
# ----------------------------------------

import nltk
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import gensim
import spacy
spcy = spacy.load('/opt/conda/envs/Python-3.6-WMLCE/lib/python3.6/site-packages/en_core_web_sm/en_core_web_sm-2.3.1')
from gensim import corpora
from gensim.models import CoherenceModel
from textblob import TextBlob

# ----------------------------------------
# For ignoring some warnings
# ----------------------------------------

import warnings
warnings.filterwarnings('ignore')
def wrng():
  warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
  warnings.simplefilter("ignore")
  wrng()


# ----------------------------------------    
# Need to download some extras
# ----------------------------------------

nltk.download('punkt')
nltk.download('stopwords')

### Reading the data


In [None]:
Unproc_df = pd.read_csv("/project_data/data_asset/Meltwater_processed.csv")
pd.set_option('display.max_columns', None)  # Showing all columns for that dataframe

### Checking for Blank Tweets

In [None]:
print("\033[94m" + "\033[1m" + "Before Removing Null Tweets Total length of Data is - ", len(Unproc_df))

if Unproc_df["Clean text_original text"].isin([np.nan]).any() == True:                                             #Removing Rows if NULL Tweets exists from Processed
        Unproc_df = Unproc_df.dropna(subset=["Clean text_original text"], axis = 0).reset_index(drop=True)
        
if Unproc_df["Clean text_comment"].isin([np.nan]).any() == True:                                             #Removing Rows if NULL Tweets exists from Processed
        Unproc_df = Unproc_df.dropna(subset=["Clean text_comment"], axis = 0).reset_index(drop=True)
        
print("\033[94m" + "\033[1m" + "After Removing Null Tweets Total length of Data is - ", len(Unproc_df))

### Appending 'User Comments' and 'Original Tweet'

From our Pre-processing dataset, there're two different coloumns for user-comments and Original tweets. Merging both will help in topic modelling.

In [None]:
sntnccmts__ = []


for qt_, txt_, cmt_ in zip(Unproc_df["Is QT"], Unproc_df["Clean text_original text"], Unproc_df["Clean text_comment"]):
    if qt_ == 0:
        sntnccmts__.append(txt_)
    else:
        sntnccmts__.append(txt_ + " " + cmt_)
        
Unproc_df["Clean_sentence_Comment"] = sntnccmts__

### Cleaning Junk-words

 Tweets contain lots of junk words (like 'looool', 'haahaa', '5555666' etc.) that are nothing but noise. Here, we:
- First, fetch all junk words and verify whether we are missing any important information.
- Second, add them to our stopwords list and remove them from our corpus.

In [None]:
# ----------------------------------------    
# Here, finding and checking junk-words
# ----------------------------------------

"""

df__  -->  dataframe to be modified

"""

def repeats_(df__):

    repeats_ = []

    for sent_ in df__["Clean_sentence_Comment"].tolist():
        for ww in sent_.split():
            pattern = re.findall(r'(.)\1\1+', ww)
            if (str(pattern).strip('[]'))!= "":
                repeats_.append(ww)
        
    print("\033[94m" + "\033[1m" + "Total count of repetitive words for dataset is - ", len(set(repeats_)))        
    print("\033[94m" + "\033[1m" + "Some repetitative words for dataset are:\n" + "\033[0m", repeats_[0:20])
    
    return repeats_


"""

sent  -->  sentence to be modified

"""

def tokenize(sent):
    final_tokens = word_tokenize(sent)
    return final_tokens


# ----------------------------------------    
# Now, Removing Junk-words
# ----------------------------------------


"""

df__  -->  dataframe to be modified
repeats__  -->  extra stopwords found from above cell

"""

def del_excess_(df__, repeats__):
    
    stop_words_ext = stopwords.words('english')
    stop_words_ext.extend(set(repeats__))             # Adding set() here, otherwise we are adding some words more than once

    sentences = []
    for line in df__["Clean_sentence_Comment"].values.tolist():
        tokenized_word = tokenize(line)
        words = [w for w in tokenized_word if len(w) > 2 and w not in stop_words_ext]           # Removing stopwords and words having length <= 2
        sentence = " ".join(words)
        sentences.append(sentence.strip())

    df__["Clean_sentence_Comment"] = sentences
    
    return df__



_repeats_ = repeats_(Unproc_df)

Unproc_df = del_excess_(Unproc_df, _repeats_)

### Lemmatization

There're different forms of one root word that have same meaning. This root word is called _lemma_. For instance, the lemma for the word 'best' is 'good'. Condensing words into their lemmatised forms can be very helpful in topic modelling.

In [None]:

"""

df__  -->  The dataframe needs to modified

"""


def lemm_(df__):


    # Lemmetization Begins

    all_lemmas = []
    
    for tweet in df__["Clean_sentence_Comment"]:
        sentence = spcy(tweet)
        text = ' '.join([elem.lemma_ for elem in sentence])
        all_lemmas.append(text)
    
    df__["Clean_sentence_Comment"] = all_lemmas

    return df__


processed_tweets_Vaccs_ = lemm_(Unproc_df)

print("\033[94m" + "\033[1m" + "Before Removing Null Tweets Total length of Data is - ", len(processed_tweets_Vaccs_))

# -----------------------------------
#Removing Rows if NULL Tweets exists from Processed
# -----------------------------------

if processed_tweets_Vaccs_["Clean_sentence_Comment"].isin([np.nan]).any() == True:                                             
        processed_tweets_Vaccs_ = processed_tweets_Vaccs_.dropna(subset=["Clean_sentence_Comment"], axis = 0).reset_index(drop=True)

        
print("\033[94m" + "\033[1m" + "After Removing Null Tweets Total length of Data is - ", len(processed_tweets_Vaccs_))

### Saving Final Dataset

In [None]:
processed_tweets_Vaccs_.to_csv('/project_data/data_asset/vacc_proc_for_topicMdl.csv', index = False)

  
  
### Author:

-  **Ananda Pal** is a Data Scientist and Performance Test Analyst at IBM, where he specialises in Data Science and Machine Learning Solutions

Copyright © IBM Corp. 2020. Licensed under the Apache License, Version 2.0. Released as licensed Sample Materials.