# _Text Normalization and Tokenization_

Machine learning algorithms technically don't work with text data, however, there is a workaround for this. By pre-processing the text, and then converting it into a numerical format (i.e. vectors of numbers), it is then in an appropriate format that can be fed into ML algorithms. But what does text pre-processing entail exactly?  

This is where things get interesting. From a high-level, preprocessing removes as much noise as possible from the text data, that way an algorithm can more readily find any potential patterns. But determining what is noise and what is not is significantly impacted by the type of text. For example, you do not want to use the same text pre-processing techniques when you are analyzing Tweets versus when you are analyzing novels. Sure, there may be some overlap, but these two examples of text are significantly different not only in their function but in the text patterns that exist within them (after all, you won't see any emojis in Dostoyevsky's __The Brothers Karamazov__...)

That being said there are some central core of text processing strategies that will help you get started:
- __Lower casing__: by lowercasing all of the text data, it allows us to capture a word that may have multiple spellings due to miscellaneous uppercasing. For example, a text may include: `America`, `aMerica`, and `AMERICA`. Now we know these are all the same word, however, machines don't, they think these are three different words. To help our computer come to its senses, we lowercase all the text so it can then recognize three cases of `america`, instead of one case of three words. 
- __Stemming__: This means looking for the "root" of a word. There are plenty of words that have multiple inflections. Take the word `connect`; some of its inflections include: `connected`, `connection`, and `connects`. With stemming, we can crudely change the inflection words to the root word by chopping off their endings.
- __Lemmatization__: Similar to stemming, in that its goal is to remove inflections, but it does it in a less crude way. It attempts to transform words to the actual root. Take the word `geese`, which is an inflection of `goose`. By using lemmatization, we can change it back into its original root (versus simply chopping off the letters at the end). 
- __Removing Stopwords__: When you are dealing with text, a lot of the words used actually provide no significant value. Examples include `a`, `this`, `and`, etc. What is the benefit of this? In theory, it allows us to keep only the important words. Lets take a look at the following sentence: `John is going to the store.` Now, let's remove `is`, `to`, and `the`: `John going store`. While it isn't grammatically correct, you still get the primary concept, that John is going to the store. While humans may think it's weird to read the above, this strategy has the potential to help a machine. 
- __Text Normalization__: Due to the character limits for Tweets, people will often use non-standard forms of words. One such example would be the use of `omg`, which stands for `oh my god`; another example would be the use of `2mrw` as a stand in for the word `tomorrow`. As I mentioned, this is pretty common pattern in social media text, so is a technique to seriously consider for this project. 
- __Text Enrichment / Augmentation__: Believe it or not, this strategy augments (i.e. adds) information that wasn't previously there before in hopes that can improve its predictive power. Sub-strategies could include things like part-of-speech tagging, or dependency parsing. 

But before we can start doing this, let's load in all the functions that we previously created in the 8.0 notebook, to address issues unique to Tweets:
- Links to Twitter pictures, YouTube videos, and other assorted URLs
- Mentioning other Twitter users and hashtags
- Emojis

We'll create a wrapper function that performs all of these functions together versus having to run them individually.

In [1]:
%load_ext autoreload
%load_ext line_profiler
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format='retina'

In [34]:
import re
import numpy as np
import pandas as pd
import os
import emoji
import spacy
import string
pd.options.display.max_columns = None
from tqdm.autonotebook import tqdm
tqdm.pandas()
import warnings
warnings.simplefilter("once")

  from pandas import Panel


## _Load Data_

In [3]:
df = pd.read_pickle("playground_data/covid19_0320_0324_updated_v2.pkl")
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3364618 entries, 2020-03-24 23:59:59 to 2020-03-20 01:37:05
Data columns (total 20 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   id                  int64 
 1   conversation_id     int64 
 2   user_id             int64 
 3   username            object
 4   name                object
 5   tweet               object
 6   mentions            object
 7   urls                object
 8   photos              object
 9   replies_count       int64 
 10  retweets_count      int64 
 11  likes_count         int64 
 12  hashtags            object
 13  link                object
 14  retweet             bool  
 15  quote_url           object
 16  video               int64 
 17  reply_to_userids    object
 18  reply_to_usernames  object
 19  processed_tweet     object
dtypes: bool(1), int64(7), object(12)
memory usage: 516.6+ MB


In [4]:
# all the functions below (except the wrapper) were taken from the 8.0 Text Preprocessing Notebook
def newline_remove(text):
    regex = re.compile(r"\n+", re.I)
    return regex.sub(" ", text)

def twitterpic_replace(text):
    regex = re.compile(r"pic.twitter.com/\w+", re.I)
    return regex.sub("xxpictwit", text)

def youtube_replace(text):
    regex = re.compile(r"(https://youtu.be/(\S+))|(https://www.youtube.(\S+))", re.I)
    return regex.sub("xxyoutubeurl", text)

def url_replace(text):
    regex = re.compile(r"(?:http|ftp|https)://(\S+)", re.I)
    return regex.sub("xxurl", text)

def usermention_replace(text):
    regex = re.compile(r"@([^\s:]+)+", re.I)
    return regex.sub("xxuser", text)

def hashtag_replace(text):
    regex = re.compile(r"#([^\s:]+)+", re.I)
    return regex.sub("xxhashtag", text)

def emoji_replace(text):
    # first demojize text
    new_text = emoji.demojize(text, use_aliases=True)
    regex = re.compile(r"(:\S+:)+", re.I)
    return regex.sub(" xxemoji ", new_text)

def unique_twitter_tokens(text):
    text = newline_remove(text)
    text = twitterpic_replace(text)
    text = youtube_replace(text)
    text = url_replace(text)
    text = usermention_replace(text)
    text = hashtag_replace(text)
    text = emoji_replace(text)
    return text

In [9]:
# create new column for clean Tweet text
df["clean_tweet"] = df["tweet"].progress_apply(unique_twitter_tokens)

HBox(children=(FloatProgress(value=0.0, max=3364618.0), HTML(value='')))




In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3364618 entries, 2020-03-24 23:59:59 to 2020-03-20 01:37:05
Data columns (total 21 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   id                  int64 
 1   conversation_id     int64 
 2   user_id             int64 
 3   username            object
 4   name                object
 5   tweet               object
 6   mentions            object
 7   urls                object
 8   photos              object
 9   replies_count       int64 
 10  retweets_count      int64 
 11  likes_count         int64 
 12  hashtags            object
 13  link                object
 14  retweet             bool  
 15  quote_url           object
 16  video               int64 
 17  reply_to_userids    object
 18  reply_to_usernames  object
 19  processed_tweet     object
 20  clean_tweet         object
dtypes: bool(1), int64(7), object(13)
memory usage: 542.3+ MB


In [11]:
# save dataframe so far for future access
df.to_pickle("playground_data/covid19_0320_0324_updated_v3.pkl")

## _Reload Data_

In [5]:
df = pd.read_pickle("playground_data/covid19_0320_0324_updated_v3.pkl")

## _Testing_

In [112]:
import spacy
from spacy import displacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

nlp = spacy.load("en_core_web_md")

In [107]:
def replace_corona(text):
    regex = re.compile(r"(corona[\s]?virus)+|(corona)+", re.I)
    return regex.sub("coronavirus", text)

In [109]:
def replace_covid(text):
    regex = re.compile(r"(covid[-\s]?19)+", re.I)
    return regex.sub("covid19", text)

In [139]:
def spacy_tokenizer(text):
    mytokens = parser(text)
    mytokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens]
    return " ".join(mytokens)

def normalize_text(text):
    PUNC = string.punctuation + "…" + "–"
    text = replace_covid(text)
    text = replace_corona(text)
    text = text.encode("ascii", errors="ignore").decode()
    # put spaces between punctuation
    punct = r"[" + re.escape(PUNC) + r"]"
    text = re.sub("(?<! )(?=" + punct + ")|(?<=" + punct + ")(?! )", r" ", text)
    # tokenize
    text = spacy_tokenizer(text)
    # create string without punctuation
    text = " ".join([word for word in text.split() if word not in PUNC and word not in STOP_WORDS])
    # remove any extra whitespace
    text = re.sub(r'[ ]{2,}',' ',text)
    return text

In [140]:
parser = English()

def clean_tweet_text(text):
    text = unique_twitter_tokens(text)
    text = normalize_text(text)
    return text

In [138]:
df["test_clean_tweet"] = df["tweet"].progress_apply(clean_tweet_text)

HBox(children=(FloatProgress(value=0.0, max=3364618.0), HTML(value='')))




In [None]:
df["test_clean_tweet"].str.contains(r")

In [123]:
normalize_text(samptweet1.iloc[0])

'coronavirus disease covid19 update xxnum xxnum gmt xxemoji xxemoji total confirmed cases xxnum xxnum xxemoji total confirmed deaths xxnum xxnum xxemoji total number recovered patients xxnum xxnum xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag'

In [124]:
normalize_text(samptweet2)

'coronavirus covid19 business advice xxurl'

In [129]:
normalize_text(subset["clean_tweet"][101])

'curling time shine suck xxhashtag xxpictwit'

In [16]:
# create sample to work with 1k tweets
subset = df.sample(n=1000, random_state=1)

In [66]:
testtext = "covid-19 covid19 COVID19 Covid-19 covid 19"

In [75]:
def find_covid(text):
    regex = re.compile(r"(covid[-\s]?19)+", re.I)
    return ",".join(x.group() for x in regex.finditer(text))

In [79]:
def replace_covid(text):
    regex = re.compile(r"(covid[-\s]?19)+", re.I)
    return regex.sub("covid19", text)

In [80]:
find_covid(testtext)

'covid-19,covid19,COVID19,Covid-19,covid 19'

In [81]:
replace_covid(testtext)

'covid19 covid19 covid19 covid19 covid19'

In [78]:
normalize_text(testtext)

'covid-19 covid19 covid19 covid-19 covid 19'

In [27]:
samptweet1 = subset["clean_tweet"].sample(n=1, random_state=9); print(samptweet1.iloc[0])

Coronavirus disease (COVID-19) Update at 7:10 GMT  xxemoji   xxemoji  Total Confirmed cases  - 382,031   xxemoji  Total Confirmed deaths - 16,565   xxemoji  Total Number of Recovered patients - 102,481  xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag


In [28]:
samptweet2 = subset["clean_tweet"][0]; print(samptweet2)

Coronavirus (COVID-19) Business Advice -  xxurl …


In [83]:
normalize_text(samptweet1.iloc[0])

'coronavirus disease covid19 update 710 gmt xxemoji xxemoji total confirmed cases 382031 xxemoji total confirmed deaths 16565 xxemoji total number recovered patients 102481 xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag'

In [84]:
normalize_text(samptweet2)

'coronavirus covid19 business advice xxurl'

In [64]:
subset["clean_tweet"] = subset["clean_tweet"].apply(normalize_text)

In [65]:
for n in range(10):
    print(subset["clean_tweet"][n] + "\n")

coronavirus covid-19 business advice xxurl

find latest xxhashtag information resources cta's covid-19 information page xxurl xxpictwit

coronavirus disease covid-19 update 710 gmt xxemoji xxemoji total confirmed cases 382031 xxemoji total confirmed deaths 16565 xxemoji total number recovered patients 102481 xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag

xxhashtag important prevent xxhashtag outbreak xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag xxhashtag

hint good news italy xxemoji lombardy italy reports day decreasing hospitalizations region covid-19 coronavirus xxurl

where's place you're going over? xxhashtag xxhashtag xxhashtag xxhashtag xxpictwit

stand italy trying times share support italian friends colleagues friends family cari amici siamo con voi xxhashtag xxhashtag

jason heyward discusses generous donation support covid-19 relief xxurl

care eco