# _Text Normalization and Tokenization_

Machine learning algorithms technically don't work with text data, however, there is a workaround for this. By pre-processing the text, and then converting it into a numerical format (i.e. vectors of numbers), it is then in an appropriate format that can be fed into ML algorithms. But what does text pre-processing entail exactly?  

This is where things get interesting. From a high-level, preprocessing removes as much noise as possible from the text data, that way an algorithm can more readily find any potential patterns. But determining what is noise and what is not is significantly impacted by the type of text. For example, you do not want to use the same text pre-processing techniques when you are analyzing Tweets versus when you are analyzing novels. Sure, there may be some overlap, but these two examples of text are significantly different not only in their function but in the text patterns that exist within them (after all, you won't see any emojis in Dostoyevsky's __The Brothers Karamazov__...)

That being said there are some central core of text processing strategies that will help you get started:
- __Lower casing__: by lowercasing all of the text data, it allows us to capture a word that may have multiple spellings due to miscellaneous uppercasing. For example, a text may include: `America`, `aMerica`, and `AMERICA`. Now we know these are all the same word, however, machines don't, they think these are three different words. To help our computer come to its senses, we lowercase all the text so it can then recognize three cases of `america`, instead of one case of three words. 
- __Stemming__: This means looking for the "root" of a word. There are plenty of words that have multiple inflections. Take the word `connect`; some of its inflections include: `connected`, `connection`, and `connects`. With stemming, we can crudely change the inflection words to the root word by chopping off their endings.
- __Lemmatization__: Similar to stemming, in that its goal is to remove inflections, but it does it in a less crude way. It attempts to transform words to the actual root. Take the word `geese`, which is an inflection of `goose`. By using lemmatization, we can change it back into its original root (versus simply chopping off the letters at the end). 
- __Removing Stopwords__: When you are dealing with text, a lot of the words used actually provide no significant value. Examples include `a`, `this`, `and`, etc. What is the benefit of this? In theory, it allows us to keep only the important words. Lets take a look at the following sentence: `John is going to the store.` Now, let's remove `is`, `to`, and `the`: `John going store`. While it isn't grammatically correct, you still get the primary concept, that John is going to the store. While humans may think it's weird to read the above, this strategy has the potential to help a machine. 
- __Text Normalization__: Due to the character limits for Tweets, people will often use non-standard forms of words. One such example would be the use of `omg`, which stands for `oh my god`; another example would be the use of `2mrw` as a stand in for the word `tomorrow`. As I mentioned, this is pretty common pattern in social media text, so is a technique to seriously consider for this project. 
- __Text Enrichment / Augmentation__: Believe it or not, this strategy augments (i.e. adds) information that wasn't previously there before in hopes that can improve its predictive power. Sub-strategies could include things like part-of-speech tagging, or dependency parsing. 

With this background in mind, let's turn our attention to developing a function that will clean our Tweet data. This is v2 of the text preprocessing component of our pipeline; the first version can be found in the `8.1-je-text-normal-token` Jupyter Notebook.

In [1]:
%load_ext autoreload
%load_ext line_profiler
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format='retina'

### _Our test Tweet_

Below is `testtweet`, which contains an actual tweet from the data we've gathered, with a few modifications. I've added the following terms/items (i.e. these weren't in the original Tweet):

- every newline character (e.g. `\n`)
- `Sars-Cov-2`
- `corona virus`
- the emoji (🧐)
- `#coronavirus`
- the YouTube link
- random GitHub link from Ryan's tutorial repository

The reason that I had this information is that each one represents a potential part that we must catch in our text preprocessing phase. We need to remove newline-characters because they provide next to no value; we need to account for emojis and replace them with special tokens to preserve that there was an emoji in the Tweet; we need to standardize various coronavirus/covid-related terms, both in regards to normal text and hashtags; and we need to address for the various types of URLs that are present. 

In [3]:
testtweet = """Sars-Cov-2 @narendramodi Dear PM, Is this really happening?\n\nThe countrymen have to pay for COVID-19 
corona virus #Covid19 Tests?  🧐 Unbelievable! \n #coronavirus pic.twitter.com/31nvInjcBZ 
https://www.youtube.com/watch?v=ig9yh8iVZWI
https://github.com/rkingery/ml_tutorials/blob/master/notebooks/ml_with_text.ipynb
"""

### _v2 Text Preprocessing Functions_

V1 of the text preprocessing component threw away or manipulated too much of the text data. Our primary focus with V2 is to retain as much information from the original Tweet as possible. With this in mind, lets turn our attention to the functions below.

- `newline_remove`: replaces newline characters (e.g. `\n`)
- `replace_coronavirus`: standardizes coronavirus term within text to `coronavirus`
- `coronavirus_hashtags`: any instances of `#coronavirus` are replaced with special token `xxhashcoronavirus`
- `replace_covid`: standardizes COVID-19 within text to `covid19`
- `covid_hashtags`: any instances of `#covid19` are replaced with special token `xxhashcovid19`
- `sarscov2_replace`: accounts for any mention of `SARS-CoV-2` and standardizes to `sarscov2`
- `emoji_replace`: replaces any emojis (e.g. 😉) in a given text with a special token `xxemoji`
- `twitterpic_replace`: if there is a link to a picutre in a format similar to `pic.twitter.com/`, we replace that substring with a special token `xxpictwit`
- `youtube_replace`: similar to the above, but geared specifically to any YouTube links, replacing them with `xxyoutubeurl`
- `url_replace`: again similar to the above, but geared specifically for any other miscellaneous URLs, replacing them with `xxurl`
- `punctuation_replace`: ensures that punctuation has one space on either side of the character (makes it easier to pick out)
- `clean_wrapper`: wrapper function that includes all the functions mentioned above

Additionally, `clean_wrapper` has some extra functionality that gives us a little more flexbility to try and compare different preprocessing strategies.

In [4]:
%%writefile sim_wrapper.py
import re
import string
import emoji
from nltk.tokenize import RegexpTokenizer, regexp_tokenize

def newline_remove(text):
    regex = re.compile(r'\n+', re.I)
    text = regex.sub(' ', text)
    return text


def replace_coronavirus(text):
    regex = re.compile(r'(corona[\s]?virus)', re.I)
    return regex.sub('coronavirus', text)


def coronavirus_hashtags(text):
    regex = re.compile(r'#(coronavirus)\b', re.I)
    return regex.sub('xxhashcoronavirus', text)


def replace_covid(text):
    regex = re.compile(r'(covid[-\s_]?19)', re.I)
    return regex.sub('covid19', text)


def covid_hashtags(text):
    regex = re.compile(r'#(covid[_-]?(19))', re.I)
    return regex.sub('xxhashcovid19', text)


def sarscov2_replace(text):
    regex = re.compile(r'(sars[-]?cov[-]?2)', re.I)
    return regex.sub(r'sarscov2', text)


def emoji_replace(text):
    # first demojize text
    new_text = emoji.demojize(text, use_aliases=True)
    regex = re.compile(r"(:\S+:)", re.I)
    return regex.sub(" xxemoji ", new_text)


def twitterpic_replace(text):
    regex = re.compile(r"pic.twitter.com/\w+", re.I)
    return regex.sub("xxpictwit", text)


def youtube_replace(text):
    regex = re.compile(r"(https://youtu.be/(\S+))|(https://www.youtube.(\S+))", re.I)
    return regex.sub("xxyoutubeurl", text)


def url_replace(text):
    regex = re.compile(r"(?:http|ftp|https)://(\S+)", re.I)
    return regex.sub("xxurl", text)


def punctuation_replace(text):
    # put spaces between punctuation
    PUNC = string.punctuation + '…–”“’'
    punct = r"[" + re.escape(PUNC) + r"]"
    text = re.sub("(?<! )(?=" + punct + ")|(?<=" + punct + ")(?! )", r" ", text)
    text = re.sub(r"[^\w\s]",'',text) # could replace with xxpunc
    # remove any extra whitespace
    text = re.sub(r'[ ]{2,}',' ',text)
    return text


def clean_wrapper(text, nltk_tokenize=False, punc_replace=False, preprocessor=False):
    PUNC = string.punctuation + '…–”“’'
    # removes newline characters from text
    text = newline_remove(text)
    # standardizes all instances of coronavirus in text
    text = replace_coronavirus(text)
    # replaces instances of #coronavirus with special token, xxhashcoronavirus
    text = coronavirus_hashtags(text)
    # standardizes all instances of covid19
    text = replace_covid(text)
    # replaces instances of #covid19 with special token, xxhashcovid19
    text = covid_hashtags(text)
    # standardizes SARS-Cov-2 to sarscov2
    text = sarscov2_replace(text)
    # removes hashtag characters
    text = text.replace(r'#', '')
    # removes @ character
    text = text.replace(r'@', '')
    # if preprocessor set to True, use preprocessor library to process tweet
    if preprocessor == True:
        p.set_options(p.OPT.NUMBER)
        text = p.tokenize(text)
        text = emoji_replace(text)
        if punc_replace == True:
            text = punctuation_replace(text)
        text = " ".join(word for word in text.split() if len(word) > 1)
        return text.strip()
    # replace emojies with special token xxemoji
    text = emoji_replace(text)
    # replace pic.twitter.com links with special token, xxpictwit
    text = twitterpic_replace(text)
    # replace YouTube links with special token, xxyoutubeurl
    text = youtube_replace(text)
    # replace other URLs with special token, xxurl
    text = url_replace(text)
    # if nltk_tokenize parameter True, then use regexp_tokenize from nltk library
    if nltk_tokenize == True:
        text = " ".join(RegexpTokenizer('\s+', gaps=True).tokenize(text))
        text = "".join(char for char in text if char not in PUNC)
    # if punc_replace set to True, replace all punctuations
    if punc_replace == True:
        text = punctuation_replace(text)
    # remove any unnecessary whitespace
    text = re.sub(r'[ ]{2,}',' ',text)
    return text.strip()

Overwriting sim_wrapper.py
