# _Getting the Text Pre-Processed_

So we can gather live Twitter data, but as I noted in a previous notebook, we need to take the text we have and clean it up, which means we'll have to do some text pre-processing.

To make things a little easier, and so that we don't run into any issues with the Twitter API (i.e. we don't make too many necessary calls to it), we'll be using data that's already been acquired. This data has tweets ranging from March 20, 2020 at ~1:30am through March 24, 2020 at 11:59pm. In this case, historical data gives us a good representation of what the streaming data will look like, and to experiment with various text pre-processing strategies without worrying about mistakenly eliminating too much information. 

The tools we'll primarily be working with will be regular expressions and the [`re`](https://docs.python.org/2/library/re.html) library, which provides operations for regular expression matching in text. 

In [1]:
%load_ext autoreload
%load_ext line_profiler
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format='retina'

In [2]:
import re
import numpy as np
import pandas as pd
import os
import fundamentals
pd.options.display.max_columns = None
from tqdm.autonotebook import tqdm
tqdm.pandas()
import warnings
warnings.simplefilter("once")

  import sys
  from pandas import Panel


In [3]:
np.random.seed(42)

### _Load Data_

We'll start by loading in the data. The pickle file we have - `covid19_0320_0324.pkl` - contains roughly 3.3M Tweets from March 20th through March 24th. We'll load it into a pandas `DataFrame`, which will allow us to then start experimenting with text preprocessing techniques.

In [4]:
%%time
# strings of file paths and file name for data
origpath = "/notebooks/CovidDisinfo-Detect/experiments"
datapath = "/notebooks/CovidDisinfo-Detect/data/interim"
filename1 = "covid19_0320_0324.pkl"

# load data into pandas dataframe
df = fundamentals.load_data(origpath, datapath, filename1)
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3364618 entries, 2020-03-24 23:59:59 to 2020-03-20 01:37:05
Data columns (total 19 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   id                  int64 
 1   conversation_id     int64 
 2   user_id             int64 
 3   username            object
 4   name                object
 5   tweet               object
 6   mentions            object
 7   urls                object
 8   photos              object
 9   replies_count       int64 
 10  retweets_count      int64 
 11  likes_count         int64 
 12  hashtags            object
 13  link                object
 14  retweet             bool  
 15  quote_url           object
 16  video               int64 
 17  reply_to_userids    object
 18  reply_to_usernames  object
dtypes: bool(1), int64(7), object(11)
memory usage: 490.9+ MB
CPU times: user 4.87 s, sys: 2.57 s, total: 7.44 s
Wall time: 7.45 s


### _Remove Newline Characters_

The first step will be to remove newline characters, which appear in the text as `\n`. Additionally there are plenty of instances where there is more than one of these characters, and even cases where there are two back-to-back. We need to be able to remove all of these instances, and the `re` library can help us do just that. 

The first function I'm going to define - `newline_seach` - will allow us to gain a better idea of how newline characters work within the text, and more importantly, to see how frequently they pop up within text.

Next we'll define a function that takes in a given text, and will sub every instance of `\n` for a simple space. Additionally, when we run this function on the DataFrame, we'll create a new column - `processed_text` - to indicate that this column is our pre-processed text.

In [5]:
def newline_search(text):
    # re.I means it is not case-sensitive
    regex = re.compile(r"\n+", re.I)
    return ",".join(x.group() for x in regex.finditer(text))

In [6]:
# experiment on first 20 observations of DataFrame
df["tweet"][:20].apply(newline_search)

created_at
2020-03-24 23:59:59                \n,\n,\n
2020-03-24 23:59:59                        
2020-03-24 23:59:59                        
2020-03-24 23:59:59                        
2020-03-24 23:59:59                   \n,\n
2020-03-24 23:59:59                        
2020-03-24 23:59:59                        
2020-03-24 23:59:59                        
2020-03-24 23:59:59                        
2020-03-24 23:59:58                        
2020-03-24 23:59:58                      \n
2020-03-24 23:59:58    \n,\n,\n,\n,\n,\n,\n
2020-03-24 23:59:58                        
2020-03-24 23:59:58                        
2020-03-24 23:59:58                        
2020-03-24 23:59:58                      \n
2020-03-24 23:59:57                        
2020-03-24 23:59:57                        
2020-03-24 23:59:57                 \n\n,\n
2020-03-24 23:59:57                        
Name: tweet, dtype: object

While not every observation has them, we can see that they occur fairly frequently and we can see at the bottom an example of a back-to-back newline character (i.e. `\n\n`). What else this showcases is how powerful the `re` library can be for capturing particular patterns in a text. Given a relatively simple raw string - `r"\n+` - it is able to capture the various instances of a newline character within a given text.

Now that we have an idea of what we're replacing, we can move to replacing them with a `newline_remove` function and subsequently reating our new `processed_tweet` column.

In [7]:
def newline_remove(text):
    regex = re.compile(r"\n+", re.I)
    return regex.sub(" ", text)

In [8]:
# we first apply the newline_remove function and then apply newline_search to see if there are any \n's
df["tweet"][:20].apply(newline_remove).apply(newline_search)

created_at
2020-03-24 23:59:59    
2020-03-24 23:59:59    
2020-03-24 23:59:59    
2020-03-24 23:59:59    
2020-03-24 23:59:59    
2020-03-24 23:59:59    
2020-03-24 23:59:59    
2020-03-24 23:59:59    
2020-03-24 23:59:59    
2020-03-24 23:59:58    
2020-03-24 23:59:58    
2020-03-24 23:59:58    
2020-03-24 23:59:58    
2020-03-24 23:59:58    
2020-03-24 23:59:58    
2020-03-24 23:59:58    
2020-03-24 23:59:57    
2020-03-24 23:59:57    
2020-03-24 23:59:57    
2020-03-24 23:59:57    
Name: tweet, dtype: object

Above is just an example (and a double check) to make sure our `newline_remove` works as intended. Now we'll use it to create our new column.

In [9]:
%%time
df["processed_tweet"] = df["tweet"].apply(newline_remove)

CPU times: user 19 s, sys: 379 ms, total: 19.3 s
Wall time: 19.4 s


### _Replace Twitter picture, YouTube and other URLs with "fillers"_

Since we are working with Twitter data, we need to address potential components of a Tweet (outside of its text), namely URLs. Each tweet could contains links to such things as a picture, a YouTube video, or a news article. The first one we'll focus on are **Twitter pictures**, which are in the following format:
- `pic.twitter.com/(random assortment of numbers & letters)`

We'll replace the above link with `pictwitter` for two reasons: it allows us to acknowledge that there was a picture in a given tweet, and it'll make it less likely that any model we produce can key in on particular url. In the following cells, we'll define the function `twitterpic_replace`, then show a quick example of what it does, and then apply it to the `processed_text` column.

In [10]:
def twitterpic_replace(text):
    regex = re.compile(r"pic.twitter.com/\w+", re.I)
    return regex.sub("xxpictwit", text)

In [11]:
# an example of what our twitter_replace function does, note the end of each of the texts below
for n in range(3):
    print(df[df["photos"] != "none"]["processed_tweet"][n:n+1].apply(twitterpic_replace).iloc[0] + "\n")

There are many in our community that are elderly, have compromised immune symptoms or are otherwise high risk for COVID-19. Print and Hang this sign on your door to help protect yourself from the public.  xxpictwit

Was this photo even taken during the COVID crisis?? 😉 JK is probably more scared of touching poverty and being infected with social services. Drastic non-privatized times call for drastic measures. No N95 for you! #jasonkenney #COVID19 #N95 #abhealth xxpictwit

Wana come up to my Hotel Room? . . . . . I have a Big Hand Sanitizer & Two LOO Rolls 💩💩 #COVIDIDIOTS #covid19UK #CoronavirusLockdown #COVID19  xxpictwit



In [12]:
# apply function to processed_tweet
df["processed_tweet"] = df["processed_tweet"].progress_apply(twitterpic_replace)

HBox(children=(FloatProgress(value=0.0, max=3364618.0), HTML(value='')))




Now that we've replaced the Twitter picture links, we'll turn our attention to **YouTube links**, following along with the method we established above.

In [13]:
def youtube_replace(text):
    regex = re.compile(r"(https://youtu.be/(\S+))|(https://www.youtube.(\S+))", re.I)
    return regex.sub("xxyoutubeurl", text)

Below are some examples of what `youtube_replace` does. If you look at the end of the text, there is the word `youtubelink` which has been substituted for the YouTube URL. 

In [16]:
for n in range(3):
    print((df[df["processed_tweet"].str.contains(r"(https://youtu.be/(\S+))|(https://www.youtube.(\S+))", re.I)]
     ["processed_tweet"][n:n+1].apply(youtube_replace).iloc[0]) + "\n")

Mexican President Takes No National Safety Measures Against COVID-19  xxyoutubeurl …

Koreans think Malaysia handles this covid-19 situation despite political instability for a while back there, better than Korean government and I live for these comments <3 @MuhyiddinYassin xxyoutubeurl …

Daily COVID-19 FAQ: Becki Young explains the recent announcement from the DOL that, starting March 25th, all PERM approvals will now be issued electronically, though the originals with wet signatures are still required upon submission to USCIS.  xxyoutubeurl 



In [17]:
df["processed_tweet"] = df["processed_tweet"].progress_apply(youtube_replace)

HBox(children=(FloatProgress(value=0.0, max=3364618.0), HTML(value='')))




Lastly, we'll see how we can go about replacing **all remaining URLs** in the Tweets. 

In [19]:
def url_replace(text):
    regex = re.compile(r"(?:http|ftp|https)://(\S+)", re.I)
    return regex.sub("xxurl", text)

This time around, the links are going to vary a lot more than the two previous examples because we're searching for all other links that aren't tied to either Twitter or YouTube. As a test, I'm going to extract 5 random samples from the `processed_tweet` column, then print what the text looked like pre and post application of the `url_replace` function. Once we're sure it's picking up all other URL links, than we can go ahead an apply it to the column.

In [20]:
test = (df[df["processed_tweet"].str.contains(r"(?:http|ftp|https)://(\S+)", re.I)])["processed_tweet"].sample(n=5, random_state=42)

Now we'll print out the 5 observations to see what they look like before...

In [21]:
for n in range(len(test)):
    print(test[n] + "\n")

CUSD students are staying connected in a time of #socialdistancing during #COVID19 #schoolclosures with help from technology and their fun-loving, creative principals + teachers. Find out how on #CUSDInsider! http://cusdinsider.org/capistrano-unified-principals-engage-students-virtually-during-covid-19-closures/ …

 http://BlackburnNews.com : Huron Country creates resource list for businesses affected by COVID-19.  https://blackburnnews.com/midwestern-ontario/midwestern-ontario-news/2020/03/20/huron-country-creates-resource-list-businesses-affected-covid-19/ … via @GoogleNews

#HarveyWeinstein Contracts #Coronavirus While in Custody, According to Reports  https://bit.ly/2xgqwiX  #COVID19 #NewYork xxpictwit

This research found that 1 in 7 patients hospitalized with Covid-19 has acquired a dangerous secondary bacterial infection, and 50% of patients who have died had such infections.  https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30566-3/fulltext#tbl2 …

They may ha

And this is what they look like after...

In [22]:
for n in range(len(test)):
    print(url_replace(test[n]) + "\n")

CUSD students are staying connected in a time of #socialdistancing during #COVID19 #schoolclosures with help from technology and their fun-loving, creative principals + teachers. Find out how on #CUSDInsider! xxurl …

 xxurl : Huron Country creates resource list for businesses affected by COVID-19.  xxurl … via @GoogleNews

#HarveyWeinstein Contracts #Coronavirus While in Custody, According to Reports  xxurl  #COVID19 #NewYork xxpictwit

This research found that 1 in 7 patients hospitalized with Covid-19 has acquired a dangerous secondary bacterial infection, and 50% of patients who have died had such infections.  xxurl …

They may have thought that, this is the way one can can rid of #Covid_19 😬  xxurl …



It captured all the links from above, so let's go ahead and apply this to the entire `processed_tweet` column.

In [23]:
df["processed_tweet"] = df["processed_tweet"].progress_apply(url_replace)

HBox(children=(FloatProgress(value=0.0, max=3364618.0), HTML(value='')))




### _Replace User Mentions & Hashtags with fillers_

Now the last two components of the text we'll have to address are user mentions and hashtags. A user mention comprises an `@` symbol plus another user's name, while a hashtag is usually comprised of a `#` with a word directly behind it (sometimes, they can be comprised of multiple words put together). Below is an example of a tweet containing a user mention and a hashtag:
- `Hey @earny_joe, how is your project coming along? #DataScience`

In the text, `@earny_joe` represents a user mention and `#DataScience` is a hashtag. For the first version of this pipeline, which constitues a baseline of sorts, I'm going to replace both user mentions and hashtags with a filler, similar to what we did with the links above. However, I want to point out, that this information, particularly in the case of hashtags, could provide value for future iterations of the pipeline/model. 

That being said lets go ahead and define functions that address these two components and apply them to the `processed_tweet` column.

In [24]:
def usermention_replace(text):
    regex = re.compile(r"@([^\s:]+)+", re.I)
    return regex.sub("xxuser", text)

In [25]:
# grab 5 observations to showcase usermention_replace
test5 = df[df["processed_tweet"].str.contains(r"@([^\s:]+)+", re.I)]["processed_tweet"][:5]

**Before `usermention_replace`...**

In [26]:
for n in range(len(test5)):
    print(test5[n] + "\n")

COVID-19: Airlines And OTAs Say, ‘Don’t Call Us’ As Travel Agents Come To The Rescue via @forbes xxurl …

Help make it happen for C3 Test for Coronavirus & COVID-19 on @indiegogo xxurl 

@drgregpoland I sent you a LinkedIn request. Thanks for your timely information on Covid 19. I admire your work. I admire Mayo Clinic. My wife Mary Ann Edwards is a pharmacist

So, not only was the order screwed up, and billing screwed up, The @att store at the Valley Mall robbed us. And acccording to 611 there is shit we can do about it because they can’t contact the store until the #COVID19 pandemic is over.

So you got your #slushfund and now you want people to go to work? You do not understand the risk of COVID19! @realDonaldTrump xxurl …



**After `usermention_reaplce`...**

In [27]:
for n in range(len(test5)):
    print(usermention_replace(test5[n]) + "\n")

COVID-19: Airlines And OTAs Say, ‘Don’t Call Us’ As Travel Agents Come To The Rescue via xxuser xxurl …

Help make it happen for C3 Test for Coronavirus & COVID-19 on xxuser xxurl 

xxuser I sent you a LinkedIn request. Thanks for your timely information on Covid 19. I admire your work. I admire Mayo Clinic. My wife Mary Ann Edwards is a pharmacist

So, not only was the order screwed up, and billing screwed up, The xxuser store at the Valley Mall robbed us. And acccording to 611 there is shit we can do about it because they can’t contact the store until the #COVID19 pandemic is over.

So you got your #slushfund and now you want people to go to work? You do not understand the risk of COVID19! xxuser xxurl …



In [28]:
df["processed_tweet"] = df["processed_tweet"].progress_apply(usermention_replace)

HBox(children=(FloatProgress(value=0.0, max=3364618.0), HTML(value='')))




Now we'll turn our attention to **replacing the hashtags**.

In [29]:
def hashtag_replace(text):
    regex = re.compile(r"#([^\s:]+)+", re.I)
    return regex.sub("xxhashtag", text)

In [30]:
test123 = "Hey @earny_joe, how are you doing during the #coronavirus #COVID_19 pandemic? #StayHome"

In [31]:
hashtag_replace(test123)

'Hey @earny_joe, how are you doing during the xxhashtag xxhashtag pandemic? xxhashtag'

In [33]:
# grab 5 observations to showcase hashtag_replace
test5 = df[df["processed_tweet"].str.contains(r"#([^\s:]+)+", re.I)]["processed_tweet"][:5]

**Before `hashtag_replace`...**

In [34]:
for n in range(len(test5)):
    print(test5[n] + "\n")

I got fired yesterday due to the damage of coronavirus. Not only me, 70 colleagues got fired from my company. People are dying. Companies are closing. CHINA MUST TAKE RESPONSIBILITY FOR THIS!!!!!!! #COVID19 #ChinaVirus

NEW #CORONAVIRUS #COVID19 #VETgirl webinar - FREE - starting in 30 minutes, 8:30pm Eastern.   Sign up now!  Get the info YOU need to keep yourself, your practice, your clients, and your patients safe!  #Veterinary #VetMed #VetTech #Pandemic xxurl …

You may be hearing a lot of concerning information in the news, and even in these #coronavirus prevention tip videos. But the best way to proceed is to stay calm, stay informed, and know the facts. Learn more here. #COVID19  xxurl 

honestly? This shit is dark. What the Texas lieutenant governor proposed IN REAL LIFE is what I proposed as an absurd, way over-the-top (because it's so overtly inhumane!) remedy meant to call attention to a real problem  xxurl … h/t Swift #CoronavirusUSA #COVID19

Was this photo even taken durin

**After `hashtag_replace`...**

In [35]:
for n in range(len(test5)):
    print(hashtag_replace(test5[n]) + "\n")

I got fired yesterday due to the damage of coronavirus. Not only me, 70 colleagues got fired from my company. People are dying. Companies are closing. CHINA MUST TAKE RESPONSIBILITY FOR THIS!!!!!!! xxhashtag xxhashtag

NEW xxhashtag xxhashtag xxhashtag webinar - FREE - starting in 30 minutes, 8:30pm Eastern.   Sign up now!  Get the info YOU need to keep yourself, your practice, your clients, and your patients safe!  xxhashtag xxhashtag xxhashtag xxhashtag xxurl …

You may be hearing a lot of concerning information in the news, and even in these xxhashtag prevention tip videos. But the best way to proceed is to stay calm, stay informed, and know the facts. Learn more here. xxhashtag  xxurl 

honestly? This shit is dark. What the Texas lieutenant governor proposed IN REAL LIFE is what I proposed as an absurd, way over-the-top (because it's so overtly inhumane!) remedy meant to call attention to a real problem  xxurl … h/t Swift xxhashtag xxhashtag

Was this photo even taken during the CO

In [36]:
df["processed_tweet"] = df["processed_tweet"].progress_apply(hashtag_replace)

HBox(children=(FloatProgress(value=0.0, max=3364618.0), HTML(value='')))




I'm going to stop here for today (**reference: April 7, 2020 @ 4:19pm PDT**). In the following cell, I'm going to save the dataframe that I have so far into a pickle file - `covid19_0320_0324_updated.pkl` and store in the `playground_data` folder, which will allow me to store data that I've edited but that isn't quite in its final form yet.

In [5]:
#df.to_pickle("playground_data/covid19_0320_0324_updated.pkl")
#df = pd.read_pickle("playground_data/covid19_0320_0324_updated.pkl")

### _Emojis_

The last component, and potentially most interesting, of a Tweet is the emoji. Today, emoji's are no longer simple smiley faces, but a wide range of images from thumbs up, to check marks, to the flags of countries. Luckily, there is a Python library with the creative name of [`emoji`](https://github.com/carpedm20/emoji/) to help us address this. 

It gives us the ability to `demojize` text. For example, let's say we have the text: `Python is 👍`. All we need to do with the `emoji` library is the following:

- `emoji.demojize("Python is 👍")

This will be returned in the following form: `Python is :thumbs_up:`. All of the emojis are demojized into this form, so we can then use the `re` library to search for this particular pattern (e.g. `:emoji_name:`) and replace with a filler token, `xxemoji`. 

That being said, we'll take the following steps over the next few cells:
- Develop an `emoji_replace` function that searches for this `:emoji_name:` pattern, and replaces with ` xxemoji ` (the additional space separate it from any other text that may be directly next to it)
- Test out the above function on a small sample (text before & after)
- Then apply the function to our pandas DataFrame

In [62]:
import emoji

# function to help grab indexes for data that we can test below function on 
def emoji_detector(text):
    regex = re.compile(r"(:\S+:)+", re.I)
    return ",".join(x.group() for x in regex.finditer(text))

def emoji_replace(text):
    # first demojize text
    new_text = emoji.demojize(text, use_aliases=True)
    regex = re.compile(r"(:\S+:)+", re.I)
    return regex.sub(" xxemoji ", new_text)

In [55]:
# indexes from first 1000 observations that contain emoji
emoji_idx = df["processed_tweet"][:1000].progress_apply(lambda x: emoji.demojize(x)).progress_apply(emoji_detector) != ""

HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))




In [58]:
# create df where all text has emojis
subset = df[:1000]; df_emojis = subset[emoji_idx]

**Before `emoji_replace`...**

In [60]:
for n in range(5):
    print(df_emojis["processed_tweet"][n] + "\n")

Was this photo even taken during the COVID crisis?? 😉 JK is probably more scared of touching poverty and being infected with social services. Drastic non-privatized times call for drastic measures. No N95 for you! xxhashtag xxhashtag xxhashtag xxhashtag xxpictwit

Wana come up to my Hotel Room? . . . . . I have a Big Hand Sanitizer & Two LOO Rolls 💩💩 xxhashtag xxhashtag xxhashtag xxhashtag  xxpictwit

Director xxuser talking about how our agency is adapting to challenges of COVID-19 and where to go to get the resources you need. For food, cash assistance or Medicaid 👇  xxurl   For unemployment 👇  xxurl  xxhashtag xxhashtag xxpictwit

xxhashtag ✅ Imagine $6 trillion and how it would go to people, not a coup for corporations.  ✅Imagine having healthcare for you and enough medical personnel and equipment to deal with COVID19 ✅Imagine being able to pay for your basic needs xxhashtag xxhashtag xxpictwit

Trump wants to lift all lockdowns throughout the nation by Easter and he said more peop

**After `emoji_replace`...**

In [63]:
for n in range(5):
    print(emoji_replace(df_emojis["processed_tweet"][n]) + "\n")

Was this photo even taken during the COVID crisis??  xxemoji  JK is probably more scared of touching poverty and being infected with social services. Drastic non-privatized times call for drastic measures. No N95 for you! xxhashtag xxhashtag xxhashtag xxhashtag xxpictwit

Wana come up to my Hotel Room? . . . . . I have a Big Hand Sanitizer & Two LOO Rolls  xxemoji  xxhashtag xxhashtag xxhashtag xxhashtag  xxpictwit

Director xxuser talking about how our agency is adapting to challenges of COVID-19 and where to go to get the resources you need. For food, cash assistance or Medicaid  xxemoji   xxurl   For unemployment  xxemoji   xxurl  xxhashtag xxhashtag xxpictwit

xxhashtag  xxemoji  Imagine $6 trillion and how it would go to people, not a coup for corporations.   xxemoji Imagine having healthcare for you and enough medical personnel and equipment to deal with COVID19  xxemoji Imagine being able to pay for your basic needs xxhashtag xxhashtag xxpictwit

Trump wants to lift all lockdown

In [64]:
df["processed_tweet"] = df["processed_tweet"].progress_apply(emoji_replace)

HBox(children=(FloatProgress(value=0.0, max=3364618.0), HTML(value='')))




In [65]:
# save dataframe so far for ease of future access
df.to_pickle("playground_data/covid19_0320_0324_updated_v2.pkl")