Text preprocessing - text cleaning with the example of sentiment classification in Tweets

```{admonition} Information
__Section__: Text preprocessing  
__Goal__: Learn some methods for text preprocessing using Tweets.  
__Time needed__: x min  
__Prerequisites__: None
```

# Text preprocessing

In this part, we introduce some easy methods of text cleaning for sentiment analysis using Tweets.

## Sentiment analysis in Tweets

Tweets and social networks in general have the great advantage of offering a lot of accessible data, usable for many tasks and researches. On Twitter for example, it is easy to collect tweets regarding a brand to analyze the perception of it by the public. This process is also called "opinion mining" and is widely used and researched.

In this particular example of sentiment analysis in Tweets, we want to build a model able to classify each Tweet into one of 3 categories: positive, neutral and negative. This kind of classification allows to get the overall expressed satisfaction on social media for a set of selected Tweets (for example, Tweets about a new product). The text is analyzed by the model according to the method we introduced in the previous page (TODO: add a link to the page where we explain the method). The kind of preprocessing that we use on the text might greatly impact the results.

Besides offering large amounts of data, social networks come with the difficulty that the language used is usually not normalized, with the use of slang, spelling mistakes, emojis and other noise in the data, making it hard to process by a text analysis model. That is the reason why we first need to apply some preprocessing techniques.

## Basic text preprocessing

We will go through the steps with the following Tweet examples (found with a search about #Christmas):

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. <a href="https://t.co/hrBnP04HnB">https://t.co/hrBnP04HnB</a></p>&mdash; KRON4 News (@kron4news) <a href="https://twitter.com/kron4news/status/1333785378818387969?ref_src=twsrc%5Etfw">December 1, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In [15]:
# Save tweet
tweet_1 = 'Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB'

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching <a href="https://twitter.com/BBCSounds?ref_src=twsrc%5Etfw">@BBCSounds</a> <a href="https://t.co/rHovIA3u5e">https://t.co/rHovIA3u5e</a></p>&mdash; Chris Hawkins (@ChrisHawkinsUK) <a href="https://twitter.com/ChrisHawkinsUK/status/1333784735571611648?ref_src=twsrc%5Etfw">December 1, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In [19]:
# Save tweet
tweet_2 = 'Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e'

TODO: find more example tweets covering the range of changes

In [59]:
# Put everything in a single list
list_tweets = []
list_tweets.append(tweet_1)
list_tweets.append(tweet_2)

### Remove URLs

URLs do not give any information when we try to analyze text from words, especially on Twitter as they are reduced to a code to take less space. One of the first reasonable thing to do is then to just remove them from the text.

For that, we can simply remove all chains of characters starting with ``http``.

```{toggle} Advanced level
To remove all chains matching a certain pattern in a string, we use regular expressions.  
In Python, we will use the function [sub()](https://docs.python.org/3/library/re.html#re.sub) from the library ``re``, which allows us to use regular expressions. This function replaces each occurence of a specified chain by another specified chain. In our case, as we want to remove the URL, the replacement chain will be an empty string, i.e. ``''``.

The regular expression we use here would be: ``https?:\/\/.+``. The part ``http`` is here because we expect a URL to start with those 4 characters, then we add a ``s?``, because some URL are 'https' but not all of them. Then, we match ``://`` exactly by adding an escape character for ``/``. We continue by adding ``[^ ]+``, meaning any character but a space, an unlimited amount of times.
```

In [60]:
import re

for i in range(0, len(list_tweets)):
    print('\n----- Tweet ', i)
    print(list_tweets[i])
    list_tweets[i] = re.sub(r'https?://[^ ]+', '', list_tweets[i])
    print(list_tweets[i])


----- Tweet  0
Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB
Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. 

----- Tweet  1
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds 


### Remove usernames

Same as for the URL, a username in a tweet won't give any valuable information because it won't be recognized as a word carrying meaning. We will then remove it.

Specifically for Twitter, all usernames start with the character ``@``. To remove them, we only have to remove all chains of characters starting with ``@``.

```{toggle} Advanced level
The regular expression here will be: ``@[^ ]+``, to match any string starting with ``@`` and ending with a space.
```

In [36]:
for i in range(0, len(list_tweets)):
    print('\n----- Tweet ', i)
    print(list_tweets[i])
    list_tweets[i] = re.sub(r'@[^ ]+', '', list_tweets[i])
    print(list_tweets[i])


----- Tweet  0
Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. 
Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. 

----- Tweet  1
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds 
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching  


### Hashtags

Hashtags are hard to apprehend, but usually contain useful information about the context of a tweet and its content. The problem with hashtags is that the words are all after the other, without a space. This kind of word is hard to understand with a basic algorithm for word extraction. However, most of the time, hashtags consist on only one word, preceeded by the symbol ``#``. It can then be useful to keep the part following the ``#``. If the word is made of two or more words, it will stay as noise in the data.

To deal with hashtags, we only remove the character ``#``.

```{toggle} Advanced level
The regular expression is very simple in that case: ``#``.
```

In [37]:
for i in range(0, len(list_tweets)):
    print('\n----- Tweet ', i)
    print(list_tweets[i])
    list_tweets[i] = re.sub(r'#', '', list_tweets[i])
    print(list_tweets[i])


----- Tweet  0
Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. 
Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. 

----- Tweet  1
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching  
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching  


### Character normalization

As Twitter is used mostly informally, it is very common to find unregularly written words. One of them is a repetition of characters to accentuate a statement, for example: "It starts todaaaaaaaaay". The word "todaaaaaaaaay" won't be recognized by our algorithm, while the word "today" would and could convey important information.

We can replace each character that is repeated more than 2 times in a row by its single value.

```{toggle} Advanced level
Here, we use a regular expression to match a letter repeated more than 2 times: ``([A-Za-z])\1{2,}``. This one is a bit more complicated than the previous ones. First, we use ``[A-Za-z]`` to match only letters. This group is in between parentheses so we can add ``\1``, to match the __same__ character that was first matched, and not any letter. Finally, we add ``{2,}`` to specify that we need a repeatition of more than 2 characters.

In the function ``re.sub()``, as the second parameter, we use ``r'\1'`` to replace the identified group with the matched character.
```

In [56]:
string = 'todaaaaaaaaay'
print(re.sub(r'([A-Za-z])\1{2,}', r'\1', string))

today


### Negations and contractions

Words such as "can't", "don't", in other words, words containing a negative contracted form, could be recognized by our algorithm, however, it is possible to make it simpler by removing the contracted form from the text. A "not" is easier to interpret as it is a more frequent word than all the contracted forms.

TODO: something like the more a word is recurrent the better information it contains

### Punctuation, special characters and numbers

In the same way, punctuation and single characters do not add any information with the method we use to process the text, as the algorithm for sentiment analysis only detects words.  
Same goes for numbers: they are not processed, understandably as they do not represent a sentiment. An exception could be for the number 0, as it can convey a negative sense. To make sure of that, we can keep the number 0 and translate into its textual form, "zero".

We decide to detect all the single ``0``, transforming them into ``zero``, and keep only letters otherwise. This has for effect to get rid of all the special characters and digits.

```{toggle} Advanced level
First, we decide to change all the zeros. For that, we select all zeros preceeding and following a space, in order to only keep real zeros. This is simply done with the regular expression `` 0 ``, replacing with ``zero``.

Then, it is easier to remove all characters that are not letters or blank spaces:``[^A-Za-z ]``, ``^`` having the effect of removing all characters that are not specified.
```

In [61]:
for i in range(0, len(list_tweets)):
    print('\n----- Tweet ', i)
    print(list_tweets[i])
    list_tweets[i] = re.sub(r' 0 ', 'zero', list_tweets[i])
    list_tweets[i] = re.sub(r'[^A-Za-z ]', '', list_tweets[i])
    print(list_tweets[i])


----- Tweet  0
Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. 
Hospitalizations from COVID have increased nearly  and California officials say they could triple by Christmas 

----- Tweet  1
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds 
Something for the afternoon slump  journey home  after school  cooking dinner  a special  minute mix of cool Christmas tunes intercut with Christmas film samples and scratching BBCSounds 
