Text preprocessing - text cleaning with the example of sentiment classification in Tweets

```{admonition} Information
__Section__: Text preprocessing  
__Goal__: Learn some methods for text preprocessing using Tweets.  
__Time needed__: x min  
__Prerequisites__: None
```

# Text preprocessing

In this part, we introduce some easy methods of text cleaning for sentiment analysis using Tweets.

## Sentiment analysis in Tweets

Tweets and social networks in general have the great advantage of offering a lot of accessible data, usable for many tasks and researches. On Twitter for example, it is easy to collect tweets regarding a brand to analyze the perception of it by the public. This process is also called "opinion mining" and is widely used and researched.

In this particular example of sentiment analysis in Tweets, we want to build a model able to classify each Tweet into one of 3 categories: positive, neutral and negative. This kind of classification allows to get the overall expressed satisfaction on social media for a set of selected Tweets (for example, Tweets about a new product). The text is analyzed by the model according to the method we introduced in the previous page (TODO: add a link to the page where we explain the method). The kind of preprocessing that we use on the text might greatly impact the results.

Besides offering large amounts of data, social networks come with the difficulty that the language used is usually not normalized, with the use of slang, spelling mistakes, emojis and other noise in the data, making it hard to process by a text analysis model. That is the reason why we first need to apply some preprocessing techniques.

## Basic text preprocessing

We will go through the steps with the following Tweet examples (found with a search about #Christmas):

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. <a href="https://t.co/hrBnP04HnB">https://t.co/hrBnP04HnB</a></p>&mdash; KRON4 News (@kron4news) <a href="https://twitter.com/kron4news/status/1333785378818387969?ref_src=twsrc%5Etfw">December 1, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In [11]:
tweet_1 = 'Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB'

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching <a href="https://twitter.com/BBCSounds?ref_src=twsrc%5Etfw">@BBCSounds</a> <a href="https://t.co/rHovIA3u5e">https://t.co/rHovIA3u5e</a></p>&mdash; Chris Hawkins (@ChrisHawkinsUK) <a href="https://twitter.com/ChrisHawkinsUK/status/1333784735571611648?ref_src=twsrc%5Etfw">December 1, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In [2]:
tweet_2 = 'Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e'

### Remove URLs

URLs do not give any information when we try to analyze text from words, especially on Twitter as they are reduced to a code to take less space. One of the first reasonable thing to do is then to just remove them from the text.

For that, we can simply remove all chains of characters starting with ``http``.

```{toggle} Advanced level
To remove all chains matching a certain pattern in a string, we use regular expressions.  
In Python, we will use the function [sub()](https://docs.python.org/3/library/re.html#re.sub) from the library ``re``, which allows us to use regular expressions. This function replaces each occurence of a specified chain by another specified chain. In our case, as we want to remove the URL, the replacement chain will be an empty string, i.e. ``''``.

The regular expression we use here would be: ``https?:\/\/.+``. The part ``http`` is here because we expect a URL to start with those 4 characters, then we add a ``s?``, because some URL are 'https' but not all of them. Then, we match ``://`` exactly by adding an escape character for ``/``. We continue by adding ``[^ ]+``, meaning any character but a space, an unlimited amount of times.
```

In [12]:
import re

for tweet in [tweet_1, tweet_2]:
    print(tweet)
    tweet = re.sub(r'https?://[^ ]+', '', tweet)
    print(tweet)

Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB
Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. 
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds 
