<a href="https://colab.research.google.com/github/bucuram/foundations-of-NLP-labs/blob/main/Lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Preprocessing

Lab overview:


* Normalization
* Tokenization
* Lematization
* Stemming
* Stopwords removal





##Text normalization (cleaning)

Depending on the task you are cleaning the text for, you may perform one or more of: 

* Transform text to lowercase
* Remove emoticons ( :) :D) and emojis (💙 🐱)
* Remove punctuation
* Remove digits or transform them to words
* Correct spelling errors


Python Regular Expressions 
*   [re Python documentation](https://docs.python.org/3/library/re.html)
*   [Quick reference](https://www.computerhope.com/unix/regex-quickref.htm)
*   [Cheat Sheet](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf)

Transform text to lowercase

In [None]:
import re

text = ''' " Jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you 
        a single image on screen. \nThe movie opens with blackness, and only distant, 
        alien-like underwater sounds. :) :D It deserves 5 stars, not 4 stars.'''

text = text.lower()

text

' " jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you \n        a single image on screen. \nthe movie opens with blackness, and only distant, \n        alien-like underwater sounds. :) :d it deserves 5 stars, not 4 stars.'

Remove digits

In [None]:
re.sub(' \d+', '', text)

' " jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you \n        a single image on screen. \nthe movie opens with blackness, and only distant, \n        alien-like underwater sounds. :) :d it deserves stars, not stars.'

Converting numbers to words using [num2words](https://github.com/savoirfairelinux/num2words) (it works on multiple languages)

We need to install the num2words library first.

In [None]:
!pip install num2words

Collecting num2words
  Downloading num2words-0.5.10-py3-none-any.whl (101 kB)
[?25l[K     |███▎                            | 10 kB 26.1 MB/s eta 0:00:01[K     |██████▌                         | 20 kB 34.1 MB/s eta 0:00:01[K     |█████████▊                      | 30 kB 36.5 MB/s eta 0:00:01[K     |█████████████                   | 40 kB 24.8 MB/s eta 0:00:01[K     |████████████████▏               | 51 kB 15.0 MB/s eta 0:00:01[K     |███████████████████▍            | 61 kB 12.3 MB/s eta 0:00:01[K     |██████████████████████▋         | 71 kB 13.8 MB/s eta 0:00:01[K     |█████████████████████████▉      | 81 kB 15.4 MB/s eta 0:00:01[K     |█████████████████████████████   | 92 kB 12.6 MB/s eta 0:00:01[K     |████████████████████████████████| 101 kB 6.5 MB/s 
Installing collected packages: num2words
Successfully installed num2words-0.5.10


After installing, we can import it.

In [None]:
from num2words import num2words

text = ' '.join([num2words(word) if word.isdigit() else word for word in text.split()])
text


'" jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds. :) :d it deserves five stars, not four stars.'

Remove emoticons ( :) :D) and emojis (💙 🐱)

Using [emoji](https://github.com/carpedm20/emoji) library or the corresponding unicode characters.

We need to install the emoji library first.

In [None]:
!pip install emoji

Collecting emoji
  Downloading emoji-1.6.0.tar.gz (168 kB)
[?25l[K     |██                              | 10 kB 25.1 MB/s eta 0:00:01[K     |███▉                            | 20 kB 28.8 MB/s eta 0:00:01[K     |█████▉                          | 30 kB 34.9 MB/s eta 0:00:01[K     |███████▊                        | 40 kB 40.4 MB/s eta 0:00:01[K     |█████████▊                      | 51 kB 30.1 MB/s eta 0:00:01[K     |███████████▋                    | 61 kB 13.8 MB/s eta 0:00:01[K     |█████████████▋                  | 71 kB 12.4 MB/s eta 0:00:01[K     |███████████████▌                | 81 kB 13.4 MB/s eta 0:00:01[K     |█████████████████▌              | 92 kB 14.7 MB/s eta 0:00:01[K     |███████████████████▍            | 102 kB 16.0 MB/s eta 0:00:01[K     |█████████████████████▍          | 112 kB 16.0 MB/s eta 0:00:01[K     |███████████████████████▎        | 122 kB 16.0 MB/s eta 0:00:01[K     |█████████████████████████▎      | 133 kB 16.0 MB/s eta 0:00:01[K    

After installing, we can import it.

In [None]:
import emoji

emoji.get_emoji_regexp().sub(u'', text)

' " jaws "  is a rare film that grabs your attention before it shows you a single image on screen. \nthe movie opens with blackness, and only distant, alien-like underwater sounds. :) :d'

The *get_emoji_regexp()* function returns a regex to match any emoji.

Another way of removing emojis with regex:


In [None]:
emoj = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002500-\U00002BEF"  # chinese char
    u"\U00002702-\U000027B0"
    u"\U00002702-\U000027B0"
    u"\U000024C2-\U0001F251"
    u"\U0001f926-\U0001f937"
    u"\U00010000-\U0010ffff"
    u"\u2640-\u2642" 
    u"\u2600-\u2B55"
    u"\u200d"
    u"\u23cf"
    u"\u23e9"
    u"\u231a"
    u"\ufe0f"
    u"\u3030"
    "]+", re.UNICODE)

text = re.sub(emoj, '', text)
text

'" jaws "  is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds.   it deserves five stars, not four stars.'

Removing emoticons (regex from [nltk Twitter Tokenizer](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/casual.py))

In [None]:
emoticon_string = r"""
    (?:
      [<>]?
      [:;=8]                     # eyes
      [\-o\*\']?                 # optional nose
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      |
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      [\-o\*\']?                 # optional nose
      [:;=8]                     # eyes
      [<>]?
      |
      </?3                       # heart
    )"""
    
emoticon_re = re.compile(emoticon_string, re.VERBOSE | re.I | re.UNICODE)
text = re.sub(emoticon_re, '', text)
text

'" jaws "  is a rare film that grabs your attention before it shows you a single image on screen. the movie opens with blackness, and only distant, alien-like underwater sounds.   it deserves five stars, not four stars.'

## Tokenization


*   Word level: Split by whitespace, [nltk.word_tokenize](https://www.nltk.org/api/nltk.tokenize.html)
*   Sentence level: Split by punctuation, [nltk.sent_tokenize](https://www.nltk.org/api/nltk.tokenize.html)


In [None]:
print(text.split())

['"', 'jaws', '"', '🦈🦈🦈', 'is', 'a', 'rare', 'film', 'that', 'grabs', 'your', 'attention', 'before', 'it', 'shows', 'you', 'a', 'single', 'image', 'on', 'screen.', 'the', 'movie', 'opens', 'with', 'blackness,', 'and', 'only', 'distant,', 'alien-like', 'underwater', 'sounds.', ':)', ':d', 'it', 'deserves', '5', 'stars,', 'not', '4', 'stars.']


We need to download first the Punkt Tokenizer Models.

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [29]:
from nltk import word_tokenize
tokenized_text_nltk = word_tokenize(text)
print(tokenized_text_nltk)

['jaws', '🦈🦈🦈', 'is', 'a', 'rare', 'film', 'that', 'grabs', 'your', 'attention', 'before', 'it', 'shows', 'you', 'a', 'single', 'image', 'on', 'screen', 'the', 'movie', 'opens', 'with', 'blackness', 'and', 'only', 'distant', 'alienlike', 'underwater', 'sounds', 'd', 'it', 'deserves', '5', 'stars', 'not', '4', 'stars']


Sentence tokenization using regex

In [None]:
 re.split('(?<=[.!?]) +', text)

['" jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen.',
 'the movie opens with blackness, and only distant, alien-like underwater sounds.',
 'it deserves five stars, not four stars.']

Sentence tokenization using nltk.sent_tokenize

In [None]:
nltk.sent_tokenize(text)

['" jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen.',
 'the movie opens with blackness, and only distant, alien-like underwater sounds.',
 'it deserves five stars, not four stars.']

In [None]:
text_example = 'I was good.Thanks.'
re.split('(?<=[.!?]) +', text_example)

['I was good.Thanks.']

In [None]:
nltk.sent_tokenize(text_example)

['I was good.Thanks.']

Removing punctuation


In [None]:
re.sub(r'[^\w\s]','', text)

'  jaws   is a rare film that grabs your attention before it shows you \n a single image on screen \nthe movie opens with blackness and only distant \n alienlike underwater sounds  d it deserves 5 stars not 4 stars'

In [None]:
import string
text = text.translate(str.maketrans('', '', string.punctuation))
text

'  jaws  🦈🦈🦈 is a rare film that grabs your attention before it shows you \n        a single image on screen \nthe movie opens with blackness and only distant \n        alienlike underwater sounds  d it deserves 5 stars not 4 stars'

Removing multiple spaces between words

In [None]:
text = re.sub(' +', ' ', text)
text

' " jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you \n a single image on screen. \nthe movie opens with blackness, and only distant, \n alien-like underwater sounds. :) :d it deserves 5 stars, not 4 stars.'

## Removing stopwords

![stopwords.jpg](https://user.oc-static.com/upload/2021/01/06/16099626487943_P1C2.png) 

[Photo credits](https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing/6980726-remove-stop-words-from-a-block-of-text)






###Why do we Need to Remove Stopwords?

For tasks such as text classification, we may want to remove any unnecessary words and keep only words with meaning. 

Stopwords removal is not used in tasks such as machine translation or text summarization.

Using [nltk](https://www.nltk.org/index.html) and [spaCy](https://spacy.io/).

Stopwords removal using nltk

In [32]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words_nltk = set(stopwords.words('english'))
print(len(stop_words_nltk))
print(stop_words_nltk)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
179
{'all', 'doing', 'than', 'while', "wouldn't", 'further', 'each', 'yourself', 'does', 'or', 'did', 'mustn', 'that', 'as', 'out', 'shouldn', 'he', 'between', 'am', "you'll", 'was', "it's", 'been', 'up', "wasn't", 'myself', "doesn't", "she's", 'we', "that'll", 'those', 'if', 'more', 'have', 'before', "shan't", 'him', 'no', 'its', 'were', "you've", 'having', 'into', 'not', 'll', 'on', 'which', "haven't", 'then', 'wasn', 'can', 'y', 'over', 'a', 'there', "mightn't", 'his', 'theirs', "needn't", 'won', 'yourselves', "weren't", 'until', 'some', 'isn', 'because', 'are', 'just', 'why', 'only', 'm', 'being', 'too', 'where', 'had', 'down', 'hers', 'needn', 'in', 'below', 'with', 'didn', "hasn't", 'about', 'against', 'shan', 'off', 'yours', 'our', 'so', "mustn't", 'both', 'your', 'they', 'once', 'weren', 'above', 'doesn', 'ain', 'mightn', "don't", 'an', 'by', "didn't", 'the', '

In [31]:
tokenized_text_without_stopwords = [i for i in tokenized_text_nltk if not i in stop_words_nltk]
print(tokenized_text_without_stopwords)

['jaws', '🦈🦈🦈', 'rare', 'film', 'grabs', 'attention', 'shows', 'single', 'image', 'screen', 'movie', 'opens', 'blackness', 'distant', 'alienlike', 'underwater', 'sounds', 'deserves', '5', 'stars', '4', 'stars']


Stopwords removal using spacy

In [33]:
nlp = spacy.load('en_core_web_sm')
stop_words_spacy = nlp.Defaults.stop_words
print(len(stop_words_spacy))
print(stop_words_spacy)

326
{'all', 'doing', 'toward', 'part', 'could', 'twenty', 'whenever', "'re", 'somewhere', 'nobody', 'beforehand', 'he', 'thereby', 'always', 'thereafter', '’re', 'whether', 'have', 'before', 'around', 'into', 'fifty', 'which', 'either', 'can', 'mostly', 'a', '‘s', 'please', 'because', 'elsewhere', 'never', 'behind', 'one', 'still', 'must', 'unless', 'become', 'where', 'had', 'latter', 'name', 'hereafter', 'least', 'below', 'with', 'everyone', 'quite', 'off', 'made', 'whose', 'beside', 'show', 'namely', 'n‘t', 'whence', 'the', 'afterwards', 'thru', 'thus', 'keep', 'amongst', 'else', 'any', 'very', 'herein', 'after', 'various', 'somehow', 'meanwhile', 'throughout', 'when', '‘m', 'be', 'formerly', 'sometime', 'top', 'often', 'these', 'towards', 'used', 'besides', 'while', 'really', 'further', "'ve", 'or', 'did', 'everywhere', 'n’t', 'next', "n't", 'forty', 'we', 'anything', 'someone', 'everything', 'those', 'anyway', 'noone', 'not', 'indeed', 'on', 'although', '‘ve', 'amount', 'beyond', '

In [34]:
tokenized_text_spacy = nlp(text)
tokenized_text_without_stopwords = [i for i in tokenized_text_spacy if not i in stop_words_spacy]
print(tokenized_text_without_stopwords)

[  , jaws,  , 🦈, 🦈, 🦈, is, a, rare, film, that, grabs, your, attention, before, it, shows, you, 
        , a, single, image, on, screen, 
, the, movie, opens, with, blackness, and, only, distant, 
        , alienlike, underwater, sounds,  , d, it, deserves, 5, stars, not, 4, stars]


## Lematization/Stemming

![1_HLQgkMt5-g5WO5VpNuTl_g.jpeg](https://miro.medium.com/max/564/1*HLQgkMt5-g5WO5VpNuTl_g.jpeg)

[Photo credits](https://tr.pinterest.com/pin/706854104005417976/)

Using [nltk](https://www.nltk.org/index.html) and [spaCy](https://spacy.io/).

Lematization

Using the WordNetLemmatizer from nltk


In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

words = word_tokenize(text)
for word in words:
    print(word, lemmatizer.lemmatize(word))

jaws jaw
🦈🦈🦈 🦈🦈🦈
is is
a a
rare rare
film film
that that
grabs grab
your your
attention attention
before before
it it
shows show
you you
a a
single single
image image
on on
screen screen
the the
movie movie
opens open
with with
blackness blackness
and and
only only
distant distant
alienlike alienlike
underwater underwater
sounds sound
d d
it it
deserves deserves
5 5
stars star
not not
4 4
stars star


Using the [lemmatizer](https://spacy.io/api/lemmatizer) from spacy

In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting pip
  Downloading pip-21.3-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 2.7 MB/s 
Collecting setuptools
  Downloading setuptools-58.2.0-py3-none-any.whl (946 kB)
[K     |████████████████████████████████| 946 kB 42.2 MB/s 
Installing collected packages: setuptools, pip
  Attempting uninstall: setuptools
    Found existing installation: setuptools 57.4.0
    Uninstalling setuptools-57.4.0:
      Successfully uninstalled setuptools-57.4.0
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
Successfully installed pip-21.3 setuptools-58.2.0


Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
     |████████████████████████████████| 13.6 MB 74 kB/s             
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
import spacy

# Load English tokenizer, tagger, parser, etc.
nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

for token in doc:
  print(token, token.lemma_)

     
jaws jaw
   
🦈 🦈
🦈 🦈
🦈 🦈
is be
a a
rare rare
film film
that that
grabs grab
your your
attention attention
before before
it it
shows show
you you

         
        
a a
single single
image image
on on
screen screen

 

the the
movie movie
opens open
with with
blackness blackness
and and
only only
distant distant

         
        
alienlike alienlike
underwater underwater
sounds sound
   
d d
it it
deserves deserve
5 5
stars star
not not
4 4
stars star


Stemming in using nltk

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
for word in words:
    print(word, ps.stem(word))

jaws jaw
🦈🦈🦈 🦈🦈🦈
is is
a a
rare rare
film film
that that
grabs grab
your your
attention attent
before befor
it it
shows show
you you
a a
single singl
image imag
on on
screen screen
the the
movie movi
opens open
with with
blackness black
and and
only onli
distant distant
alienlike alienlik
underwater underwat
sounds sound
d d
it it
deserves deserv
5 5
stars star
not not
4 4
stars star


[Other stemmers in nltk](https://www.nltk.org/api/nltk.stem.html)

The spacy library does not perform stemming, only lemmatization.

# Assignment

To be uploaded here: https://forms.gle/ygCNwFM4i5RMPtsC6

Preprocess texts from Twitter

## Data

We will use the twitter corpus from nltk, usually used in sentiment analysis.

The fist step is downloading the dataset using the *download* function.

In [None]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In order to inspect our data, we look at the first 25 tweets from the dataset. The text contains a lot of mentions, hashtags and emoticons.

In [None]:
from nltk.corpus  import twitter_samples

tweets = twitter_samples.strings('positive_tweets.json')
tweets[:25]

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!',
 '@97sides CONGRATS :)',
 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days',
 '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM',
 "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI",
 '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.',
 'Jgh , but we have to go to Bayan :D bye',
 'As an act of mischievousness, am calling the ETL layer of our in-house warehousing 

**Given a list of tweets, preprocess each tweet from the list.**

**Instructions**: Implement the *preprocess* function. You can do the text cleaning in any order you prefer.

**Hint**: You may need to use regex expressions (use the resources provided above).


In [None]:
def preprocess(tweets):

    """
    Input: 
        tweets: a list of tweets
    Output: 
        prepocessed_tweets: a list of preprocessed tweets
    """

    preprocessed_tweets = []

    for tweet in tweets:

        ###remove new line characters '\n'
        ###remove links http://t.co/of3DyOzML0
        ###remove mentions '@'
        ###remove hashtags '#'
        ###lowercase text
        ###remove emojis '👌 🍭 :) :D'
        ###remove digits
        ###remove punctuation
        ###tokenize tweet into separate words
        ###lematization or stemming
        ###remove stopwords
        
        preprocessed_tweets.append(words)
    
    return prepocessed_tweets

preprocess(tweet)

Using NLTK’s Pre-Trained Sentiment Analyzer **VADER** (Valence Aware Dictionary and sEntiment Reasoner).

In [None]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:

from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores(tweets[0])

{'compound': 0.7579, 'neg': 0.0, 'neu': 0.615, 'pos': 0.385}

Tools:

* [Preprocessing library for Twitter](https://github.com/s/preprocessor)
* [Emoji library](https://github.com/carpedm20/emoji)
* [Demoji library](https://github.com/bsolomon1124/demoji)


Further reading:

* [Lexical Normalization](https://arxiv.org/pdf/1710.03476.pdf)
* [On learning and representing social meaning in NLP: a sociolinguistic perspective](https://aclanthology.org/2021.naacl-main.50.pdf)






