<a href="https://colab.research.google.com/github/bucuram/foundations-of-NLP-labs/blob/main/Lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Preprocessing

Lab overview:


* Normalization
* Tokenization
* Lematization
* Stemming
* Stopwords removal





##Text normalization (cleaning)

Depending on the task you are cleaning the text for, you may perform one or more of: 

* Transform text to lowercase
* Remove emoticons ( :) :D) and emojis (💙 🐱)
* Remove punctuation
* Remove digits or transform them to words
* Correct spelling errors


Python Regular Expressions 
*   [re Python documentation](https://docs.python.org/3/library/re.html)
*   [Quick reference](https://www.computerhope.com/unix/regex-quickref.htm)
*   [Cheat Sheet](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf)

![regular_expressions](https://res.cloudinary.com/practicaldev/image/fetch/s--_iE0KvdT--/c_imagga_scale,f_auto,fl_progressive,h_900,q_auto,w_1600/https://dev-to-uploads.s3.amazonaws.com/i/zpek00ubevoxvn458b01.png)

[Photo source](https://dev.to/mconner89/regular-expressions-grouping-and-string-methods-3ijn)

Here is our text sample, a short review of the movie [Jaws](https://en.wikipedia.org/wiki/Jaws_(film))

In [None]:
text = '" Jaws " 🦈🦈🦈 is a rare film that grabs your attention before it shows you a single image on screen. The movie opens with blackness, and only distant, alien-like underwater sounds. :) :D It deserves 5 stars, not 4 stars.'
text

Transform text to lowercase

In [None]:
text = text.lower()
text

importing [re](https://docs.python.org/3/library/re.html) library

In [None]:
import re

Remove digits

In [None]:
re.sub(' \d+', '', text)

Converting numbers to words using [num2words](https://github.com/savoirfairelinux/num2words) (it works on multiple languages)

We need to install the num2words library first.

In [None]:
!pip install num2words

After installing, we can import it.

In [None]:
from num2words import num2words

text = ' '.join([num2words(word) if word.isdigit() else word for word in text.split()])
text


Remove emoticons ( :) :D) and emojis (💙 🐱)

Using [emoji](https://github.com/carpedm20/emoji) library or the corresponding unicode characters.

We need to install the emoji library first.

In [None]:
!pip install emoji

After installing, we can import it.

In [None]:
import emoji

emoji.get_emoji_regexp().sub(u'', text)

The *get_emoji_regexp()* function returns a regex to match any emoji.

Another way of removing emojis with regex:


In [None]:
emoj = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002500-\U00002BEF"  # chinese char
    u"\U00002702-\U000027B0"
    u"\U00002702-\U000027B0"
    u"\U000024C2-\U0001F251"
    u"\U0001f926-\U0001f937"
    u"\U00010000-\U0010ffff"
    u"\u2640-\u2642" 
    u"\u2600-\u2B55"
    u"\u200d"
    u"\u23cf"
    u"\u23e9"
    u"\u231a"
    u"\ufe0f"
    u"\u3030"
    "]+", re.UNICODE)

text = re.sub(emoj, '', text)
text

Removing emoticons (regex from [nltk Twitter Tokenizer](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/casual.py))

In [None]:
emoticon_string = r"""
    (?:
      [<>]?
      [:;=8]                     # eyes
      [\-o\*\']?                 # optional nose
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      |
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      [\-o\*\']?                 # optional nose
      [:;=8]                     # eyes
      [<>]?
      |
      </?3                       # heart
    )"""
    
emoticon_re = re.compile(emoticon_string, re.VERBOSE | re.I | re.UNICODE)
text = re.sub(emoticon_re, '', text)
text

## Tokenization


*   Word level: Split by whitespace, [nltk.word_tokenize](https://www.nltk.org/api/nltk.tokenize.html)
*   Sentence level: Split by punctuation, [nltk.sent_tokenize](https://www.nltk.org/api/nltk.tokenize.html)


In [None]:
print(text.split())

We need to download first the Punkt Tokenizer Models.

In [None]:
import nltk
nltk.download('punkt')

In [None]:
from nltk import word_tokenize
tokenized_text_nltk = word_tokenize(text)
print(tokenized_text_nltk)

Sentence tokenization using regex

In [None]:
 re.split('(?<=[.!?]) +', text)

Sentence tokenization using nltk.sent_tokenize

In [None]:
nltk.sent_tokenize(text)

In [None]:
text_example = 'I was good.Thanks.'
re.split('(?<=[.!?]) +', text_example)

In [None]:
nltk.sent_tokenize(text_example)

Removing punctuation


In [None]:
re.sub(r'[^\w\s]','', text)

Using [string](https://docs.python.org/3/library/string.html) library. 

The string.punctuation method returns a list of punctuation marks. 

We use the translate() method which replaces every instance of a punctuation mark with the value '' in our strings. We use the str.maketrans() method to support the translation.

In [None]:
import string
text = text.translate(str.maketrans('', '', string.punctuation))
text

Removing multiple spaces between words

In [None]:
text = re.sub(' +', ' ', text)
text

## Removing stopwords

![stopwords.jpg](https://user.oc-static.com/upload/2021/01/06/16099626487943_P1C2.png) 

[Photo source](https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing/6980726-remove-stop-words-from-a-block-of-text)






###Why do we Need to Remove Stopwords?

For tasks such as text classification, we may want to remove any unnecessary words and keep only words with meaning. 

Stopwords removal is not used in tasks such as machine translation or text summarization.

Using [nltk](https://www.nltk.org/index.html) and [spaCy](https://spacy.io/).

Stopwords removal using nltk

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words_nltk = set(stopwords.words('english'))
print(len(stop_words_nltk))
print(stop_words_nltk)

In [None]:
tokenized_text_without_stopwords = [i for i in tokenized_text_nltk if not i in stop_words_nltk]
print(tokenized_text_without_stopwords)

Stopwords removal using spacy

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
stop_words_spacy = nlp.Defaults.stop_words
print(len(stop_words_spacy))
print(stop_words_spacy)

In [None]:
tokenized_text_spacy = nlp(text)
tokenized_text_without_stopwords = [i for i in tokenized_text_spacy if not i in stop_words_spacy]
print(tokenized_text_without_stopwords)

## Lematization/Stemming

![1_HLQgkMt5-g5WO5VpNuTl_g.jpeg](https://miro.medium.com/max/564/1*HLQgkMt5-g5WO5VpNuTl_g.jpeg)

[Photo source](https://tr.pinterest.com/pin/706854104005417976/)

Using [nltk](https://www.nltk.org/index.html) and [spaCy](https://spacy.io/).

Lematization

Using the WordNetLemmatizer from nltk


In [None]:
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

words = word_tokenize(text)
for word in words:
    print(word, lemmatizer.lemmatize(word))

Using the [lemmatizer](https://spacy.io/api/lemmatizer) from spacy

In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

In [None]:
import spacy

# Load English tokenizer, tagger, parser, etc.
nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

for token in doc:
  print(token, token.lemma_)

Stemming in using nltk

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
for word in words:
    print(word, ps.stem(word))

[Other stemmers in nltk](https://www.nltk.org/api/nltk.stem.html)

The spacy library does not perform stemming, only lemmatization.

# Assignment

To be uploaded here: https://forms.gle/ygCNwFM4i5RMPtsC6

Preprocess texts from Twitter

## Data

We will use the twitter corpus from nltk, usually used in sentiment analysis.

The fist step is downloading the dataset using the *download* function.

In [None]:
import nltk
nltk.download('twitter_samples')

In order to inspect our data, we look at the first 25 tweets from the dataset. The text contains a lot of mentions, hashtags and emoticons.

In [None]:
from nltk.corpus  import twitter_samples

tweets = twitter_samples.strings('positive_tweets.json')
tweets = tweets[:25]
tweets

**Given a list of tweets, preprocess each tweet from the list.**

**Instructions**: Implement the *preprocess* function. You can do the text cleaning in any order you prefer.

**Hint**: You may need to use regex expressions (use the resources provided above).


In [None]:
def preprocess(tweets):

    """
    Input: 
        tweets: a list of tweets
    Output: 
        prepocessed_tweets: a list of preprocessed tweets
    """

    ###you may need to create an additional list in which to store the processed tweets
    ###pay attention that some of the cleaning steps can be done at the document level, while others may be computed at word level


    for tweet in tweets:

        ###remove new line characters '\n'
        ###remove links http://t.co/of3DyOzML0
        ###remove mentions '@'
        ###remove hashtags '#'
        ###lowercase text
        ###remove emojis and emoticons '👌 🍭 :) :D'
        ###remove digits
        ###remove punctuation
        ###tokenize tweet into separate words
        ###remove stopwords
        ###lematization or stemming
    
    return prepocessed_tweets

preprocess(tweet)

Tools:

* [Preprocessing library for Twitter](https://github.com/s/preprocessor)
* [Emoji library](https://github.com/carpedm20/emoji)
* [Demoji library](https://github.com/bsolomon1124/demoji)
* [Gensim](https://radimrehurek.com/gensim/)


Further reading:

* [Lexical Normalization](https://arxiv.org/pdf/1710.03476.pdf)
* [On learning and representing social meaning in NLP: a sociolinguistic perspective](https://aclanthology.org/2021.naacl-main.50.pdf)






