# <center>Natural Language Processing Hands-on #2</center>

During the Natural Language Processing course, text representation algorithms have been introduced. However they don't suffice to the creation of complete NLP systems.

Most of them usually rely on text pre-processing at first -- in other words, they rely on a specific data pipeline that is tied to the final task you are trying to solve.

As a result, we will try in this notebook to create a pipeline from scratch given a specific final task.

# Resources you'll need

## Machine Learning libraries

Lots of ML libraries exist in the wild. You have general libraries such as [scikit-learn](https://scikit-learn.org/stable/), domain related libraries such as [nltk](https://www.nltk.org/) or hyper specific implementation of optimized algorithms such as annoy [annoy](https://pypi.org/project/annoy/).

In this notebook, you'll need to rely on the following packages:

   - [scikit-learn](https://scikit-learn.org/stable/): all purpose machine learning resource if they aren't neural based.
   - [nltk](https://www.nltk.org/): natural language toolkit -- implements lots of preprocessing steps and text transformation. 
   - [gensim](https://radimrehurek.com/gensim/): library designed to be easy to use for both topic modeling and text representation.
   - [spacy](https://spacy.io/): industrialization machine learning systems. Provide lots of pretrained weights for various models.

Usually, a simple pip install is sufficient for them to work. If you have already installed it, feel free to create a dedicated virtual environment, which is really a good practice. If you want to know more regarding that, you can rely on this [here](https://virtualenvwrapper.readthedocs.io/en/latest/).

## Data & final task definition

Given the [News dataset](https://www.kaggle.com/rmisra/news-category-dataset/download) (also available alongside this notebook), you'll have to build a simple topic modeling system that will identify the topics of the news headlines.

Those headlines have already been labelled. Here are the categories and document counts of this dataset:

* POLITICS: 32739

* WELLNESS: 17827

* ENTERTAINMENT: 16058

* TRAVEL: 9887

* STYLE & BEAUTY: 9649

* PARENTING: 8677

* HEALTHY LIVING: 6694

* QUEER VOICES: 6314

* FOOD & DRINK: 6226

* BUSINESS: 5937

* COMEDY: 5175

* SPORTS: 4884

* BLACK VOICES: 4528

* HOME & LIVING: 4195

* PARENTS: 3955

* THE WORLDPOST: 3664

* WEDDINGS: 3651

* WOMEN: 3490

* IMPACT: 3459

* DIVORCE: 3426

* CRIME: 3405

* MEDIA: 2815

* WEIRD NEWS: 2670

* GREEN: 2622

* WORLDPOST: 2579

* RELIGION: 2556

* STYLE: 2254

* SCIENCE: 2178

* WORLD NEWS: 2177

* TASTE: 2096

* TECH: 2082

* MONEY: 1707

* ARTS: 1509

* FIFTY: 1401

* GOOD NEWS: 1398

* ARTS & CULTURE: 1339

* ENVIRONMENT: 1323

* COLLEGE: 1144

* LATINO VOICES: 1129

* CULTURE & ARTS: 1030

* EDUCATION: 1004

# Exploring the dataset

In [2]:
import pandas as pd
import itertools

You can load the dataset using pandas and the [.read_json()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) method. Try loading your dataset here:

In [3]:
dataset = pd.read_json("News_Category_Dataset_v2.json", lines=True, dtype={"headline": str})

------

A good way to get the grasp of your corpus is to count the occurences of words across it. For convenience, we've defined a dummy function that splits words by checking where spaces are and... Simply that. This is the most basic form of word identification in text that could be used.

In [4]:
def dummy_word_split(texts):
    """Function identifying words in a sentence in a really dummy way.
        
        Argument:
            - texts (list of str): a list of raw texts in which we'd like to identify words
            
        Return:
            - list of list containing each word separately.
    """
    texts_out = []
    for text in texts:
        texts_out.append(text.split(" "))
        
    return texts_out

In [5]:
splitted_texts = dummy_word_split(dataset["headline"].tolist())

In [6]:
splitted_texts[0] + splitted_texts[1]

['There',
 'Were',
 '2',
 'Mass',
 'Shootings',
 'In',
 'Texas',
 'Last',
 'Week,',
 'But',
 'Only',
 '1',
 'On',
 'TV',
 'Will',
 'Smith',
 'Joins',
 'Diplo',
 'And',
 'Nicky',
 'Jam',
 'For',
 'The',
 '2018',
 'World',
 "Cup's",
 'Official',
 'Song']

Now, let's define a function that counts word occurences and highlight what are the most important words of our corpus:

In [7]:
def compute_word_occurences(texts):
    """You have to define this function yourself. """
    
    words = itertools.chain.from_iterable(texts)
    
    word_count = pd.Series(words).value_counts()
    word_count = pd.DataFrame({"Word": word_count.index, "Count": word_count.values})

    return word_count

Once this is done, display the top 20 most occuring words in your texts.

In [9]:
compute_word_occurences(splitted_texts).head(1000)

Unnamed: 0,Word,Count
0,The,47803
1,To,38569
2,A,24839
3,In,24141
4,Of,22956
5,For,18788
6,Is,16823
7,And,15137
8,On,13642
9,With,12556


Does it make sense, and can you leverage such results?

# Actual pipeline

As you have seen above, the results obtained from a simple word count aren't so great. Similar words doesn't add up (such as run and running), and you have a lot of noise included. Words such as *the*, *you*, *an* could be removed for instance.

Actually, a lot can be done. Let's check that out.

----------

## What does the pipeline look like?

A NLP data pipeline often relies on the following elements. Some can be added, some can be removed, but they all look like this at some point:

1. **Ensuring data quality.** You have to make sure that there's no N/A in your data and that everything is in the good format shape. Having this as the entrance of your pipeline will save you a lot of time in the long run, so try defining it thoroughly.


2. **Filtering texts from unwanted characters**. Especially if you get data from web, you'll end up with HTML tags or encoding stuff that you don't need in your texts. Before applying anything to them, you need to get them cleaned up. Here, try removing the dates and the punctuation for instance.


3. **Unify your texts**. (*This is topic modeling specific*). You don't want to make the difference between a word at the beginning of a phrase of in the middle of it here. You should unify all your words by lowercasing them and deaccenting them as well.


4. **Converting sentences to lists of words**. Some words aren't needed for our analyses, such as *your*, *my*, etc. In order to remove them easily, you have to convert your sentences to lists of words. You can use the dummy function defined above but I'd advised against it. Try finding a function that does that smoothly in [gensim.utils](https://radimrehurek.com/gensim/utils.html)!


5. **Remove useless words**. You need to remove useless words from your corpus. You have two approaches: [use a hard defined list of stopwords](https://www.analyticsvidhya.com/blog/2019/08/how-to-remove-stopwords-text-normalization-nltk-spacy-gensim-python/) or rely on TF-IDF to identify useless words. The first is the simplest, the second might yield better results!


6. **Creating n-grams**. If you look at New York, it is composed of two words. As a result, a word count wouldn't really return a true count for *New York* per se. In NLP, we represent New York as New_York, which is considered a single word. The n-gram creation consists in identifying words that occur together often and regrouping them. It boosts interpretability for topic modeling in this case.


7. **Stemming / Lemmatization**. Shouldn't run, running, runnable be grouped and counted as a single word when we're identifying discussion topics? Yes, they should. Stemming is the process of cutting words to their word root (run- for instance) quite brutally while lemmatization will do the same by identifying the kind of word it is working on. You should convert the corpus words into those truncated representations to have a more realistic word count.


8. **Part of speech tagging**. POS helps in the identification of verbs, nouns, adjectives, etc. For topic models, it is a good idea to work only on verbs and nouns. Adjectives don't convey info about the actual underlying topic discussed at hand.

## Let's create it!

In [16]:
import itertools
import os
import re
import secrets
import string

import pandas as pd
import spacy

from itertools import chain

from gensim.models.callbacks import CallbackAny2Vec
from gensim.models import Word2Vec, Phrases, KeyedVectors
from gensim.models.phrases import Phraser
from gensim.utils import simple_preprocess
from nltk.corpus import wordnet
# from pattern.en import pluralize, singularize
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm import tqdm

from spacy.parts_of_speech import IDS as POS_map

Now it's your turn. Try to implement each step of the pipeline, and compare the word counts obtained earlier and the one obtained after preprocessing your texts.

### Ensuring data quality

In [10]:
def check_data_quality(texts):
    """Check wheter all the dataset is conform to the expected behaviour."""
    assert all([isinstance(t, str) for t in texts]), "Input data contains something different than strings."
    assert all([t != np.nan for t in texts]), "Input data contains NaN values."
    
    return True

In [11]:
def force_format(texts):
    return [str(t) for t in texts]

In [12]:
texts = force_format(dataset["headline"])

In [13]:
print(f"Is the dataset passing our data quality check?\n{check_data_quality(texts)}")

Is the dataset passing our data quality check?
True


### Filtering texts

https://regex101.com/

In [14]:
def filter_text(texts_in):
    """Removes incorrect patterns from a list of texts, such as hyperlinks, bullet points and so on"""
    
    texts_out = re.sub(r'https?:\/\/[A-Za-z0-9_.-~\-]*', ' ', texts_in, flags=re.MULTILINE)
    texts_out = re.sub(r'[(){}\[\]<>]', ' ', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'&amp;#.*;', ' ', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'&gt;', ' ', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'â€™', "'", texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'\s+', ' ', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'&#x200B;', ' ', texts_out, flags=re.MULTILINE)
    # Mail regex
    # This regex is correct but WAY TOO LONG to process. So we skip it with a simpler version
    # texts_out = re.sub(r"(?i)(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|\"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])", '', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+\.[a-zA-Z0-9-_.]+', '', texts_out, flags=re.MULTILINE)
    # Phone regex
    # This regex is correct but WAY TOO LONG to process. So we skip it with a simpler version
    # texts_out = re.sub(r".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", '', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r"\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}", '', texts_out, flags=re.MULTILINE)
    # Remove names in twitter
    texts_out = re.sub(r'@\S+( |\n)', '', texts_out, flags=re.MULTILINE)

    # Remove starts commonly used on social media
    texts_out = re.sub(r'\*', '', texts_out, flags=re.MULTILINE)
    return texts_out


In [17]:
texts = [filter_text(t) for t in texts]

### Unifying texts & converting sentences to list of words

In [24]:
def sent_to_words(sentences):
    """Converts sentences to words.

    Convert sentences in lists of words while removing the accents and the punctuation.

    @param:
        sentences: a list of strings, the sentences we want to convert
    @return
        A list of words' lists.
    """
    for sentence in tqdm(sentences):
        yield (simple_preprocess(str(sentence), deacc=True))


In [25]:
texts = list(sent_to_words(texts))

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200853/200853 [00:06<00:00, 32374.27it/s]


In [26]:
texts

[['there',
  'were',
  'mass',
  'shootings',
  'in',
  'texas',
  'last',
  'week',
  'but',
  'only',
  'on',
  'tv'],
 ['will',
  'smith',
  'joins',
  'diplo',
  'and',
  'nicky',
  'jam',
  'for',
  'the',
  'world',
  'cup',
  'official',
  'song'],
 ['hugh', 'grant', 'marries', 'for', 'the', 'first', 'time', 'at', 'age'],
 ['jim',
  'carrey',
  'blasts',
  'castrato',
  'adam',
  'schiff',
  'and',
  'democrats',
  'in',
  'new',
  'artwork'],
 ['julianna',
  'margulies',
  'uses',
  'donald',
  'trump',
  'poop',
  'bags',
  'to',
  'pick',
  'up',
  'after',
  'her',
  'dog'],
 ['morgan',
  'freeman',
  'devastated',
  'that',
  'sexual',
  'harassment',
  'claims',
  'could',
  'undermine',
  'legacy'],
 ['donald',
  'trump',
  'is',
  'lovin',
  'new',
  'mcdonald',
  'jingle',
  'in',
  'tonight',
  'show',
  'bit'],
 ['what',
  'to',
  'watch',
  'on',
  'amazon',
  'prime',
  'that',
  'new',
  'this',
  'week'],
 ['mike',
  'myers',
  'reveals',
  'he',
  'like',
  'to',

### Removing useless words

In [28]:
def get_stopwords(additional_stopwords=[]):
    """Return a list of english stopwords, that can be augmented by using a stopwords file or a list of stopwords

    Args:
        filepath (str, optional): path to a text file where each line is a stopword
        additional_stopwords (list of str, optional): list of string representing stopwords
    Returns:
        List of strings representing stopwords
    """
    # Loading standard english stop words
    with open('stopwords.txt', 'r') as f:
        stop_w = f.readlines()
    stopwords = [s.rstrip() for s in stop_w]

    # Adding stop words from sklearn
    stopwords = list(text.ENGLISH_STOP_WORDS.union(stopwords))

    # Adding words from a list if specified
    if additional_stopwords:
        stopwords += additional_stopwords

    # Removing duplicates
    stopwords = list(set(stopwords))

    # Removing some \n that were included in the native stopwords of sklearn ... WHY?
    stopwords = [s.replace("\n", "") for s in stopwords]

    stopwords = sorted(stopwords, key=str.lower)

    return stopwords


In [29]:
stopwords = get_stopwords(additional_stopwords=["trump"])

In [31]:
stopwords = get_stopwords(additional_stopwords=["trump"])

texts = [[word for word in txt if word not in stopwords] for txt in tqdm(texts)]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200853/200853 [00:11<00:00, 17648.00it/s]


In [32]:
texts

[['mass', 'shootings', 'texas', 'week', 'tv'],
 ['smith',
  'joins',
  'diplo',
  'nicky',
  'jam',
  'world',
  'cup',
  'official',
  'song'],
 ['hugh', 'grant', 'marries', 'time', 'age'],
 ['jim',
  'carrey',
  'blasts',
  'castrato',
  'adam',
  'schiff',
  'democrats',
  'artwork'],
 ['julianna', 'margulies', 'donald', 'poop', 'bags', 'pick', 'dog'],
 ['morgan',
  'freeman',
  'devastated',
  'sexual',
  'harassment',
  'claims',
  'undermine',
  'legacy'],
 ['donald', 'lovin', 'mcdonald', 'jingle', 'tonight', 'bit'],
 ['watch', 'amazon', 'prime', 'week'],
 ['mike', 'myers', 'reveals', 'fourth', 'austin', 'powers', 'film'],
 ['watch', 'hulu', 'week'],
 ['justin', 'timberlake', 'visits', 'texas', 'school', 'shooting', 'victims'],
 ['south',
  'korean',
  'president',
  'meets',
  'north',
  'korea',
  'kim',
  'jong',
  'talk',
  'summit'],
 ['life', 'risk', 'remote', 'oyster', 'growing', 'region', 'called', 'robots'],
 ['crackdown', 'immigrant', 'parents', 'puts', 'kids', 'straine

### Creating n-grams

In [33]:
def create_bigrams(texts, bigram_count=15, threshold=10, convert_sent_to_words=False, as_str=True):
    """Identify bigrams in texts and return the texts with bigrams integrated"""
    if convert_sent_to_words:
        texts = list(sent_to_words(texts))
    
    bigram_model = Phraser(Phrases(texts, min_count=bigram_count, threshold=threshold))
    
    if as_str:
        return [" ".join(bigram_model[t]) for t in texts]

    else:
        return [bigram_model[t] for t in texts]

def create_trigrams(texts, trigram_count=15, threshold=10, convert_sent_to_words=False, as_str=True):
    """Identify trigrams in texts and return the texts with trigrams integrated"""
    if convert_sent_to_words:
        texts = list(sent_to_words(texts))
    
    bigram_model = Phraser(Phrases(texts, min_count=bigram_count, threshold=threshold))
    
    if as_str:
        return [" ".join(bigram_model[t]) for t in texts]

    else:
        return [bigram_model[t] for t in texts]


In [34]:
texts = create_bigrams(texts)

In [35]:
texts

['mass_shootings texas week tv',
 'smith joins diplo nicky jam world_cup official song',
 'hugh grant marries time age',
 'jim_carrey blasts castrato adam schiff democrats artwork',
 'julianna margulies donald poop bags pick dog',
 'morgan_freeman devastated sexual_harassment claims undermine legacy',
 'donald lovin mcdonald jingle tonight bit',
 'watch amazon prime week',
 'mike myers reveals fourth austin powers film',
 'watch hulu week',
 'justin_timberlake visits texas school_shooting victims',
 'south_korean president meets north_korea kim_jong talk summit',
 'life risk remote oyster growing region called robots',
 'crackdown immigrant parents puts kids strained',
 'son concerned fbi obtained wiretaps putin ally met jr',
 'edward_snowden loves vladimir_putin',
 'booyah obama photographer hilariously trolls spy claim',
 'ireland votes repeal abortion amendment landslide referendum',
 'ryan_zinke reel critics grand pivot conservation',
 'scottish golf resort pays women significantly

### Stemming / Lemmatization & Part-of-Speech filtering

***Note***: *if you encounter an error regarding a missing spacy model, head to your CLI and enter*
````bash
    python -m spacy download en_core_web_md
````

In [36]:
def lemmatize_texts(texts, 
                    allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'], 
                    forbidden_postags=[], 
                    as_sentence=False, 
                    get_postags=False, 
                    spacy_model=None):
    """Lemmatize a list of texts.
    
            Please refer to https://spacy.io/api/annotation for details on the allowed
        POS tags.
        @params:
            - texts_in: a list of texts, where each texts is a string
            - allowed_postags: a list of part of speech tags, in the spacy fashion
            - as_sentence: a boolean indicating whether the output should be a list of sentences instead of a list of word lists
        @return:
            - A list of texts where each entry is a list of words list or a list of sentences
        """
    texts_out = []
    
    if allowed_postags and forbidden_postags:
        raise ValueError("Can't specify both allowed and forbidden postags")

    if forbidden_postags:
        allowed_postags = list(set(POS_map.keys()).difference(set(forbidden_postags)))

    if not spacy_model:
        print("Loading spacy model")
        spacy_model = spacy.load('en_core_web_md')

    print("Beginning lemmatization process")
    total_steps = len(texts)

    docs = spacy_model.pipe(texts)

    for i, doc in tqdm(enumerate(docs), total=total_steps):
        if get_postags:
            texts_out.append(["_".join([token.lemma_, token.pos_]) for token in doc if token.pos_ in allowed_postags])
        else:
            texts_out.append(
                [token.lemma_ for token in doc if token.pos_ in allowed_postags])
    
    if as_sentence:
        texts_out = [" ".join(text) for text in texts_out]
        
    return texts_out


In [66]:
l_texts = lemmatize_texts(texts[:1000],
                allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV', 'X', 'PROPN'], 
                get_postags=False)

Loading spacy model
Beginning lemmatization process



  0%|                                                                                                                                                                             | 0/1000 [00:00<?, ?it/s]
  0%|▏                                                                                                                                                                    | 1/1000 [00:02<34:36,  2.08s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:02<00:00, 474.62it/s]

In [69]:
occurences = compute_word_occurences(l_texts)

## Using pre-trained Word2Vec representations from spacy

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

nlp("this is a course")[3].vector

ModuleNotFoundError: No module named 'spacy'

In [51]:
nlp = spacy.load('en_core_web_sm')

In [74]:
def get_word_embeddings(texts, occurences):
    words = []
    vector_representations = []

    for word in occurences["Word"].head(100):
        words.append(word)
        vector_representations.append(nlp(word)[0].vector)

    pd.DataFrame({"Words": words, "Vector": vector_representations})

Unnamed: 0,Words,Vector
0,donald,"[4.727399, 1.0132356, 0.833761, -3.2051291, -0..."
1,man,"[4.744825, 0.62399197, -1.4038447, -3.6159914,..."
2,win,"[6.64212, 2.612313, -0.42527348, -4.177225, 1...."
3,woman,"[5.9668517, 1.5909166, 0.039670303, -5.785555,..."
4,call,"[5.7618876, -0.8812548, -0.33859012, -2.108201..."
5,report,"[7.803542, -1.3666301, 1.1807303, -2.839004, -..."
6,black,"[0.45798773, -1.2457224, -2.5088394, -3.915559..."
7,reveal,"[1.4318871, 2.0382378, -1.319364, -2.7912467, ..."
8,people,"[2.1282616, 2.3495624, 0.9128659, -3.5999022, ..."
9,year,"[6.3642664, -1.6766875, -0.54288036, -3.903829..."
