# Loading the dataset

In [None]:
import pandas as pd
import itertools

In [None]:
dataset = pd.read_json("News_Category_Dataset_v2.json", lines=True, dtype={"headline": str})

In [None]:
dataset.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200853 entries, 0 to 200852
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   category           200853 non-null  object        
 1   headline           200853 non-null  object        
 2   authors            200853 non-null  object        
 3   link               200853 non-null  object        
 4   short_description  200853 non-null  object        
 5   date               200853 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.2+ MB


# Actual pipeline

1. **Ensuring data quality**. We want to make sure that there's no N/A in the data and that everything is in the good format shape.


2. **Filtering texts**. We want to get rid of HTML tags or encoding stuff that we don't need in the texts. Before applying anything to them, we need to get them cleaned up.


3. **Unifying texts**. In the use case of Topic Modeling, we don't want to make the difference between a word at the beginning of a phrase of in the middle of it here. We should unify all words by lowercasing them and deaccenting them as well.


4. **Converting sentences to lists of words**. Some words aren't needed for our analyses, such as *your*, *my*, etc. In order to remove them easily, we have to convert the sentences to lists of words.


5. **Remove useless words**. we need to remove useless words from the corpus. We have two approaches: use a hard defined list of stopwords or rely on TF-IDF to identify useless words. The first is the simplest, the second might yield better results.


6. **Creating n-grams**. If we look at New York, it is composed of two words. As a result, a word count wouldn't really return a true count for *New York* per se. To fix this, we should represent New York as New_York, which is considered a single word. The n-gram creation consists in identifying words that occur together often and regrouping them. It boosts interpretability for topic modeling in this case.


7. **Stemming / Lemmatization**. Shouldn't run, running, runnable be grouped and counted as a single word when we're identifying discussion topics? Yes, they should. Stemming is the process of cutting words to their word root quite brutally while lemmatization will do the same by identifying the kind of word it is working on. We should convert the corpus words into those truncated representations to have a more realistic word count.


8. **Part of speech tagging**. POS tagging helps in identifying verbs, nouns, adjectives, and other parts of speech. For topic modeling, it is beneficial to focus on a limited set of parts of speech, such as nouns, proper nouns, verbs, and adjectives. Other parts of speech, like conjunctions and prepositions, typically do not convey significant information about the topics.

## Let's create it!

In [None]:
import itertools
import os
import re
import secrets
import string

import pandas as pd
import spacy

from itertools import chain

from gensim.models.callbacks import CallbackAny2Vec
from gensim.models import Word2Vec, Phrases, KeyedVectors
from gensim.models.phrases import Phraser
from gensim.utils import simple_preprocess
from nltk.corpus import wordnet
# from pattern.en import pluralize, singularize
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm import tqdm

from spacy.parts_of_speech import IDS as POS_map
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import Counter
import matplotlib.patches as mpatches

### 1.Ensuring data quality

We want to make sure that there's no N/A in the data and that everything is in the good format shape.

In [None]:
# @title
def check_data_quality(texts):
    """Check wheter all the dataset is conform to the expected behaviour."""
    assert all([isinstance(t, str) for t in texts]), "Input data contains something different than strings."
    assert pd.Series(texts).isnull().sum() == 0, "Input data contains NaN values."

    return True

In [None]:
# @title
def force_format(texts):
    return [str(t) for t in texts]

In [None]:
# @title
texts = force_format(dataset["headline"])

In [None]:
# @title
print(f"Is the dataset passing our data quality check?\n{check_data_quality(texts)}")

Is the dataset passing our data quality check?
True


### 2.Filtering texts

We want to get rid of HTML tags or encoding stuff that we don't need in the texts. Before applying anything to them, we need to get them cleaned up.

https://regex101.com/

In [None]:
def filter_text(texts_in):
    """Removes incorrect patterns from a list of texts, such as hyperlinks, bullet points and so on"""

    texts_out = re.sub(r'https?:\/\/[A-Za-z0-9_.-~\-]*', ' ', texts_in, flags=re.MULTILINE)
    texts_out = re.sub(r'[(){}\[\]<>]', ' ', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'&amp;#.*;', ' ', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'&gt;', ' ', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'â€™', "'", texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'\s+', ' ', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'&#x200B;', ' ', texts_out, flags=re.MULTILINE)
    # Mail regex
    # This regex is correct but WAY TOO LONG to process. So we skip it with a simpler version
    # texts_out = re.sub(r"(?i)(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|\"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])", '', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r'[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+\.[a-zA-Z0-9-_.]+', '', texts_out, flags=re.MULTILINE)
    # Phone regex
    # This regex is correct but WAY TOO LONG to process. So we skip it with a simpler version
    # texts_out = re.sub(r".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", '', texts_out, flags=re.MULTILINE)
    texts_out = re.sub(r"\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}", '', texts_out, flags=re.MULTILINE)
    # Remove names in twitter
    texts_out = re.sub(r'@\S+( |\n)', '', texts_out, flags=re.MULTILINE)

    # Remove starts commonly used on social media
    texts_out = re.sub(r'\*', '', texts_out, flags=re.MULTILINE)
    return texts_out


In [None]:
texts = [filter_text(t) for t in texts]

In [None]:
texts[:10]

['There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV',
 "Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song",
 'Hugh Grant Marries For The First Time At Age 57',
 "Jim Carrey Blasts 'Castrato' Adam Schiff And Democrats In New Artwork",
 'Julianna Margulies Uses Donald Trump Poop Bags To Pick Up After Her Dog',
 "Morgan Freeman 'Devastated' That Sexual Harassment Claims Could Undermine Legacy",
 "Donald Trump Is Lovin' New McDonald's Jingle In 'Tonight Show' Bit",
 'What To Watch On Amazon Prime That’s New This Week',
 "Mike Myers Reveals He'd 'Like To' Do A Fourth Austin Powers Film",
 'What To Watch On Hulu That’s New This Week']

### 3.Unifying texts & 4.converting sentences to list of words

In the use case of Topic Modeling, we don't want to make the difference between a word at the beginning of a phrase of in the middle of it here. We should unify all words by lowercasing them and deaccenting them as well.

Some words aren't needed for our analyses, such as your, my, etc. In order to remove them easily, we have to convert the sentences to lists of words.

In [None]:
def sent_to_words(sentences):
    """Converts sentences to words.

    Convert sentences in lists of words while removing the accents and the punctuation.

    @param:
        sentences: a list of strings, the sentences we want to convert
    @return
        A list of words' lists.
    """
    for sentence in tqdm(sentences): # tqdm show a bar progress over iterable
        yield (simple_preprocess(str(sentence), deacc=True,)) # lowercase and remove accent and number

In [None]:
texts = list(sent_to_words(texts))

100%|██████████| 200853/200853 [00:06<00:00, 29752.17it/s]


In [None]:
texts[:3]

[['there',
  'were',
  'mass',
  'shootings',
  'in',
  'texas',
  'last',
  'week',
  'but',
  'only',
  'on',
  'tv'],
 ['will',
  'smith',
  'joins',
  'diplo',
  'and',
  'nicky',
  'jam',
  'for',
  'the',
  'world',
  'cup',
  'official',
  'song'],
 ['hugh', 'grant', 'marries', 'for', 'the', 'first', 'time', 'at', 'age']]

### 5.Removing useless words

We need to remove unnecessary words from the corpus. We will use a predefined list of stop words from *sklearn.feature_extraction* and add a custom list for simplicity.

In [None]:
def get_stopwords(additional_stopwords=[]):
    """Return a list of english stopwords, that can be augmented by using a stopwords file or a list of stopwords

    Args:
        filepath (str, optional): path to a text file where each line is a stopword
        additional_stopwords (list of str, optional): list of string representing stopwords
    Returns:
        List of strings representing stopwords
    """
    # Loading standard english stop words
    with open('stopwords.txt', 'r') as f:
        stop_w = f.readlines() # return a list stop_w where each line of the file correspond to an element of the list
    stopwords = [s.rstrip() for s in stop_w] # removing trailing new line "\n" character

    # Adding stop words from sklearn
    stopwords = list(text.ENGLISH_STOP_WORDS.union(stopwords))

    # Adding words from a list if specified
    if additional_stopwords:
        stopwords += additional_stopwords

    # Removing duplicates
    stopwords = list(set(stopwords))

    # Removing some \n that were included in the native stopwords of sklearn ... WHY?
    stopwords = [s.replace("\n", "") for s in stopwords]

    stopwords = sorted(stopwords, key=str.lower)

    return stopwords


In [None]:
stopwords = get_stopwords(additional_stopwords=["trump","(PHOTOS)","donald"])

In [None]:
texts = [[word for word in txt if word not in stopwords] for txt in tqdm(texts)]

100%|██████████| 200853/200853 [00:19<00:00, 10308.94it/s]


In [None]:
texts[:3]

[['mass', 'shootings', 'texas', 'week', 'tv'],
 ['smith',
  'joins',
  'diplo',
  'nicky',
  'jam',
  'world',
  'cup',
  'official',
  'song'],
 ['hugh', 'grant', 'marries', 'time', 'age']]

### 6.Creating n-grams

Here, we will focus on creating bigrams

In [None]:
def create_bigrams(texts, bigram_count=15, threshold=10, convert_sent_to_words=False, as_str=True):
    """Identify bigrams in texts and return the texts with bigrams integrated"""
    if convert_sent_to_words:
        texts = list(sent_to_words(texts))

    bigram_model = Phraser(Phrases(texts, min_count=bigram_count, threshold=threshold))

    if as_str:
        return [" ".join(bigram_model[t]) for t in texts]

    else:
        return [bigram_model[t] for t in texts]

In [None]:
texts = create_bigrams(texts)

In [None]:
texts[:3]

['mass_shootings texas week tv',
 'smith joins diplo nicky jam world_cup official song',
 'hugh grant marries time age']

### 7.Stemming/Lemmatization & 8.Part-of-Speech filtering

Finally, we will use en_core_web_lg, which is a spaCy model, to perform lemmatization and part-of-speech (POS) tagging. We will filter the text to retain only the following parts of speech: NOUN, ADJ, VERB, ADV, and PROPN.

In [None]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.6.0/en_core_web_lg-3.6.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [None]:
def lemmatize_texts(texts,
                    allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'],
                    forbidden_postags=False,
                    as_sentence=False,
                    get_postags=False,
                    spacy_model=False):
    """Lemmatize a list of texts.

            Please refer to https://spacy.io/api/annotation for details on the allowed
        POS tags.
        @params:
            - texts: a list of texts, where each texts is a string
            - allowed_postags: a list of part of speech tags, in the spacy fashion
            - as_sentence: a boolean indicating whether the output should be a list of sentences instead of a list of word lists
        @return:
            - A list of texts where each entry is a list of words list or a list of sentences
        """
    texts_out = []

    if allowed_postags and forbidden_postags:
        raise ValueError("Can't specify both allowed and forbidden postags")

    if forbidden_postags:
        allowed_postags = list(set(POS_map.keys()).difference(set(forbidden_postags))) # return a list of POS tags that are in POS_map but not in forbidden_postags

    if not spacy_model:
        print("Loading spacy model")
        spacy_model = spacy.load('en_core_web_lg') #en_core_web_trf

    print("Beginning lemmatization process")
    total_steps = len(texts)

    docs = spacy_model.pipe(texts)

    for doc in tqdm(docs, total=total_steps):
        if get_postags:
            texts_out.append(["_".join([token.lemma_, token.pos_]) for token in doc if token.pos_ in allowed_postags])
        else:
            texts_out.append(
                [token.lemma_ for token in doc if token.pos_ in allowed_postags])

    if as_sentence:
        texts_out = [" ".join(text) for text in texts_out]

    return texts_out


In [None]:
l_texts = lemmatize_texts(texts[:1000],
                allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV','PROPN'],
                get_postags=False)

Loading spacy model
Beginning lemmatization process


100%|██████████| 1000/1000 [00:02<00:00, 482.22it/s]


In [None]:
# Remove empty headlines
l_texts = [headline for headline in l_texts if headline]

In [None]:
l_texts[:3]

[['mass_shootings', 'texas', 'week', 'tv'],
 ['smith', 'join', 'diplo', 'nicky', 'jam', 'world_cup', 'official', 'song'],
 ['hugh', 'grant', 'marry', 'time', 'age']]

### Exporting Processed Texts to JSON File

In [None]:
import json
# File path where you want to save the file
file_path = 'l_texts_en_core_web_lg.json'

# Writing to the file
with open(file_path, 'w') as file:
    # Serializing the list of lists as JSON
    json.dump(l_texts, file)