# Máster en Big Data Science
## Live coding session

---

- Date: January 25, 2022
- Language: Python 3.9
- Author: Fernando Rabanal

### Data load

- Dataset: [BBC News Summary](https://www.kaggle.com/pariza/bbc-news-summary)
- General information:
    - 5 classes: business, entertainment, politics, sport, tech
    - 2224 articles in total
    - First line of each article is treated as title
    
- Possible problems to be tackled:
    - Text summarization
    - **Text classification**
    - Named Entity Recognition
    - ...

In [1]:
import os
import re

import altair as alt
import gensim
import numpy as np
import umap
import pandas as pd
import spacy

from tqdm.notebook import tqdm
from loguru import logger
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

nlp = spacy.load('en_core_web_lg')

We'll load the corpus as {id: {'x': text, 'y': category}}.

- This way, data loading process gets a bit overcomplicated, as a specific structure is required.
- On the other hand, we will have flexibility in how we process text for the different algorithms as we have all information in a predefined structure.

In [2]:
base_folder = 'BBC News Summary/News Articles/'
tags = [filename for filename in os.listdir(base_folder) if not filename.startswith('.')]

all_data = {}
counter = 0
for tag in tags:
    txt_files = [filename for filename in os.listdir(f'{base_folder}{tag}') if filename.endswith('.txt')]
    logger.info(f'Category: {tag} | Files: {len(txt_files)}')
    for filename in tqdm(txt_files):
        try:
            with open(f'{base_folder}{tag}/{filename}', 'r') as f:
                txt = f.read()
            all_data[counter] = {'x': txt, 'y': tag}
            counter += 1
        except:
            pass

2022-01-12 20:06:21.809 | INFO     | __main__:<module>:8 - Category: tech | Files: 401


  0%|          | 0/401 [00:00<?, ?it/s]

2022-01-12 20:06:21.886 | INFO     | __main__:<module>:8 - Category: sport | Files: 511


  0%|          | 0/511 [00:00<?, ?it/s]

2022-01-12 20:06:21.962 | INFO     | __main__:<module>:8 - Category: politics | Files: 417


  0%|          | 0/417 [00:00<?, ?it/s]

2022-01-12 20:06:22.032 | INFO     | __main__:<module>:8 - Category: entertainment | Files: 386


  0%|          | 0/386 [00:00<?, ?it/s]

2022-01-12 20:06:22.101 | INFO     | __main__:<module>:8 - Category: business | Files: 510


  0%|          | 0/510 [00:00<?, ?it/s]

Let's visualize the first document

In [3]:
all_data[0]

{'x': 'More women turn to net security\n\nOlder people and women are increasingly taking charge of protecting home computers against malicious net attacks, according to a two-year study.\n\nThe number of women buying programs to protect PCs from virus, spam and spyware attacks rose by 11.2% each year between 2002 and 2004. The study, for net security firm Preventon, shows that security messages are reaching a diversity of surfers. It is thought that 40% of those buying home net security programs are retired. For the last three years, that has gone up by an average of 13.2%. But more retired women (53%) were buying security software than retired men. The research reflects the changing stereotype and demographics of web users, as well as growing awareness of the greater risks that high-speed broadband net connections can pose to surfers.\n\nThe study predicts that 40% of all home PC net security buyers will be women in 2005. They could even overtake men as the main buyers by 2007, if cur

## First approach: classic NLP with TF-IDF model

- Basic text cleaning process with `re` module
- Text preprocessing with `spacy`, industrialized process

- Gensim: extremes filtered for greater performance
- Classifiers: Logistic Regression and Random Forest

In [4]:
corpus = [doc.get('x') for _, doc in all_data.items()]

def clean_text(s, nlp_model):
    s = re.sub('\n', ' ', s)
    s = re.sub(r' +', ' ', s)
    return [token.lemma_ for token in nlp(s) if not token.is_stop and not token.is_punct]

corpus = [clean_text(doc, nlp) for doc in corpus]

In [5]:
dictionary = gensim.corpora.Dictionary(corpus)
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=500)

num_docs = dictionary.num_docs
num_terms = len(dictionary.keys())
print((num_docs, num_terms))

(2224, 500)


In [6]:
modelo_bow = [dictionary.doc2bow(text) for text in corpus]

tfidf = gensim.models.TfidfModel(modelo_bow)
tfidf_docs = tfidf[modelo_bow]

corpus_tfidf_csc = gensim.matutils.corpus2csc(tfidf_docs)

In [7]:
# List of targets
y = [doc.get('y') for _, doc in all_data.items()]

In [8]:
clf = LogisticRegression(max_iter=1000)
cross_val_score(clf, corpus_tfidf_csc.T, y, cv=5, scoring='accuracy')

array([0.94606742, 0.97078652, 0.95730337, 0.94382022, 0.95495495])

In [9]:
m = RandomForestClassifier(random_state=1)
cross_val_score(m, corpus_tfidf_csc.T, y, cv=5, scoring='accuracy')

array([0.93707865, 0.95280899, 0.94831461, 0.94157303, 0.96171171])

## What happens if I obtain document embeddings?

Spacy ships GloVe vectors for 1M words in its `lg` models. It means we can easily obtain document vectors by averaging word vectors very easily. Of course, there are contextualized document embedding models that could achieve better performance, but let's see if we can manage this dataset with available Spacy vectors.

In [10]:
doc1 = nlp(all_data[0].get('x'))
doc2 = nlp(all_data[999].get('x'))

Spacy documents (and tokens) have a `similarity` method to evaluate similarity with another document (or token).

In [11]:
doc1.similarity(doc2)

0.9302041536062987

Document vectors are of the same shape as word ones

In [12]:
doc1.vector

array([-1.06087252e-01,  1.57878980e-01, -1.23277478e-01, -7.39157200e-02,
        5.20617291e-02, -1.36929769e-02,  1.93576291e-02, -7.48483166e-02,
        1.70265567e-02,  2.16320920e+00, -1.81970730e-01,  2.17154678e-02,
        5.53707890e-02, -5.61834797e-02, -1.08367153e-01, -7.01459870e-02,
       -6.53185323e-02,  1.17052400e+00, -2.06994221e-01, -3.95639464e-02,
       -1.52840391e-02, -2.83568110e-02, -4.17800061e-02, -2.24864595e-02,
        3.20174359e-02, -2.74697365e-03, -8.12956244e-02, -5.66419810e-02,
        7.73983970e-02, -3.41670364e-02, -4.24717404e-02,  4.06999886e-02,
       -3.62336338e-02,  4.92242463e-02,  8.74517187e-02, -6.81794807e-02,
       -8.83685499e-02,  4.15820368e-02, -5.02331415e-04, -3.55098955e-02,
       -4.51946668e-02,  5.30518331e-02,  6.09959885e-02, -8.34672749e-02,
        2.17540152e-02,  3.50477993e-02, -9.91473719e-02, -2.26061475e-02,
        3.71811837e-02,  3.14484676e-03, -1.32091448e-01, -1.32745225e-02,
       -6.66089654e-02, -

In [13]:
doc1.vector.shape

(300,)

In [14]:
for token in doc1:
    if token.has_vector:
        break
token.vector

array([-3.9717e-01,  3.0269e-01, -1.8428e-01, -6.5407e-02,  1.9637e-01,
       -5.8685e-02,  3.7790e-02,  2.9643e-01,  1.1542e-02,  2.2009e+00,
       -4.6806e-02, -7.1777e-03, -1.1853e-01, -4.1681e-01, -2.0386e-01,
        1.2567e-01,  3.2915e-03,  1.3143e+00, -4.7148e-01, -1.1948e-01,
       -2.5665e-01,  1.0156e-01,  1.3020e-01, -7.0407e-01, -7.4254e-02,
       -1.7186e-02,  1.7362e-02,  1.5262e-01,  5.1837e-01, -3.6875e-01,
       -4.0545e-02, -4.6352e-02,  7.9905e-03, -3.0805e-01,  6.0676e-01,
       -1.3668e-01, -2.6167e-01,  2.3586e-01,  1.3590e-01, -8.4004e-02,
       -1.2044e-01,  1.3398e-02,  3.7747e-01,  4.7950e-02, -7.7707e-02,
        3.0638e-03, -1.0368e-02,  3.1060e-01,  1.0559e-01,  3.9321e-02,
       -4.6871e-01,  1.3819e-01,  2.5762e-01, -2.3689e-01,  5.6828e-02,
        1.4335e-01, -3.1491e-01,  1.2502e-02,  4.1930e-02, -1.4981e-01,
       -1.5684e-02, -3.4712e-02, -2.9339e-01,  1.0509e-01,  3.9542e-01,
        9.0509e-02,  3.1770e-02,  3.4126e-01, -1.2346e-02,  1.08

In [15]:
token.vector.shape

(300,)

Now let's see if embeddings for all documents have a meaning themselves. We could visualize documents first according to GLoVe vector embeddings.

- We use UMAP (`umap-learn`) to reduce dimensionality from 300D to 2D, as it is difficult to generate and interpret a 300D plot. UMAP is nonlinear, fast and data-driven, so we can expect interesting visualizations derived from its use.
- We use `altair` as visualization tool, as it can provide a good control over the plot aspect and interactive charts.

In [16]:
embeddings = np.zeros((len(all_data), 300))
for idx, doc in tqdm(all_data.items()):
    embeddings[idx,:] = nlp(doc.get('x')).vector

  0%|          | 0/2224 [00:00<?, ?it/s]

In [17]:
umap_embed = pd.DataFrame(umap.UMAP(n_neighbors=5, min_dist=0.3, metric='correlation', random_state=1)
                          .fit_transform(embeddings), columns=['X', 'Y'])

alt.Chart(umap_embed).mark_circle(size=25).encode(
    x=alt.X('X', axis=None),
    y=alt.Y('Y', axis=None),
).configure_view(
    width=600,
    height=400,
).properties(
    title='Document embeddings'
).interactive()

What happens if tags are included in the chart? We include category information as point colors.

In [18]:
umap_embed['title'] = ''
umap_embed['tag'] = ''
for idx, _ in umap_embed.iterrows():
    umap_embed.loc[idx, 'title'] = all_data.get(idx).get('x').split('\n')[0]
    umap_embed.loc[idx, 'tag'] = all_data.get(idx).get('y')

alt.Chart(umap_embed).mark_circle(size=25).encode(
    x=alt.X('X', axis=None),
    y=alt.Y('Y', axis=None),
    color=alt.Color('tag'),
    tooltip=['title']
).configure_view(
    width=600,
    height=400,
).properties(
    title='Document embeddings (real tag)'
).interactive()

After these visualizations, we proceed to classify the documents in the same way as the first approach.

In [19]:
clf = LogisticRegression(max_iter=1000)
cross_val_score(clf, umap_embed[['X', 'Y']].values, y, cv=5, scoring='accuracy')

array([0.9011236 , 0.94382022, 0.94382022, 0.94382022, 0.94369369])

In [20]:
cross_val_score(clf, embeddings, y, cv=5, scoring='accuracy')

array([0.93932584, 0.96853933, 0.96404494, 0.96853933, 0.94369369])

In [21]:
m = RandomForestClassifier(random_state=1)
cross_val_score(m, umap_embed[['X', 'Y']].values, y, cv=5, scoring='accuracy')

array([0.9258427 , 0.9505618 , 0.95280899, 0.94606742, 0.95495495])

In [22]:
cross_val_score(m, embeddings, y, cv=5, scoring='accuracy')

array([0.94831461, 0.97078652, 0.96179775, 0.95505618, 0.9481982 ])