# Máster en Big Data Science
## Live coding session

---

- Date: January 25, 2022
- Language: Python 3.9
- Author: Fernando Rabanal

### Data load

- Dataset: [BBC News Summary](https://www.kaggle.com/pariza/bbc-news-summary)
- General information:
    - 5 classes: business, entertainment, politics, sport, tech
    - 2224 articles in total
    - First line of each article is treated as title
    
- Possible problems to be tackled:
    - Text summarization
    - **Text classification**
    - Named Entity Recognition
    - ...

In [1]:
import os
import re

import altair as alt
import gensim
import numpy as np
import umap
import pandas as pd
import spacy

from tqdm.notebook import tqdm
from loguru import logger
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

nlp = spacy.load('en_core_web_lg')

  warn("Tensorflow not installed; ParametricUMAP will be unavailable")


We'll load the corpus as {id: {'x': text, 'y': category}}.

- This way, data loading process gets a bit overcomplicated, as a specific structure is required.
- On the other hand, we will have flexibility in how we process text for the different algorithms as we have all information in a predefined structure.

In [2]:
base_folder = 'BBC News Summary/News Articles/'
tags = [filename for filename in os.listdir(base_folder) if not filename.startswith('.')]

all_data = {}
counter = 0
for tag in tags:
    txt_files = [filename for filename in os.listdir(f'{base_folder}{tag}') if filename.endswith('.txt')]
    logger.info(f'Category: {tag} | Files: {len(txt_files)}')
    for filename in tqdm(txt_files):
        try:
            with open(f'{base_folder}{tag}/{filename}', 'r') as f:
                txt = f.read()
            all_data[counter] = {'x': txt, 'y': tag}
            counter += 1
        except:
            pass

2022-01-25 20:03:38.813 | INFO     | __main__:<module>:8 - Category: tech | Files: 401


  0%|          | 0/401 [00:00<?, ?it/s]

2022-01-25 20:03:38.848 | INFO     | __main__:<module>:8 - Category: sport | Files: 511


  0%|          | 0/511 [00:00<?, ?it/s]

2022-01-25 20:03:38.881 | INFO     | __main__:<module>:8 - Category: politics | Files: 417


  0%|          | 0/417 [00:00<?, ?it/s]

2022-01-25 20:03:38.909 | INFO     | __main__:<module>:8 - Category: entertainment | Files: 386


  0%|          | 0/386 [00:00<?, ?it/s]

2022-01-25 20:03:38.942 | INFO     | __main__:<module>:8 - Category: business | Files: 510


  0%|          | 0/510 [00:00<?, ?it/s]

In [3]:
all_data[0]

{'x': 'More women turn to net security\n\nOlder people and women are increasingly taking charge of protecting home computers against malicious net attacks, according to a two-year study.\n\nThe number of women buying programs to protect PCs from virus, spam and spyware attacks rose by 11.2% each year between 2002 and 2004. The study, for net security firm Preventon, shows that security messages are reaching a diversity of surfers. It is thought that 40% of those buying home net security programs are retired. For the last three years, that has gone up by an average of 13.2%. But more retired women (53%) were buying security software than retired men. The research reflects the changing stereotype and demographics of web users, as well as growing awareness of the greater risks that high-speed broadband net connections can pose to surfers.\n\nThe study predicts that 40% of all home PC net security buyers will be women in 2005. They could even overtake men as the main buyers by 2007, if cur

## First approach: classic NLP with TF-IDF model

- Basic text cleaning process with `re` module
- Text preprocessing with `spacy`, industrialized process

- Gensim: extremes filtered for greater performance
- Classifiers: Logistic Regression and Random Forest

In [4]:
corpus = [doc['x'] for doc in all_data.values()]
corpus[0]

'More women turn to net security\n\nOlder people and women are increasingly taking charge of protecting home computers against malicious net attacks, according to a two-year study.\n\nThe number of women buying programs to protect PCs from virus, spam and spyware attacks rose by 11.2% each year between 2002 and 2004. The study, for net security firm Preventon, shows that security messages are reaching a diversity of surfers. It is thought that 40% of those buying home net security programs are retired. For the last three years, that has gone up by an average of 13.2%. But more retired women (53%) were buying security software than retired men. The research reflects the changing stereotype and demographics of web users, as well as growing awareness of the greater risks that high-speed broadband net connections can pose to surfers.\n\nThe study predicts that 40% of all home PC net security buyers will be women in 2005. They could even overtake men as the main buyers by 2007, if current r

In [5]:
def clean(text, nlp_model):
    text = re.sub('\n', ' ', text)
    text = re.sub(r' +', ' ', text)
    doc = nlp_model(text)
    return [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]

corpus = [clean(doc, nlp) for doc in corpus]
corpus[0]

['woman',
 'turn',
 'net',
 'security',
 'old',
 'people',
 'woman',
 'increasingly',
 'take',
 'charge',
 'protect',
 'home',
 'computer',
 'malicious',
 'net',
 'attack',
 'accord',
 'year',
 'study',
 'number',
 'woman',
 'buy',
 'program',
 'protect',
 'pc',
 'virus',
 'spam',
 'spyware',
 'attack',
 'rise',
 '11.2',
 'year',
 '2002',
 '2004',
 'study',
 'net',
 'security',
 'firm',
 'Preventon',
 'show',
 'security',
 'message',
 'reach',
 'diversity',
 'surfer',
 'think',
 '40',
 'buy',
 'home',
 'net',
 'security',
 'program',
 'retire',
 'year',
 'go',
 'average',
 '13.2',
 'retired',
 'woman',
 '53',
 'buy',
 'security',
 'software',
 'retire',
 'man',
 'research',
 'reflect',
 'change',
 'stereotype',
 'demographic',
 'web',
 'user',
 'grow',
 'awareness',
 'great',
 'risk',
 'high',
 'speed',
 'broadband',
 'net',
 'connection',
 'pose',
 'surfer',
 'study',
 'predict',
 '40',
 'home',
 'pc',
 'net',
 'security',
 'buyer',
 'woman',
 '2005',
 'overtake',
 'man',
 'main',
 'b

In [6]:
' '.join(corpus[0])

'woman turn net security old people woman increasingly take charge protect home computer malicious net attack accord year study number woman buy program protect pc virus spam spyware attack rise 11.2 year 2002 2004 study net security firm Preventon show security message reach diversity surfer think 40 buy home net security program retire year go average 13.2 retired woman 53 buy security software retire man research reflect change stereotype demographic web user grow awareness great risk high speed broadband net connection pose surfer study predict 40 home pc net security buyer woman 2005 overtake man main buyer 2007 current rate persist accord research think old people vigilant protect pc tend cautious want insurance policy case wrong say over-60 woman take research start young male stereotype computer user 10 year Paul Goosens head Preventon tell BBC News website see real people sex woman access home net service provider need responsibility make sure people educate net threat online 

In [10]:
dictionary = gensim.corpora.Dictionary(corpus)
dictionary

<gensim.corpora.dictionary.Dictionary at 0x7f3e5909dad0>

In [12]:
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=250)
dictionary

<gensim.corpora.dictionary.Dictionary at 0x7f3e5909dad0>

In [13]:
num_docs = dictionary.num_docs
num_terms = len(dictionary.keys())
logger.info(f'Terms: {num_terms} | Docs: {num_docs}')

2022-01-25 20:22:37.878 | INFO     | __main__:<module>:3 - Terms: 250 | Docs: 2224


In [14]:
bow = [dictionary.doc2bow(doc) for doc in corpus]

tfidf = gensim.models.TfidfModel(bow)
tfidf_docs = tfidf[bow]
tfidf

<gensim.models.tfidfmodel.TfidfModel at 0x7f3e5ba23350>

In [19]:
tfidf_mat = gensim.matutils.corpus2csc(tfidf_docs)
tfidf_mat[0]

<1x2224 sparse matrix of type '<class 'numpy.float64'>'
	with 350 stored elements in Compressed Sparse Column format>

In [20]:
y = [doc['y'] for doc in all_data.values()]
y[:5]

['tech', 'tech', 'tech', 'tech', 'tech']

In [22]:
pd.Series(y).value_counts(normalize=True)

business         0.229317
sport            0.229317
politics         0.187500
tech             0.180306
entertainment    0.173561
dtype: float64

In [23]:
clf = LogisticRegression(max_iter=1000)
clf

LogisticRegression(max_iter=1000)

In [25]:
tfidf_mat

<250x2224 sparse matrix of type '<class 'numpy.float64'>'
	with 82929 stored elements in Compressed Sparse Column format>

In [27]:
len(y)

2224

In [29]:
res1 = cross_val_score(clf, tfidf_mat.T, y, cv=5, scoring='accuracy')
logger.info(f'LR: {res1} | Mean: {np.mean(res1)}')

2022-01-25 20:40:41.457 | INFO     | __main__:<module>:2 - LR: [0.91460674 0.94831461 0.94382022 0.93258427 0.92342342] | Mean: 0.9325498532240104


In [30]:
clf2 = RandomForestClassifier(random_state=1)
res2 = cross_val_score(clf2, tfidf_mat.T, y, cv=5, scoring='accuracy')
logger.info(f'RF: {res2} | Mean: {np.mean(res2)}')

2022-01-25 20:42:45.091 | INFO     | __main__:<module>:3 - RF: [0.90337079 0.93258427 0.93258427 0.94606742 0.93918919] | Mean: 0.9307591861524445


## What happens if I obtain document embeddings?

Spacy ships GloVe vectors for 1M words in its `lg` models. It means we can easily obtain document vectors by averaging word vectors very easily. Of course, there are contextualized document embedding models that could achieve better performance, but let's see if we can manage this dataset with available Spacy vectors.

In [31]:
doc1 = nlp(all_data[0]['x'])
doc2 = nlp(all_data[1000]['x'])

In [32]:
doc1.similarity(doc2)

0.955825620184052

In [35]:
doc2.vector.shape

(300,)

In [36]:
for token in doc1:
    if token.has_vector:
        break
token.vector

array([-3.9717e-01,  3.0269e-01, -1.8428e-01, -6.5407e-02,  1.9637e-01,
       -5.8685e-02,  3.7790e-02,  2.9643e-01,  1.1542e-02,  2.2009e+00,
       -4.6806e-02, -7.1777e-03, -1.1853e-01, -4.1681e-01, -2.0386e-01,
        1.2567e-01,  3.2915e-03,  1.3143e+00, -4.7148e-01, -1.1948e-01,
       -2.5665e-01,  1.0156e-01,  1.3020e-01, -7.0407e-01, -7.4254e-02,
       -1.7186e-02,  1.7362e-02,  1.5262e-01,  5.1837e-01, -3.6875e-01,
       -4.0545e-02, -4.6352e-02,  7.9905e-03, -3.0805e-01,  6.0676e-01,
       -1.3668e-01, -2.6167e-01,  2.3586e-01,  1.3590e-01, -8.4004e-02,
       -1.2044e-01,  1.3398e-02,  3.7747e-01,  4.7950e-02, -7.7707e-02,
        3.0638e-03, -1.0368e-02,  3.1060e-01,  1.0559e-01,  3.9321e-02,
       -4.6871e-01,  1.3819e-01,  2.5762e-01, -2.3689e-01,  5.6828e-02,
        1.4335e-01, -3.1491e-01,  1.2502e-02,  4.1930e-02, -1.4981e-01,
       -1.5684e-02, -3.4712e-02, -2.9339e-01,  1.0509e-01,  3.9542e-01,
        9.0509e-02,  3.1770e-02,  3.4126e-01, -1.2346e-02,  1.08

In [37]:
token.vector.shape

(300,)

In [38]:
embeddings = np.zeros((len(all_data), 300))
for idx, doc in tqdm(all_data.items()):
    embeddings[idx, :] = nlp(doc['x']).vector

  0%|          | 0/2224 [00:00<?, ?it/s]

In [39]:
res3 = cross_val_score(clf, embeddings, y, cv=5, scoring='accuracy')
res4 = cross_val_score(clf2, embeddings, y, cv=5, scoring='accuracy')
logger.info(f'LR-embed: {res3} | Mean: {np.mean(res3)}')
logger.info(f'RF-embed: {res4} | Mean: {np.mean(res4)}')

2022-01-25 20:58:01.409 | INFO     | __main__:<module>:3 - LR-embed: [0.94606742 0.96853933 0.96404494 0.96853933 0.94594595] | Mean: 0.9586273914363801
2022-01-25 20:58:01.409 | INFO     | __main__:<module>:4 - RF-embed: [0.94157303 0.97078652 0.97078652 0.96404494 0.95045045] | Mean: 0.959528292337281


In [40]:
umap_embed = umap.UMAP(n_neighbors=5, min_dist=0.3, metric='correlation', random_state=1) \
    .fit_transform(embeddings)
umap_embed[:5,:]

array([[ 1.431147  , 13.988908  ],
       [ 0.99275994, 14.940517  ],
       [ 5.5264616 , 20.125797  ],
       [ 2.276065  , 16.329544  ],
       [-0.46491933, 13.657876  ]], dtype=float32)

In [41]:
umap_df = pd.DataFrame(umap_embed, columns=['X', 'Y'])

In [46]:
alt.Chart(umap_df).mark_circle(size=15).encode(
    x='X',
    y='Y'
).interactive()

In [48]:
umap_df['tag'] = ''
umap_df['title'] = ''

for idx in range(umap_df.shape[0]):
    umap_df.loc[idx, 'title'] = all_data[idx]['x'].split('\n')[0]
    umap_df.loc[idx, 'tag'] = all_data[idx]['y']

In [54]:
alt.Chart(umap_df).mark_circle(size=15).encode(
    x=alt.X('X', axis=None),
    y=alt.Y('Y', axis=None),
    color=alt.Color('tag'),
    tooltip=['title', 'tag']
).configure_view(
    width=600, height=400
).properties(title='Document embeddings (real tag)').interactive()

In [55]:
res5 = cross_val_score(clf, umap_embed, y, cv=5, scoring='accuracy')
res6 = cross_val_score(clf2, umap_embed, y, cv=5, scoring='accuracy')
logger.info(f'LR-2D: {res5} | Mean: {np.mean(res5)}')
logger.info(f'RF-2D: {res6} | Mean: {np.mean(res6)}')

2022-01-25 21:22:58.476 | INFO     | __main__:<module>:3 - LR-2D: [0.91460674 0.93033708 0.94382022 0.93033708 0.93468468] | Mean: 0.9307571616560381
2022-01-25 21:22:58.477 | INFO     | __main__:<module>:4 - RF-2D: [0.92359551 0.9505618  0.9505618  0.94831461 0.95720721] | Mean: 0.9460481830144752
