# Máster en Big Data Science
## Live coding session

---

- Date: January 25, 2022
- Language: Python 3.9
- Author: Fernando Rabanal

### Data load

- Dataset: [BBC News Summary](https://www.kaggle.com/pariza/bbc-news-summary)
- General information:
    - 5 classes: business, entertainment, politics, sport, tech
    - 2224 articles in total
    - First line of each article is treated as title
    
- Possible problems to be tackled:
    - Text summarization
    - **Text classification**
    - Named Entity Recognition
    - ...

In [1]:
import os
import re

import altair as alt
import gensim
import numpy as np
import umap
import pandas as pd
import spacy

from tqdm.notebook import tqdm
from loguru import logger
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

nlp = spacy.load('en_core_web_lg')

We'll load the corpus as {id: {'x': text, 'y': category}}.

- This way, data loading process gets a bit overcomplicated, as a specific structure is required.
- On the other hand, we will have flexibility in how we process text for the different algorithms as we have all information in a predefined structure.

In [2]:
base_folder = 'BBC News Summary/News Articles/'
tags = [filename for filename in os.listdir(base_folder) if not filename.startswith('.')]

all_data = {}
counter = 0
for tag in tags:
    txt_files = [filename for filename in os.listdir(f'{base_folder}{tag}') if filename.endswith('.txt')]
    logger.info(f'Category: {tag} | Files: {len(txt_files)}')
    for filename in tqdm(txt_files):
        try:
            with open(f'{base_folder}{tag}/{filename}', 'r') as f:
                txt = f.read()
            all_data[counter] = {'x': txt, 'y': tag}
            counter += 1
        except:
            pass

2022-01-12 20:12:08.260 | INFO     | __main__:<module>:8 - Category: tech | Files: 401


  0%|          | 0/401 [00:00<?, ?it/s]

2022-01-12 20:12:08.294 | INFO     | __main__:<module>:8 - Category: sport | Files: 511


  0%|          | 0/511 [00:00<?, ?it/s]

2022-01-12 20:12:08.320 | INFO     | __main__:<module>:8 - Category: politics | Files: 417


  0%|          | 0/417 [00:00<?, ?it/s]

2022-01-12 20:12:08.346 | INFO     | __main__:<module>:8 - Category: entertainment | Files: 386


  0%|          | 0/386 [00:00<?, ?it/s]

2022-01-12 20:12:08.373 | INFO     | __main__:<module>:8 - Category: business | Files: 510


  0%|          | 0/510 [00:00<?, ?it/s]

## First approach: classic NLP with TF-IDF model

- Basic text cleaning process with `re` module
- Text preprocessing with `spacy`, industrialized process

- Gensim: extremes filtered for greater performance
- Classifiers: Logistic Regression and Random Forest

## What happens if I obtain document embeddings?

Spacy ships GloVe vectors for 1M words in its `lg` models. It means we can easily obtain document vectors by averaging word vectors very easily. Of course, there are contextualized document embedding models that could achieve better performance, but let's see if we can manage this dataset with available Spacy vectors.