# Demo 3: A classification task involving Natural Langauge Processing (NLP)

## Imports

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import re
from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
import spacy
from tqdm import tqdm

## Problem Description

This dataset is taken from [Twitter Financial News dataset](https://www.kaggle.com/datasets/sulphatet/twitter-financial-news) from Kaggle. We are given text as input and we need to predict the corresponding category of the financial news. 

According to the documentation, the categories are listed as below: 

    "LABEL_0": "Analyst Update",

    "LABEL_1": "Fed | Central Banks",

    "LABEL_2": "Company | Product News",

    "LABEL_3": "Treasuries | Corporate Debt",

    "LABEL_4": "Dividend",

    "LABEL_5": "Earnings",

    "LABEL_6": "Energy | Oil",

    "LABEL_7": "Financials",

    "LABEL_8": "Currencies",

    "LABEL_9": "General News | Opinion",

    "LABEL_10": "Gold | Metals | Materials",

    "LABEL_11": "IPO",

    "LABEL_12": "Legal | Regulation",

    "LABEL_13": "M&A | Investments",

    "LABEL_14": "Macro",

    "LABEL_15": "Markets",

    "LABEL_16": "Politics",

    "LABEL_17": "Personnel Change",

    "LABEL_18": "Stock Commentary",

    "LABEL_19": "Stock Movement". 

There are two possible ways to handle this data and build a predictive model. In the previous notebook (Part A), we focused on using Bag of Words representation. In this notebook (Part B), we will focus on using some pre-trained NLP libraries. 

## Reading the data

First, we read in the training and validation data provided. We will further split the training set into training and validation data. The validation data provided will be used as the test data. 

In [2]:
train_df = pd.read_csv("../datasets/train_data.csv")
test_df = pd.read_csv("../datasets/valid_data.csv")

Let's have a look of how the training data looks like: 

In [3]:
train_df

Unnamed: 0,text,label
0,Here are Thursday's biggest analyst calls: App...,0
1,Buy Las Vegas Sands as travel to Singapore bui...,0
2,"Piper Sandler downgrades DocuSign to sell, cit...",0
3,"Analysts react to Tesla's latest earnings, bre...",0
4,Netflix and its peers are set for a ‘return to...,0
...,...,...
16985,KfW credit line for Uniper could be raised to ...,3
16986,KfW credit line for Uniper could be raised to ...,3
16987,Russian https://t.co/R0iPhyo5p7 sells 1 bln r...,3
16988,Global ESG bond issuance posts H1 dip as supra...,3


Let's split our data into `X` and `y`, with their own training, validation and test portions. 

In [4]:
X_train, y_train = train_df["text"], train_df["label"]
X_test, y_test = test_df["text"], test_df["label"]

## Preprocessing text

At first glance, there are some hyperlinks present in the text which may contaminate our data and affect the performance. We define a function to remove the hyperlinks from the data: 

In [5]:
def remove_hyperlinks(text_df: pd.DataFrame) -> pd.DataFrame:
    url_pattern = re.compile(r"http(s)?://\S+")
    text_df = text_df.apply(lambda text: re.sub(url_pattern, '', text))
    return text_df

Let's consider an example with a hyperlink: 

In [6]:
X_train.iloc[926]

'SEE Expands Sustainable Portfolio With Launch of Innovative Paper Bubble Mailer  https://t.co/R1OebY7H6L  https://t.co/b66wE3yvPI'

Let's apply our newly defined function `remove_hyperlinks` and see the corresponding output: 

In [7]:
remove_hyperlinks(X_train).iloc[926]

'SEE Expands Sustainable Portfolio With Launch of Innovative Paper Bubble Mailer    '

## Part B: Using pre-trained NLP libraries

Natural Language Processing (NLP) is an integral part of many modern applications, from chatbots to machine translation and sentiment analysis. With the vast amounts of textual data generated every day, there is an increasing need for tools that can accurately analyze and extract meaningful information from this data. Pre-trained NLP libraries, such as `nltk` and `spacy` are designed for this need. In this demo, we will focus on discussing the usage of spacy. 

Spacy is a powerful NLP library that provides a wide range of features, including fast and efficient text processing, advanced linguistic analysis, and support for deep learning models. Spacy is designed to be easy to use, and its pre-trained models enable users to quickly perform a variety of common NLP tasks without the need for extensive training or expertise. By using Spacy, developers and data scientists can save time and resources, while also gaining access to cutting-edge NLP technology that can help to unlock the full potential of their text data.

Let's try to explore the training data using `spacy`. First, we load the model before using it. 

In [8]:
nlp = spacy.load("en_core_web_md")

Recall that we have previously used `TruncatedSVD`, which is a variant of `PCA`, to reduce the dimensionality of the features. Pre-trained libraries like `spacy` also transforms raw sentences into some features with lower dimensions, in which they are called sentence embeddings. The dimension of each embedded sentence is `300`. 

Let's extract a random sample from the data and get the sentence embedding of it. 

In [9]:
selected_text = X_train.iloc[2211]
print(selected_text)

AWS Selected as Delta’s Preferred Cloud Provider  https://t.co/6NDB9DjMzU  https://t.co/33zuRfvkYu


In [10]:
doc = nlp(selected_text)

sentence_embedding = doc.vector
sentence_embedding

array([-7.3421407e-01,  4.0525827e-01, -1.9024168e-01,  1.1192417e-01,
        3.4868500e+00,  9.9555415e-01, -1.1839241e+00,  1.6747411e-01,
       -1.0778084e+00, -2.0522883e+00,  4.0652585e+00, -6.9855839e-01,
       -2.2036667e+00,  3.4582505e-01, -7.1054751e-01,  1.1283466e+00,
        1.4741750e+00,  2.1859000e+00, -2.8289163e-01,  1.0027727e-01,
        9.7559649e-01, -1.5284873e+00, -9.5997500e-01, -6.3331157e-01,
        7.2001912e-02, -1.8117576e+00, -1.2292227e+00, -7.2013330e-01,
       -5.2782416e-01, -2.0157583e-01, -1.3222781e+00,  9.4262165e-01,
       -1.3730808e+00, -1.0827259e+00, -8.4195834e-01, -9.2275836e-02,
        1.9563252e-01,  2.1243913e-01, -8.4608896e-03, -2.0311692e+00,
        4.7339168e-01, -5.2695495e-01, -3.4479773e+00, -4.1158918e-01,
       -1.5155476e+00,  6.6906500e-01,  2.0562582e+00, -1.9486855e+00,
        1.7709298e-01, -1.5825042e+00,  1.4722586e-01,  9.1526359e-01,
        2.4090834e-01, -2.0232999e+00,  6.7629136e-02, -2.5324914e-01,
      

This is a sentence embedding vector which stores 300 different numerical features. 

### Modelling

Recall from the previous notebook that `RandomForestClassifier` is the best-performing model (although only with a test f1 of `0.70`). Let's build a pipeline and train the model using sentence embeddings. 

In [11]:
EMBEDDINGS = 300

def get_sentence_embeddings(sent_df):
    new_df = pd.DataFrame(columns=range(EMBEDDINGS))
    for i in tqdm(range(len(sent_df))):
        text = sent_df.iloc[i]
        vec = nlp(text).vector
        new_df.loc[len(new_df)] = vec
    return new_df

In [12]:
pipe_emb = make_pipeline(
    FunctionTransformer(get_sentence_embeddings), 
    RandomForestClassifier(random_state=613, n_estimators=64, max_depth=16, class_weight="balanced")
)

In [13]:
pipe_emb.fit(X_train, y_train)

100%|█████████████████████████████████████| 16990/16990 [04:41<00:00, 60.43it/s]


In [14]:
pipe_emb.score(X_train, y_train)

100%|█████████████████████████████████████| 16990/16990 [04:37<00:00, 61.25it/s]


0.9928193054738081

The train accuracy obtained is `0.99`.  Seems we are overfitting. 

### Results on Test set

Let's try to fit our results to the test set: 

In [15]:
y_pred = pipe_emb.predict(X_test)

100%|███████████████████████████████████████| 4117/4117 [00:47<00:00, 87.29it/s]


In [16]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.18      0.29        73
           1       0.79      0.52      0.63       214
           2       0.61      0.79      0.69       852
           3       1.00      0.34      0.50        77
           4       0.87      0.84      0.85        97
           5       0.94      0.78      0.85       242
           6       0.79      0.47      0.59       146
           7       0.89      0.56      0.69       160
           8       0.92      0.38      0.53        32
           9       0.51      0.80      0.62       336
          10       1.00      0.38      0.56        13
          11       1.00      0.14      0.25        14
          12       0.89      0.43      0.58       119
          13       0.89      0.15      0.25       116
          14       0.63      0.71      0.67       415
          15       0.81      0.55      0.66       125
          16       0.74      0.77      0.75       249
          17       0.86    

## Discussions

Overall, analyzing text data for classification problems is not an easy task. It requires further fine-tuning of the models for a (hopefully) better result. Moreover, the data is highly non-linear in nature which suggests that the problem we are facing is more difficult than expected. 