In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Embedding Construction with scikit-learn

In this tutorial, we will explore simple NLP techniques to create embeddings with scikit-learn. Here, you will learn how to ingest unstructured data input inside a scikit-learn transformer, which output can possibly be used in any machine learning pipeline.

By the end of this tutorial, you should have a deep understanding of how scikit-learn can be used to perform Natural Language Processing.




In [None]:
DATA_PATH = '/content/drive/MyDrive/Inspire/QA Webinars/Episode 02/data'
TRAIN_DATA = 'data.csv'

## 1. Dataset Creation
In this phase, we are going to perform the necessary preprocessing needed to ingest a consistent numerical input inside the machine learning pipeline. 
Let us use the pandas `read_csv` method to ingest the data.

In [None]:
import pandas as pd
import os
df = pd.read_csv(os.path.join(DATA_PATH, TRAIN_DATA))

A simple inspection of the dataset shows that we have two columns: one called `Sentiment` containing the target variable, and another one called `Text` containing the financial news text.

In [None]:
df.head()

Unnamed: 0,Sentiment,Text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


Let us investigate the distribution of the target variable in our dataset

In [None]:
df[['Text', 'Sentiment']].groupby('Sentiment').count()

Unnamed: 0_level_0,Text
Sentiment,Unnamed: 1_level_1
negative,604
neutral,2879
positive,1363


Let us perform a simple factorization of the `Sentiment` column: this is necessary since we can't pass a string to a machine learning model. We need to apply a 1-to-1 mapping between each single category with respect to the natural numbers - aka 0,1,2, etc.

To do so, we employ the scikit-learn `LabelEncoder`, which is a transformer used to encode categorical values into numerical ones.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Sentiment_id'] = le.fit_transform(df['Sentiment'])

The nice thing of scikit-learn transformers is that we can easily get the inverse of a label index, using the `inverse_transform` method

Let us run the inverse transform of `[2,1]` and check it is equal to `[positive, neutral]`

In [None]:
le.inverse_transform([2,1])

array(['positive', 'neutral'], dtype=object)

#### Alternative way of encoding the target
```python
df['Sentiment_id'] = df['Sentiment'].factorize()[0]
```
 This can be used instead of te label encoder. However, the transform is not as direct as the LabelEncoder(). So my suggestion is to use the LabelEncoder.

In [None]:
df.tail()

Unnamed: 0,Sentiment,Text,Sentiment_id
4841,negative,LONDON MarketWatch -- Share prices ended lower...,0
4842,neutral,Rinkuskiai 's beer sales fell by 6.5 per cent ...,1
4843,negative,Operating profit fell to EUR 35.4 mn from EUR ...,0
4844,negative,Net sales of the Paper segment decreased to EU...,0
4845,negative,Sales in Finland decreased by 10.5 % in Januar...,0


### 2 Creating Numerical Embedding for Text Features: Bag of Words

The basic idea of bag-of-words (BoW) is to take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter. 

In [None]:
corpus = [
 'This is the first document.',
 'This document is the second document.',
]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
count_vect = CountVectorizer(stop_words='english')
# count_vect = CountVectorizer(stop_words='english', ngram_range=(1,2))
corpus_vect = count_vect.fit_transform(corpus)

In [None]:
toy_df = pd.DataFrame(corpus_vect.toarray())
toy_df.columns = count_vect.get_feature_names_out()

In [None]:
toy_df

Unnamed: 0,document,second
0,1,0
1,2,1


The bag-of-words model assumes that the words are independent. Thus, it doesn’t take into account any relationship between words. Hence, the meaning of sentences is lost. 

Let us apply the BoW to our dataset

In [None]:
count_vect_df = CountVectorizer(stop_words='english')
corpus_vect_df = count_vect_df.fit_transform(df.Text.to_list())

In [None]:
bow_df = pd.DataFrame(corpus_vect_df.toarray())
bow_df.columns = count_vect_df.get_feature_names_out()

In [None]:
bow_df[['usa']].value_counts()

usa
0      4831
1        15
dtype: int64

In [None]:
count_vect_df = CountVectorizer(stop_words='english', ngram_range=(1,2))
corpus_vect_df = count_vect_df.fit_transform(df.Text.to_list())

In [None]:
bow_df = pd.DataFrame(corpus_vect_df.toarray())
bow_df.columns = count_vect_df.get_feature_names_out()

In [None]:
bow_df[['share price']].value_counts()

share price
0              4840
1                 6
dtype: int64

### 3 Creating Numerical Embedding for Text Features: TfidfVectorizer


Ro avoid the aforementioned issues obtained with the BoW, we apply the Term Frequency-Inverse Domain Frequency (TFIDF) techniques to convert the corpus of texts into a numerical matrix.
Each single row will describe a document, and the columns will be made of the n-grams (typically unigrams and bigrams) that characterize our documents.

To do so, we use the TfidfVectorizer class from the scikit-learn submodule feature_extraction. In case you want to get more on this class, please watch the course Introduction to Natural Language Processing with scikit-learn available in our content library.

Here, we specify the argument `stop_words` equal to `'english'` in order to remove common English words, such as pronouns; the argument `min_df` equal to 5 so that we keep only those terms that appear at least in five different sentences;  the argument `n_gram_range` as equal to (1,2) meaning that we keep track of bnoth unigrams and nbigrams; and finally, we set the argument `sublinear_tf` as True. This argument is pretty interesting, and I strongly encourage you to use it especially when you are working with heterogeneous text data. In general, sublinear_df is set to True to use a logarithmic form for frequency, to give diminishing returns as the frequency of a word increases.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    stop_words='english',
    # min_df = 5,
    sublinear_tf = True,
    norm = 'l2',
    # ngram_range = (1, 2)

)

In [None]:
corpus = [
 'Dogs are very friendly',
 'Dogs and cat are domestic animals',
 'Dogs and cat are not friends',
 "Apple surged by 10 percent",
 "Eating an apple every day helps to keep you health",
]

In [None]:
X = tfidf.fit_transform(corpus).toarray()

In [None]:
df_tfidf_toy = pd.DataFrame(X, columns=tfidf.get_feature_names_out())
df_tfidf_toy

Unnamed: 0,10,animals,apple,cat,day,dogs,domestic,eating,friendly,friends,health,helps,percent,surged
0,0.0,0.0,0.0,0.0,0.0,0.556451,0.0,0.0,0.830881,0.0,0.0,0.0,0.0,0.0
1,0.0,0.568014,0.0,0.45827,0.0,0.380406,0.568014,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.556816,0.0,0.462208,0.0,0.0,0.0,0.690159,0.0,0.0,0.0,0.0
3,0.523358,0.0,0.422242,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.523358,0.523358
4,0.0,0.0,0.374105,0.0,0.463693,0.0,0.0,0.463693,0.0,0.0,0.463693,0.463693,0.0,0.0


Cosine Similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(df_tfidf_toy, df_tfidf_toy)
pd.DataFrame(cosine_sim)

Unnamed: 0,0,1,2,3,4
0,1.0,0.211677,0.257196,0.0,0.0
1,0.211677,1.0,0.430999,0.0,0.0
2,0.257196,0.430999,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.157963
4,0.0,0.0,0.0,0.157963,1.0


TFIDF model

In [None]:
tfidf = TfidfVectorizer(
    stop_words='english',
    min_df = 5,
    sublinear_tf = True,
    norm = 'l2',
    ngram_range = (1, 2)

)

In [None]:
X = tfidf.fit_transform(df.Text.to_list()).toarray()

In [None]:
pd.DataFrame(X, columns=tfidf.get_feature_names_out())

Unnamed: 0,00,00 eet,000,000 corresponding,000 eur,000 euro,000 new,000 people,000 period,000 quarter,...,yesterday,yhtyma,yit,york,zinc,zone,æinen,ænnen,ænnen tehtaat,ærvi
0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.15391,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4841,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4842,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4843,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4844,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


###End Lab