# Feature Extraction
This notebook will focus on extracting features from the preprocessed text data using TF-IDF, Word2Vec, and Sentence Transformers. The extracted featurees will be used in the subsequent notebook for model training and evaluation.  

## Set Up Dependencies

In [11]:
import pandas as pd
import numpy as np
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
import multiprocessing
from time import time
from sentence_transformers import SentenceTransformer
from scipy import sparse

Let's load the preprocessed dataframe from the previous notebook.

In [3]:
with open('data/df_balanced.pkl', 'rb') as f:
    df_balanced = pickle.load(f)

df_balanced.head()

Unnamed: 0,class,tweet,cleaned_text,Word2Vec,TF-IDF,SentenceTrans
0,0,I LOVE my 10 &amp; 5 but most days they remind...,I LOVE my 10 &amp; 5 but most days they remind...,"[love, my, amp, but, most, day, they, remind, ...",love my amp but most day they remind me why bi...,i love my amp but most days they remind me w...
1,1,She be thinking she throwing that pussy back s...,She be thinking she throwing that pussy back s...,"[she, be, think, she, throw, that, pussy, back...",she be think she throw that pussy back so good...,she be thinking she throwing that pussy back s...
2,1,RT @lamessican: I love when bitches throw shad...,I love when bitches throw shade. Just confirms...,"[love, when, bitch, throw, shade, just, confir...",love when bitch throw shade just confirm do so...,i love when bitches throw shade just confirms ...
3,1,"If you ain't a hoe, get up out my trap house @...","If you ain't a hoe, get up out my trap house .","[if, you, ain, hoe, get, up, out, my, trap, ho...",if you ain hoe get up out my trap house,if you aint a hoe get up out my trap house
4,0,Just hit 40 in flappy bird.&#128527;,Just hit 40 in flappy bird.&#128527;,"[just, hit, in, flappy, bird]",just hit in flappy bird,just hit in flappy bird


## Vectorize the Text Data
Convert the preprocessed text data to vector representations.

### TF-IDF

Parameters:
1. `min_df`: minimum document frequency
    
    Ignore terms with document frequency lower than 3 documents.
2. `ngram_range`: n-values for n-grams to be extracted
    
    Extract unigrams and bigrams.

In [5]:
vectorizer = TfidfVectorizer(min_df = 3, ngram_range= (1,2))
x_tfidf = vectorizer.fit_transform(df_balanced['TF-IDF'])

In [10]:
print(x_tfidf.shape)

(8326, 7924)


Saving the text representations

In [12]:
# Save the sparse matrix to a .npz file
sparse.save_npz('x_tfidf.npz', x_tfidf)

### Word2Vec

Setting up Model Parameters
1. `min_count`: minimum frequency of words

    Ignore terms with frequency count lower than 3.
2. `window`: context window size

    Set to 5 to balance between syntactic context and broader semantic relationships.
3. `vector_size`: dimensionality of the feature vectors

    Set to 100 for reduced computational load.
4. `sample`: threshold for downsampling of higher-frequency words

    Set to 6e-5 for a higher threshold for less aggressive downsampling.
5. `alpha`: initial learning rate

    Set to 0.03 for faster convergence.
6. `min_alpha`: as training progresses, learning rate decreases linearly from `alpha` to `min_alpha`

    Set to 0.0007.
7. `negative`: negative sampling distinguishes between genuine context words and noise

    Set to 10 for reduced computational load.
8. `workers`: number of CPU cores used during training, impacting speed

    Set to cores-1 to utilise most of the system's processing power.


In [25]:
cores = multiprocessing.cpu_count() # Count the number of cores in a computer

w2v_model = Word2Vec(min_count=3, 
                     window=5, 
                     vector_size=100,
                     sample=6e-5,
                     alpha=0.03,
                     min_alpha=0.0007,
                     negative=10,
                     workers=cores-1)

Building the Vocabulary Table

In [26]:
t = time()
w2v_model.build_vocab(df_balanced['Word2Vec'])
print(f'Time to build vocab:{round((time()-t)/60, 2)} mins')

Time to build vocab:0.0 mins


Training the Model

In [27]:
t = time()
w2v_model.train(df_balanced['Word2Vec'], total_examples=w2v_model.corpus_count, epochs=30)
print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

Time to train the model: 0.02 mins


Generate Sentence Vectors

In [40]:
def get_sentence_vector(text):
    word_vectors = [w2v_model.wv[word] for word in text if word in w2v_model.wv]
    if len(word_vectors) == 0:
        return np.zeros(w2v_model.vector_size)
    else:
        return np.mean(word_vectors, axis=0)
    

x_w2v = np.array([get_sentence_vector(text) for text in df_balanced['Word2Vec']])

Saving the text representations

In [41]:
np.save('representations/x_w2v.npy', x_w2v)

### Sentence Transformers

Generating sentence embeddings

In [4]:
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
x_ST = sentence_model.encode(df_balanced['SentenceTrans'], show_progress_bar=True)



Batches:   0%|          | 0/261 [00:00<?, ?it/s]

Verifying the embeddings are a NumPy array

In [61]:
print(type(x_ST))

<class 'numpy.ndarray'>


Saving the text representations

In [23]:
np.save('representations/x_ST.npy', x_ST)