# DIM0782 - Machine Learning (DIMAp/UFRN/2024.1)

## Preprocessing the data

### Transforming non-structured data to structured data

This is textual data, so the first step is to turn it into structured data by applying a transformer. I'm going to work mostly with BERT here.

First, I'm going to define the imports block.

In [1]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb

Now reading the sentiments dataset:

In [2]:
sentiments_df = pd.read_csv("datasets/twitter_sentiment_base_original.csv", usecols=["text", "label"])
sentiments_df.head()

Unnamed: 0,text,label
0,i just feel really helpless and heavy hearted,4
1,ive enjoyed being able to slouch about relax a...,0
2,i gave up my internship with the dmrg and am f...,4
3,i dont know i feel so lost,0
4,i am a kindergarten teacher and i am thoroughl...,4


## -------- Run When The Dataset is Huge ---------

In [3]:
sample_size, random_state = 4000, 42
sampled_data = []

for i in range(6): ## Since we have 6 sentimens, labeled from 0 to 5
    sampled_data.append(sentiments_df[sentiments_df['label'] == i].sample(n=sample_size, random_state=random_state))

sentiments_df = pd.concat(sampled_data)
sentiments_df

Unnamed: 0,text,label
133243,ive learned to surround myself with women who ...,0
88501,i already feel crappy because of this and you ...,0
131379,i feel like i have lost mourned and moved past...,0
148369,i could write a whole lot more about why im fe...,0
134438,i always seem to feel inadequate,0
...,...,...
44216,i feel a strange sadness because the downhill ...,5
370216,i feel absolutely amazing when i have a conver...,5
118400,im not going to repeat every word written in t...,5
347704,i feel impressed to extend this to all,5


-- ----------

Now I'm going to utilize a BERT tokenizer to transform the text data that I have and then I'll concatenate this data with the sentiment labels that I had originally within my base.

In [4]:
## Loading the pretrained model and tokenizer
tokenizer = (ppb.DistilBertTokenizer).from_pretrained('distilbert-base-uncased')
model = (ppb.DistilBertModel).from_pretrained('distilbert-base-uncased')

## Tokenizing w/ BERT
sentiments_tokenized = sentiments_df["text"].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

The code above outputs the tokenized data with rows of different sizes, since each sentence has a different length. We now need to apply a process called padding in order to make all of the sentences with equal size. 

In [5]:
## Generating a new dataframe with the tokenized data appended with their associated labels
biggest_sentence_length = 0
for i in sentiments_tokenized.values:
    if len(i) > biggest_sentence_length:
        biggest_sentence_length = len(i)

## Getting the values padded
sentiments_tokenized_padded = np.array([i + [0]*(biggest_sentence_length-len(i)) for i in sentiments_tokenized.values])
sentiments_tokenized_padded.shape

(24000, 89)

We will also need to create an attention mask because of the padded values.

In [6]:
attention_mask = np.where(sentiments_tokenized_padded != 0, 1, 0)
attention_mask.shape

(24000, 89)

Now we're finally going to process those input IDs that we generated on the steps above and we will go for the actual embeddings. Given the size of the dataset, I ran multiple tests and realized that memory would be an issue. I need to split this into batches and work from there.

In [7]:
input_ids = torch.tensor(sentiments_tokenized_padded)
attention_mask = torch.tensor(attention_mask)
batch_size = 4000
all_hidden_states = []

def batch_generator(input_ids, attention_mask, batch_size):
    for i in range(0, len(input_ids), batch_size):
        yield input_ids[i:i+batch_size], attention_mask[i:i+batch_size]

## Processes the input over batches through the help of a generator
with torch.no_grad():
    for batch_ids, batch_mask in batch_generator(input_ids, attention_mask, batch_size):
        outputs = model(batch_ids, attention_mask=batch_mask)
        all_hidden_states.append(outputs.last_hidden_state)

## Concatenates all batch results
last_hidden_states = torch.cat(all_hidden_states, dim=0)

#### Optional 

Getting all of that on a file to ensure this data gets saved:

In [14]:
## Saves tensor to a file
torch.save(last_hidden_states, 'tensor.pt')

Now we can extract the features:

In [33]:
sentiments_features = last_hidden_states[:,0,:].numpy()
labels = sentiments_df['label']

Additionally, I may want to generate a CSV file containing the structured data. This can be achieved by running:

In [37]:
sentiments_df_embedded = pd.DataFrame({'features': sentiments_features.tolist(), 'label': labels})
sentiments_df_embedded.to_csv('datasets/twitter_sentiment_base_tokenized.csv')

### What next?

In [73]:
## This is how you get the list of features as attributes. You should have 24000 entries with 768 as dimension
sentiments_df_embedded['features'].to_numpy()

array([list([-0.2756660282611847, 0.13099384307861328, -0.30848270654678345, -0.27635568380355835, -0.3243187665939331, -0.19651874899864197, 0.31184184551239014, -0.0651598870754242, 0.2597992718219757, -0.407318651676178, 0.23187246918678284, -0.027276519685983658, -0.14279161393642426, 0.4174335300922394, 0.02785697765648365, 0.19772104918956757, 0.1364981234073639, 0.12654495239257812, -0.04384823143482208, 0.14918705821037292, -0.034560855478048325, -0.5005660653114319, 0.3364400863647461, 0.4016433656215668, -0.25961434841156006, -0.10035407543182373, 0.11607948690652847, -0.36378908157348633, -0.000383131206035614, -0.07112886756658554, 0.19983097910881042, -0.007428696379065514, -0.3480699360370636, -0.09331079572439194, -0.1758042573928833, -0.12350504100322723, -0.049198612570762634, -0.12183865904808044, -0.06427358090877533, 0.07888013124465942, -0.4791826605796814, -0.07259225100278854, -0.467416912317276, -0.04203997179865837, 0.004909709095954895, -0.07226689904928207, -

### Reduction of Instances

As part of this work, it is required to reduce instances on my dataset.

The goal here is to collect 14972 random instances from each of the 6 classes the sentiments dataset has, and create a new base with this balanced collection.

In [18]:
## Variables for the sampling conditions
sample_size, random_state = 14972, 42
sampled_data = []

for i in range(6): ## Since we have 6 sentimens, labeled from 0 to 5
    sampled_data.append(sentiments_tokenized_df[sentiments_tokenized_df['label'] == i].sample(n=sample_size, random_state=random_state))

sampled_sentiments_df = pd.concat(sampled_data)
sampled_sentiments_df

Unnamed: 0,text,label
133243,"[101, 4921, 2063, 4342, 2000, 15161, 2870, 200...",0
88501,"[101, 1045, 2525, 2514, 10231, 7685, 2138, 199...",0
131379,"[101, 1045, 2514, 2066, 1045, 2031, 2439, 9587...",0
148369,"[101, 1045, 2071, 4339, 1037, 2878, 2843, 2062...",0
134438,"[101, 1045, 2467, 4025, 2000, 2514, 14710, 102...",0
...,...,...
142937,"[101, 1045, 4384, 2026, 6181, 2318, 3110, 1037...",5
373641,"[101, 1045, 2514, 2023, 2428, 7622, 2068, 1998...",5
148816,"[101, 10047, 2183, 2000, 3745, 2055, 2026, 430...",5
23199,"[101, 1045, 4342, 2008, 2065, 1045, 5454, 1062...",5


Now I'm just going to generate a CSV file with this new dataset.

In [19]:
sampled_sentiments_df.to_csv('datasets/twitter_sentiments_base_sampled.csv')