# DIM0782 - Machine Learning (DIMAp/UFRN/2024.1)

## Preprocessing the data

### Transforming non-structured data to structured data

This is textual data, so the first step is to turn it into structured data by applying a transformer. I'm going to work mostly with BERT here.

First, I'm going to define the imports block.

In [1]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb

Now reading the sentiments dataset:

In [2]:
sentiments_df = pd.read_csv("datasets/twitter_sentiment_base_original.csv", usecols=["text", "label"])
sentiments_df.head()

Unnamed: 0,text,label
0,i just feel really helpless and heavy hearted,4
1,ive enjoyed being able to slouch about relax a...,0
2,i gave up my internship with the dmrg and am f...,4
3,i dont know i feel so lost,0
4,i am a kindergarten teacher and i am thoroughl...,4


Now I'm going to utilize a BERT tokenizer to transform the text data that I have and then I'll concatenate this data with the sentiment labels that I had originally within my base.

In [12]:
## Tokenizing w/ BERT
tokenizer = (ppb.BertTokenizer).from_pretrained('bert-base-uncased')
sentiments_tokenized = sentiments_df["text"].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

The code above outputs the tokenized data with rows of different sizes, since each sentence has a different length. We now need to apply a process called padding in order to make all of the sentences with equal size. 

In [16]:
## Generating a new dataframe with the tokenized data appended with their associated labels
sentiments_tokenized_df = pd.DataFrame({'text': sentiments_tokenized, 'label': sentiments_df['label']})

## Getting the length of the biggest sentence on the dataset. All other rows will be padded accordingly.
biggest_sentence_length = 0
for i in sentiments_tokenized.values:
    if len(i) > biggest_sentence_length:
        biggest_sentence_length = len(i)

## Now padding each sentence up to the biggest length
def pad_sentence(sentence):
    padding_size = biggest_sentence_length - len(sentence)
    if padding_size > 0:
        return np.pad(sentence, (0, padding_size), 'constant', constant_values=(0,))
    return sentence
sentiments_tokenized_df['text'] = sentiments_tokenized_df['text'].apply(pad_sentence)
sentiments_tokenized_df.head()

Unnamed: 0,text,label
0,"[101, 1045, 2074, 2514, 2428, 13346, 1998, 308...",4
1,"[101, 4921, 2063, 5632, 2108, 2583, 2000, 2288...",0
2,"[101, 1045, 2435, 2039, 2026, 22676, 2007, 199...",4
3,"[101, 1045, 2123, 2102, 2113, 1045, 2514, 2061...",0
4,"[101, 1045, 2572, 1037, 11793, 3836, 1998, 104...",4


Additionally, I may want to generate a CSV file containing the structured data. This can be achieved by running:

In [17]:
sentiments_tokenized_df.to_csv('datasets/twitter_sentiment_base_tokenized.csv')

As part of the requirements, it was asked to generate a boxplot chart on the numeric attributes of the dataset (on our case, they're the labels):

In [16]:
sentiments_df.boxplot(column='label')

<Axes: >

### Reduction of Instances

As part of this work, it is required to reduce instances on my dataset.

The goal here is to collect 14972 random instances from each of the 6 classes the sentiments dataset has, and create a new base with this balanced collection.

In [18]:
## Variables for the sampling conditions
sample_size, random_state = 14972, 42
sampled_data = []

for i in range(6): ## Since we have 6 sentimens, labeled from 0 to 5
    sampled_data.append(sentiments_tokenized_df[sentiments_tokenized_df['label'] == i].sample(n=sample_size, random_state=random_state))

sampled_sentiments_df = pd.concat(sampled_data)
sampled_sentiments_df

Unnamed: 0,text,label
133243,"[101, 4921, 2063, 4342, 2000, 15161, 2870, 200...",0
88501,"[101, 1045, 2525, 2514, 10231, 7685, 2138, 199...",0
131379,"[101, 1045, 2514, 2066, 1045, 2031, 2439, 9587...",0
148369,"[101, 1045, 2071, 4339, 1037, 2878, 2843, 2062...",0
134438,"[101, 1045, 2467, 4025, 2000, 2514, 14710, 102...",0
...,...,...
142937,"[101, 1045, 4384, 2026, 6181, 2318, 3110, 1037...",5
373641,"[101, 1045, 2514, 2023, 2428, 7622, 2068, 1998...",5
148816,"[101, 10047, 2183, 2000, 3745, 2055, 2026, 430...",5
23199,"[101, 1045, 4342, 2008, 2065, 1045, 5454, 1062...",5


Now I'm just going to generate a CSV file with this new dataset.

In [19]:
sampled_sentiments_df.to_csv('datasets/twitter_sentiments_base_sampled.csv')