# DIM0782 - Machine Learning (DIMAp/UFRN/2024.1)

## Preprocessing the data

### Transforming non-structured data to structured data

This is textual data, so the first step is to turn it into structured data by applying a transformer. I'm going to work mostly with BERT here.

First, I'm going to define the imports block.

In [1]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb

Now reading the sentiments dataset:

In [2]:
sentiments_df = pd.read_csv("datasets/twitter_sentiment_base_original.csv", usecols=["text", "label"])
sentiments_df.head()

Unnamed: 0,text,label
0,i just feel really helpless and heavy hearted,4
1,ive enjoyed being able to slouch about relax a...,0
2,i gave up my internship with the dmrg and am f...,4
3,i dont know i feel so lost,0
4,i am a kindergarten teacher and i am thoroughl...,4


Now I'm going to utilize a BERT tokenizer to transform the text data that I have and then I'll concatenate this data with the sentiment labels that I had originally within my base.

In [4]:
## Tokenizing w/ BERT
tokenizer = (ppb.BertTokenizer).from_pretrained('bert-base-uncased')
sentiments_tokenized = sentiments_df["text"].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

## Generating a new CSV with the tokenized data and the sentiment labels
sentiments_tokenized_dataset = pd.DataFrame({'text': sentiments_tokenized, 'label': sentiments_df['label']})
sentiments_tokenized_dataset.to_csv('datasets/twitter_sentiment_base_tokenized.csv')