# Prepare Dataset for Sequence Classification

A Huggingface Dataset needed for [Sequence Classification](https://huggingface.co/docs/transformers/tasks/sequence_classification) tasks must have two of the following columns:

- `text`: _(string)_ text corpus that has to be classified
- `label`: _(integer)_ class label. _(Must be an Integer from $\{0,1,2, ..., p\}$)_

## Objective

In this notebook, I will be demonstrating how to convert a dataset in CSV format containing missing values into a huggingface dataset and then split it into training and testing dataset. You can download the CSV dataset from [Kaggle](https://www.kaggle.com/datasets/saurabhshahane/twitter-sentiment-dataset).

## Preprocessing the Dataset using `pandas`

In [1]:
import pandas

In [2]:
df = pandas.read_csv("./Twitter_Data.csv")
df.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


__Dealing with Missing Values__

I am removing the missing values since we have a large dataset and purging a few data items won't significantly harm the overall training.

In [3]:
print(df.isna().sum(axis='index'))

clean_text    4
category      7
dtype: int64


In [4]:
df.dropna(axis='index', inplace=True)

__Changing the Labels__

Here chaning the numerical labels `[-1.0, 0.0, 1.0]` to `['NEGATIVE','NEUTRAL','POSITIVE']` so that we can create a `ID2LABEL` and `LABEL2ID` mappings for the training process and also it will help us better understand the inference results post training.

In [5]:
df['category'] = df['category'].map({-1.0:"NEGATIVE", 0.0:"NEUTRAL", 1.0:"POSITIVE"})
df.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,NEGATIVE
1,talk all the nonsense and continue all the dra...,NEUTRAL
2,what did just say vote for modi welcome bjp t...,POSITIVE
3,asking his supporters prefix chowkidar their n...,POSITIVE
4,answer who among these the most powerful world...,POSITIVE


__Modify Column Names__

Here we change the column names to match huggingface's recommended format.

In [6]:
df.columns = ['text','label']
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 162969 entries, 0 to 162979
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    162969 non-null  object
 1   label   162969 non-null  object
dtypes: object(2)
memory usage: 3.7+ MB


## Conversion into Hunggingface `Dataset` object

In [7]:
from datasets import Dataset, Features, Value, ClassLabel

__Define Features__

In [8]:
features = Features({
    "text": Value('string'),
    "label": ClassLabel(num_classes=3, names=list(df['label'].unique()))
})

features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['NEGATIVE', 'NEUTRAL', 'POSITIVE'], id=None)}

__Conversion from `DataFrame` to `Dataset`__

In [9]:
dataset = Dataset.from_pandas(df, features, preserve_index=False)
dataset = dataset.train_test_split(test_size = 0.2, shuffle = True)

In [10]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 130375
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 32594
    })
})