### Tokenizing

We will be tokenizing to create two input tensors; our `input IDs`, and `attention mask`.

We will contain our tensors within two numpy arrays, which will be of dimensions `len(df) * 512` - the 512 is the sequence length of our tokenized sequences for BERT, and `len(df)` the number of samples in our dataset.

so lets start by importing our Dependencies ⬇

In [4]:
import pandas as pd
import numpy as np
from transformers import AutoTokenizer

In [9]:
df = pd.read_csv('/content/Digikala-comments/Data/Cleaned-data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,id,Phrase,Sentiment
0,0,0,کیفیت حجم صدای محصول توی بازار اصلا نمیشه تهرا...,100
1,1,1,ماه مصرف دوبار مدل گرفتم اولاش خوبه ماه باد می...,5
2,2,2,کارآیی کارای سبک دیدن فیلم مطالعه مناسبه,60
3,3,3,بررسی کیفی خریدش شگفت انگیز حتما پیشنهاد,0
4,4,4,بسته ضعیف ظاهر بامزه ای داره عکسش شبیه جنس چوب...,60


In [10]:
seq_len = 512
num_samples = len(df)

num_samples, seq_len

(179348, 512)

In [11]:
df['Phrase'] = df['Phrase'].astype('str')

Now we can begin tokenizing, since we're dealing with persian text we can use the `HooshvareLab` persian Bert tokenizer:

In [12]:
# initialize tokenizer
model_name = "HooshvareLab/bert-fa-zwnj-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# tokenize - this time returning Numpy tensors
tokens = tokenizer(df['Phrase'].tolist(), max_length=seq_len, truncation=True,
                   padding='max_length', add_special_tokens=True,
                   return_tensors='np')

Downloading:   0%|          | 0.00/292 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/565 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/134 [00:00<?, ?B/s]

Which returns us three numpy arrays - input_ids, token_type_ids, and attention_mask.

In [13]:
tokens.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [14]:
tokens['input_ids'][:10]

array([[    2,  3114,  4035, ...,     0,     0,     0],
       [    2,  2208,  2763, ...,     0,     0,     0],
       [    2,     1, 22195, ...,     0,     0,     0],
       ...,
       [    2,  2915,  7060, ...,     0,     0,     0],
       [    2,  3114,  2297, ...,     0,     0,     0],
       [    2,  2395, 12228, ...,     0,     0,     0]])

In [15]:
tokens['attention_mask'][:10]

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

Now we save them to file as Numpy binary files:

In [16]:
# with open('movie-xids.npy', 'wb') as f:
#     np.save(f, tokens['input_ids'])
# with open('movie-xmask.npy', 'wb') as f:
#     np.save(f, tokens['attention_mask'])

Now is the time to deal with out target values which are intigers from `0 to 100` and its range is too big for classification so lets scale them to `0 to 5`

In [17]:
df['Sentiment'] = df['Sentiment'].div(20).round(0)
df['Sentiment']

0         5.0
1         0.0
2         3.0
3         0.0
4         3.0
         ... 
179343    5.0
179344    3.0
179345    3.0
179346    0.0
179347    3.0
Name: Sentiment, Length: 179348, dtype: float64

We need to extract values and one-hot encode them into another numpy array, which will have the dimensions `len(df) * number` of label classes. We will initialize a numpy zero array beforehand, but we won't populate it row by row - we will use some fancy indexing techniques instead.

In [18]:
# first extract sentiment column
arr = df['Sentiment'].values
arr = arr.astype('int')

In [19]:
# we then initialize the zero array
labels = np.zeros((num_samples, arr.max()+1))
labels.shape

(179348, 6)

We are able to use `arr.max()+1` to define our second dimension here because we have the values `[0, 1, 2, 3, 4, 5]` in our Sentiment column, there are six unique labels which means we need our labels array to have six columns (one for each) - `arr.max() = 5`, so we do `5 + 1` to get our required value of 5.

Now we use the current values in our arr of `[0, 1, 2, 3, 4, 5]` to place 1 values in the correct positions of our presently zeros-only array:

In [20]:
labels[np.arange(num_samples), arr] = 1

labels

array([[0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.]])

lets save them and we're good to go.

In [21]:
# with open('movie-labels.npy', 'wb') as f:
#     np.save(f, labels)