Data: https://www.kaggle.com/mustfkeskin/turkish-movie-sentiment-analysis-dataset/code

In [1]:
# Check the GPU colab gave to us.
!nvidia-smi -L

GPU 0: Tesla K80 (UUID: GPU-586ac994-7081-d8ff-b0a8-487379a672f1)


# Models we're going to build:

* Fine-tuned bert models ("dbmdz/bert-base-turkish-128k-uncased" and "dbmdz/bert-base-turkish-uncased") without taking out the stopwords and without stemming
* Fine-tuned bert models ("dbmdz/bert-base-turkish-128k-uncased" and "dbmdz/bert-base-turkish-uncased") with stemming
* Fine-tuned bert models ("dbmdz/bert-base-turkish-128k-uncased" and "dbmdz/bert-base-turkish-uncased") with taking out the stopwords from our dataset
* Fine-tuned bert models ("dbmdz/bert-base-turkish-128k-uncased" and "dbmdz/bert-base-turkish-uncased") with taking out the stopwords and stemming

## Model 1: Fine-tuned bert model with stopwords and without stemming

### Preprocess data

In [3]:
# Get data
import pandas as pd

df = pd.read_csv("magaza_yorumlari_duygu_analizi.csv", encoding="utf-16")
df.head()

Unnamed: 0,Görüş,Durum
0,"ses kalitesi ve ergonomisi rezalet, sony olduğ...",Olumsuz
1,hizli teslimat tesekkürler,Tarafsız
2,ses olayı süper....gece çalıştır sıkıntı yok.....,Olumlu
3,geldi bigün kullandık hemen bozoldu hiçtavsiye...,Olumsuz
4,Kulaklığın sesi kaliteli falan değil. Aleti öv...,Olumsuz


In [4]:
# Check the DataFrame to see the number of lines and non-null objects
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11429 entries, 0 to 11428
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Görüş   11426 non-null  object
 1   Durum   11429 non-null  object
dtypes: object(2)
memory usage: 178.7+ KB


In [5]:
# Check value counts to see whether the data is balanced or not
df.Durum.value_counts()

Olumlu      4253
Olumsuz     4238
Tarafsız    2938
Name: Durum, dtype: int64

In [6]:
# Since some nulls might be seen as a float, drop na to not face any problems.
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11426 entries, 0 to 11428
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Görüş   11426 non-null  object
 1   Durum   11426 non-null  object
dtypes: object(2)
memory usage: 267.8+ KB


In [7]:
# Check the first line in df.Görüş
df.Görüş[0]

"ses kalitesi ve ergonomisi rezalet, sony olduğu için aldım ama 4'de 1 fiyatına çin replika ürün alsaydım çok çok daha iyiydi, kesinlikle tavsiye etmiyorum."

In [8]:
# Remove punctuation for our model to learn better
df['Görüş'] = df['Görüş'].str.replace(r'[^\w\s]+', '')
df.Görüş[0]

'ses kalitesi ve ergonomisi rezalet sony olduğu için aldım ama 4de 1 fiyatına çin replika ürün alsaydım çok çok daha iyiydi kesinlikle tavsiye etmiyorum'

In [9]:
# Lower the inputs for our model to learn better
df["Görüş"] = df["Görüş"].str.lower()
df.head()

Unnamed: 0,Görüş,Durum
0,ses kalitesi ve ergonomisi rezalet sony olduğu...,Olumsuz
1,hizli teslimat tesekkürler,Tarafsız
2,ses olayı süpergece çalıştır sıkıntı yokkablo ...,Olumlu
3,geldi bigün kullandık hemen bozoldu hiçtavsiye...,Olumsuz
4,kulaklığın sesi kaliteli falan değil aleti öve...,Olumsuz


----- BU KISMI BU MODELDEN ÇIKAR -------

In [10]:
# Import the nltk library and download stopwords
#import nltk

#nltk.download("stopwords")

In [11]:
# Get the stopwords
#from nltk.corpus import stopwords
#
#stop_words = stopwords.words("turkish")
#stop_words[:10]

In [12]:
# Remove stopwords from each line and check the lines
#stop_words = set(stop_words)
#df['Görüş'] = df['Görüş'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
#
#df.Görüş[0]

In [13]:
# Import the library and get the stemmer for Turkish Language
#from TurkishStemmer import TurkishStemmer
#
#stemmer = TurkishStemmer()

In [14]:
# To use stemmer on each word, turn each line into a list
#df['Görüş'] = df['Görüş'].str.split()
#df.head()

In [15]:
# Apply stemmer
#df['Görüş'] = df['Görüş'].apply(lambda x: [stemmer.stem(y) for y in x])
#
#df.head()

In [16]:
#df.Görüş[0]

In [17]:
# Turn each line back to a string (from list)
#df['Görüş'] = df['Görüş'].apply(lambda x: ' '.join(word for word in x))
#
#df.head()

--------------------------------

In [18]:
# Get train sentences from df.Görüş
train_sentences = df["Görüş"].tolist()

train_sentences[0]

'ses kalitesi ve ergonomisi rezalet sony olduğu için aldım ama 4de 1 fiyatına çin replika ürün alsaydım çok çok daha iyiydi kesinlikle tavsiye etmiyorum'

In [19]:
# Shuffle the data so our model can learn in a proper way
df = df.sample(frac=1)
df.head()

Unnamed: 0,Görüş,Durum
8047,ürünün ekran ayarlamalırı muadillerine göre gü...,Olumsuz
7905,kaçırmayın derim çocuklara kış aylarında meyve...,Olumlu
6419,4gb ram tek parça mı belirtilmemiş 16gb a kada...,Tarafsız
10129,ürün çalışmıyor yardımcı olabilirmisiniz,Olumsuz
4580,şu anda konserve yapıyorum başlığı tam olarak ...,Olumsuz


In [20]:
# Check the DataFrame for one last time
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11426 entries, 8047 to 3582
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Görüş   11426 non-null  object
 1   Durum   11426 non-null  object
dtypes: object(2)
memory usage: 267.8+ KB


Looks like the dataset is ready to use in model_1

### Input pipeline

In [21]:
# Get the average and the max length of the inputs
import numpy as np

sent_lens = [len(sentence.split()) for sentence in train_sentences]
avg_sent_len = np.mean(sent_lens)
max_sent_len = np.max(sent_lens)
avg_sent_len, max_sent_len

(21.72378785226676, 422)

In [22]:
# How long of a sentence lenght covers 95% of examples?
output_seq_len_95 = int(np.percentile(sent_lens, 95))

output_seq_len_95

64

In [23]:
# How long of a sentence lenght covers 97% of examples?
output_seq_len_97 = int(np.percentile(sent_lens, 97))

output_seq_len_97

78

In [24]:
# How long of a sentence lenght covers 99% of examples?
output_seq_len_99 = int(np.percentile(sent_lens, 99))

output_seq_len_99

120

In [25]:
# Since it is not a large number, 120 is chosen.
output_seq_len = 120

In [26]:
# Get transformers
!pip install transformers

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 12.8 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 50.3 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 50.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 50.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 510 kB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers


In [27]:
# Import AutoTokenizer
from transformers import AutoTokenizer
# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-128k-uncased")

Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/386 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.18M [00:00<?, ?B/s]

In [28]:
# Encode inputs
import tensorflow as tf

input_ids = []
attention_mask = []

for txt in df.Görüş.values:
    encoded = tokenizer.encode_plus(
        text=txt, # the sentence to be encoded 
        add_special_tokens=True, # Add [CLS] and [SEP]
        max_length=120, # max length of a sentence
        truncation=True, # truncate if sentence length is bigger than max_length
        pad_to_max_length=True, # Add [PAD]s
        return_attention_mask=True, # Generate attention mask
        return_tensors="tf" # return TensorFlow tensors
    )

    # Append input_ids and attention_masks to their own lists
    input_ids.append(encoded["input_ids"])
    attention_mask.append(encoded["attention_mask"])

# Concatenate
input_ids = tf.concat(input_ids, 0)
attention_mask = tf.concat(attention_mask, 0)

print("Original: ", df.Görüş.values[0])
print("Token IDs: ", input_ids[0])



Original:  ürünün ekran ayarlamalırı muadillerine göre güzel ekran görüntüsü güzel fakat bir eksiği varses sistemi çok kötümüzik veya tv seyir ettiğinizde ses pek çıkmıyor ve hiç bana zevk vermedibu monotöre güçlü iki hatta üç hopörlür konmalı idi
Token IDs:  tf.Tensor(
[     2 124719   1009   4587  25098   5682   1022  37362   2616  22586
  14368   4587  92775   3440  14368   3244   1947  13466   2242 112715
   1951   3894   6110  36007   3711   1954   2358   4567   6441 102868
   1942   3072   3082  18101   3685   1946   9416   2789   5535  14952
   3199  44509  98967  94211   2537   3749   4508  10076  85623   1018
   2057   2788   6422      3      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0     

In [29]:
# Convert tokens to ids to check if the encoding operation is done correctly
tokenizer.convert_ids_to_tokens(input_ids[0])

['[CLS]',
 'urunu',
 '##n',
 'ekran',
 'ayarlama',
 '##lır',
 '##ı',
 'muadil',
 '##lerine',
 'gore',
 'guzel',
 'ekran',
 'goruntu',
 '##su',
 'guzel',
 'fakat',
 'bir',
 'eksi',
 '##gi',
 'vars',
 '##es',
 'sistemi',
 'cok',
 'kotu',
 '##muz',
 '##ik',
 'veya',
 'tv',
 'seyir',
 'ettiginiz',
 '##de',
 'ses',
 'pek',
 'cık',
 '##mıyor',
 've',
 'hic',
 'bana',
 'zevk',
 'vermedi',
 '##bu',
 'mono',
 '##tore',
 'guclu',
 'iki',
 'hatta',
 'uc',
 'hop',
 '##orlu',
 '##r',
 'kon',
 '##malı',
 'idi',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]

In [30]:
# Check input_ids and shape of input_ids
input_ids, input_ids.shape

(<tf.Tensor: shape=(11426, 120), dtype=int32, numpy=
 array([[     2, 124719,   1009, ...,      0,      0,      0],
        [     2,  21783,  91319, ...,      0,      0,      0],
        [     2,  68990,  10132, ...,      0,      0,      0],
        ...,
        [     2,  36664,   2122, ...,      0,      0,      0],
        [     2,  39110,   8320, ...,      0,      0,      0],
        [     2,  26965,     25, ...,      0,      0,      0]], dtype=int32)>,
 TensorShape([11426, 120]))

In [31]:
# Check attention_mask and shape of attention_mask
attention_mask, attention_mask.shape

(<tf.Tensor: shape=(11426, 120), dtype=int32, numpy=
 array([[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        ...,
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>, TensorShape([11426, 120]))

In [32]:
# One hot encode our labels to use in our models
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
labels_one_hot = one_hot_encoder.fit_transform(df["Durum"].to_numpy().reshape(-1,1))
labels_one_hot

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.]])

In [33]:
# Create a TensorFlow Dataset
dataset = tf.data.Dataset.from_tensor_slices((input_ids, attention_mask, labels_one_hot))
dataset.take(1)

<TakeDataset shapes: ((120,), (120,), (3,)), types: (tf.int32, tf.int32, tf.float64)>

In [34]:
# Create a function to map our dataset
def map_func(input_ids, masks, labels):
    # We convert our three-item tuple into a two-item tuple where the input item is a dictionary
    return {"input_ids": input_ids,
            "attention_mask": masks}, labels

In [35]:
# Map the dataset using the function we created and check the dataset
dataset = dataset.map(map_func)
dataset.take(1)

<TakeDataset shapes: ({input_ids: (120,), attention_mask: (120,)}, (3,)), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.float64)>

In [36]:
# Get the length of our dataset
len_dataset = len(dataset)

In [37]:
# Batch our dataset and drop remainders
batch_size=32
dataset = dataset.shuffle(10000).batch(batch_size, drop_remainder=True)

dataset.take(1)

<TakeDataset shapes: ({input_ids: (32, 120), attention_mask: (32, 120)}, (32, 3)), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.float64)>

In [38]:
# Split our dataset into train, validation and test datasets
split = 0.8
size = int((input_ids.shape[0] / batch_size) * split)

train_ds = dataset.take(size) # 80% of the dataset
val_test_ds = dataset.skip(size) # 20% of the dataset

split_val_test = 0.5
size_val_test = int(((input_ids.shape[0] / batch_size) - len(train_ds)) * split_val_test)

val_ds = val_test_ds.take(size_val_test) # 10% of dataset
test_ds = val_test_ds.skip(size_val_test) # 10% of dataset

len(dataset), len(train_ds), len(val_ds), len(test_ds)

(357, 285, 36, 36)

In [39]:
# Import the model
from transformers import TFAutoModel

bert128k = TFAutoModel.from_pretrained("dbmdz/bert-base-turkish-128k-uncased")
#bert = TFAutoModel.from_pretrained("dbmdz/bert-base-turkish-uncased")

Downloading:   0%|          | 0.00/1.06G [00:00<?, ?B/s]

Some layers from the model checkpoint at dbmdz/bert-base-turkish-128k-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at dbmdz/bert-base-turkish-128k-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [40]:
# Create the model
import tensorflow as tf

# two input layers, we ensure layer name variables match to dictionary keys in TF dataset
input_ids = tf.keras.layers.Input(shape=(120,), name='input_ids', dtype='int32')
mask = tf.keras.layers.Input(shape=(120,), name='attention_mask', dtype='int32')

# we access the transformer model within our bert object using the bert attribute (eg bert.bert instead of bert)
embeddings = bert128k.bert(input_ids, attention_mask=mask)[1]  # access final activations (already max-pooled) [1]
# convert bert embeddings into 3 output classes
x = tf.keras.layers.Dense(1024, activation='relu')(embeddings)
outputs = tf.keras.layers.Dense(3, activation='softmax', name='outputs')(x)

# model
model_1_128k_uncased = tf.keras.Model(inputs=[input_ids, mask], outputs=outputs)

In [41]:
#Get the summary of model_1_128k_uncased
model_1_128k_uncased.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 120)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 120)]        0           []                               
                                                                                                  
 bert (TFBertMainLayer)         TFBaseModelOutputWi  184345344   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 120,                                           

In [42]:
# Get learning rate using PolynomialDecay
from tensorflow.keras.optimizers.schedules import PolynomialDecay

x = len_dataset*0.85 # length of train_ds (unbatched)

num_epochs = 3
num_train_steps = x * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5,
    end_learning_rate=0.,
    decay_steps=num_train_steps
)

In [43]:
# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_scheduler)
loss = tf.keras.losses.CategoricalCrossentropy()
acc = tf.keras.metrics.CategoricalAccuracy('accuracy')

model_1_128k_uncased.compile(optimizer=optimizer, 
              loss=loss, 
              metrics=[acc])

In [44]:
# Check the datasets before fitting our model
train_ds, val_ds, test_ds

(<TakeDataset shapes: ({input_ids: (32, 120), attention_mask: (32, 120)}, (32, 3)), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.float64)>,
 <TakeDataset shapes: ({input_ids: (32, 120), attention_mask: (32, 120)}, (32, 3)), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.float64)>,
 <SkipDataset shapes: ({input_ids: (32, 120), attention_mask: (32, 120)}, (32, 3)), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.float64)>)

In [45]:
# Fit the model
history = model_1_128k_uncased.fit(
    train_ds,
    validation_data=val_ds,
    epochs=4,
    verbose=1
)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [46]:
# Evaluate the model on test_ds
model_1_128k_uncased.evaluate(test_ds)



[0.27499502897262573, 0.9071180820465088]

In [51]:
# Download helper functions
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2021-12-04 02:04:17--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2021-12-04 02:04:17 (74.0 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [52]:
from helper_functions import calculate_results

In [49]:
# Make predictions
model_1_128k_uncased_pred_probs = model_1_128k_uncased.predict(test_ds)
model_1_128k_uncased_pred_probs[0], model_1_128k_uncased_pred_probs.shape

(array([0.89375335, 0.00717377, 0.09907291], dtype=float32), (1152, 3))

In [50]:
# Convert pred_probs to classes
model_1_128k_uncased_preds = tf.argmax(model_1_128k_uncased_pred_probs, axis=1)
model_1_128k_uncased_preds

<tf.Tensor: shape=(1152,), dtype=int64, numpy=array([0, 2, 1, ..., 1, 1, 2])>

Go back to input-pipeline and label encode labels