<a href="https://colab.research.google.com/github/aliflazuardi/Bangkit-Capstone/blob/main/Baseline_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Capstone Project - Indonesian Vaccine Sentiment Analysis**

This Colab notebook will be used to train machine learning model for B21-CAP0267 capstone project. This machine learning model will be used for analyze Indonesian sentiment about Covid-19 Vaccine.

Author: 
1. Khalifah Lazuardi Firmansyah - M0020082
2. Muhammad Nadhif Farizi - M0020085

Import Tensorflow and check its version

In [None]:
import tensorflow as tf
print(tf.__version__)

## **Clone github repo** 

In [None]:
!git clone -b update https://github.com/aliflazuardi/Bangkit-Capstone.git

## **Import dataset**

In [None]:
import os
import pandas as pd

# Get current path location
current_path = os.getcwd()

dataset_path = os.path.join(current_path, "Bangkit-Capstone", "Dataset", "modified-dataset.csv")
df = pd.read_csv(dataset_path)

In [None]:
df.sample(10)

## **Basic checking on dataset and preprocess**

In [None]:
df.shape

In [None]:
# column types
df.dtypes

In [None]:
# rename df.column == "label\t" to "label"
df = df.rename(columns={"label\t": "label"})
df.columns

Missing values

In [None]:
# change label to int type
missing_values = df["label"].isna()
df = df.drop(df[missing_values].index)


set index

In [None]:
df.set_index(pd.Index(range(0, 1520)))

## **Exploratory data analysis**

Get every class total counts

In [None]:
# get total counts for each label
df.groupby(by=['label']).count()

Get total words from dataset

In [None]:
total_words_count = 0
for tweet in df["data"]:
  words_list = tweet.split()
  total_words_count = total_words_count + len(words_list)

print("Total {} words in dataset".format(total_words_count))

Store words in pandas series

In [None]:
# create pandas series
word_series = pd.Series(dtype=object)

for tweet in df["data"]:
  words_list = tweet.split()
  x = pd.Series(data=words_list, dtype=object)
  word_series = word_series.append(x)

word_series

In [None]:
# total word unique
print("There are {} unique words".format(len(word_series.unique())))

The problem is there are words that irrelevant to predict vaccines such as 'kata hubung' dan 'RT'. Therefore, we must see the frequency for each words and eliminate words that are not relevant to vaccines and has big frequency

Create word only dataframe and analyze

In [None]:
# create words only dataframe
word_dataframe = pd.DataFrame({'word': word_series, 'index': range(0, 17614)})

# unique value count
pd.set_option('display.max_rows', 60)
word_dataframe.groupby(by=['word']).count()

In [None]:
# remove these words since these are stop-words and muncul terus
words_to_remove = ['dan', 'di', 'yang']

Remove stopwords

In [None]:
for i in range(len(df)):
  # print(df.iloc[i]['data'])
  tweet = df.iloc[i]['data']
  for stop_word in words_to_remove:
    if stop_word in tweet:
      tweet = tweet.replace(stop_word, "")
  df.iloc[i]['data'] = tweet

## **Train Validation Test**

Train validation test size

In [None]:
import numpy as np

# store in a dataset and random shuffle
dataset = df.values
np.random.shuffle(dataset)

# split train dev test (70%, 20%, 10%)
TRAIN_SIZE = round(len(dataset) * 0.7)
VAL_SIZE = round(len(dataset) * 0.2)
TEST_SIZE = len(dataset) - TRAIN_SIZE - VAL_SIZE

print("trainsize {}, val size {}, test size {}".format(TRAIN_SIZE, VAL_SIZE, TEST_SIZE))

Each tweet convert to words and store in corpus

In [None]:
train_dataset = dataset[:TRAIN_SIZE]
val_dataset = dataset[TRAIN_SIZE: (TRAIN_SIZE+VAL_SIZE)]
test_dataset = dataset[(VAL_SIZE+ TRAIN_SIZE):]
# print(len(train_dataset), len(val_dataset), len(test_dataset))
# train_dataset

In [None]:
# train data and label
X_train = train_dataset[:, 0]
Y_train = train_dataset[:, 1]

# validation data and label
X_validation = val_dataset[:, 0]
Y_validation = val_dataset[:, 1]

# test data and label
X_test = test_dataset[:, 0]
Y_test =test_dataset[:, 1]

## **Tokenizer**

Convert to data to sequence

In [None]:
X_train

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

max_words = 5000
max_len = 100

# create tokenizer
tokenizer = Tokenizer(num_words=max_words, oov_token='<OOV>')
tokenizer.fit_on_texts(X_train)
sequences = tokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(sequences, maxlen=max_len)

# validation data to sequences
val_sequences = tokenizer.texts_to_sequences(X_validation)
X_validation = pad_sequences(val_sequences, maxlen=max_len)

One hot encode labels

In [None]:
import numpy as np

# convert to arrat
Y_train_arr = np.asarray(Y_train, dtype=np.int)
Y_validation_arr = np.asarray(Y_validation).astype(np.int)

# convert to one hot encode
Y_train = tf.one_hot(indices=Y_train_arr, depth=3, dtype=tf.int64)
Y_validation = tf.one_hot(indices=Y_validation_arr, depth=3, dtype=tf.int64)

# convert to numpy
Y_train = Y_train.numpy()
Y_validation = Y_validation.numpy()

In [None]:
Y_validation

In [None]:
print(tokenizer.word_index)

## **Building Model and Train Model**

### **Baseline Model**

In [None]:
from keras.models import Sequential
from keras import layers
from keras import regularizers
from keras import backend as K
from keras.callbacks import ModelCheckpoint
from datetime import datetime

from keras.layers import Embedding

# save the model with this name
time_now = datetime.now()
MODEL_NAME = 'sentiment-analysis-CNN1D-LSTM-{}.hdf5'.format(time_now.strftime("%d-%m-%Y")) 

embedding_layer = Embedding(1000, 64)

model = Sequential()
model.add(layers.Embedding(input_dim=max_words, output_dim=16)) #The embedding layer
model.add(layers.Conv1D(filters=20, kernel_size=(3), strides=1, activation='relu'))
model.add(layers.MaxPooling1D(pool_size=3))
model.add(layers.LSTM(20, activation='tanh', dropout=0.2)) #Our LSTM layer
model.add(layers.Dense(3,activation='softmax'))


model.compile(optimizer='rmsprop',loss='categorical_crossentropy', metrics=['accuracy'])

checkpoint = ModelCheckpoint(MODEL_NAME, monitor='val_accuracy', verbose=1,save_best_only=True, mode='auto', period=1,save_weights_only=False)
history = model.fit(X_train, Y_train, epochs=100,validation_data=(X_validation, Y_validation),callbacks=[checkpoint])


In [None]:
model.summary()

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

In [None]:
plot_graphs(history, 'accuracy')

In [None]:
plot_graphs(history, 'loss')

Test prediction

In [None]:
sentiment = ['Neutral','Positive','Negative']
sequence = tokenizer.texts_to_sequences(['Alhamdulillah, papa dah vaksin, ayah dan mak masih tunggu giliran. Malah adik saya pun dah divaksin. Kami yang lain2 tunggu giliran. In shaa Allah.'])
print(sequence)
test = pad_sequences(sequence, maxlen=max_len)
sentiment[np.around(model.predict(test), decimals=0).argmax(axis=1)[0]]
test.shape

## **Evaluate metrics**

Make predictions

Formula for Recall, Precision and F1 Score

1.   Precision = TP / (TP+FP)
2.   Recall = TP / (TP+FN)
3.   F1 score = 2*((precision*recall)/(precision+recall))





Decode from one hot to 1D array encoding i.e 0:neutral, 1:positive, 2:negative

In [None]:
# loop through X_validation dataset and predict
y_true = []
y_predict = []

# store y_predict
for i in range(len(X_validation)):
  tweet_val = np.expand_dims(X_validation[i], axis=0)
  # print(model.predict(tweet_val))
  print("predicted: {}".format(np.around(model.predict(tweet_val), decimals=0).argmax(axis=1)[0]))
  predicted = np.around(model.predict(tweet_val), decimals=0).argmax(axis=1)[0]
  y_predict.append(predicted)

y_predict = np.array(y_predict)

# store y_true 
a = Y_validation

for i in range(len(a)):
  y_true.append(a[i].argmax(axis=0))

y_true = np.array(y_true)

In [None]:
print(y_true.shape, y_predict.shape)

In [None]:
from sklearn import metrics
acc = metrics.accuracy_score(y_true, y_predict)
f1_score = metrics.f1_score(y_true, y_predict, average='macro')
recall = metrics.recall_score(y_true, y_predict, average='macro')
precision = metrics.precision_score(y_true, y_predict, average='macro')

In [None]:
print("accuracy: {}, f1 score: {}, recall: {}, precision: {}".format(acc, f1_score, recall, precision))