<a href="https://colab.research.google.com/github/coffema/coffema/blob/main/Emoji_Sentiment_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 03: Emoji Sentiment Classification

This Week's assignment is to train sequence models on the Emoji Data to classify the sentences emotion. You'll be creating models that takes in a sentence and predicts the appropriate emoji that describes the sentiment.

Before starting copy this file and work on your own copy by following the below steps:
* `File > Save Copy in Drive`.
* Add your name to the file (e.g., Assignment 01 - Zahraa Dhafer)

Before submitting do the following:
* Rerun the entire notebook from beginning to end to ensure everything is working properly and there are no errros `Runtime > Restart and run all`
* Make sure the outputs of your code matches the expected outputs.
* Download the notebook locally `File > Download > Download .ipynb`
* Submit it through the submission form listed below.

**Make sure to only edit cells that start with `# YOUR CODE HERE` and nothing else, if you need more space to write your code you can add more cells, but never delete any existing cells.**

**Submission Deadline: Saturday, 18/2/2023 at 11 PM**

**Submission Form:**  https://forms.gle/DpgK8sHDfUSLZFmbA

**Requirements:**

1. Read & Preprocess Dataset
   1. Text Cleanup
   2. Vectorize Text
   3. Padding
2. Build and train an RNN-based model
3. Save and test the model

Good luck and feel free to ask any questions on the #questions channel.

In [None]:
# Import python libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import re  
import string  

In [None]:
# To ensure notebook's reproducability, let's set the random seed
seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)

## Data Preperation


**DATASET**
The dataset consists of two csv files, a training file with 16k rows and a testing file with 2k rows, each row has 3 columns, the sentence, the emotion as text (meant to provide description to the emoji and not to be used in training/testing) and the emoji symbol (e.g. 😄, 😡, 😍).<br>

In [None]:
!wget https://pub-7bf6c3e2f2bf41b083bd3e31313d5856.r2.dev/emoji_dataset.zip
!unzip 'emoji_dataset.zip'   

--2023-02-18 21:08:12--  https://pub-7bf6c3e2f2bf41b083bd3e31313d5856.r2.dev/emoji_dataset.zip
Resolving pub-7bf6c3e2f2bf41b083bd3e31313d5856.r2.dev (pub-7bf6c3e2f2bf41b083bd3e31313d5856.r2.dev)... 104.18.3.35, 104.18.2.35, 2606:4700::6812:323, ...
Connecting to pub-7bf6c3e2f2bf41b083bd3e31313d5856.r2.dev (pub-7bf6c3e2f2bf41b083bd3e31313d5856.r2.dev)|104.18.3.35|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 659306 (644K) [application/zip]
Saving to: ‘emoji_dataset.zip.1’


2023-02-18 21:08:13 (30.8 MB/s) - ‘emoji_dataset.zip.1’ saved [659306/659306]

Archive:  emoji_dataset.zip
replace test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

## Read Dataset

Read the csv files into Pandas DataFrames `train_df` and `test_df`, note that the data is already split into training and testing sets, so you don't need to split it yourself.

In [None]:
# YOUR CODE HERE
train_df = pd.read_csv("train.csv")
train_df.head()

In [None]:
test_df = pd.read_csv("test.csv")
test_df.head()

In [None]:
# Test (Do not edit)

print("train_df shape: ", train_df.shape)
print("test_df shape: ", test_df.shape)




Expected Output

```
train_df shape:  (16000, 3)
test_df shape:  (2000, 3)
```

In [None]:
# let's preview the first 5 rows of the training dataset
train_df.head()

Split

Notice that there are 3 columns in the dataset:
* `text`: the sentence
* `emotion`: the emotion as text (meant to provide description to the emoji and not to be used in training/testing)
* `emoji`: the emoji symbol (e.g. 😄, 😡, 😍)

We will be using the `text` column as the input to our model and the `emoji` column as the target.

Split the data into `x_train`, `y_train`, `x_test`, `y_test` using the `text` and `emoji` columns respectively.

In [None]:
# YOUR CODE HERE
x_train = train_df['text']
y_train = train_df['emoji']
x_test = test_df['text']
y_test = test_df['emoji']

In [None]:
# Test (Do not edit)

print("Number of train_text: ", len(x_train))
print("Number of test_text:", len(x_test))
print("Labels: ", sorted(set(y_train)))


Expected Output

```
Number of train_text:  16000
Number of test_text: 2000
Labels:  ['😄', '😍', '😡', '😢', '😨', '😲']
```

## Preprocess the text

Let's start by setting up hyperparameters preprocessing the text and the model.

In [None]:
vocab_size = 10000 # The maximum number of words to be used. (most frequent)
max_sequence = 24 # Max number of words in each text.
batch_size = 64 # Batch size for training.
embedding_dims = 50 # Dimension of the embedding layers.

Create function `text_cleanup` that takes a string as input and returns a string after removing all the special characters and numbers.

In [None]:
# Create a function to clean up the text 

def text_cleanup(text):

    text = re.sub(r"<.*?>", "", text) # remove HTML tags
    text = re.sub(r"\d+", "", text) # remove numbers
    text = re.sub(r"\w*\d\w*", "", text) # remove words with numbers
    text = re.sub(r"https?://\S+", "", text) # remove URLs
    text = re.sub(r"\S*@\S*\s?", "", text) # remove emails
    text = re.sub(r"@\S+", "", text) # remove mentions (@username)
    text = re.sub(r"#\S+", "", text) # remove hashtags (#)
    text = re.sub(r"\s+", " ", text) # remove extra spaces
    
    return text

Apply the function to the `x_train` and `x_test` and save the results in `x_train_clean` and `x_test_clean` respectively.

In [None]:
# YOUR CODE HERE
x_train_clean = x_train.map(text_cleanup)

x_test_clean = x_test.map(text_cleanup)

### Vectorize Text

Create a `tf.keras.preprocessing.text.Tokenizer` named `tokenizer` and apply the approriate parameters to it (make sure to include the `oov_token` parameter). Then fit the tokenizer on the `x_train_clean` data.

Don't convert the text to sequences yet, we will do that later.

In [None]:
# YOUR CODE HERE
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words=vocab_size,
    lower=True,
    split=' ',
    char_level=False,
    oov_token="<OOV>",
    analyzer=None
)

In [None]:
tokenizer.fit_on_texts(x_train_clean)
tokenizer.fit_on_texts(x_test_clean)

In [None]:
# Test (Do not edit)

print("OOV token: ", tokenizer.oov_token)
print("Vocabulary size: ", tokenizer.get_config()['num_words'])


Expected Output

```
OOV token:  <OOV>
Vocabulary size:  15212
```

Now that we have a tokenizer, we can use it to convert the text to sequences. Create a function `text_to_sequences` that takes a list of strings as input and returns a list of lists of integers (the sequences). Save the tokenized `x_train_clean` and `x_test_clean` in `x_train_seq` and `x_test_seq` respectively.

In [None]:
x_train_seq = tokenizer.texts_to_sequences(x_train_clean)
x_test_seq = tokenizer.texts_to_sequences(x_test_clean)

In [None]:
x_train_seq

### Padding

Create a function `pad_sequences` that takes a list of lists of integers as input and returns a list of lists of integers after padding the sequences to the same length (use the maximum length that we already defined with the hyperparameter, make sure to set the `padding` parameter to `post`).

Save the padded `x_train_seq` and `x_test_seq` in `x_train_pad` and `x_test_pad` respectively.

In [None]:
# YOUR CODE HERE
x_train_pad = tf.keras.preprocessing.sequence.pad_sequences(
x_train_seq , maxlen=max_sequence, padding="post"
)

x_test_pad = tf.keras.preprocessing.sequence.pad_sequences(
x_test_seq , maxlen=max_sequence, padding="post"
)


In [None]:
# Test (Do not edit)

print("x_train shape: ", x_train_pad.shape)
print("x_test shape: ", x_test_pad.shape)


Expected Output

```
x_train shape:  (16000, 24)
x_test shape:  (2000, 24)
```

### One Hot Encode Labels

Create a function `one_hot_encode` that takes a list of strings as input and returns a list of lists of integers after one hot encoding the labels.

In [None]:
# create emoji index and reverse index

emoji_index = {'😄': 0, '😍': 1, '😡': 2, '😢': 3, '😨': 4, '😲': 5}

reverse_emoji_index = {v: k for k, v in emoji_index.items()}

emoji_index, reverse_emoji_index

Now create two functions:
* `emoji_to_index`: takes an emoji string as input and returns the index of the emoji using `emoji_index` dictionary.
* `index_to_emoji`: takes an index as input and returns the emoji string using `index_emoji` dictionary.

In [None]:
# YOUR CODE HERE
def emoji_to_index(emoji):
  return emoji_index.get(emoji)
 


In [None]:
def index_to_emoji(index):
  return list(emoji_index.keys())[list(emoji_index.values()).index(index)]
  

In [None]:
# Test (Do not edit)

print("😄index: ", emoji_to_index('😄'))
print("4 emoji: ", index_to_emoji(4))


Expected Output

```
😄 index:  0
4 emoji:  😨
```

Now we will encode the labels using the function we just created and then one hot encode them.

Remember that we need to convert the labels to integers first before one hot encoding them.

In [None]:
y_train = y_train.apply(emoji_to_index)
y_test = y_test.apply(emoji_to_index)

depth = len(set(y_train))
y_train

0        None
1        None
2        None
3        None
4        None
         ... 
15995    None
15996    None
15997    None
15998    None
15999    None
Name: emoji, Length: 16000, dtype: object

In [None]:
# convert emoji to index
y_train = y_train.apply(emoji_to_index)
y_test = y_test.apply(emoji_to_index)

# one-hot encode the labels
depth = len(set(y_train))
y_train = tf.one_hot(y_train, depth)
y_test = tf.one_hot(y_test, depth)

y_train.shape, y_test.shape

Now that we have the tensors ready, let's convert them to TF.Data pipelines. 

Remember that we can always use TF.Data Pipeplines even if we do the preprocessing manually.

In [None]:
def dataset_creator(x, y):
    dataset = tf.data.Dataset.from_tensor_slices((x, y))
    dataset = dataset.shuffle(1000)
    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
    return dataset


train_dataset = dataset_creator(x_train_pad, y_train)
test_dataset = dataset_creator(x_test_pad, y_test)

In [None]:
# preview dataset
for x, y in train_dataset.take(1):
    print(x.shape, y.shape)
    print(x[0])
    print(y[0])

# preview dataset size
print("Train dataset size: ", len(train_dataset))
print("Test dataset size: ", len(test_dataset))

## Modeling

Create `model` using the sequential API. Your target is hit 90% accuracy on the validation set.

Start with the smallest model that you can think of and then increase model size until you hit the target.

In [None]:
# YOUR CODE HERE

model = tf.keras.models.Sequential(
    [
        tf.keras.layers.LSTM(
            32, input_shape=(sequence_length, len(selected_columns))
        ),  # input shape is sequence_length (720) x number of features (7)
        tf.keras.layers.Dense(1),
    ]
)

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.Huber(),
    metrics=[tf.keras.metrics.MeanAbsoluteError()],
)

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    "model_checkpoint.h5",
    save_best_only=True,
    monitor="val_loss",
)

early_stopping_callback = tf.keras.callbacks.EarlyStopping(
    patience=3,
    restore_best_weights=True,
)

model.fit(
    train_dataset,
    epochs=10,
    validation_data=test_dataset,
    callbacks=[checkpoint_callback, early_stopping_callback],
)

In [None]:
# Test (Do not edit)

model.summary()    


Expected Output

```
Model: "..."
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 24, 50)            500050    
                                                                 
 ...
                                                                 
 dense (Dense)               (None, 6)                 ...       
                                                                 
=================================================================
Total params: ...
Trainable params: ...
Non-trainable params: 0
_________________________________________________________________
```

In [None]:
# compile the model
model.compile(optimizer= tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.CategoricalCrossentropy(),
    metrics=[tf.keras.metrics.CategoricalAccuracy()]
)

Now fit the model on the training data and evaluate it on the testing data. Use an approriate number of epochs and any callbacks you deem necessary.

In [None]:
# YOUR CODE HERE



You will now save the model to `model.h5` and then load it again to make sure it works.

Make sure to save the whole model and not just weights.

In [None]:
# YOUR CODE HERE



In [None]:
# Test (Do not edit)

# delete existing model and load it from scratch
if not model:
    del model
model = tf.keras.models.load_model('model.h5')

# input a test sentence and get the predicted emoji
def predict_emoji(sentence):
    sentence = text_cleanup(sentence)
    sentence = tokenizer.texts_to_sequences([sentence])
    sentence = tf.keras.preprocessing.sequence.pad_sequences(sentence, maxlen=max_sequence, padding='post')
    prediction = model.predict(sentence, verbose=0)
    return index_to_emoji(np.argmax(prediction))

sentence1 = "I am so happy"
sentence2 = "I am so sad"

print("Sentence 1: ", sentence1, "\nEmoji: ", predict_emoji(sentence1))
print()
print("Sentence 2: ", sentence2, "\nEmoji: ", predict_emoji(sentence2))


Expected Output

```
Sentence 1:  I am so happy 
Emoji:  😄

Sentence 2:  I am so sad 
Emoji:  😢
```