## Report for Kaggle competition

### Outline
1. Data preparation
2. Feature engineering
3. Model architecture
4. Training & testing

---

## 1. Data preparation

### 1.1 Load data
Load the json and csv files to each dataframe.

In [50]:
import pandas as pd
tweets = pd.read_json("dm2022-isa5810-lab2-homework2/tweets_DM.json", lines=True)

In [51]:
data_id = pd.read_csv("dm2022-isa5810-lab2-homework2/data_identification.csv") #, index_col='tweet_id')
emotion = pd.read_csv("dm2022-isa5810-lab2-homework2/emotion.csv")#, index_col='tweet_id')

In [52]:
print(data_id.shape)
data_id.head()

(1867535, 2)


Unnamed: 0,tweet_id,identification
0,0x28cc61,test
1,0x29e452,train
2,0x2b3819,train
3,0x2db41f,test
4,0x2a2acc,train


### 1.2 Set up dataframes 

Extract only "tweet_id" and "text" from the json file.

In [53]:
tweet_id = [tweets['_source'][i]['tweet']['tweet_id'] for i in range(len(tweets['_source']))]
tweet_text = [tweets['_source'][i]['tweet']['text'] for i in range(len(tweets['_source']))]
tweet = pd.DataFrame({'tweet_id': tweet_id, 'text': tweet_text})

Merge three dataframes into a single dataframe.

In [56]:
df = data_id.merge(emotion, on='tweet_id', how='outer').merge(tweet, on='tweet_id', how='outer')

Split the dataframe into training / test dataframe

In [57]:
train_df = df[df['identification']=='train']
test_df = df[df['identification']=='test']

Split the training dataframe into training / validation dataframe

In [58]:
val_df = train_df.sample(frac=0.2)
train_df = train_df[~train_df.index.isin(val_df.index)]

Save the dataframe files

In [59]:
train_df.to_pickle("./train_df.pkl") 
test_df.to_pickle("./test_df.pkl")
val_df.to_pickle("./val_df.pkl")

## 2. Feature engineering

### 2.1 Define vectorizer

Here I choose "Bag of Words" which achieved the better perform
.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
nltk.download('punkt')
BOW500_vectorizer = CountVectorizer(max_features=500, tokenizer=nltk.word_tokenize) 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Coo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Apply vectorizer to the text of dataframes

In [1]:
# apply analyzer to training data
BOW500_vectorizer.fit(train_df['text'])

# Transform documents to matrix.
train_data_BOW_features_500 = BOW500_vectorizer.transform(train_df['text'])

## Check dimension
train_data_BOW_features_500.shape


NameError: name 'CountVectorizer' is not defined

### 2.2 Set the training data and label.

In [60]:
# for a classificaiton problem, you need to provide both training & testing data
X_train = BOW500_vectorizer.transform(train_df['text'])
y_train = train_df['emotion']

X_val = BOW500_vectorizer.transform(val_df['text'])
y_val = val_df['emotion']# all of this sould be nan

## take a look at data dimension is a good habit  :)
print('X_train.shape: ', X_train.shape)
print('y_train.shape: ', y_train.shape)
print('X_val.shape: ', X_val.shape)
print('y_val.shape: ', y_val.shape)

X_train.shape:  (1164450, 500)
y_train.shape:  (1164450,)
X_val.shape:  (291113, 500)
y_val.shape:  (291113,)


In [64]:
y_train = train_df['emotion']
y_val = val_df['emotion']

One-hot encoding to transform our categorical  labels to numerical ones.

In [65]:
import keras
## deal with label (string -> one-hot)
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils # keras==2.4.0 and tensorflow==2.3.0

label_encoder = LabelEncoder()
label_encoder.fit(y_train)
print('check label: ', label_encoder.classes_)
print('\n## Before convert')
print('y_train[0:4]:\n', y_train[0:4])
print('\ny_train.shape: ', y_train.shape)
print('y_val.shape: ', y_val.shape)

def label_encode(le, labels):
    enc = le.transform(labels)
    return np_utils.to_categorical(enc)

def label_decode(le, one_hot_label):
    dec = np.argmax(one_hot_label, axis=1)
    return le.inverse_transform(dec)

y_train = label_encode(label_encoder, y_train)
y_val = label_encode(label_encoder, y_val)

print('\n\n## After convert')
print('y_train[0:4]:\n', y_train[0:4])
print('\ny_train.shape: ', y_train.shape)
print('y_val.shape: ', y_val.shape)

check label:  ['anger' 'anticipation' 'disgust' 'fear' 'joy' 'sadness' 'surprise'
 'trust']

## Before convert
y_train[0:4]:
 1             joy
4           trust
6    anticipation
7    anticipation
Name: emotion, dtype: object

y_train.shape:  (1164450,)
y_val.shape:  (291113,)


## After convert
y_train[0:4]:
 [[0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0.]]

y_train.shape:  (1164450, 8)
y_val.shape:  (291113, 8)


  print('y_train[0:4]:\n', y_train[0:4])


## 3. Model architecture

### 3.1 Build the model

Define input/output dimension

In [66]:
# I/O check
input_shape = X_train.shape[1]
print('input_shape: ', input_shape)

output_shape = len(label_encoder.classes_)
print('output_shape: ', output_shape)

input_shape:  500
output_shape:  8


Neural network architecture

In [67]:
from keras.models import Model
from keras.layers import Input, Dense
from keras.layers import ReLU, Softmax

# input layer
model_input = Input(shape=(input_shape, ))  # 500
X = model_input

# 1st hidden layer
X_W1 = Dense(units=64)(X)  # 64
H1 = ReLU()(X_W1)

# 2nd hidden layer
H1_W2 = Dense(units=64)(H1)  # 64
H2 = ReLU()(H1_W2)

# output layer
H2_W3 = Dense(units=output_shape)(H2)  # 4
H3 = Softmax()(H2_W3)

model_output = H3

# create model
model = Model(inputs=[model_input], outputs=[model_output])

# loss function & optimizer
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# show model construction
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 500)]             0         
                                                                 
 dense_3 (Dense)             (None, 64)                32064     
                                                                 
 re_lu_2 (ReLU)              (None, 64)                0         
                                                                 
 dense_4 (Dense)             (None, 64)                4160      
                                                                 
 re_lu_3 (ReLU)              (None, 64)                0         
                                                                 
 dense_5 (Dense)             (None, 8)                 520       
                                                                 
 softmax_1 (Softmax)         (None, 8)                 0   

## 4. Training and testing process

### 4.1 Create logger and start training

In [68]:
from keras.callbacks import CSVLogger

csv_logger = CSVLogger('./training_log.csv')

# training setting
epochs = 25
batch_size = 32

# training!
history = model.fit(X_train, y_train, 
                    epochs=epochs, 
                    batch_size=batch_size, 
                    callbacks=[csv_logger],
                    validation_data = (X_val, y_val))
print('training finish')

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
training finish


### 4.2 Test data

In [69]:
X_test = BOW500_vectorizer.transform(test_df['text'])

In [71]:
y_test_pred = model.predict(X_test)



Decode the output vector into each category.

In [75]:
import numpy as np
y_test_pred = label_decode(label_encoder, y_test_pred)
test_df['emotion'] = y_test_pred

### 4.3 Save output to csv file.

In [80]:
header = ["tweet_id", "emotion"]
test_df.to_csv('output_NN1.csv', columns=header)

---