## Import Modules

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from datasets import load_dataset
from tensorflow import keras

from keras.layers import Embedding, LSTM, Dense, Bidirectional, SpatialDropout1D
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical, pad_sequences

from nltk.corpus import stopwords
from gensim.models import Word2Vec

## Import the datasets

In [2]:
dataset = load_dataset("climatebert/climate_sentiment")

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 320
    })
})

In [4]:
pd_train = pd.DataFrame.from_dict(dataset["train"])
pd_test = pd.DataFrame.from_dict(dataset["test"])

In [5]:
pd_train

Unnamed: 0,text,label
0,− Scope 3: Optional scope that includes indire...,1
1,The Group is not aware of any noise pollution ...,0
2,Global climate change could exacerbate certain...,0
3,Setting an investment horizon is part and parc...,0
4,Climate change the physical impacts of climate...,0
...,...,...
995,Greenhouse gas Mitigation Measures Our five ye...,1
996,We have updated our external sector statements...,1
997,STOREBRAND'S USE Task Force on Climate-related...,0
998,Estimations of nanced emissions indicate the i...,1


## Text Classification using RNN with word embeddings

### Data preprocessing

Cleaning the text, replacing symbol into space, remove symbols, and using stopwords.

In [6]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing. 
    text = text.replace('x', '')
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
    
    return text

pd_train['text'] = pd_train['text'].apply(clean_text)
pd_train['text'] = pd_train['text'].str.replace('\d+', '')

pd_test['text'] = pd_test['text'].apply(clean_text)
pd_test['text'] = pd_test['text'].str.replace('\d+', '')

In [7]:
text_train = pd_train["text"].values
label_train = pd_train["label"].values

text_test = pd_test["text"].values
label_test = pd_test["label"].values

In [8]:
# define max words for the vocabulary
MAX_WORDS = 50000
tokenizer_train = Tokenizer(num_words=MAX_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer_test = Tokenizer(num_words=MAX_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)

# fit dataset to tokenizer
tokenizer_train.fit_on_texts(text_train)
tokenizer_test.fit_on_texts(text_test)

# convert dataset to sequence of integer
seq_train = tokenizer_train.texts_to_sequences(text_train)
seq_test = tokenizer_test.texts_to_sequences(text_test)

In [9]:
# pad the sequence to fixed_length, will adjust later
MAX_SEQ = 100
X_train = pad_sequences(sequences=seq_train, maxlen=MAX_SEQ)
X_test = pad_sequences(sequences=seq_test, maxlen=MAX_SEQ)

### Split the data

In [10]:
# turn the lables into categorical
y_train = to_categorical(label_train, 3)
y_test = to_categorical(label_test, 3)

Datasets already split, so will use validation.

In [11]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42069)

### Create the RNN Model using Randomly Initialized Embedding

In [12]:
# create sequential model to stack layers
rnn = Sequential()

# embedding layer to convert integer tokens into dense vectors
# weights not assigned, will use randmoly initialized
rnn.add(Embedding(input_dim=MAX_WORDS, output_dim=100, input_length=X_train.shape[1]))

# performs variational dropout in NLP models
rnn.add(SpatialDropout1D(rate=0.2))

# bidirectional with 100 unit
# process sequence in both direction, it's said to capture context efficiently
rnn.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True)))

# add final layer of 50 unit
rnn.add(Bidirectional(LSTM(50)))

# add dense layer, with 3 output and softmax activation (used for multiclass)
rnn.add(Dense(3, activation="softmax"))

# compile the RNN model
rnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [13]:
# train the model
rnn_history = rnn.fit(X_train, y_train, epochs=8, batch_size=16, validation_data=(X_val, y_val))

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


In [14]:
# evaluate using data validation
loss, accuracy = rnn.evaluate(X_val, y_val)
print(f"Loss:\t{loss:.4f}")
print(f"Accuracy:\t{accuracy:.4f}")

Loss:	1.1081
Accuracy:	0.7200


In [15]:
# evaluate using data test
loss, accuracy = rnn.evaluate(X_test, y_test)
print(f"Loss:\t{loss:.4f}")
print(f"Accuracy:\t{accuracy:.4f}")

Loss:	2.4801
Accuracy:	0.3688


The accuracy is "meh", when using test data. Already tried to adjust the layers used, but still couldn't find models that produce better accuracy.

### Create the RNN models using Word2Vec embedding

Create the embedding matrix using Word2Vec

In [16]:
EMBEDDING_DIM = 100
w2v = Word2Vec(sentences=text_train, vector_size=EMBEDDING_DIM, window=5, min_count=1, sg=0)
w2v.save("word2vec.model")

In [17]:
embedding_matrix = np.zeros((MAX_WORDS, EMBEDDING_DIM))
for word, i in tokenizer_train.word_index.items():
    if i < MAX_WORDS:
        if word in w2v.wv:
            embedding_matrix[i] = w2v.wv[word]

In [18]:
embedding_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [19]:
# create sequential model to stack layers
rnn_w2v = Sequential()

# embedding layer to convert integer tokens into dense vectors
# change the weight to embedding_matrix from Word2Vec
rnn_w2v.add(Embedding(input_dim=MAX_WORDS, output_dim=EMBEDDING_DIM, input_length=X_train.shape[1], weights=[embedding_matrix], trainable=True))

# performs variational dropout in NLP models
rnn_w2v.add(SpatialDropout1D(rate=0.2))

# bidirectional with 100 unit
# process sequence in both direction, it's said to capture context efficiently
rnn_w2v.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True)))

# add final layer of 50 unit
rnn_w2v.add(Bidirectional(LSTM(50)))

# add dense layer, with 3 output and softmax activation (used for multiclass)
rnn_w2v.add(Dense(3, activation="softmax"))

# compile the RNN model
rnn_w2v.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [20]:
# train the model
rnn_w2v_history = rnn_w2v.fit(X_train, y_train, epochs=8, batch_size=16, validation_data=(X_val, y_val))

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


In [21]:
# evaluate the models
loss, accuracy = rnn_w2v.evaluate(X_val, y_val)
print(f"Loss:\t{loss:.4f}")
print(f"Accuracy:\t{accuracy:.4f}")

Loss:	1.1746
Accuracy:	0.7400


In [22]:
# evaluate the models
loss, accuracy = rnn_w2v.evaluate(X_test, y_test)
print(f"Loss:\t{loss:.4f}")
print(f"Accuracy:\t{accuracy:.4f}")

Loss:	3.1745
Accuracy:	0.4000


The accuracy between two models is not consistent which one is better. It really depends on the layers used, embedding dimension, and even batch size.

### Predict the models using two embedding

In [23]:
# using randomly initialized embedding
predictions = rnn.predict(X_test[:5])

for text, prediction, groundtruth in zip(tokenizer_test.sequences_to_texts(X_test), predictions, y_test[:5]):
    pred = prediction.tolist()
    groundtruth = groundtruth.tolist()
    print(f"Text: {text}\nPredicted: {pred.index(max(pred))}\nGroundtruth: {groundtruth.index(max(groundtruth))}\n")

Text: sustainable strategy red lines sustainable strategy range incorporate series proprietary red lines order ensure poorest performing companies esg perspective eligible investment
Predicted: 1
Groundtruth: 0

Text: verizons environmental health safety management system provides framework identifying controlling reducing risks associated environments operate besides regular management system assessments internal thirdparty compliance audits inspections performed annually hundreds facilities worldwide goal assessments identify correct sitespecific issues educate empower facility managers supervisors implement corrective actions verizons environment health safety efforts directed supported eperienced eperts around world support operations facilities
Predicted: 1
Groundtruth: 1

Text: 2019 company closed series transactions related sale canadian fossil fuelbased electricity generation business transaction heartland generation ltd affiliate energy capital partners included sale 10 partly

In [24]:
# using Word2Vec embedding
predictions = rnn_w2v.predict(X_test[:5])

for text, prediction, groundtruth in zip(tokenizer_test.sequences_to_texts(X_test), predictions, y_test[:5]):
    pred = prediction.tolist()
    groundtruth = groundtruth.tolist()
    print(f"Text: {text}\nPredicted: {pred.index(max(pred))}\nGroundtruth: {groundtruth.index(max(groundtruth))}\n")

Text: sustainable strategy red lines sustainable strategy range incorporate series proprietary red lines order ensure poorest performing companies esg perspective eligible investment
Predicted: 1
Groundtruth: 0

Text: verizons environmental health safety management system provides framework identifying controlling reducing risks associated environments operate besides regular management system assessments internal thirdparty compliance audits inspections performed annually hundreds facilities worldwide goal assessments identify correct sitespecific issues educate empower facility managers supervisors implement corrective actions verizons environment health safety efforts directed supported eperienced eperts around world support operations facilities
Predicted: 1
Groundtruth: 1

Text: 2019 company closed series transactions related sale canadian fossil fuelbased electricity generation business transaction heartland generation ltd affiliate energy capital partners included sale 10 partly

## Summary

### Embedding Comparison

In a comparative analysis of two embedding methods, randomly initialized embeddings and Word2Vec embeddings were evaluated for their performance. The results indicate that while Word2Vec embeddings outperform randomly initialized embeddings, both types of embeddings fall short of delivering satisfactory results.

#### Randomly Initialized Embedding
- Accuracy: 36.56%

Randomly initialized embeddings achieved an accuracy of 36.56%. However, it's crucial to note that this accuracy may vary significantly based on several factors, including the number of layers, embedding dimension, batch size, and the number of training epochs used. The inconsistency in accuracy highlights the sensitivity of this approach to hyperparameter settings.

#### Word2Vec Embedding
- Accuracy: 40%

On the other hand, Word2Vec embeddings performed relatively better with an accuracy of 40%. Similar to randomly initialized embeddings, the accuracy of Word2Vec embeddings is highly dependent on hyperparameter choices such as the number of layers, embedding dimension, batch size, and training epochs.

It is important to emphasize that achieving a high level of accuracy with either embedding method is a complex task and requires careful tuning of these hyperparameters. The choice between randomly initialized embeddings and Word2Vec embeddings should be made considering the specific requirements and constraints of the problem at hand.

Please note that these accuracy values are not absolute and can vary significantly based on the configuration of the models and data used in the experiments.


_*both accuracy using different datasets than training._