# 1. Objective

In this term project, you will deploy Deep Learning models to build a classification model using RapidMiner to predict the sentiment of consumers towards US airlines based on their reviews expressed in the form of tweets. If you strongly prefer to use some other DL-based software/frameworks instead of RapidMiner, such as TensorFlow or PyTorch, let me know before starting the work. This is a group project, and you should work on it in the groups that you have formed already.

You can use any neural network model you like for this classification task. In particular, you may start with a simple single fully connected network as a “baseline” and then try to use more complex models, including CNN and RNN based models, to achieve better performance results than this simple baseline model.  Your goal is to reach the mean absolute error of at least 0.48, which should not be too difficult. If you want to be more ambitious, you can try to reach the mean absolute error of 0.47 (medium difficulty), or even 0.46 (this is difficult). The higher accuracy you get, the more points you will be awarded. 

In addition to the simple NN baseline mentioned above, you should also build another basic baseline, such as a logistic regression model (similar to the one we used in the RapidMiner Lab done in the class) and compare the performance results of your DL-based model with that baseline. The expectation is that the more sophisticated DL-model should outperform simple baselines.

In [1]:
# fetch data
import requests
from io import StringIO

# core
import numpy as np
import pandas as pd

# preprocessing
from sklearn.model_selection import train_test_split

# modeling
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# evaluation
from sklearn.metrics import mean_absolute_error

# 2. Fetch Data

In [2]:
# url of dataset
google_drive_file_url = 'https://drivesds.google.com/file/d/1c96crlNZr7XiF3-9lmZ1nEJaY3MHTTz5/view?usp=sharing'

def fetch_google_drive_csv(google_drive_file_url):

    file_id = google_drive_file_url.split('/')[-2]
    dwn_url = 'https://drive.google.com/uc?export=download&id=' + file_id
    url = requests.get(dwn_url)
    csv_raw = StringIO(url.text)
    return pd.read_csv(csv_raw)

data = fetch_google_drive_csv(google_drive_file_url)

In [3]:
data.head()

Unnamed: 0,sentiment,dimension1,dimension2,dimension3,dimension4,dimension5,dimension6,dimension7,dimension8
0,1,-0.400418,0.293417,-0.572702,0.125659,0.471714,-0.034476,0.042176,-0.429317
1,1,-0.454608,-0.194998,-0.497063,0.242207,0.209621,0.064868,0.072154,0.629457
2,0,-0.515892,-0.120781,-0.106512,-0.260192,0.197666,-0.155029,-0.306803,0.694974
3,1,0.04777,-0.230509,0.132355,0.174913,0.24204,-0.229259,-0.835945,0.294148
4,0,-0.574353,-0.132517,-0.09161,0.466463,0.51098,-0.33848,0.20204,-0.100443


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55524 entries, 0 to 55523
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   sentiment   55524 non-null  int64  
 1   dimension1  55524 non-null  float64
 2   dimension2  55524 non-null  float64
 3   dimension3  55524 non-null  float64
 4   dimension4  55524 non-null  float64
 5   dimension5  55524 non-null  float64
 6   dimension6  55524 non-null  float64
 7   dimension7  55524 non-null  float64
 8   dimension8  55524 non-null  float64
dtypes: float64(8), int64(1)
memory usage: 3.8 MB


In [5]:
data.shape

(55524, 9)

# 3. Preprocess Data

In [6]:
target = 'sentiment'

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(target, axis=1), # predictors
    data[target], # target
    test_size=0.2,
    random_state=90
)

In [7]:
X_train.head()

Unnamed: 0,dimension1,dimension2,dimension3,dimension4,dimension5,dimension6,dimension7,dimension8
21623,-0.393855,0.001229,-0.635915,0.251783,0.325279,-0.352774,0.331207,-0.192716
5150,-0.759324,0.177525,0.013792,0.335651,0.41925,-0.28949,-0.060092,-0.125991
20133,-0.540323,-0.507512,0.048441,-0.265888,0.517273,0.328362,-0.018433,0.041325
33608,0.118876,-0.121272,0.200789,-0.304994,0.358935,-0.143573,-0.812649,0.167265
46437,-0.434374,-0.511446,0.231434,0.175639,0.519935,0.417033,-0.102997,0.102339


In [8]:
y_train.head()

21623    0
5150     1
20133    0
33608    0
46437    1
Name: sentiment, dtype: int64

# 4. Modeling

## 4.1 Base Model Performance

### 4.1.1 Logistic Regression

In [19]:
model_lr = LogisticRegression(random_state=90).fit(X_train, y_train)
y_preds_lr = model_lr.predict(X_test)
score_lr = mean_absolute_error(y_test, y_preds_lr)

print(f'Logistic Regression MAE: {score_lr:.4f}')

Logistic Regression MAE: 0.4299


### 4.1.2 Random Forest

In [20]:
model_rf = RandomForestClassifier(
    n_estimators=500,
    random_state=90
).fit(X_train, y_train)
y_preds_rf = model_rf.predict(X_test)
score_rf = mean_absolute_error(y_test, y_preds_rf)

print(f'Random Forest MAE: {score_rf:.4f}')

Random Forest MAE: 0.4149


## 4.2 Deep Learning

In [100]:
import tensorflow as tf

class CustomNN(tf.keras.Model):

    def __init__(self):
        super(CustomNN, self).__init__()
        self.dense1 = tf.keras.layers.Dense(32, input_dim=8, activation=tf.nn.relu)
        self.normalization = tf.keras.layers.BatchNormalization()
        self.dense2 = tf.keras.layers.Dense(32, activation=tf.nn.relu)
        self.dense3 = tf.keras.layers.Dense(32, activation=tf.nn.relu)
        self.dense4 = tf.keras.layers.Dense(16, activation=tf.nn.relu)
        self.dropout = tf.keras.layers.Dropout(0.50)
        self._output = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)

    def call(self, inputs, training=False):
        x = self.dense1(inputs)
        x = self.normalization(x)
        x = self.dense2(x)
        if training:
            x = self.dropout(x)
        x = self.normalization(x)
        x = self.dense3(x)
        if training:
            x = self.dropout(x)
        x = self.normalization(x)
        x = self.dense4(x)
        return self._output(x)

model = CustomNN()

In [101]:
tf.keras.backend.set_floatx('float64')
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=50)

optimizer = tf.keras.optimizers.SGD()
loss_fn = tf.keras.losses.mean_absolute_error

model.compile(optimizer=optimizer, loss=loss_fn)

In [102]:
model.fit(
    X_train, 
    y_train,
    batch_size=64,
    epochs=500, 
    verbose=2,
    callbacks=[callback],
    validation_data=(X_test, y_test)
)

Epoch 1/500
695/695 - 1s - loss: 0.4979 - val_loss: 0.4979
Epoch 2/500
695/695 - 1s - loss: 0.4908 - val_loss: 0.4937
Epoch 3/500
695/695 - 1s - loss: 0.4849 - val_loss: 0.4894
Epoch 4/500
695/695 - 1s - loss: 0.4809 - val_loss: 0.4859
Epoch 5/500
695/695 - 1s - loss: 0.4792 - val_loss: 0.4840
Epoch 6/500
695/695 - 1s - loss: 0.4768 - val_loss: 0.4817
Epoch 7/500
695/695 - 1s - loss: 0.4752 - val_loss: 0.4804
Epoch 8/500
695/695 - 1s - loss: 0.4732 - val_loss: 0.4788
Epoch 9/500
695/695 - 1s - loss: 0.4726 - val_loss: 0.4783
Epoch 10/500
695/695 - 1s - loss: 0.4705 - val_loss: 0.4767
Epoch 11/500
695/695 - 1s - loss: 0.4697 - val_loss: 0.4763
Epoch 12/500
695/695 - 1s - loss: 0.4684 - val_loss: 0.4763
Epoch 13/500
695/695 - 1s - loss: 0.4673 - val_loss: 0.4762
Epoch 14/500
695/695 - 1s - loss: 0.4658 - val_loss: 0.4763
Epoch 15/500
695/695 - 1s - loss: 0.4638 - val_loss: 0.4755
Epoch 16/500
695/695 - 1s - loss: 0.4618 - val_loss: 0.4756
Epoch 17/500
695/695 - 1s - loss: 0.4583 - val_lo

<tensorflow.python.keras.callbacks.History at 0x7f89087a3e80>

In [104]:
y_preds_nn = model.predict(X_test).flatten()
score_nn = mean_absolute_error(y_test, y_preds_nn)

print(f'Deep Learning MAE: {score_nn:.4f}')

Deep Learning MAE: 0.4320
