# Overview

In this term project, you will deploy Deep Learning models to build a classification model using RapidMiner to predict the sentiment of consumers towards US airlines based on their reviews expressed in the form of tweets. If you strongly prefer to use some other DL-based software/frameworks instead of RapidMiner, such as TensorFlow or PyTorch, let me know before starting the work. This is a group project, and you should work on it in the groups that you have formed already.

In [1]:
# fetch data
import requests
from io import StringIO

# core
import numpy as np
import pandas as pd

# preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# baseline algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# deep learning
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout
from tensorflow.keras.losses import mean_absolute_error as tf_mae
from tensorflow.keras.optimizers import Adam

# evaluation
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_absolute_error as skl_mae

# 1. Fetch Data

The data is provided to you in two versions: 

1. The original version of the tweets (and their sentiments) is located at https://drive.google.com/file/d/1atyRH5Yz7TU-2ziyZknfd7ib2LLwYeuv/view?usp=sharing
2. The preprocessed version of the tweets is located at https://drive.google.com/file/d/1c96crlNZr7XiF3-9lmZ1nEJaY3MHTTz5/view?usp=sharing, where text preprocessing and pre-training of the text embeddings of the tweets using autoencoders have already been done to make your life simpler. This preprocessed version contains the sentiments about the tweets in column 1 of the spreadsheet (either positive (1) or negative (0)) and the 8-dimenisonal pre-trained embeddings of the tweets (in columns 2 – 9 of the spreadsheet).

I recommend that you use the preprocessed version of the tweets since it will save you a lot of preprocessing work to build these embeddings that is non-trivial. However, if you like challenges, you can do preprocessing and building the embeddings using autoencoders yourself and, therefore, work directly with the “raw” tweets. As a “reward” for this extra work, you will be awarded 10 extra points (the max score of this project is 100) if you preprocess tweets yourself.

In [2]:
# url of dataset
google_drive_file_url = 'https://drivesds.google.com/file/d/1c96crlNZr7XiF3-9lmZ1nEJaY3MHTTz5/view?usp=sharing'

def fetch_google_drive_csv(google_drive_file_url):

    file_id = google_drive_file_url.split('/')[-2]
    download_url = 'https://drive.google.com/uc?export=download&id=' + file_id
    url = requests.get(download_url)
    csv_raw = StringIO(url.text)
    return pd.read_csv(csv_raw)

data = fetch_google_drive_csv(google_drive_file_url)

In [3]:
data.head()

Unnamed: 0,sentiment,dimension1,dimension2,dimension3,dimension4,dimension5,dimension6,dimension7,dimension8
0,1,-0.400418,0.293417,-0.572702,0.125659,0.471714,-0.034476,0.042176,-0.429317
1,1,-0.454608,-0.194998,-0.497063,0.242207,0.209621,0.064868,0.072154,0.629457
2,0,-0.515892,-0.120781,-0.106512,-0.260192,0.197666,-0.155029,-0.306803,0.694974
3,1,0.04777,-0.230509,0.132355,0.174913,0.24204,-0.229259,-0.835945,0.294148
4,0,-0.574353,-0.132517,-0.09161,0.466463,0.51098,-0.33848,0.20204,-0.100443


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55524 entries, 0 to 55523
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   sentiment   55524 non-null  int64  
 1   dimension1  55524 non-null  float64
 2   dimension2  55524 non-null  float64
 3   dimension3  55524 non-null  float64
 4   dimension4  55524 non-null  float64
 5   dimension5  55524 non-null  float64
 6   dimension6  55524 non-null  float64
 7   dimension7  55524 non-null  float64
 8   dimension8  55524 non-null  float64
dtypes: float64(8), int64(1)
memory usage: 3.8 MB


In [5]:
data.shape

(55524, 9)

In [6]:
data.astype('float32')

Unnamed: 0,sentiment,dimension1,dimension2,dimension3,dimension4,dimension5,dimension6,dimension7,dimension8
0,1.0,-0.400418,0.293417,-0.572702,0.125659,0.471714,-0.034476,0.042176,-0.429317
1,1.0,-0.454608,-0.194998,-0.497063,0.242207,0.209621,0.064868,0.072154,0.629457
2,0.0,-0.515892,-0.120781,-0.106512,-0.260192,0.197666,-0.155029,-0.306803,0.694974
3,1.0,0.047770,-0.230509,0.132355,0.174913,0.242040,-0.229259,-0.835945,0.294148
4,0.0,-0.574353,-0.132517,-0.091610,0.466463,0.510980,-0.338480,0.202040,-0.100443
...,...,...,...,...,...,...,...,...,...
55519,0.0,0.349226,-0.236151,0.256277,-0.167399,0.641524,0.501288,-0.230371,-0.112517
55520,1.0,0.194372,0.017959,-0.743399,0.242821,0.271741,-0.509180,0.124537,0.040954
55521,0.0,0.309441,-0.311137,-0.066972,0.175950,0.684461,0.503490,-0.057506,-0.216101
55522,0.0,-0.459892,0.036321,-0.695212,-0.030716,0.374198,-0.347496,0.013221,0.204852


# 2. Preprocess Data

Your task is to predict the score of the sentiment (positive or negative) between 0 and 1 based on the embeddings of the tweets specified in columns 2 – 9 of the pre-possessed spreadsheet (or the original tweets if you decided to work with the raw tweeting data). To evaluate the performance of your model, please split the dataset into the train set and the test set in the 0.8:0.2 ratio and use cross-validation to calculate the prediction performance.

## 2.1 Train/Test Split

In [7]:
target = 'sentiment'

data = data.astype('float32')

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(target, axis=1), # predictors
    data[target], # target
    test_size=0.2,
    random_state=90
)

In [8]:
X_train.shape

(44419, 8)

In [9]:
X_test.shape

(11105, 8)

## 2.2 Cross-Validation

In [10]:
cv_splits = KFold(n_splits=5, random_state=90, shuffle=True)

In [11]:
fold_num = 1
for train, test in cv_splits.split(X_train, y_train):
    print(f'Fold #{fold_num}: Train shape: {X_train.iloc[train].shape}, Test shape: {X_train.iloc[test].shape}')
    fold_num += 1

Fold #1: Train shape: (35535, 8), Test shape: (8884, 8)
Fold #2: Train shape: (35535, 8), Test shape: (8884, 8)
Fold #3: Train shape: (35535, 8), Test shape: (8884, 8)
Fold #4: Train shape: (35535, 8), Test shape: (8884, 8)
Fold #5: Train shape: (35536, 8), Test shape: (8883, 8)


In [12]:
def nn_grid_search(model):
    
    param_grid = {
        'epochs': [50],
        'batch_size':[25, 50, 100, 250, 500]
    }

    grid = GridSearchCV(
        estimator=model,
        param_grid=param_grid,
        scoring='neg_mean_absolute_error',
        n_jobs=-1,
        cv=cv_splits
    )

    grid_result = grid.fit(X_train, y_train)
    
    # summarize results
    print(f"Best: {-grid_result.best_score_:.4f} using {grid_result.best_params_:}")
    means = -grid_result.cv_results_['mean_test_score']
    stds = grid_result.cv_results_['std_test_score']
    params = grid_result.cv_results_['params']
    for mean, stdev, param in zip(means, stds, params):
        print(f"{mean:.4f} +/- {stdev:.4f} with: {param}")
    
    return grid_result

# 3. Neural Networks

You can use any neural network model you like for this classification task. In particular, you may start with a simple single fully connected network as a “baseline” and then try to use more complex models, including CNN and RNN based models, to achieve better performance results than this simple baseline model.  Your goal is to reach the mean absolute error of at least 0.48, which should not be too difficult. If you want to be more ambitious, you can try to reach the mean absolute error of 0.47 (medium difficulty), or even 0.46 (this is difficult). The higher accuracy you get, the more points you will be awarded. 

In addition to the simple NN baseline mentioned above, you should also build another basic baseline, such as a logistic regression model (similar to the one we used in the RapidMiner Lab done in the class) and compare the performance results of your DL-based model with that baseline. The expectation is that the more sophisticated DL-model should outperform simple baselines.

## 3.1 Baseline Model Performance

### 3.1.1 Logistic Regression

In [13]:
%%time

model_lr = LogisticRegression()
cv_results_lr = cross_validate(model_lr, X_train, y_train, scoring='neg_mean_absolute_error', cv=cv_splits)
cv_lr_mu = -cv_results_lr['test_score'].mean()
cv_lr_sd = cv_results_lr['test_score'].std()

print(f'Logistic Regression MAE: {cv_lr_mu:.4f} +/- {cv_lr_sd:.4f}\n')

Logistic Regression MAE: 0.4366 +/- 0.0041

CPU times: user 639 ms, sys: 98.3 ms, total: 737 ms
Wall time: 214 ms


In [14]:
_ = model_lr.fit(X_train, y_train)

### 4.1.2 Random Forest

In [15]:
%%time

model_rf = RandomForestClassifier(n_estimators=500)
cv_results_rf = cross_validate(model_rf, X_train, y_train, scoring='neg_mean_absolute_error', cv=cv_splits)
cv_rf_mu = -cv_results_rf['test_score'].mean()
cv_rf_sd = cv_results_rf['test_score'].std()

print(f'Random Forest MAE: {cv_rf_mu:.4f} +/- {cv_rf_sd:.4f}\n')

Random Forest MAE: 0.4177 +/- 0.0050

CPU times: user 3min 31s, sys: 2.05 s, total: 3min 33s
Wall time: 3min 35s


In [16]:
_ = model_rf.fit(X_train, y_train)

### 4.1.3 Simple Neural Network

In [17]:
%%time

def create_model_nn0():
    
    # model configuration
    loss_function = tf_mae
    optimizer = Adam()
    
    model = Sequential([
        Dense(32, activation='relu', input_dim=8),
        Dense(16, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(loss=loss_function,
                  optimizer=optimizer)
    
    return model
 
model_nn_0 = KerasClassifier(build_fn=create_model_nn0, verbose=0)
cv_results_nn_0 = nn_grid_search(model_nn_0)

Best: 0.4428 using {'batch_size': 500, 'epochs': 50}
0.4621 +/- 0.0189 with: {'batch_size': 25, 'epochs': 50}
0.4531 +/- 0.0226 with: {'batch_size': 50, 'epochs': 50}
0.4616 +/- 0.0209 with: {'batch_size': 100, 'epochs': 50}
0.4524 +/- 0.0247 with: {'batch_size': 250, 'epochs': 50}
0.4428 +/- 0.0218 with: {'batch_size': 500, 'epochs': 50}
CPU times: user 3.54 s, sys: 751 ms, total: 4.29 s
Wall time: 2min 24s


## 4.2 Deep Learning

### 4.2.1 DL Model #1

In [18]:
%%time

def create_model_nn1():
    
    # model configuration
    loss_function = tf_mae
    optimizer = Adam()
    
    model = Sequential([
        Dense(8, activation='relu', input_dim=8),
        Dense(8, activation='relu'),
        Dense(8, activation='relu'),
        Dense(8, activation='relu'),
        Dense(8, activation='relu'),
        Dense(8, activation='relu'),
        Dense(8, activation='relu'),
        Dense(8, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(loss=loss_function,
                  optimizer=optimizer)
    
    return model
 
    
model_nn_1 = KerasClassifier(build_fn=create_model_nn1, verbose=0)
cv_results_nn_1 = nn_grid_search(model_nn_1)

Best: 0.4488 using {'batch_size': 50, 'epochs': 50}
0.4629 +/- 0.0151 with: {'batch_size': 25, 'epochs': 50}
0.4488 +/- 0.0138 with: {'batch_size': 50, 'epochs': 50}
0.4564 +/- 0.0132 with: {'batch_size': 100, 'epochs': 50}
0.4556 +/- 0.0163 with: {'batch_size': 250, 'epochs': 50}
0.4624 +/- 0.0146 with: {'batch_size': 500, 'epochs': 50}
CPU times: user 36.7 s, sys: 6.34 s, total: 43.1 s
Wall time: 2min 59s


### 4.2.1 DL Model #2

In [19]:
%%time

def create_model_nn2():
    
    # model configuration
    loss_function = tf_mae
    optimizer = Adam()
    
    model = Sequential([
        Dense(64, activation='relu', input_dim=8),
        BatchNormalization(),
        Dense(32, activation='relu'),
        BatchNormalization(),
        Dense(16, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(loss=loss_function,
                  optimizer=optimizer)
    
    return model
 
model_nn_2 = KerasClassifier(build_fn=create_model_nn2, verbose=0)
cv_results_nn_2 = nn_grid_search(model_nn_2)

Best: 0.4205 using {'batch_size': 500, 'epochs': 50}
0.4256 +/- 0.0047 with: {'batch_size': 25, 'epochs': 50}
0.4230 +/- 0.0046 with: {'batch_size': 50, 'epochs': 50}
0.4212 +/- 0.0055 with: {'batch_size': 100, 'epochs': 50}
0.4217 +/- 0.0057 with: {'batch_size': 250, 'epochs': 50}
0.4205 +/- 0.0050 with: {'batch_size': 500, 'epochs': 50}
CPU times: user 10.9 s, sys: 1.97 s, total: 12.9 s
Wall time: 4min 24s


# 4. Evaluation

After you build your neural network, apply the trained deep learning model to the test set and evaluate its performance using the accuracy measures.

In [24]:
# # baseline algo predictions
preds_lr = model_lr.predict_proba(X_test)[:,1]
preds_rf = model_rf.predict_proba(X_test)[:,1]

# # # predict test test for NN models
preds_nn_0 = cv_results_nn_0.best_estimator_.predict_proba(X_test)[:,1]
preds_nn_1 = cv_results_nn_1.best_estimator_.predict_proba(X_test)[:,1]
preds_nn_2 = cv_results_nn_2.best_estimator_.predict_proba(X_test)[:,1]

# # # evaluation of test predictions
mae_test_lr = skl_mae(y_test, preds_lr)
mae_test_rf = skl_mae(y_test, preds_rf)
mae_test_nn_0 = skl_mae(y_test, preds_nn_0)
mae_test_nn_1 = skl_mae(y_test, preds_nn_1)
mae_test_nn_2 = skl_mae(y_test, preds_nn_2)

test_results_df = pd.DataFrame({
    'Model': ['logistic regression', 'random forest', 'simple NN', 'DL #1', 'DL #2'],
    'MAE': [mae_test_lr, mae_test_rf, mae_test_nn_0, mae_test_nn_1, mae_test_nn_2]
}).sort_values('MAE').reset_index(drop=True)

test_results_df

Unnamed: 0,Model,MAE
0,DL #2,0.415586
1,random forest,0.460184
2,DL #1,0.467447
3,simple NN,0.467448
4,logistic regression,0.489189
