## Part II: Neural Networks
Among all the built ML models Logistic Regression showed the best results. It had an F1 score of 57%. F-measure was chosen due to the dataset imbalance and the importance of both false negatives and false positives. The result of the F score was not ideal. It indicated that the model's performance was a little better than average, and neural networks were tried out to improve the accuracy of the predictions.

Neural networks have been used extensively in natural language processing and they provide powerful new tools for modeling language. They are applied to many language problems: unsupervised learning of word representations, supervised text classification, language modeling, etc. They are well suited for learning the complex underlying structure of a sentence and semantic proximity of various words. Neural networks are much more flexible than other ML models, as they allow to easily experiment with different structures, adding and removing layers as needed. Neural networks are also easy to train as new data comes in.

## Data Preparation and Exploration
Before building neural networks, several preprocessing steps were required.

In [43]:
# Import all required packages
import pandas as pd
import numpy
import random
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from keras import models
from keras import layers
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelBinarizer
from keras.preprocessing.text import Tokenizer

import warnings
warnings.filterwarnings(action='ignore')

ModuleNotFoundError: No module named 'imblearn'

In [36]:
df = pd.read_csv('Data/Data.csv')
df.head()

Unnamed: 0,text,category
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion


In [37]:
X = df['text']
y = df['category']

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)
print(y_train.value_counts(),'\n\n', y_test.value_counts())

Neutral emotion     4452
Positive emotion    2372
Negative emotion     449
Name: category, dtype: int64 

 Neutral emotion     1092
Positive emotion     606
Negative emotion     121
Name: category, dtype: int64


In [39]:
X_train_final, X_val, y_train_final, y_val = train_test_split(X_train,
                                                              y_train,
                                                              test_size=0.2,
                                                              random_state=42)
print(y_train_final.value_counts(),'\n\n', y_test.value_counts())

Neutral emotion     3590
Positive emotion    1873
Negative emotion     355
Name: category, dtype: int64 

 Neutral emotion     1092
Positive emotion     606
Negative emotion     121
Name: category, dtype: int64


In [41]:
# Use one-hot encoding to reformat the complaints into a matrix of vectors
# Only keep the 2000 most common words
tokenizer = Tokenizer(num_words=2500)
tokenizer.fit_on_texts(X_train_final)
X_train_tokens = tokenizer.texts_to_matrix(X_train_final, mode='binary')
X_val_tokens = tokenizer.texts_to_matrix(X_val, mode='binary')
X_test_tokens = tokenizer.texts_to_matrix(X_test, mode='binary')

In [42]:
# Transform the product labels to numerical values
lb = LabelBinarizer()
lb.fit(y_train_final)
y_train_lb = to_categorical(lb.transform(y_train_final))[:, :, 1]
y_val_lb = to_categorical(lb.transform(y_val))[:, :, 1]
y_test_lb = to_categorical(lb.transform(y_test))[:, :, 1]

In [44]:
pip install imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.10.1-py3-none-any.whl (226 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.0/226.0 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting joblib>=1.1.1
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: joblib, imbalanced-learn
  Attempting uninstall: joblib
    Found existing installation: joblib 1.1.0
    Uninstalling joblib-1.1.0:
      Successfully uninstalled joblib-1.1.0
Successfully installed imbalanced-learn-0.10.1 joblib-1.2.0
Note: you may need to restart the kernel to use updated packages.


In [47]:
# Use SMOTE class to improve the model's performance on the minority class
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)

# Preview the synthetic sample class distribution
X_train_resamples, y_train_resampled = smote.fit_resample(X_train_tokens, y_train_lb)

In [48]:
# Build a baseline neural network model
random.seed(123)
baseline_model = models.Sequential()
baseline_model.add(layers.Dense(50, activation='relu', input_shape=(2500,)))
baseline_model.add(layers.Dense(25, activation='relu'))
baseline_model.add(layers.Dense(3, activation='softmax'))

Metal device set to: Apple M1


2023-05-05 19:07:26.640642: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-05-05 19:07:26.642124: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [None]:
# Compile the model
import keras
baseline_model.compile(optimizer='SGD',
                       loss='categorical_crossentropy',
                       metrics=['accuracy',
                                keras.metrics.Precision(name='precision', class_id=0),
                                keras.metrics.Recall(name='recall', class_id=0)])

In [None]:
# Train the model
weights = {0: 5.41356, 1: 0.54487, 2: 1.02042}
baseline_model_val = baseline_model.fit(X_train_tokens,
                                        y_train_lb,
                                        class_weight=weights,
                                        epochs=250,
                                        batch_size=256,
                                        validation_data=(X_val_tokens, y_val_lb))