# CYBER BULLYING MODELLING NOTEBOOK

---

## NCSU CSC 591: Algorithms for Data Guided Buisness Intelligence

---
As social media usage grows across all age groups, the great majority of individuals rely on this crucial medium for day-to-day communication. Because of the pervasiveness of social media, cyberbullying may affect anybody at any time or from any location, and the internet's relative anonymity makes such personal attacks more difficult to stop than conventional bullying.


In light of this, this dataset comprises over 47000 tweets labeled with the following cyberbullying categories: Age, Ethnicity, Gender, Religion, Other sort of cyberbullying, Not cyberbullying.

Trigger Warning: These tweets either describe a bullying occurrence or are the crime itself; consequently, read them until you are comfortable.

---

#### Contributors: Anmolika Goyal(agoyal4), Anshul Navinbhai Patel(apatel28), Shubhangi Jain(sjain29)

---



Connect the Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


### Import Libraries

In [None]:
# Installing the libraries
!pip install kaggle
!pip install emoji==1.6.3

Collecting emoji==1.6.3
  Downloading emoji-1.6.3.tar.gz (174 kB)
[K     |████████████████████████████████| 174 kB 5.0 MB/s 
[?25hBuilding wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.6.3-py3-none-any.whl size=170298 sha256=fc060435f3d56c9937c23d71d10d51b818723f22bb8e79e5042fc947ab52aa98
  Stored in directory: /root/.cache/pip/wheels/03/8b/d7/ad579fbef83c287215c0caab60fb0ae0f30c4d7ce5f580eade
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-1.6.3


In [None]:
# Upload the kaggle.json
# Reference: https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/
! mkdir ~/.kaggle
!cp "/content/gdrive/MyDrive/Github Repos/kaggle.json" ~/.kaggle/kaggle.json
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
# General Librarires
import kaggle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re, string
import emoji
from tqdm import tqdm
# Model Saving
import joblib
import pickle
# Scikit-Learn Functions
from sklearn import preprocessing, decomposition, metrics, pipeline
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
# Machine Learning
import xgboost as xgb
# NLTK
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords') 
nltk.download('punkt')
stop_words = stopwords.words('english')
# Keras
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.utils import np_utils
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
from tensorflow.keras.layers import BatchNormalization

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Load the dataset from kaggle

Dataset Link: https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification

In [None]:
!kaggle datasets download -d andrewmvd/cyberbullying-classification

Downloading cyberbullying-classification.zip to /content
  0% 0.00/2.82M [00:00<?, ?B/s]
100% 2.82M/2.82M [00:00<00:00, 78.1MB/s]


In [None]:
!unzip /content/cyberbullying-classification.zip

Archive:  /content/cyberbullying-classification.zip
  inflating: cyberbullying_tweets.csv  


In [None]:
# Load dataset from csv
df = pd.read_csv('/content/cyberbullying_tweets.csv')
df.head()

Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying


### Preprocessing the Dataset

Using the information gathered from Exploratory Data Analysis Notebook: [Link](https://github.com/anshulp2912/Cyberbullying_Tweet_Classification/blob/main/src/cyberbullying_EDA.ipynb)

In [None]:
# Removing the duplicate rows from dataset
dataset = df.drop_duplicates()
dataset.head()

Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying


In [None]:
# Define preprocessing functions

# Remove emojis from text
def remove_emoji(txt):
  txt = re.sub(emoji.get_emoji_regexp(), r"", txt)
  return txt

# Expand common abbreviations
def expand_txt(txt):
  txt = re.sub(r"\'d", " would", txt)
  txt = re.sub(r"\'ll", " will", txt)
  txt = re.sub(r"can\'t", "can not", txt)
  txt = re.sub(r"\'ve", " have", txt)
  txt = re.sub(r"\'re", " are", txt)
  txt = re.sub(r"\'s", " is", txt)
  txt = re.sub(r"\'m", " am", txt)
  txt = re.sub(r"n\'t", " not", txt)
  txt = re.sub(r"\'t", " not", txt)
  return txt

# Remove characters, links, mentions, and punctuations
def clean_nonwanted_chars(txt):
  # Remove characters
  txt = txt.replace('\n', ' ')
  txt = txt.replace('\r', '')
  # Remove mentions and links
  txt = re.sub(r'[^\x00-\x7f]',r'', txt)
  # Remove punctuations
  punc_remove = string.punctuation
  punc_list = str.maketrans('', '', punc_remove)
  txt = txt.translate(punc_list)
  txt = [word for word in txt.split() if word not in stop_words]
  txt = ' '.join(txt)
  return txt

# Remove Hashtags
def remove_hash(txt):
  txt = " ".join(word.strip() for word in re.split('#(?!(?:hashtag)\b)[\w-]+(?=(?:\s+#[\w-]+)*\s*$)', txt)) 
  txt = " ".join(word.strip() for word in re.split('#|_', txt))
  return txt

# Remove characters from between the words
def remove_chars(txt):
    clean = []
    for word in txt.split(' '):
        if ('&' in word) | ('$' in word):
            clean.append('')
        else:
            clean.append(word)
    txt = ' '.join(clean)
    return txt

# Remove multiple spaces and tabs
def remove_space(txt):
  txt = re.sub("\s\s+" , " ", txt)
  return txt

In [None]:
# Process the textual data
def preprocess_text(txt):
  txt = txt.lower()
  txt = remove_emoji(txt)
  txt = expand_txt(txt)
  txt = clean_nonwanted_chars(txt)
  txt = remove_hash(txt)
  txt = remove_chars(txt)
  txt = remove_space(txt)
  # Stemming the text
  tokens = nltk.word_tokenize(txt)
  PS = nltk.stem.PorterStemmer()
  txt = ' '.join([PS.stem(words) for words in tokens])
  return txt

# Generate clean text
clean_txt = []
for txt in list(dataset.tweet_text.values):
  clean_txt.append(preprocess_text(txt))

# Replace text in dataframe
dataset['tweet_text'] = clean_txt

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
# Removing the duplicate rows from dataset again after cleaning
dataset = dataset.drop_duplicates()
dataset.head()

Unnamed: 0,tweet_text,cyberbullying_type
0,word katandandr food crapilici mkr,not_cyberbullying
1,aussietv white mkr theblock imacelebrityau tod...,not_cyberbullying
2,xochitlsuckkk classi whore red velvet cupcak,not_cyberbullying
3,jasongio meh p thank head concern anoth angri ...,not_cyberbullying
4,rudhoeenglish isi account pretend kurdish acco...,not_cyberbullying


In [None]:
# Generate the X and y dataset
X = dataset.tweet_text.values
y = dataset.cyberbullying_type.values

In [None]:
# Convert the textual labels to numericals value
LE = preprocessing.LabelEncoder()
y = LE.fit_transform(y)

# Save the models
print('\nSaving Label Encoder...')
with open('/content/gdrive/Shareddrives/ADBI_Capstone/models/LE.pkl', 'wb') as f:
    pickle.dump(LE, f)


Saving Label Encoder...


In [None]:
# Split the Dataset into Train(70%), Validation(15%), Test(15%)
X_train, X_remaining, y_train, y_remaining = train_test_split(X, y, stratify=y, test_size=0.3, shuffle=True, random_state=111)
X_val, X_test, y_val, y_test = train_test_split(X_remaining, y_remaining, stratify=y_remaining, test_size=0.5, shuffle=True, random_state=111)

### Generate word vectors from sentences


*   Count Vectorizer
*   TFIDF Vectorizer



In [None]:
# Count Vectorizer
CV = CountVectorizer(analyzer='word',max_features=3000,token_pattern=r'\w{1,}',ngram_range=(1, 3), stop_words = 'english')
CV.fit(list(X_train)+list(X_val)+list(X_test))
X_train_CV = CV.transform(X_train)
X_val_CV = CV.transform(X_val)
X_test_CV = CV.transform(X_test)

# Save the models
print('\nSaving Count Vectorizer...')
with open('/content/gdrive/Shareddrives/ADBI_Capstone/models/CountVectorizer.pkl', 'wb') as f:
    pickle.dump(CV, f)


Saving Count Vectorizer...


In [None]:
# TFIDF Vectorizer
TFIDF = TfidfVectorizer(min_df=3,  max_features=3000, strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}', ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1, stop_words = 'english')
TFIDF.fit(list(X_train)+list(X_val)+list(X_test))
X_train_TFIDF = TFIDF.transform(X_train)
X_val_TFIDF = TFIDF.transform(X_val)
X_test_TFIDF = TFIDF.transform(X_test)

# Save the models
print('\nSaving TFIDF Vectorizer...')
with open('/content/gdrive/Shareddrives/ADBI_Capstone/models/TFIDFVectorizer.pkl', 'wb') as f:
    pickle.dump(TFIDF, f)


Saving TFIDF Vectorizer...


### Model Training

#### Machine Learning Models


*   Logistic Regression
*   Naive Bayes
*   XGBoost
*   Support Vector Machines


In [None]:
# Logistic Regression
print('LOGISTIC REGRESSION MODEL')

# Count Vectorizer
print('\nCount Vectorizer Model')
model_CV_LR = LogisticRegression(solver='saga')
model_CV_LR.fit(X_train_CV, y_train)
y_pred_val_CV = model_CV_LR.predict_proba(X_val_CV)
y_pred_test_CV = model_CV_LR.predict_proba(X_test_CV)
# Cross-Validation Score
scores = cross_val_score(model_CV_LR, X_train_CV, y_train, cv=5)
scores = pd.Series(scores)
print('CV Cross Validation Train Accuracy: %0.3f'%scores.mean())
# Calculate Multinomial Log loss 
print('CV Validation Loss: %0.3f'%metrics.log_loss(y_val, y_pred_val_CV))
print('CV Test Loss: %0.3f'%metrics.log_loss(y_test, y_pred_test_CV))

# TFIDF Vectorizer
print('\nTFIDF Vectorizer Model')
model_TFIDF_LR = LogisticRegression(solver='saga')
model_TFIDF_LR.fit(X_train_TFIDF, y_train)
y_pred_val_TFIDF = model_TFIDF_LR.predict_proba(X_val_TFIDF)
y_pred_test_TFIDF = model_TFIDF_LR.predict_proba(X_test_TFIDF)
# Cross-Validation Score
scores = cross_val_score(model_TFIDF_LR, X_train_TFIDF, y_train, cv=5)
scores = pd.Series(scores)
print('TFIDF Cross Validation Train Accuracy: %0.3f'%scores.mean())
# Calculate Multinomial Log loss 
print('TFIDF Validation Loss: %0.3f'%metrics.log_loss(y_val, y_pred_val_TFIDF))
print('TFIDF Test Loss: %0.3f'%metrics.log_loss(y_test, y_pred_test_TFIDF))

# Save the models
print('\nSaving Model...')
joblib.dump(model_CV_LR, '/content/gdrive/Shareddrives/ADBI_Capstone/models/model_CV_LR.sav')
joblib.dump(model_TFIDF_LR, '/content/gdrive/Shareddrives/ADBI_Capstone/models/model_TFIDF_LR.sav')

LOGISTIC REGRESSION MODEL

Count Vectorizer Model




CV Cross Validation Train Accuracy: 0.826
CV Validation Loss: 0.419
CV Test Loss: 0.416

TFIDF Vectorizer Model
TFIDF Cross Validation Train Accuracy: 0.822
TFIDF Validation Loss: 0.448
TFIDF Test Loss: 0.451

Saving Model...


['/content/gdrive/Shareddrives/ADBI_Capstone/models/model_TFIDF_LR.sav']

In [None]:
# Naive Bayes
print('NAIVE BAYES MODEL')

# Count Vectorizer
print('\nCount Vectorizer Model')
model_CV_NB = MultinomialNB()
model_CV_NB.fit(X_train_CV, y_train)
y_pred_val_CV = model_CV_NB.predict_proba(X_val_CV)
y_pred_test_CV = model_CV_NB.predict_proba(X_test_CV)
# Cross-Validation Score
scores = cross_val_score(model_CV_NB, X_train_CV, y_train, cv=5)
scores = pd.Series(scores)
print('CV Cross Validation Train Accuracy: %0.3f'%scores.mean())
# Calculate Multinomial Log loss 
print('CV Validation Loss: %0.3f'%metrics.log_loss(y_val, y_pred_val_CV))
print('CV Test Loss: %0.3f'%metrics.log_loss(y_test, y_pred_test_CV))

# TFIDF Vectorizer
print('\nTFIDF Vectorizer Model')
model_TFIDF_NB = MultinomialNB()
model_TFIDF_NB.fit(X_train_TFIDF, y_train)
y_pred_val_TFIDF = model_TFIDF_NB.predict_proba(X_val_TFIDF)
y_pred_test_TFIDF = model_TFIDF_NB.predict_proba(X_test_TFIDF)
# Cross-Validation Score
scores = cross_val_score(model_TFIDF_NB, X_train_TFIDF, y_train, cv=5)
scores = pd.Series(scores)
print('TFIDF Cross Validation Train Accuracy: %0.3f'%scores.mean())
# Calculate Multinomial Log loss 
print('TFIDF Validation Loss: %0.3f'%metrics.log_loss(y_val, y_pred_val_TFIDF))
print('TFIDF Test Loss: %0.3f'%metrics.log_loss(y_test, y_pred_test_TFIDF))

# Save the models
print('\nSaving Model...')
joblib.dump(model_CV_NB, '/content/gdrive/Shareddrives/ADBI_Capstone/models/model_CV_NB.sav')
joblib.dump(model_TFIDF_NB, '/content/gdrive/Shareddrives/ADBI_Capstone/models/model_TFIDF_NB.sav')

NAIVE BAYES MODEL

Count Vectorizer Model
CV Cross Validation Train Accuracy: 0.770
CV Validation Loss: 0.686
CV Test Loss: 0.707

TFIDF Vectorizer Model
TFIDF Cross Validation Train Accuracy: 0.767
TFIDF Validation Loss: 0.632
TFIDF Test Loss: 0.635

Saving Model...


['/content/gdrive/Shareddrives/ADBI_Capstone/models/model_TFIDF_NB.sav']

In [None]:
# XGBOOST
print('XGBOOST MODEL')

# Count Vectorizer
print('\nCount Vectorizer Model')
model_CV_XG = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, subsample=0.8, nthread=10, learning_rate=0.1)
model_CV_XG.fit(X_train_CV.tocsc(), y_train)
y_pred_val_CV = model_CV_XG.predict_proba(X_val_CV.tocsc())
y_pred_test_CV = model_CV_XG.predict_proba(X_test_CV.tocsc())
# Cross-Validation Score
scores = cross_val_score(model_CV_XG, X_train_CV.tocsc(), y_train, cv=5)
scores = pd.Series(scores)
print('CV Cross Validation Train Accuracy: %0.3f'%scores.mean())
# Calculate Multinomial Log loss 
print('CV Validation Loss: %0.3f'%metrics.log_loss(y_val, y_pred_val_CV))
print('CV Test Loss: %0.3f'%metrics.log_loss(y_test, y_pred_test_CV))

# TFIDF Vectorizer
print('\nTFIDF Vectorizer Model')
model_TFIDF_XG = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, subsample=0.8, nthread=10, learning_rate=0.1)
model_TFIDF_XG.fit(X_train_TFIDF.tocsc(), y_train)
y_pred_val_TFIDF = model_TFIDF_XG.predict_proba(X_val_TFIDF.tocsc())
y_pred_test_TFIDF = model_TFIDF_XG.predict_proba(X_test_TFIDF.tocsc())
# Cross-Validation Score
scores = cross_val_score(model_TFIDF_XG, X_train_TFIDF.tocsc(), y_train, cv=5)
scores = pd.Series(scores)
print('TFIDF Cross Validation Train Accuracy: %0.3f'%scores.mean())
# Calculate Multinomial Log loss 
print('TFIDF Validation Loss: %0.3f'%metrics.log_loss(y_val, y_pred_val_TFIDF))
print('TFIDF Test Loss: %0.3f'%metrics.log_loss(y_test, y_pred_test_TFIDF))

# Save the models
print('\nSaving Model...')
joblib.dump(model_CV_XG, '/content/gdrive/Shareddrives/ADBI_Capstone/models/model_CV_XG.sav')
joblib.dump(model_TFIDF_XG, '/content/gdrive/Shareddrives/ADBI_Capstone/models/model_TFIDF_XG.sav')

XGBOOST MODEL

Count Vectorizer Model
CV Cross Validation Train Accuracy: 0.834
CV Validation Loss: 0.398
CV Test Loss: 0.390

TFIDF Vectorizer Model
TFIDF Cross Validation Train Accuracy: 0.831
TFIDF Validation Loss: 0.403
TFIDF Test Loss: 0.399

Saving Model...


['/content/gdrive/Shareddrives/ADBI_Capstone/models/model_TFIDF_XG.sav']

In [None]:
# Support Vector Machines
print('SVM MODEL')

# Reduce the size of training set to speed up training
SVD = decomposition.TruncatedSVD(n_components=150)
X_train_CV_svd = SVD.fit_transform(X_train_CV)
X_val_CV_svd = SVD.transform(X_val_CV)
X_test_CV_svd = SVD.transform(X_test_CV)

scaler = preprocessing.StandardScaler()
X_train_CV_svd = scaler.fit_transform(X_train_CV_svd)
X_val_CV_svd = scaler.transform(X_val_CV_svd)
X_test_CV_svd = scaler.transform(X_test_CV_svd)

# Count Vectorizer
print('\nCount Vectorizer Model')
model_CV_SVM = SVC(C=1.0, probability=True)
model_CV_SVM.fit(X_train_CV_svd, y_train)
y_pred_val_CV = model_CV_SVM.predict_proba(X_val_CV_svd)
y_pred_test_CV = model_CV_SVM.predict_proba(X_test_CV_svd)
# Cross-Validation Score
scores = cross_val_score(model_CV_SVM, X_train_CV_svd, y_train, cv=5)
scores = pd.Series(scores)
print('CV Cross Validation Train Accuracy: %0.3f'%scores.mean())
# Calculate Multinomial Log loss 
print('CV Validation Loss: %0.3f'%metrics.log_loss(y_val, y_pred_val_CV))
print('CV Test Loss: %0.3f'%metrics.log_loss(y_test, y_pred_test_CV))

# Reduce the size of training set to speed up training
SVD = decomposition.TruncatedSVD(n_components=150)
X_train_TFIDF_svd = SVD.fit_transform(X_train_TFIDF)
X_val_TFIDF_svd = SVD.transform(X_val_TFIDF)
X_test_TFIDF_svd = SVD.transform(X_test_TFIDF)

scaler = preprocessing.StandardScaler()
X_train_TFIDF_svd = scaler.fit_transform(X_train_TFIDF_svd)
X_val_TFIDF_svd = scaler.transform(X_val_TFIDF_svd)
X_test_TFIDF_svd = scaler.transform(X_test_TFIDF_svd)

# TFIDF Vectorizer
print('\nTFIDF Vectorizer Model')
model_TFIDF_SVM = SVC(C=1.0, probability=True)
model_TFIDF_SVM.fit(X_train_TFIDF_svd, y_train)
y_pred_val_TFIDF = model_TFIDF_SVM.predict_proba(X_val_TFIDF_svd)
y_pred_test_TFIDF = model_TFIDF_SVM.predict_proba(X_test_TFIDF_svd)
# Cross-Validation Score
scores = cross_val_score(model_TFIDF_SVM, X_train_TFIDF_svd, y_train, cv=5)
scores = pd.Series(scores)
print('TFIDF Cross Validation Train Accuracy: %0.3f'%scores.mean())
# Calculate Multinomial Log loss 
print('TFIDF Validation Loss: %0.3f'%metrics.log_loss(y_val, y_pred_val_TFIDF))
print('TFIDF Test Loss: %0.3f'%metrics.log_loss(y_test, y_pred_test_TFIDF))

# Save the models
print('\nSaving Model...')
joblib.dump(model_CV_SVM, '/content/gdrive/Shareddrives/ADBI_Capstone/models/model_CV_SVM.sav')
joblib.dump(model_TFIDF_SVM, '/content/gdrive/Shareddrives/ADBI_Capstone/models/model_TFIDF_SVM.sav')
with open('/content/gdrive/Shareddrives/ADBI_Capstone/models/SVD.pkl', 'wb') as f:
    pickle.dump(SVD, f)
with open('/content/gdrive/Shareddrives/ADBI_Capstone/models/SVMScaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

SVM MODEL

Count Vectorizer Model
CV Cross Validation Train Accuracy: 0.795
CV Validation Loss: 0.514
CV Test Loss: 0.503

TFIDF Vectorizer Model
TFIDF Cross Validation Train Accuracy: 0.808
TFIDF Validation Loss: 0.481
TFIDF Test Loss: 0.468

Saving Model...


#### Word Vectors


In [None]:
# Download Glove Vectors
!wget https://nlp.stanford.edu/data/glove.6B.zip

--2022-04-23 21:51:58--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-04-23 21:51:59--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-04-23 21:54:42 (5.07 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



In [None]:
#unzip the contents of the glove folder
!unzip /content/glove.6B.zip

Archive:  /content/glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [None]:
#Create a dictionary and load all the glove vectors into it.
# This dictionary will be used to fetch the values in the normalized vectors created for the sentences
embeddings = {}
with open("/content/glove.6B.300d.txt", 'r', encoding='utf-8') as f:
    #go line by line and map the tokens with the vectors in the dictionary
    for line in f:
        values = line.split()
        token = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings[token] = vector

print(str(len(embeddings))+' word vectors have been found in this dictionary')

# Saving the embeddings
with open('/content/gdrive/Shareddrives/ADBI_Capstone/models/embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings, f)

400000 word vectors have been found in this dictionary


In [None]:
#Using word_tokenize to create vectors which are normalized for the wole sentence
def tokenized_sentence(s):
    text = str(s).lower()
    #use word_tokenize to split the text into words
    text = word_tokenize(text)
    #create a list named text which stores the words that are not in stop_words
    text = [word for word in text if not word in stop_words]
    #check if the word is an alphanumeric
    text = [word for word in text if word.isalpha()]
    values = []
    #for each word in text, append the value of the vector for that word into the values list
    for word in text:
        try:
            values.append(embeddings[word])
        except:
            continue
    values = np.array(values)
    vectors = values.sum(axis=0)
    if(type(vectors) != np.ndarray):
        return np.zeros(300)
    #return the normalized vectors of the sentence
    return vectors / np.sqrt((vectors ** 2).sum())

In [None]:
# Divide the data into training set, testing set and validation set
# use tokenize_sentence function to generate the vectors for each statement in the respective set.

X_training = []
#for each sentence in the X_train set, create normalized vectors and append it to the X_training list
for sentence in tqdm(X_train):
  X_training.append(tokenized_sentence(sentence))
#Convert this list into an array using np.array
X_training = np.array(X_training)

X_validation = []
#for each sentence in the X_val set, create normalized vectors and append it to the X_validation list
for sentence in tqdm(X_val):
  X_validation.append(tokenized_sentence(sentence))
#Convert this list into an array using np.array
X_validation = np.array(X_validation)

X_testing = []
#for each sentence in the X_test set, create normalized vectors and append it to the X_testing list
for sentence in tqdm(X_test):
  X_testing.append(tokenized_sentence(sentence))
#Convert this list into an array using np.array
X_testing = np.array(X_testing)

100%|██████████| 33185/33185 [00:12<00:00, 2662.78it/s]
100%|██████████| 7111/7111 [00:01<00:00, 3942.58it/s]
100%|██████████| 7112/7112 [00:01<00:00, 4237.26it/s]


In [None]:
# Train on glove vectors using XGBOOST
print('GLOVE XGBOOST')
GLOVE_XB = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, subsample=0.8, nthread=10, learning_rate=0.1)
#fit the model 
GLOVE_XB.fit(X_training, y_train)
#use predict_proba() which returns an array of lists containing the class probabilities for the input data points
y_pred_val_GLOVE = GLOVE_XB.predict_proba(X_validation)
y_pred_test_GLOVE = GLOVE_XB.predict_proba(X_testing)
# Cross-Validation Score
scores = cross_val_score(GLOVE_XB, X_training, y_train, cv=5)
scores = pd.Series(scores)
# Calculate the validation accuracy
print('GLOVE Cross Validation Train Accuracy: %0.3f'%scores.mean())
# Calculate Multinomial Log loss 
print('GLOVE Validation Loss: %0.3f'%metrics.log_loss(y_val, y_pred_val_GLOVE))
print('GLOVE Test Loss: %0.3f'%metrics.log_loss(y_test, y_pred_test_GLOVE))

# Save the models
print('\nSaving Model...')
joblib.dump(GLOVE_XB, '/content/gdrive/Shareddrives/ADBI_Capstone/models/GLOVE_XB.sav')

GLOVE XGBOOST
GLOVE Cross Validation Train Accuracy: 0.781
GLOVE Validation Loss: 0.554
GLOVE Test Loss: 0.560

Saving Model...


['/content/gdrive/Shareddrives/ADBI_Capstone/models/GLOVE_XB.sav']

#### Deep Neural Network


*   Vanilla Neural Network
*   Bidirectional LSTM



In [None]:
# Processing the Glove vectors for faster execution
scaler = preprocessing.StandardScaler()
X_training_SCL = scaler.fit_transform(X_training)
X_validation_SCL = scaler.transform(X_validation)
X_testing_SCL = scaler.transform(X_testing)

# Use one-hot encoding to convert target to binary
y_train_enc = np_utils.to_categorical(y_train)
y_val_enc = np_utils.to_categorical(y_val)
y_test_enc = np_utils.to_categorical(y_test)

# Saving the variables
print('\nSaving the variables...')
with open('/content/gdrive/Shareddrives/ADBI_Capstone/models/NNScaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)


Saving the variables...


In [None]:
# Vanilla Neural Network
print('Vanilla Neural Network')

# Define the network
# using the sequetial functionality of neural networks in python, create a neural network
# In sequential neural network, the input of the current layer is the output of the previous layer
vanillann = Sequential()
vanillann.add(Dense(300, input_dim=300, activation='relu'))
vanillann.add(Dropout(0.2))
vanillann.add(BatchNormalization())
vanillann.add(Dense(200, activation='relu'))
vanillann.add(Dropout(0.2))
vanillann.add(BatchNormalization())
vanillann.add(Dense(300, activation='relu'))
vanillann.add(Dropout(0.3))
vanillann.add(BatchNormalization())
vanillann.add(Dense(6))
vanillann.add(Activation('softmax'))

# Use compile function to compile the model and give a summary of the data in the model
vanillann.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
vanillann.summary()

# Initialize the H=hyperparameters
batch_size = 64
epochs = 20

# Fitting the model
history = vanillann.fit(X_training_SCL, y_train_enc, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(X_validation_SCL, y_val_enc))

# Predict on Validation and Test Set
# The predict funtion helps predict the labels of the data values on the basis of the model which is trained
y_pred_val_ANN = vanillann.predict(X_validation_SCL)
y_pred_test_ANN = vanillann.predict(X_testing_SCL)

#Display the accuracy of the Vanilla ANN model that we trained
print('Vanilla ANN Train Accuracy: %0.3f'%history.history['accuracy'][-1])
# Calculate Multinomial Log loss 
print('Vanilla ANN Validation Loss: %0.3f'%metrics.log_loss(y_val_enc, y_pred_val_ANN))
print('Vanilla ANN Test Loss: %0.3f'%metrics.log_loss(y_test_enc, y_pred_test_ANN))

# Save the models
print('\nSaving Model...')
joblib.dump(vanillann, '/content/gdrive/Shareddrives/ADBI_Capstone/models/vanillann.sav')

Vanilla Neural Network
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 300)               90300     
                                                                 
 dropout (Dropout)           (None, 300)               0         
                                                                 
 batch_normalization (BatchN  (None, 300)              1200      
 ormalization)                                                   
                                                                 
 dense_1 (Dense)             (None, 200)               60200     
                                                                 
 dropout_1 (Dropout)         (None, 200)               0         
                                                                 
 batch_normalization_1 (Batc  (None, 200)              800       
 hNormalization)                 

['/content/gdrive/Shareddrives/ADBI_Capstone/models/vanillann.sav']

In [None]:
#Bi-directional LSTM Model
print('Bi-directional LSTM Model')

# Generate sequences for LSTM
#use texts_to sequences to convert each text in data to a sequence of integers
token = text.Tokenizer(num_words=None)
max_len = 100
token.fit_on_texts(list(X_train) + list(X_val) + list(X_test))
X_train_seq = token.texts_to_sequences(X_train)
X_val_seq = token.texts_to_sequences(X_val)
X_test_seq = token.texts_to_sequences(X_test)

# Adding padding to sequences
# Use pad_sequences to add padding t each sentence in the Train, Validation and Testing data
X_train_pad = sequence.pad_sequences(X_train_seq, maxlen=max_len)
X_val_pad = sequence.pad_sequences(X_val_seq, maxlen=max_len)
X_test_pad = sequence.pad_sequences(X_test_seq, maxlen=max_len)

# create an matrix which contains the values and the vectors for the words we have in the dataset
word_index = token.word_index
vector_matrix = np.zeros((len(word_index) + 1, 300))
for word, val in tqdm(word_index.items()):
    em_vector = embeddings.get(word)
    if em_vector is not None:
        vector_matrix[val] = em_vector

# Define the network
# using the sequetial functionality of neural networks in python, create a neural network
# In sequential neural network, the input of the current layer is the output of the previous layer
biLSTM = Sequential()
biLSTM.add(Embedding(len(word_index) + 1, 300, weights=[vector_matrix], input_length=max_len, trainable=False))
biLSTM.add(SpatialDropout1D(0.3))
biLSTM.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3)))
biLSTM.add(Dense(1024, activation='relu'))
biLSTM.add(Dropout(0.8))
biLSTM.add(Dense(1024, activation='relu'))
biLSTM.add(Dropout(0.8))
biLSTM.add(Dense(6))
biLSTM.add(Activation('softmax'))

# Use compile function to compile the model and give a summary of the data in the model
biLSTM.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
biLSTM.summary()

# Initialize the H=hyperparameters
batch_size = 512
epochs = 20

#Use EarlyStopping function to stop early when the metric stops improving after a few epochs
#Fit the model with the early stopping callback
stop_callback = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
biLSTM.fit(X_train_pad, y=y_train_enc, batch_size=batch_size, epochs=epochs, 
          verbose=1, validation_data=(X_val_pad, y_val_enc), callbacks=[stop_callback])

# Predict on Validation and Test Set
# The predict funtion helps predict the labels of the data values on the basis of the model which is trained
y_pred_val_biLSTM = biLSTM.predict(X_val_pad)
y_pred_test_biLSTM = biLSTM.predict(X_test_pad)

#Display the accuracy of the Bidirectional LSTM model that we trained
print('Bidirectional LSTM Train Accuracy: %0.3f'%history.history['accuracy'][-1])
# Calculate Multinomial Log loss 
print('Bidirectional LSTM Validation Loss: %0.3f'%metrics.log_loss(y_val_enc, y_pred_val_biLSTM))
print('Bidirectional LSTM Test Loss: %0.3f'%metrics.log_loss(y_test_enc, y_pred_test_biLSTM))

# Save the models
print('\nSaving Model...')
joblib.dump(biLSTM, '/content/gdrive/Shareddrives/ADBI_Capstone/models/biLSTM.sav')
with open('/content/gdrive/Shareddrives/ADBI_Capstone/models/token.pkl', 'wb') as f:
    pickle.dump(token, f)
with open('/content/gdrive/Shareddrives/ADBI_Capstone/models/word_index.pkl', 'wb') as f:
    pickle.dump(word_index, f)

Bi-directional LSTM Model


100%|██████████| 55047/55047 [00:00<00:00, 740021.83it/s]


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 300)          16514400  
                                                                 
 spatial_dropout1d (SpatialD  (None, 100, 300)         0         
 ropout1D)                                                       
                                                                 
 bidirectional (Bidirectiona  (None, 600)              1442400   
 l)                                                              
                                                                 
 dense_4 (Dense)             (None, 1024)              615424    
                                                                 
 dropout_3 (Dropout)         (None, 1024)              0         
                                                                 
 dense_5 (Dense)             (None, 1024)             

