<a href="https://colab.research.google.com/github/ayaanzhaque/SuiSense/blob/master/notebooks/BERT/workingBertModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ae/05/c8c55b600308dc04e95100dc8ad8a244dd800fe75dfafcf1d6348c6f6209/transformers-3.1.0-py3-none-any.whl (884kB)
[K     |████████████████████████████████| 890kB 3.4MB/s 
[?25hCollecting tokenizers==0.8.1.rc2
[?25l  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 15.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 41.6MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB

In [2]:
#importing relevant libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

import torch

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [6]:
df = pd.read_csv('/content/cleaned_data_casual.csv')
og_batch_1 = df[['selftext', 'is_suicide']]

For performance reasons, we'll only use 2,000 sentences from the dataset

In [7]:
#batch_1_start = og_batch_1.head(60)
#batch_1_end = og_batch_1.tail(60)
#test_batch_1 = pd.concat([batch_1_start, batch_1_end], ignore_index=True)
batch_1 = og_batch_1.rename(columns={'selftext': 0, 'is_suicide': 1})
batch_1.head()

Unnamed: 0,0,1
0,We understand that most people who reply immed...,0
1,Welcome to /r/depression's check-in post - a p...,0
2,I've been feeling really depressed and lonely ...,0
3,I literally broke down crying and asked to go ...,0
4,Any kind soul want to give a depressed person ...,0


We can ask pandas how many sentences are labeled as "positive" (value 1) and how many are labeled "negative" (having the value 0)

In [8]:
batch_1[1].value_counts()

1    980
0    917
2    804
Name: 1, dtype: int64

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model. 

In [9]:
# For DistilBERT:
#model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

## Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [10]:
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True, truncation=True, max_length=128)))

In [11]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

Our dataset is now in the `padded` variable, we can view its dimensions below:

In [12]:
np.array(padded).shape

(2701, 128)

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [13]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2701, 128)

In [14]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [15]:
features = last_hidden_states[0][:,0,:].numpy()
print(features)

[[ 0.15627055 -0.20722187 -0.03962998 ...  0.15881488  0.3669506
   0.46319035]
 [-0.05789474 -0.01733885 -0.1070803  ...  0.01880805 -0.06970198
   0.54524016]
 [ 0.19604996  0.20534162  0.24803358 ...  0.1578196   0.45455894
  -0.4178381 ]
 ...
 [ 0.45092496 -0.15768571 -0.16213767 ... -0.16745816  0.6286537
   0.411904  ]
 [ 0.1912971  -0.16093378 -0.01542986 ... -0.11184699  0.46886578
   0.47975355]
 [ 0.5417328   0.0307596   0.32225472 ...  0.01061545  0.7953502
   0.41072857]]


The labels indicating which sentence is positive and negative now go into the `labels` variable

In [17]:
labels = batch_1[1]
labels.head()

0    0
1    0
2    0
3    0
4    0
Name: 1, dtype: int64

## Model #2: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [18]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.25, random_state=42, stratify=labels)

In [None]:
# parameters = {'C': np.linspace(0.0001, 100, 20)}
# grid_search = GridSearchCV(LogisticRegression(), parameters)
# grid_search.fit(train_features, train_labels)

# print('best parameters: ', grid_search.best_params_)
# print('best scrores: ', grid_search.best_score_)

In [20]:
from keras.models import Sequential
from keras import layers
from keras.layers import Dense, Activation, Embedding, Flatten, GlobalMaxPool1D, Dropout, Conv1D
from keras.callbacks import ModelCheckpoint

model3 = Sequential()
model3.add(Dense(1, activation='sigmoid'))

model3.add(Dense(10, activation='relu'))
# model3.add(Dense(1, activation='sigmoid'))
model3.add(Dense(3, activation = 'softmax')) # this gives the probability distribution for each class, should be 1 hot encoded, maybe we take this out of the model and only use it when predicting

model3.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])


checkpoint = ModelCheckpoint('/content/depression_suicide_neither_nn.h5', verbose=1, monitor='val_loss',save_best_only=True, mode='auto')  


history = model3.fit(train_features, train_labels, batch_size=32, epochs=100, verbose=1, callbacks=[checkpoint], validation_data=(test_features, test_labels))

Epoch 1/100
Epoch 00001: val_loss improved from inf to 1.06888, saving model to /content/depression_suicide_neither_nn.h5
Epoch 2/100
Epoch 00002: val_loss improved from 1.06888 to 1.06888, saving model to /content/depression_suicide_neither_nn.h5
Epoch 3/100
Epoch 00003: val_loss did not improve from 1.06888
Epoch 4/100
Epoch 00004: val_loss did not improve from 1.06888
Epoch 5/100
Epoch 00005: val_loss did not improve from 1.06888
Epoch 6/100
Epoch 00006: val_loss did not improve from 1.06888
Epoch 7/100
Epoch 00007: val_loss did not improve from 1.06888
Epoch 8/100
Epoch 00008: val_loss did not improve from 1.06888
Epoch 9/100
Epoch 00009: val_loss did not improve from 1.06888
Epoch 10/100
Epoch 00010: val_loss did not improve from 1.06888
Epoch 11/100
Epoch 00011: val_loss did not improve from 1.06888
Epoch 12/100
Epoch 00012: val_loss did not improve from 1.06888
Epoch 13/100
Epoch 00013: val_loss did not improve from 1.06888
Epoch 14/100
Epoch 00014: val_loss did not improve from

In [21]:
from keras.models import load_model

loadedModel = load_model('/content/depression_suicide_neither_nn.h5')
loadedModel.evaluate(test_features, test_labels)



[1.0688765048980713, 0.5636094808578491]

<img src="https://jalammar.github.io/images/distilBERT/bert-training-logistic-regression.png" />

## Evaluating Model #2
So how well does our model do in classifying sentences? One way is to check the accuracy against the testing dataset:

How good is this score? What can we compare it against? Let's first look at a dummy classifier:

In [None]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.509 (+/- 0.06)


So our model clearly does better than a dummy classifier. But how does it compare against the best models?

## Proper SST2 scores
For reference, the [highest accuracy score](http://nlpprogress.com/english/sentiment_analysis.html) for this dataset is currently **96.8**. DistilBERT can be trained to improve its score on this task – a process called **fine-tuning** which updates BERT’s weights to make it achieve a better performance in this sentence classification task (which we can call the downstream task). The fine-tuned DistilBERT turns out to achieve an accuracy score of **90.7**. The full size BERT model achieves **94.9**.



And that’s it! That’s a good first contact with BERT. The next step would be to head over to the documentation and try your hand at [fine-tuning](https://huggingface.co/transformers/examples.html#glue). You can also go back and switch from distilBERT to BERT and see how that works.