<a href="https://colab.research.google.com/github/bjohn22/Natural-Language-Processing/blob/main/Eluvio_NLP_Challenge_concise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Natural Language Understanding: Semantic Analyses**
## Comparison of single predictors with Bidirectional Encoder Representations from Transformers (BERT)



## Workflow:
1. Prepare train and test datasets
2. Text normalization
3. Model training
4. Model prediction and evaluation

### Predictors
* Single predictors
>Preprocess the dataset, split to train test, fit the model, evaluate

* BERT
>Preprocess the dataset using BERT custom preprocess steps, split the train test, fit and evaluate the model
>>save the model.


#### Text Normalization
Text normalization is defined as a process that consists of a series of steps that should be followed to wrangle, clean, and standardize textual data into a form that could be consumed by other NLP and analytics systems and applications as input.
##### Tokenizing the text
1. Sentence Tokenization/ sentence segmentation
> splitting a text corpus into sentences using common separators like periods or newlin (\n)

2. Word Tokenization
> process of splitting or segmenting sentences into their constituent words.
> Word tokenization will be used or for the stemming/ lemmatizing depending on which one we choose.
#### Removing Stopwords
* These words have litle or no significance
>They are usually removed from text during processing so as to retain words having maximum significance and context. Stopwords are usually words that end up occurring the most if you aggregated any corpus of text based on singular tokens and checked their frequencies. Words like *a, the , me ,* and so on are stopwords.

#### Stemming
* We start with explaining morphemes:
* Morphemes are the smallest independent unit in any natural language
* Morphemes consist of units that are stems and affixes.
* Affixes are units like prefixes, suffixes, and so on, which are attached to a word stem to change its meaning or create a new word altogether.
* Word stems are also often known as the *base form* of a word, and we can create new words by attaching affixes to them in a process known as *inflection*.
* The reverse of this is obtaining the base form of a word from its inflected form, and this is known as ***stemming***.
* The *nltk* package has several implementations for stemmers. These stemmers are implemented in the *stem module*, which inherits the *StemmerI* interface in the *nltk.stem.api* module.

#### Lemmatization
* lemmatization is very similar to stemming—you remove word affixes to get to a base form of the word.
* But in this case, this base form is also known as the root word, but not the root stem
* The difference is that the root stem may not always be a lexicographically correct word; that is, it may not be present in the dictionary.
* The root word, also known as the lemma , will always be present in the dictionary.

The lemmatization process is *considerably slower than stemming* because an
additional step is involved where the root form or lemma is formed by removing the affix
from the word if and only if the lemma is present in the dictionary.


Connect to google drive: Directory for data

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Import packages

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re
import pickle
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Dataset prepocessing



```
# This is formatted as code
```

1.Read in the dataset and snoop into the dataset

In [5]:
path = "/content/drive/MyDrive/Eluvio_DS_Challenge.csv"
news_df = pd.read_csv(path)

In [6]:
news_df.head()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews


In [7]:
len(news_df)

509236

In [8]:
print(sum(news_df['category'] == "worldnews"))
print(sum(news_df["down_votes"] == 0))

509236
509236


* It looks like the dataset contains headline news ('title' attribute) from around the world.
* The 'title' has a rating (i.e. 'up_votes', 'down_vote' attributes)
>The the 'up_vote' attribute will be used as label.

*Note*
* 'category' contains only "worldnews" and "down_votes" are only 0, so they are dropped.

In [9]:
news_df = news_df.drop("category", axis = 1)
news_df = news_df.drop("down_votes", axis = 1)
news_df = news_df.drop("time_created", axis = 1)
news_df = news_df.drop("date_created", axis = 1)

In [10]:
#Sanity check
news_df.head()

Unnamed: 0,up_votes,title,over_18,author
0,3,Scores killed in Pakistan clashes,False,polar
1,2,Japan resumes refuelling mission,False,polar
2,3,US presses Egypt on Gaza border,False,polar
3,1,Jump-start economy: Give health care to all,False,fadi420
4,4,Council of Europe bashes EU&UN terror blacklist,False,mhermans


In [11]:
len(set(news_df['author']))  # the number of author

85838

# Process the 'title' (word vectorize)
In order for this text data to be usable by model we’ll need to convert each title to a numeric representation, which we call vectorization


In [12]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## build the corpus (normalize then tokenize)
### Normalization
Convert all the different forms of a given word into one using:
>>Stemming
>>Lemmatization

1. Stemming: SnowballStemmer wil be used for this case. But there are other algorithms in the NLTK package.

In [13]:
#convert all text to lowercase
title = news_df.title.str.lower()

In [14]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [15]:
# To get the stems of words in a sentence.
def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

# To get the words themselves in a sentence.
def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [16]:
# Get full stems and tokens to build vocabulary
def tokenized_stemmed(title):
    totalvocab_stemmed = []
    totalvocab_tokenized = []
    for i in title:
        allwords_stemmed = tokenize_and_stem(i) 
        totalvocab_stemmed.extend(allwords_stemmed) 

        allwords_tokenized = tokenize_only(i)
        totalvocab_tokenized.extend(allwords_tokenized)
    return totalvocab_stemmed, totalvocab_tokenized

In [17]:
totalvocab_stemmed_, totalvocab_tokenized_ = tokenized_stemmed(title)

In [18]:
print(len(totalvocab_stemmed_))

7194561


In [19]:
#Save the corpus in our directory for re-use
pickle.dump((totalvocab_stemmed_, totalvocab_tokenized_), open("/content/drive/MyDrive/stem_token_.pkl", "wb" ))

In [20]:
totalvocab_stemmed_, totalvocab_tokenized_ = pickle.load(open("/content/drive/MyDrive/stem_token_.pkl", "rb" ))

In [21]:
totalvocab = zip(totalvocab_stemmed_, totalvocab_tokenized_)

In [22]:
totalvocab = list(set(totalvocab))

In [23]:
totalvocab_stemmed, totalvocab_tokenized = zip(*totalvocab)

In [24]:
pickle.dump((totalvocab_stemmed, totalvocab_tokenized), open("/content/drive/MyDrive/stem_token.pkl", "wb" ))

In [25]:
totalvocab_stemmed, totalvocab_tokenized = pickle.load(open("/content/drive/MyDrive/stem_token.pkl", "rb" ))

In [26]:
print(len(totalvocab_stemmed))

115041


In [27]:
#stem-token vocabulary
vocab_frame = pd.DataFrame({'words_tokenized': totalvocab_tokenized}, index = totalvocab_stemmed)


In [28]:

pickle.dump(vocab_frame, open('/content/drive/MyDrive/vocab_frame.pkl','wb'))



In [29]:
#See the vocabulary we built
vocab_frame.head(30)

Unnamed: 0,words_tokenized
thutmosi,thutmosis
chaff,chaff
supremo,supremo
bp-rosneft,bp-rosneft
dot-com,dot-com
borodino,borodino
bobblehead,bobblehead
empress,empress
narathiwat,narathiwat
duffel,duffel


In [30]:
vocab_frame = pickle.load(open('/content/drive/MyDrive/vocab_frame.pkl','rb'))

###Removing *stop words* such as ('but','we','he', 'if')
This is done without any noticeable effect on the semantics of the text.

In [31]:
# Build stopwords set. Combine two common set.
import sklearn.feature_extraction.text as text
stopwords = nltk.corpus.stopwords.words('english')
my_stop_words = text.ENGLISH_STOP_WORDS.union(stopwords)

Tf-idf to vectorize text.
TF-IDF stands for Term Frequency — Inverse Document Frequency and is a statistic that aims to better define how important a word is for a document, while also taking into account the relation to other documents from the same corpus.


In [32]:
# tf-idf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df =10**-3 ,analyzer = 'word', max_features=len(set(totalvocab_stemmed)), stop_words=my_stop_words, tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(title)

print(tfidf_matrix.shape)

  'stop_words.' % sorted(inconsistent))


(509236, 1814)


In [33]:
#Getting the importance of words in the titles.
tf_idf = pd.DataFrame(tfidf_matrix[0].T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["TF-IDF"])
tf_idf = tf_idf.sort_values('TF-IDF', ascending=False)
print (tf_idf.head(5))

            TF-IDF
score     0.656073
clash     0.513358
pakistan  0.441258
kill      0.333651
1st       0.000000


###Saving results

In [34]:
pickle.dump(tfidf_matrix, open("/content/drive/MyDrive/tfidf_matrix.pkl", "wb" ))


In [35]:
pickle.dump(tfidf_vectorizer, open( "/content/drive/MyDrive/tfidf_vectorizer.pkl", "wb" ))



In [36]:
tfidf_matrix = pickle.load(open("/content/drive/MyDrive/tfidf_matrix.pkl", "rb" ))


In [37]:
tfidf_vectorizer = pickle.load(open("/content/drive/MyDrive//tfidf_vectorizer.pkl", "rb" ))

In [38]:
tfidf_matrix

<509236x1814 sparse matrix of type '<class 'numpy.float64'>'
	with 3565328 stored elements in Compressed Sparse Row format>

Model

In [39]:
#setting up the label. 
np.quantile(news_df['up_votes'], 0.8)

24.0

In [40]:
#Setting a cut() level for the label to know which news article is hot or not.
news_df['up_votes'].describe()

count    509236.000000
mean        112.236283
std         541.694675
min           0.000000
25%           1.000000
50%           5.000000
75%          16.000000
max       21253.000000
Name: up_votes, dtype: float64

In [41]:
#1 =  high up_votes (>np.quantile(news_df['up_votes'], 0.8)), 0 = low votes (<np.quantile(news_df['up_votes'], 0.8))
thre = np.quantile(news_df['up_votes'], 0.8)
y = [1 if i > thre else 0 for i in news_df['up_votes']]
y = np.array(y)
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, y, test_size = 0.2, shuffle = True, random_state = 123)

MultinomialNB

In [42]:
clf = MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [43]:
y_predict = clf.predict(X_test)
clf.score(X_test, y_test)

0.804335873065745

In [44]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.80      1.00      0.89     81923
           1       0.48      0.00      0.00     19925

    accuracy                           0.80    101848
   macro avg       0.64      0.50      0.45    101848
weighted avg       0.74      0.80      0.72    101848



LogisticRegression

In [45]:
LR = LogisticRegression(C=1.0, penalty='elasticnet', solver='saga', tol=0.01, l1_ratio=0.5)

In [46]:
LR.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=0.5, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='elasticnet',
                   random_state=None, solver='saga', tol=0.01, verbose=0,
                   warm_start=False)

In [47]:
y_predict = LR.predict(X_test)
LR.score(X_test, y_test)

0.8050329903385437

In [48]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.81      0.99      0.89     81923
           1       0.53      0.04      0.07     19925

    accuracy                           0.81    101848
   macro avg       0.67      0.51      0.48    101848
weighted avg       0.75      0.81      0.73    101848




GBDT

In [49]:
gbdt = GradientBoostingClassifier()
gbdt.fit(X_train, y_train)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [50]:
y_predict = gbdt.predict(X_test)
gbdt.score(X_test, y_test)

0.8045715183410572

In [51]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.80      1.00      0.89     81923
           1       0.62      0.00      0.01     19925

    accuracy                           0.80    101848
   macro avg       0.71      0.50      0.45    101848
weighted avg       0.77      0.80      0.72    101848



Random Forest

In [52]:
rfc = RandomForestClassifier(n_jobs = -1, max_features = 'sqrt', n_estimators = 10, oob_score = True)
rfc.fit(X_train, y_train)

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=True, random_state=None, verbose=0,
                       warm_start=False)

In [53]:
y_predict = rfc.predict(X_test)
rfc.score(X_test, y_test)

0.7924554237687534

In [54]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.81      0.97      0.88     81923
           1       0.30      0.05      0.08     19925

    accuracy                           0.79    101848
   macro avg       0.55      0.51      0.48    101848
weighted avg       0.71      0.79      0.73    101848



XGB

In [55]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier

In [56]:
xgb = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

In [57]:
xgb.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
              nthread=4, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=27,
              silent=None, subsample=0.8, verbosity=1)

In [58]:
y_predict = xgb.predict(X_test)

In [59]:
xgb.score(X_test, y_test)

0.8051311758699238

In [60]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.81      0.99      0.89     81923
           1       0.53      0.04      0.07     19925

    accuracy                           0.81    101848
   macro avg       0.67      0.51      0.48    101848
weighted avg       0.75      0.81      0.73    101848



###Comments
The best accuracy is 80.5% achieved by Gradient Boosting, Extreme Gradient Boosting and Logistic regression classifiers. But gradient boosting classifer performed better because it has the highest precision (62%) which means it is better at correctly classifying a 'title' text that has high up_votes than the rest.   

##BERT Model for text classification
Preprocessing and normalization for BERT is different and custom for BERT.

In [64]:
#about the text
title_vote= news_df[['title', 'up_votes']].copy()
title_vote.head()


Unnamed: 0,title,up_votes
0,Scores killed in Pakistan clashes,3
1,Japan resumes refuelling mission,2
2,US presses Egypt on Gaza border,3
3,Jump-start economy: Give health care to all,1
4,Council of Europe bashes EU&UN terror blacklist,4


In [65]:
#The labels.
thre = np.quantile(news_df['up_votes'], 0.8)
title_vote['y_vote'] = [1 if i > thre else 0 for i in news_df['up_votes']]

In [66]:
title_vote.head()

Unnamed: 0,title,up_votes,y_vote
0,Scores killed in Pakistan clashes,3,0
1,Japan resumes refuelling mission,2,0
2,US presses Egypt on Gaza border,3,0
3,Jump-start economy: Give health care to all,1,0
4,Council of Europe bashes EU&UN terror blacklist,4,0


In [52]:
#delete
title_vote1 = title_vote[['title', 'y_vote']]
title_vote1.head()

Unnamed: 0,title,y_vote
0,Scores killed in Pakistan clashes,0
1,Japan resumes refuelling mission,0
2,US presses Egypt on Gaza border,0
3,Jump-start economy: Give health care to all,0
4,Council of Europe bashes EU&UN terror blacklist,0


In [67]:
!pip install -q -U tensorflow-text
!pip install transformers

[K     |████████████████████████████████| 3.4MB 5.7MB/s 
[?25hCollecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 3.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/08/cd/342e584ee544d044fb573ae697404ce22ede086c9e87ce5960772084cad0/sacremoses-0.0.44.tar.gz (862kB)
[K     |████████████████████████████████| 870kB 51.8MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 51.0MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sa

In [68]:
pip install tf-models-official

Collecting tf-models-official
[?25l  Downloading https://files.pythonhosted.org/packages/57/4a/23a08f8fd2747867ee223612e219eeb0d11c36116601d99b55ef3c72e707/tf_models_official-2.4.0-py2.py3-none-any.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 5.6MB/s 
Collecting pyyaml>=5.1
[?25l  Downloading https://files.pythonhosted.org/packages/7a/a5/393c087efdc78091afa2af9f1378762f9821c9c1d7a22c5753fb5ac5f97a/PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636kB)
[K     |████████████████████████████████| 645kB 9.2MB/s 
Collecting dataclasses
  Downloading https://files.pythonhosted.org/packages/26/2f/1095cdc2868052dd1e64520f7c0d5c8c550ad297e944e641dbf1ffbb9a5d/dataclasses-0.6-py3-none-any.whl
Collecting opencv-python-headless
[?25l  Downloading https://files.pythonhosted.org/packages/6d/6d/92f377bece9b0ec9c893081dbe073a65b38d7ac12ef572b8f70554d08760/opencv_python_headless-4.5.1.48-cp37-cp37m-manylinux2014_x86_64.whl (37.6MB)
[K     |████████████████████████████████| 37.6MB 126k

In [69]:
import tensorflow as tf
import os
import shutil
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optmizer

import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')

In [70]:
#From Hugging Face Transfromers library

from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures


In [71]:
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [72]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


In [73]:
import tensorflow as tf
import pandas as pd

In [74]:
title_vote.head()
title_vote.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509236 entries, 0 to 509235
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   title     509236 non-null  object
 1   up_votes  509236 non-null  int64 
 2   y_vote    509236 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 11.7+ MB


In [75]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split


In [77]:
#Stratified splitting (35% test) the dataset to maintain data distribtion
split = StratifiedShuffleSplit(n_splits=1, test_size=0.35, random_state=42)
for train_index, test_index in split.split(title_vote, title_vote["y_vote"]):
    strat_train_news = title_vote.loc[train_index]
    strat_test_news = title_vote.loc[test_index]

In [78]:
#Sanity checks
strat_train_news.info()
strat_train_news.head()
    

<class 'pandas.core.frame.DataFrame'>
Int64Index: 331003 entries, 149289 to 393311
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   title     331003 non-null  object
 1   up_votes  331003 non-null  int64 
 2   y_vote    331003 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 10.1+ MB


Unnamed: 0,title,up_votes,y_vote
149289,The Saudi government is placing its bets squar...,48,1
491574,The long-lost ship of British polar explorer S...,1737,1
385457,US Embassy employee gunned down in Pakistan,51,1
172973,Malawi: Madonna Demanded Special Treatment,4,0
308167,Gay sex could be punishable by 100 lashes of t...,508,1


In [79]:
strat_test_news.info()
strat_test_news.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 178233 entries, 497236 to 68455
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   title     178233 non-null  object
 1   up_votes  178233 non-null  int64 
 2   y_vote    178233 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 5.4+ MB


Unnamed: 0,title,up_votes,y_vote
497236,Bomb Threat Prompts Evacuation at Brussels Tra...,15,0
107103,India’s ‘Hitler’ Soap Opera Stirs Controversy,4,0
125184,"Euro Crisis Deepens: After Spain, the focus of...",2,0
395063,Australia s Bernie Fraser quits as chairman of...,8,0
265185,UK: Illegal immigrants and foreign offenders ...,6,0


###Create input sequences
* Using InputExample function we can convert the pandas dataframes into suitable ormats for BERT model.
1. `convert_data_to_examples`: This will accept our train and test datasets and convert each row into an InputExample object.

2. `convert_examples_to_tf_dataset`: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.

In [80]:
def convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN): 
  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)
  
  return train_InputExamples, validation_InputExamples

  train_InputExamples, validation_InputExamples = convert_data_to_examples(train, 
                                                                           test, 
                                                                           'DATA_COLUMN', 
                                                                           'LABEL_COLUMN')
  
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        #  we will use padding to make all the sentences have the same length
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )




In [81]:
DATA_COLUMN = 'title' 
LABEL_COLUMN = 'y_vote' 

In [82]:
#Calling the above functions
train_InputExamples, validation_InputExamples = convert_data_to_examples(strat_train_news, strat_test_news, DATA_COLUMN, LABEL_COLUMN)

train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)

validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)



###Configuring the BERT Model

* we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
* We train the model for 2 epochs using a GPU. 
>this takes about 4 hours

In [84]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])
model.fit(train_data, epochs=2, validation_data=validation_data)


Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f9c07a390d0>

In [85]:
from tensorflow import keras

In [88]:
model.save('/content/drive/MyDrive')



In [None]:
#"/content/drive/MyDrive

Making predictions
Create a random news headline.sentence on eshould be upvote, sentences 2  and 3 are just normal. sentence 4 was copied from the high up_vote in dataset.

In [86]:
pred_sentences = ['The police officer who leaked the footage of sex is killed',
                  'DOW is up 5% today', 'It s election year', 'Hundreds of thousands of leaked emails reveal massively widespread corruption in global oil industry']

In [87]:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
#labels = ['Negative','Positive']
labels = [0,1]
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
  print(pred_sentences[i], ": \n", labels[label[i]])

The police officer who leaked the footage of sex is killed : 
 1
DOW is up 5% today : 
 0
It s election year : 
 0
Hundreds of thousands of leaked emails reveal massively widespread corruption in global oil industry : 
 1


In [89]:
strat_test_news.sort_values(by=['up_votes'], ascending=False)

Unnamed: 0,title,up_votes,y_vote
391415,Twitter has forced 30 websites that archive po...,13435,1
391318,The police officer who leaked the footage of t...,12333,1
390252,Paris shooting survivor suing French media for...,11288,1
449809,Hundreds of thousands of leaked emails reveal ...,11108,1
500786,Feeding cows seaweed could slash global greenh...,10394,1
...,...,...,...
137417,PLO: 18 Palestinians killed in Damascus by Syr...,0,0
398787,Silicon Valley shouldn’t let China strong-arm ...,0,0
468658,Facebook accidentally declared the Philippines...,0,0
2576,Jewish groups condemn FIA boss over Nazi sex...,0,0


In [26]:
from tensorflow import keras
#model = keras.models.load_model('path/to/location')

In [27]:
#load saved model and fit predict again
model = keras.models.load_model('/content/drive/MyDrive')