# Natural Language Processing

**Objective**
- Demonstrate how to use slickML for implementing sentiment analysis using state-of-the-art pre-trained BERT (Bidirectional Encoder Representations from Transformers).
- Note: the input data expected: Text/Review and Labels (e.g. "Positive","Neutral","Negative",etc.)
- Reference: https://huggingface.co/transformers/model_doc/bert.html
- Example Dataset: Tweets scrapped on Twitter from the 2016 U.S. Presidential Debates

In [1]:
# Change path to project root
%cd ..

/Users/tracesmith/Desktop/Trace/Code_Library/slick-ml


In [7]:
%load_ext autoreload

# widen the screen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

import os, sys
import pandas asa pd
import transformers
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'
sns.set(style='whitegrid', palette='muted', font_scale=1.2)

In [11]:
from slickml.nlp import SentimentTorch

## Load Data

In [57]:
df = pd.read_csv('data/sentiment.csv')

In [58]:
df.head()

Unnamed: 0,id,candidate,candidate_confidence,relevant_yn,relevant_yn_confidence,sentiment,sentiment_confidence,subject_matter,subject_matter_confidence,candidate_gold,...,relevant_yn_gold,retweet_count,sentiment_gold,subject_matter_gold,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,1,No candidate mentioned,1.0,yes,1.0,Neutral,0.6578,None of the above,1.0,,...,,5,,,RT @NancyLeeGrahn: How did everyone feel about...,,2015-08-07 09:54:46 -0700,629697200650592256,,Quito
1,2,Scott Walker,1.0,yes,1.0,Positive,0.6333,None of the above,1.0,,...,,26,,,RT @ScottWalker: Didn't catch the full #GOPdeb...,,2015-08-07 09:54:46 -0700,629697199560069120,,
2,3,No candidate mentioned,1.0,yes,1.0,Neutral,0.6629,None of the above,0.6629,,...,,27,,,RT @TJMShow: No mention of Tamir Rice and the ...,,2015-08-07 09:54:46 -0700,629697199312482304,,
3,4,No candidate mentioned,1.0,yes,1.0,Positive,1.0,None of the above,0.7039,,...,,138,,,RT @RobGeorge: That Carly Fiorina is trending ...,,2015-08-07 09:54:45 -0700,629697197118861312,Texas,Central Time (US & Canada)
4,5,Donald Trump,1.0,yes,1.0,Positive,0.7045,None of the above,1.0,,...,,156,,,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,,2015-08-07 09:54:45 -0700,629697196967903232,,Arizona


### Convert Dataset to Binary Classification + Encode Labels

In [8]:
df=df[df['sentiment'] != 'Neutral'].reset_index(drop=True)
le = LabelEncoder()
df['sentiment'] = le.fit_transform(df['sentiment'])

### Split Train/Test

In [None]:
train,test = train_test_split(df,test_size=0.10,random_state=123)
val,test = train_test_split(test,test_size=0.50,random_state=123)

### Explore Max Sentences

In [None]:
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
token_len = []
for txt in df.text:
    tokens = tokenizer.encode(str(txt),max_length=512,truncation=True)
    token_len.append(len(tokens))

In [None]:
sns.distplot(token_len)

### Model Training

In [None]:
Model = SentimentTorch(epochs=1, batch_size=1, max_len=60, n_classes=2, n_workers=4)
print(Model)

In [None]:
history = Model.fit(train,val)

### Evaluate Model

In [None]:
Y_pred, Y_proba, Y_test = Model.predict(test)

In [None]:
print(classification_report(Y_test, Y_pred,target_names=["Negative","Postive"]))

### Plot Loss

In [None]:
def plot_loss(history):
    """
    Function for plot model history for train/validation 
    Parameters
    ----------
    history: dict, 
      Model training/validation history (accuracy,loss)
    """
    plt.figure(figsize=(10,8))
    plt.plot(history['train_acc'],color='navy', label='train accuracy')
    plt.plot(history['val_acc'], color='orange',label='validation accuracy')
    plt.title('Training history')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend()
    plt.ylim([0, 1])
    plt.title('Loss vs Accuracy',fontsize=18)

plot_loss(history)