## Sentiment Analysis Model

### Requirements

In [77]:
# !pip install transformers
# !pip install tensorflow
# !pip install -U jupyter
# !pip install ipywidgets
# Link for automodels from tensorflow https://huggingface.co/transformers/v3.0.2/model_doc/auto.html
#!pip install nltk

### Imports

In [79]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from tensorflow.nn import softmax
from tensorflow.keras import Sequential, layers
import pandas as pd
import glob
from big_picture.get_merged_data import get_data
from big_picture import pre_processor
import numpy as np

### Model and base example

In [80]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [81]:
text = "Replace me by any text you'd like"
encoded_input = tokenizer(text, return_tensors='tf')
encoded_input

{'input_ids': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=
array([[ 101, 5672, 2033, 2011, 2151, 3793, 2017, 1005, 1040, 2066,  102]],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

In [82]:
output = model(encoded_input)
output

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[ 3.3290102, -2.7952664]], dtype=float32)>, hidden_states=None, attentions=None)

In [83]:
softmax(output.logits)

<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[0.9978157 , 0.00218429]], dtype=float32)>

### Basic modelling of our data

In [55]:
REL_PATH_INPUT = "../raw_data/data_12k/"
CONTENT_COL = "content"
DESCRIPTION_COL = "short_description"
HEADLINE_COL = "headline"

news_all_data = "news_all_data"

df =  get_data(REL_PATH_INPUT)
df = df.sample(150)
df[CONTENT_COL] = df[CONTENT_COL].replace('\n',' ', regex=True)

df[news_all_data] = df[CONTENT_COL] + " " + df[DESCRIPTION_COL] + " " + df[HEADLINE_COL]
df = df.dropna(subset=[news_all_data]).reset_index()
df = df[df[news_all_data] != "Invalid file"].reset_index(drop=True)

In [56]:
texts = list(df.news_all_data)

In [66]:
texts = "Finland has banned the neo-Nazi Nordic Resistance Movement following a court ruling"

In [67]:
encoded_input = tokenizer(texts, return_tensors='tf',
                padding=True)

In [68]:
encoded_input = encoded_input

# When one text:
#[Encoding(num_tokens=16, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, 
# overflowing])]

In [69]:
encoded_input

{'input_ids': <tf.Tensor: shape=(1, 16), dtype=int32, numpy=
array([[  101,  6435,  2038,  7917,  1996,  9253,  1011,  6394, 13649,
         5012,  2929,  2206,  1037,  2457,  6996,   102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 16), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

In [70]:
output = model(encoded_input)

In [71]:
softmax(output.logits)

<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[0.8897241 , 0.11027591]], dtype=float32)>