## Performing sentiment analysis using DistilBERT

The fine-tuned version of DistilBERT that will be used is hosted on [this respository](https://huggingface.co/noahnsimbe/DistilBERT-yelp-sentiment-analysis)

Model imeplementation can be found [in his notebook](https://huggingface.co/noahnsimbe/DistilBERT-yelp-sentiment-analysis/blob/main/DistilBERT.ipynb)


In [None]:
from transformers import pipeline

## Loading the model

In [None]:
model = pipeline(model="noahnsimbe/DistilBERT-yelp-sentiment-analysis")

## Performing sentiment analysis

In [None]:
model(["Been going to Dr. Goldberg for over 10 years. I think I was one of his 1st patients when he started at MHMG. He's been great over the years and is really all about the big picture. It is because of him, not my now former gyn Dr. Markoff, that I found out I have fibroids. He explores all options with you and is very patient and understanding. He doesn't judge and asks all the right questions. Very thorough and wants to be kept in the loop on every aspect of your medical health and your life."])

`LABEL_0`, `LABEL_1`, `LABEL_2` stand for `Negative`, `Neutral` and `Positive` respectively

# @noah i am just giving you layout here you should work on this code

In [None]:
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification, DataCollatorWithPadding
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from trainmodel import prepare_input_data
import numpy as np


In [None]:
train_texts, train_labels, test_texts, test_labels, _ = prepare_input_data('train_data.csv', 'test_data.csv')


In [None]:
train_texts = [str(text) for text in np.where(pd.isnull(train_texts), '', train_texts)]
test_texts = [str(text) for text in np.where(pd.isnull(test_texts), '', test_texts)]


In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

def encode_texts(tokenizer, texts):
    return tokenizer(texts, padding=True, truncation=True, max_length=128, return_tensors="tf")

train_encodings = encode_texts(tokenizer, train_texts)
test_encodings = encode_texts(tokenizer, test_texts)


In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    {'input_ids': train_encodings['input_ids'], 'attention_mask': train_encodings['attention_mask']},
    train_labels
)).batch(16)

test_dataset = tf.data.Dataset.from_tensor_slices((
    {'input_ids': test_encodings['input_ids'], 'attention_mask': test_encodings['attention_mask']},
    test_labels
)).batch(16)


In [None]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=np.unique(train_labels).size)

optimizer = Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])


In [None]:
model.fit(train_dataset, validation_data=test_dataset, epochs=3)
