**Covid19 Twitter Sentiment Analysis with Hugging Face**

Hugging Face is a platform that provides open-source machine learning technologies. You can install their package to access pre-built models, which you can use directly or fine-tune using your own dataset and leveraging prior knowledge gained from the initial training. Once trained, you can host your models on the platform and use them on other devices and applications in the future. To access all the features of the platform, please visit the website and sign in. To learn more about text classification with Hugging Face, please refer to the relevant resources.

Please note that Hugging Face models are based on deep learning and require significant GPU computational power for training. We recommend using Colab, your preferred GPU cloud provider, or a local machine with an NVIDIA GPU for this purpose.

**Application of Hugging Face Text classification model Fune-tuning**

In [27]:
!pip install datasets

[0m

In [29]:
!pip install transformers

[0m

In [30]:
!pip install sentencepiece

[0m

**Importing Relevant Libraries**

In [31]:
import os
import numpy as np
import pandas as pd
import torch
from datasets import load_dataset
from sklearn.model_selection import train_test_split

from transformers import AutoModelForSequenceClassification
# from transformers import TFAutoModelForSequenceClassification
# from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
# from transformers import BertTokenizer, BertModel
from transformers import AutoTokenizer, AutoConfig, AdamW
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

from sklearn.metrics import mean_squared_error

In [32]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"

In [33]:
# Load the dataset from a GitHub link
url = "https://raw.githubusercontent.com/Azubi-Africa/Career_Accelerator_P5-NLP/master/zindi_challenge/data/Train.csv"
df = pd.read_csv(url), encoding = "ISO-8859-1")


# A way to eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]

SyntaxError: unmatched ')' (506664577.py, line 3)

In [36]:
# Load the dataset from a GitHub link
import pandas as pd
url = "https://raw.githubusercontent.com/Azubi-Africa/Career_Accelerator_P5-NLP/master/zindi_challenge/data/Train.csv"
df = pd.read_csv(url, encoding="ISO-8859-1")


# A way to eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]

****Splitting the dataset in to Train & Eval****

In [37]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [38]:
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
9305,YMRMEDME,Mickey's Measles has gone international <url>,0.0,1.0
3907,5GV8NEZS,S1256 [NEW] Extends exemption from charitable ...,0.0,1.0
795,EI10PS46,<user> your ignorance on vaccines isn't just ...,1.0,0.666667
5793,OM26E6DG,Pakistan partly suspends polio vaccination pro...,0.0,1.0
3431,NBBY86FX,In other news I've gone up like 1000 mmr,0.0,1.0


In [39]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7999 entries, 9305 to 1387
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   tweet_id   7999 non-null   object 
 1   safe_text  7999 non-null   object 
 2   label      7999 non-null   float64
 3   agreement  7999 non-null   float64
dtypes: float64(2), object(2)
memory usage: 312.5+ KB


In [40]:
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
label,7999.0,0.301413,0.646832,-1.0,0.0,0.0,1.0,1.0
agreement,7999.0,0.854398,0.180677,0.333333,0.666667,1.0,1.0,1.0


In [41]:
eval.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
6571,R7JPIFN7,Children's Museum of Houston to Offer Free Vac...,1.0,1.0
1754,2DD250VN,<user> no. I was properly immunized prior to t...,1.0,1.0
3325,ESEVBTFN,<user> thx for posting vaccinations are impera...,1.0,1.0
1485,S17ZU0LC,This Baby Is Exactly Why Everyone Needs To Vac...,1.0,0.666667
4175,IIN5D33V,"Meeting tonight, 8:30pm in room 322 of the stu...",1.0,1.0


In [42]:
eval.label.unique()

array([ 1., -1.,  0.])

In [43]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (7999, 4), eval is (2000, 4)


****Creating a pytorch dataset****

In [44]:
from datasets import DatasetDict, Dataset
train_dataset = Dataset.from_pandas(train[['tweet_id', 'safe_text', 'label', 'agreement']])
eval_dataset = Dataset.from_pandas(eval[['tweet_id', 'safe_text', 'label', 'agreement']])

dataset = DatasetDict({'train': train_dataset, 'eval': eval_dataset})
dataset = dataset.remove_columns('__index_level_0__')
dataset

DatasetDict({
    train: Dataset({
        features: ['tweet_id', 'safe_text', 'label', 'agreement'],
        num_rows: 7999
    })
    eval: Dataset({
        features: ['tweet_id', 'safe_text', 'label', 'agreement'],
        num_rows: 2000
    })
})

**Preprocessing our data**

In [45]:
# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

# checkpoint = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
checkpoint = "roberta-base"
# checkpoint = "xlnet-base-cased"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [46]:
def transform_labels(label):

    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['safe_text'], padding='max_length')

# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)
# data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/7999 [00:00<?, ?ex/s]

  0%|          | 0/2000 [00:00<?, ?ex/s]

In [47]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7999
    })
    eval: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [48]:
# Configure the trianing parameters like `num_train_epochs`: 
# the number of time the model will repeat the training loop over the dataset
training_args = TrainingArguments(
    "test_trainer",
    num_train_epochs=10,
    load_best_model_at_end=True,
    save_strategy='epoch',
    evaluation_strategy='epoch',
    logging_strategy='epoch',
    logging_steps=100,
    per_device_train_batch_size=16,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [49]:
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.

In [21]:
# set up the optimizer with the PyTorch implementation of AdamW
# optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

In [50]:
train_dataset = dataset['train'].shuffle(seed=24) 
eval_dataset = dataset['eval'].shuffle(seed=24) 

In [51]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"rmse": mean_squared_error(labels, predictions, squared=False)}

In [52]:
trainer = Trainer(
    model,
    training_args, 
    train_dataset=train_dataset, 
    eval_dataset=eval_dataset,
    # data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [53]:
trainer.train()

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rmse
1,0.7428,0.70561,0.767789
2,0.6018,0.587418,0.663702
3,0.4638,0.616905,0.63206
4,0.3363,0.596771,0.597913
5,0.2376,0.776491,0.604979
6,0.171,0.883687,0.607042
7,0.1239,1.013598,0.614817
8,0.0938,1.080369,0.605392
9,0.0729,1.255414,0.630079
10,0.0574,1.288336,0.629285




TrainOutput(global_step=2500, training_loss=0.29012388076782225, metrics={'train_runtime': 4533.0274, 'train_samples_per_second': 17.646, 'train_steps_per_second': 0.552, 'total_flos': 2.104644228406272e+16, 'train_loss': 0.29012388076782225, 'epoch': 10.0})

In [54]:
# Launch the final evaluation 
trainer.evaluate()



{'eval_loss': 0.5874180793762207,
 'eval_rmse': 0.6637017402418047,
 'eval_runtime': 42.0346,
 'eval_samples_per_second': 47.58,
 'eval_steps_per_second': 2.974,
 'epoch': 10.0}

**Pushing to HuggingFace**

In [55]:
# # Push the model and tokenizer to Hugging Face
token = "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"
model.push_to_hub("ikoghoemmanuell/finetuned_sentiment_model", use_auth_token=token, commit_message="Pushed model")
tokenizer.push_to_hub("https://huggingface.co/TruelyEpic/tweeter-sentiment-analysis-bert-base-cased", use_auth_token=token, commit_message="pushed tokenize

IndentationError: unexpected indent (3635035780.py, line 2)

In [None]:
from transformers import AutoTokenizer, AutoConfig, AutoModelForSequenceClassification

model_path = "TruelyEpic/tweeter-sentiment-analysis-bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_path)
config = AutoConfig.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

In [57]:
token = "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"
model_id = "TruelyEpic/tweeter-sentiment-analysis-bert-base-cased"
tokenizer_id = "TruelyEpic/tweeter-sentiment-analysis-bert-base-cased"

# Push the model
!huggingface-cli push finetuned_sentiment_model $model_id --use-deepspeed --token $token --hub-model

# Push the tokenizer
!huggingface-cli push tweeter-sentiment-analysis-bert-base-cased $tokenizer_id --use-deepspeed --token $token --hub-model


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
usage: huggingface-cli <command> [<args>]
huggingface-cli: error: argument {env,login,whoami,logout,repo,lfs-enable-largefiles,lfs-multipart-upload,scan-cache,delete-cache}: invalid choice: 'push' (choose from 'env', 'login', 'whoami', 'logout', 'repo', 'lfs-enable-largefiles', 'lfs-multipart-upload', 'scan-cache', 'delete-cache')
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
usage: huggingface-cli <command> [<args>]
huggingface-cli: error: argument {env,login,whoami,logout,r

In [63]:
from transformers import BertTokenizer, BertForSequenceClassification

# Load the pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertForSequenceClassification.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [64]:
# Save the model and tokenizer
model.save_pretrained("model/")
tokenizer.save_pretrained("tokenizer/")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.txt',
 'tokenizer/added_tokens.json')

In [65]:
import json

In [66]:
# Get the tokenizer vocabulary and save it as a JSON file
tokenizer_vocab = tokenizer.get_vocab()
with open("tokenizer/tokenizer.json", "w") as f:
    json.dump(tokenizer_vocab, f)

In [68]:
pip install streamlit

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting streamlit
  Downloading streamlit-1.22.0-py2.py3-none-any.whl (8.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting validators>=0.2
  Downloading validators-0.20.0.tar.gz (30 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting pydeck>=0.1.dev5
  Downloading pydeck-0.8.1b0-py2.py3-none-any.whl (4.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m79.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m
Collecting watchdog
  Downloading watchdog-3.0.0-py3-none-manylinux2014_x86_64.whl (82 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.1/8

In [76]:
pip install gradio

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting gradio
  Downloading gradio-3.30.0-py3-none-any.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m58.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting ffmpy
  Downloading ffmpy-0.3.0.tar.gz (4.8 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting python-multipart
  Downloading python_multipart-0.0.6-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting gradio-client>=0.2.4
  Downloading gradio_client-0.2.4-py3-none-any.whl (287 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.9/287.9 kB[0m [31m19.4 

In [77]:
import streamlit as st
import transformers
import torch

# Load the model and tokenizer
model = transformers.AutoModelForSequenceClassification.from_pretrained("BertTokenizer.from_pretrained("bert-base-cased")")
tokenizer = transformers.AutoTokenizer.from_pretrained("BertTokenizer.from_pretrained("bert-base-cased")")

# Define the function for sentiment analysis
@st.cache(allow_output_mutation=True)
def predict_sentiment(text):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt")
    # Pass the tokenized input through the model
    outputs = model(**inputs)
    # Get the predicted class and return the corresponding sentiment
    predicted_class = torch.argmax(outputs.logits, dim=-1).item()
    if predicted_class == 0:
        return "Negative"
    elif predicted_class == 1:
        return "Neutral"
    else:
        return "Positive"

# Setting the page configurations
st.set_page_config(
    page_title="Sentiment Analysis App",
    page_icon=":smile:",
    layout="wide",
    initial_sidebar_state="auto",
)

# Add description and title
st.write("""
# How Positive or Negative is your Text?
Enter some text and we'll tell you if it has a positive, negative, or neutral sentiment!
""")


# Add image
image = st.image("https://i0.wp.com/thedatascientist.com/wp-content/uploads/2018/10/sentiment-analysis.png", width=400)

# Get user input
text = st.text_input("Enter some text here:")

# Define the CSS style for the app
st.markdown(
"""
<style>
body {
    background-color: #f5f5f5;
}
h1 {
    color: #4e79a7;
}
</style>
""",
unsafe_allow_html=True
)


# Show sentiment output
if text:
    sentiment = predict_sentiment(text)
    if sentiment == "Positive":
        st.success(f"The sentiment is {sentiment}!")
    elif sentiment == "Negative":
        st.error(f"The sentiment is {sentiment}.")
    else:
        st.warning(f"The sentiment is {sentiment}.")


SyntaxError: invalid syntax. Perhaps you forgot a comma? (1724802585.py, line 6)

In [78]:
import transformers
import torch
import gradio as gr

# Load the model and tokenizer
model = transformers.AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-cased")

# Define the function for sentiment analysis
def predict_sentiment(text):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt")
    # Pass the tokenized input through the model
    outputs = model(**inputs)
    # Get the predicted class and return the corresponding sentiment
    predicted_class = torch.argmax(outputs.logits, dim=-1).item()
    if predicted_class == 0:
        return "Negative"
    elif predicted_class == 1:
        return "Neutral"
    else:
        return "Positive"

# Create the input and output interfaces
inputs = gr.inputs.Textbox(label="Enter some text here:")
outputs = gr.outputs.Textbox(label="Sentiment")

# Create the Gradio interface
gr.Interface(fn=predict_sentiment, inputs=inputs, outputs=outputs,
             title="Sentiment Analysis App",
             description="Enter some text and we'll tell you if it has a positive, negative, or neutral sentiment!",
             article="https://huggingface.co/transformers/model_doc/bert.html",
             thumbnail="https://i0.wp.com/thedatascientist.com/wp-content/uploads/2018/10/sentiment-analysis.png").launch()


Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]



Kaggle notebooks require sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://03c25871b30a970dd7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces




In [None]:
import gradio as gr
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits, dim=-1).item()
    if predicted_class == 0:
        return "Negative"
    elif predicted_class == 1:
        return "Neutral"
    else:
        return "Positive"

In [None]:
input = gr.inputs.Textbox(label="Enter some text here:")
outputs = gr.outputs.Textbox(label="Sentiment")

In [None]:
gr.Interface(fn=predict_sentiment, inputs=inputs, outputs=outputs,
             title="Sentiment Analysis App",
             description="Enter some text and we'll tell you if it has a positive, negative, or neutral sentiment!",
             article="https://huggingface.co/transformers/model_doc/bert.html",
             thumbnail="https://i0.wp.com/thedatascientist.com/wp-content/uploads/2018/10/sentiment-analysis.png").launch()


In [None]:
transformers-cli login
transformers-cli repo create your-repo-name
transformers-cli push your-repo-name