Notebook objectives:
- Fine-tune a HuggingFace model for text classification by following this documentation. Below are the specifications:
  - Make use of Google Colab’s free GPU to train a HuggingFace model
  - Follow the documentation from start to finish
  - Be able to answer questions about each piece of code during the interview
  - Demonstrate fine-tuning using the sample dataset provided.
  - Bonus points: Use another text classification dataset to perform fine-tuning



# Project 1: Text classification

## Project objective
- Fine-tune a HuggingFace model for text classification 

## Set up install the following:
- transformers: model used for text classification
- dataset: library to download GLUE datasets
- Git-LFS

-  using  a pre trained SST-2 (Standford Sentiment Analysis Treebank) that determines  if a sentence is positive or neative

- using [hate-speech-data](https://huggingface.co/datasets/hate_speech_offensive) to finetune the model to predict if the text is hate speech

Objecttive for part 2:
- Social media and messaging platforms have given us the ability to connect and express ourselves freely. However, what happens when these platforms are used to spread negativity and hate? 
- That's where the Hater_classifier comes in. This tool is designed to identify hateful speech on social media, possibly filtering out the negativity making it a powerful tool for creating a more positive online environment.



In [None]:
! pip install  transformers
! apt install git-lfs
! pip install datasets
! pip install torch
! pip install imbalanced-learn
! pip install optuna
! pip install ray[tune]
! pip install sklearn
! pip install pynvml 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Reading package lists... Done
Building dependency tree       

git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 23 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


import all the libraries needed

In [None]:
from huggingface_hub import notebook_login
import pandas as pd
import torch
# import datasets
from datasets import load_dataset, load_metric, Dataset, DatasetDict
# from transformers import AutoTokenizer
import transformers
from transformers import AutoTokenizer,Trainer,AutoModelForSequenceClassification, TrainingArguments
from transformers.utils import send_example_telemetry
# import imblearn
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
from sklearn.model_selection import train_test_split

# for gpu util
from pynvml import * 



define functions to track GPU utilization

In [None]:
from pynvml import *


def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()


Get your hugging face api in https://huggingface.co/settings/tokens

In [None]:
# log in ito  hugging face
notebook_login()



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# show the version on the tranformer
# since transform is > 4.11.0 there is no issue
print(transformers.__version__)

4.27.3


In [None]:
send_example_telemetry("text_classification_notebook", framework="pytorch")

### Model Definition
Pick a model from the [Model Hub](https://huggingface.co/models) with a  clasffification head 

Adjust the batch size as needed to not run out of memory

why choose distilbert-base-uncased with the task of sst2?
- sst2/sentiment analysis, can help identify the hate speech
- distilbert-base-uncased: 110m Params compared other bert-large-uncased which is  3x more (340M)

In [None]:
task = "hate_speech_detection"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

# from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
# model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

# inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
# with torch.no_grad():
#     logits = model(**inputs).logits

# predicted_class_id = logits.argmax().item()
# model.config.id2label[predicted_class_id]

### Loading the existing model's dataset

Picking a dataset:
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.

In [None]:
existing_tasks = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

actual_task = "sst2"
data = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)



  0%|          | 0/3 [00:00<?, ?it/s]

  metric = load_metric('glue', actual_task)


In [None]:
# metric["accuracy"]

In [None]:
#checking the dictionary format of the data
display(data)

#checking what the first entry in train
data["train"][0]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

{'sentence': 'hide new secretions from the parental units ',
 'label': 0,
 'idx': 0}

Pick random entries to get a sense of what the data looks like
1. turn the into dataset object into a dataframae so you can get samples from it 

In [None]:
pd.DataFrame(data["train"]).sample(10)

Unnamed: 0,sentence,label,idx
57574,award-winning english cinematographer giles nu...,1,57574
60141,it 'll probably be in video stores by christma...,0,60141
9065,sensuality and a conniving wit,1,9065
42544,much self-centered,0,42544
38159,your disgust,0,38159
37579,a story already overladen with plot conceits,0,37579
36247,too many films that can be as simultaneously f...,1,36247
60842,takes a fresh and absorbing look at a figure w...,1,60842
28010,"love , family and",1,28010
45759,"in all fairness , i must report that the child...",1,45759


### Loading the data for hate speech

data description (based on github):
- hate_speech = number of CF users who judged the tweet to be hate speech.

- offensive_language = number of CF users who judged the tweet to be offensive.

- neither = number of CF users who judged the tweet to be neither offensive nor non-offensive.

- class = class label for majority of CF users. 
  - 0 - hate speech 
  - 1 - offensive language 
  - 2 - neither


In [None]:
#
dataset_name = "hate_speech_offensive"

hate_speech_ds  = load_dataset(dataset_name)
# hate_speech = pd.read_csv("hate_speech_labeled_data.csv")



  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
hate_speech_df = pd.DataFrame(hate_speech_ds["train"])

In [None]:
# hate_speech_df

Showing some samples of each class

In [None]:
#adding to expand the tweet to see the full text
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_colwidth', -1)

print("Hate speech")
display(hate_speech_df[hate_speech_df["class"]==0].sample(10)[["class","tweet"]])
print("-----------------------------------------------------------------")

print("Offensive language")
display(hate_speech_df[hate_speech_df["class"]==1].sample(10)[["class","tweet"]])
print("-----------------------------------------------------------------")

print("Neither")
print()
display(hate_speech_df[hate_speech_df["class"]==2].sample(10)[["class","tweet"]])
print("-----------------------------------------------------------------")

Hate speech


  pd.set_option('display.max_colwidth', -1)


Unnamed: 0,class,tweet
21168,0,Stoni is a fuckin queer...
3408,0,"@HuffingtonPost im American, don't give 2 shits about the beaners, ship them beaners back"
22483,0,Wassup wit all these fucc niccas
3591,0,@JacobbBacker fag
12189,0,Kanye west is a faggot
19303,0,RT @ivanrabago_: @staycoolwheels @Studhardt22 Joshua is a faggot. Just suspend him on those grounds
17812,0,RT @TopBlokeBill: Are you Justin Beiber? Because your a Fucken gay cunt @Arii_nV http://t.co/pgHISyDEyB
15420,0,RT @HG_Shit: Mfs still in the same spot as 3yrs ago. Beefing with the same nighas. Sleeping with the same hoes. Eating from the same restau&#8230;
5939,0,"@erinscafe We hate the Yankees though, right? I feel like I'm really good at hating them."
10929,0,"I swear these anon fags go to protests just to take pictures to post to Twitter. ""Look, I was there...Like me"""


-----------------------------------------------------------------
Offensive language


Unnamed: 0,class,tweet
14321,1,RT @BasedZae: When lil b said &#8220;Word around town that I aint got no bitches? what?! Thats a damn lie nigga my nuts is my witness&#8221; http://t.c&#8230;
11685,1,If you're being a little scared bitch about the POSSIBILITY of a louisville purge...just go to Ferguson.
17411,1,RT @SteadmanTerri: @wheeler_kashhh Mann fuckk dat shit!!..fuckk dat bitch!! &#128514;&#128514;
159,1,"""@KingCuh: @16stanleys io io alu record ho vine sai pe hahahaha"" lol anywaaaaaays..... haha"
15266,1,RT @FuckTraVonn: @TropicalKyle I saw the xxx but I just thought she was a hoe
20142,1,RT @thebootycluh_: &#8220;@_xoxoMOOKY: Why do niggas settle for easy bitches ? Like don't y'all wanna bitch that challenges you to better yoursel&#8230;
2364,1,@AbstractLife man too much damn giggling was going. He was offering $40 tips and shit. Lol think I even heard the bitch spit a couple times
23579,1,"animal crackers are the fuckin bomb, yo"
9214,1,Fuc u say ? &#8220;@AyyyeThatsChubb: All these hoes look good on IG&#8221; http://t.co/PlsFL84cDp
22829,1,Who's ready for nigger spam?


-----------------------------------------------------------------
Neither



Unnamed: 0,class,tweet
13811,2,Permanent slit in my eyebrow from when I tumbled down the bleachers when I was little.... I stayed in the ER
5503,2,@allsportsbruh how? All of the QBs we had before this season were trash. They brought in a veteran so they could have someone to play
20846,2,Sighs of relief from Beijing Guoan fans &amp; #China's govt as club nips Japanese rival 2-1 in tense match: #football http://t.co/EesEOR7Iba
17404,2,"RT @SportsNation: So, Yanks have signed Brian McCann &amp; Jacoby Ellsbury. Rumor has it, they still want Robinson Cano, Babe Ruth, Miguel Cabr&#8230;"
7226,2,"@tyler_wilde CH3BURASHKA sure gets to be a nit-picky Russian monkey/bear thing sometimes, doesn't he?"
13659,2,Only Americans are degenerate enough to 'honor' their war dead by having a barbecue. Anyone who 'grills out' for Memorial Day is trash.
6594,2,@longbongchris mk Hun just textith me
11264,2,I'm finna see the Yankees win a baseball game
902,2,"#porn,#android,#iphone,#ipad,#sex,#xxx, | #Teen | Cutie lesbian teens toy slits http://t.co/ZS05enjjwm"
16422,2,RT @MelissaTweets: Scott Walker investigate. Christie's Bridgegate. Perry's drunk monkey Dem DA-gate. All men innocent. All threats to left&#8230;


-----------------------------------------------------------------


Dataset observations:
1. Majority (77%) of the examples are  offensive language 
2. We rarely find hate speech having only 1430 examples (5% of the whole dataset)

Possible experiments:
- Rebalance the dataset via:
  - oversampling 
  - undersampling 


In [None]:
#check the distribution of the classes
hate_speech_df["class"].value_counts(normalize=True)


1    0.774321
2    0.167978
0    0.057701
Name: class, dtype: float64

check for nulls and drop them

In [None]:
hate_speech_df.isna().sum()
hate_speech_df = hate_speech_df.dropna()



## Preprocessing

### Tokenize

1. Defining the tokenizer (tokenizer is a method or process to convert the words into numbers such that the computer can understand it)

2. sample of the tokenizer




In [None]:
#create a toeknizer that fits the model in this case bert-distil-uncase
#use the vocab of the pretrained model
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

#input is the id in of the vocabulary 
tokenizer("it's me, hi, I'm the problem, it's me")

{'input_ids': [101, 2009, 1005, 1055, 2033, 1010, 7632, 1010, 1045, 1005, 1049, 1996, 3291, 1010, 2009, 1005, 1055, 2033, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

tokenize curse words or profanity to check what's the result

In [None]:
print(tokenizer("faggot"))
print(tokenizer("bitch"))
print(tokenizer("bitch"))

{'input_ids': [101, 6904, 13871, 4140, 102], 'attention_mask': [1, 1, 1, 1, 1]}
{'input_ids': [101, 7743, 102], 'attention_mask': [1, 1, 1]}
{'input_ids': [101, 7743, 102], 'attention_mask': [1, 1, 1]}



<!-- After tokenzing,use pytorch to train the model to take advantage of colab free gpu:

- video source: https://youtu.be/Dh9CL8fyG80
- link source: https://youtu.be/Dh9CL8fyG80 -->


In [None]:

# sentence1_key, sentence2_key = task_to_keys[task]
sentence1_key = "sentence"

#print the sentence
print(f"Sentence: {data['train'][0][sentence1_key]}")

#print the tokenized sentence 
tokenizer(data['train'][0][sentence1_key],truncation=True) 

Sentence: hide new secretions from the parental units 


{'input_ids': [101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Change the sample code since we're using ```sst2``` only, which means we only need sentence column. I'm using a different dataset so the ```sentence_key``` is tweet

The argument:
- `truncation=True`: This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

Why did I tokenize first?
-  because its better to tokenize first then oversample and undersample, and also need to train test plit
- so don't need to retokenizing the same tweets

In [None]:
#pre process function for existing data from dataset library
def preprocess_function(examples):
  return tokenizer(examples["tweet"], truncation=True)

# #preprocessing on hate_speech dataset
# def my_preprocess_function(examples):
#   return tokenizer(examples, truncation=True)

#encoded hate_speech for fine tuning
# endocded_hspeech = hate_speech["tweet"].map(my_preprocess_function)

encoded_ds = hate_speech_ds.map(preprocess_function)

#check what encoded ds looks like
encoded_ds



DatasetDict({
    train: Dataset({
        features: ['count', 'hate_speech_count', 'offensive_language_count', 'neither_count', 'class', 'tweet', 'input_ids', 'attention_mask'],
        num_rows: 24783
    })
})

In [None]:
hate_speech_df = pd.DataFrame(encoded_ds["train"])

In [None]:
# hate_speech_df

### Undersampling and Oversampling

1. Defining the tokenizer (tokenizer is a method or process to convert the words into numbers such that the computer can understand it)
2. sample of the tokenizer




Perform oversampling using [Imbalanced-Learn Library](https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/)

-  ```minority``` parameter means,  majority of 1,000 examples and the minority class had 100, this strategy would oversampling the minority class so that it has 1,000 examples.

- Need to oversample twice becuase we have 3 classes, after the first over sampling only 0.0 has increased  from 2050 to 30333 (same logic applies to oversampling)



In [None]:
# eval
X = hate_speech_df[["tweet",'input_ids', 'attention_mask']].values
y = hate_speech_df["class"].values

print("distrubution before over sampling")
print(Counter(y))
#d 1,000 examples and the minority class had 100, this strategy would oversampling the minority class so that it has 1,000 examples.
oversample = RandomOverSampler(random_state=0,sampling_strategy='minority')

X_over, y_over = oversample.fit_resample(X, y)
print("distrubution after over sampling")
print(Counter(y_over))

# need to oversample twice becuase we have 3 classes
# after the first over sampling only 0.0 has increased  from 2050 to 30333
X_over, y_over = oversample.fit_resample(X_over, y_over)
print("distrubution after 2nd over sampling")
print(Counter(y_over))


over_h_speech = pd.DataFrame(X_over, columns=["tweet",'input_ids', 'attention_mask'])
over_h_speech["label"] = y_over

distrubution before over sampling
Counter({1: 19190, 2: 4163, 0: 1430})
distrubution after over sampling
Counter({1: 19190, 0: 19190, 2: 4163})
distrubution after 2nd over sampling
Counter({2: 19190, 1: 19190, 0: 19190})


In [None]:
# over_h_speech

Perform undersampling using [Imbalanced-Learn Library](https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/)


In [None]:
undersample = RandomUnderSampler(random_state=0,sampling_strategy='majority')

X_under, y_under = undersample.fit_resample(X, y)
print("distrubution after under sampling")
print(Counter(y_under))

X_under, y_under = undersample.fit_resample(X_under, y_under)
print("distrubution after 2nd under sampling")
print(Counter(y_under))

# under_h_speech = pd.DataFrame({"tweet":X_under.flatten(),"class":y_under})


under_h_speech = pd.DataFrame(X_over, columns=["tweet",'input_ids', 'attention_mask'])
under_h_speech["label"] = y_over



distrubution after under sampling
Counter({2: 4163, 0: 1430, 1: 1430})
distrubution after 2nd under sampling
Counter({0: 1430, 1: 1430, 2: 1430})


### Train test split on the following:
- hate_speech
- over_h_speech
- under_h_speech

In [None]:
hate_speech_df.columns = ['count', 'hate_speech_count', 'offensive_language_count',
       'neither_count', 'label', 'tweet', 'input_ids', 'attention_mask']

In [None]:

train, test = train_test_split(hate_speech_df, test_size=0.2)
over_train, over_test = train_test_split(over_h_speech, test_size=0.2)
under_train, under_test = train_test_split(under_h_speech, test_size=0.2)


                               

### Create a Dataset Dictionary
- So we can use it for the model

reference: The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

In [None]:
# Dataset.from_pandas(over_train)



train_ds = Dataset.from_pandas(train, split="train")
test_ds = Dataset.from_pandas(test, split="test")

o_train_ds = Dataset.from_pandas(over_train, split="o_train")
o_test_ds = Dataset.from_pandas(over_train, split="o_test")


u_train_ds = Dataset.from_pandas(under_train, split="u_train")
u_test_ds = Dataset.from_pandas(under_test, split="u_test")



In [None]:
train_ds

Dataset({
    features: ['count', 'hate_speech_count', 'offensive_language_count', 'neither_count', 'label', 'tweet', 'input_ids', 'attention_mask', '__index_level_0__'],
    num_rows: 19826
})

In [None]:

#processed dataset
dataset_proc = DatasetDict({"train":train_ds,
                             "test":test_ds, 
                             "o_train_ds":o_train_ds,
                             "o_test_ds":o_test_ds,
                             "u_train_ds":u_train_ds,
                             "u_test_ds":u_test_ds, })


In [None]:
# dataset_proc

## Fine-tuning the model

1. load the model into cache
2. .tocuda() to use gpu [source](https://huggingface.co/docs/transformers/perf_train_gpu_one)



In [None]:

labels = 3
# model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=labels)
##comment this out if you just want cpu
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=labels).to("cuda")

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier

defining training arguments

required attributes
-  folder name for model checkpoint storage

- evaluation_strategy (str or IntervalStrategy, optional, defaults to "no") — The evaluation strategy to adopt during training. Possible values are:
- per_device_train_batch_size (int, optional, defaults to 8) — The batch size per GPU/TPU core/CPU for training.
- we can push the model to the Hub regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook

In [None]:
model_name = model_checkpoint.split("/")[-1]
batch_size= 8
metric_name = "accuracy"
task = "hate_speech"

args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=True,
)

measurement of performance/ prediction 
1. `metric` we loaded earlier
2. the prediction is rgmax of our predicted logits or predictions[:, 0]


In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

In [None]:
#define the validation key of sst2
validation_key  = "validation"
trainer = Trainer(model,
                  args,
                  train_dataset=dataset_proc["train"],
                  eval_dataset=dataset_proc["test"],
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics)



/content/distilbert-base-uncased-finetuned-hate_speech is already a clone of https://huggingface.co/Dc26/distilbert-base-uncased-finetuned-hate_speech. Make sure you pull the latest changes with `repo.git_pull()`.


In [None]:
result = trainer.train()
print_summary(result)
#why is the training and loss and accuracy all the same?

# print_summary(result)

Epoch,Training Loss,Validation Loss,Accuracy
1,0.19,0.294211,0.056082
2,0.193,0.294211,0.056082
3,0.1883,0.294211,0.056082
4,0.1835,0.294211,0.056082
5,0.1925,0.294211,0.056082


Time: 784.05
Samples/second: 126.43
GPU memory occupied: 9937 MB.


In [None]:
# result

In [None]:
trainer.evaluate()

{'eval_loss': 0.29421094059944153,
 'eval_accuracy': 0.0560823078474884,
 'eval_runtime': 9.5234,
 'eval_samples_per_second': 520.508,
 'eval_steps_per_second': 32.551,
 'epoch': 5.0}

In [None]:
trainer.push_to_hub()

To https://huggingface.co/Dc26/distilbert-base-uncased-finetuned-hate_speech
   5dc7512..8b3f0f8  main -> main

   5dc7512..8b3f0f8  main -> main

To https://huggingface.co/Dc26/distilbert-base-uncased-finetuned-hate_speech
   8b3f0f8..bd3a48b  main -> main

   8b3f0f8..bd3a48b  main -> main



'https://huggingface.co/Dc26/distilbert-base-uncased-finetuned-hate_speech/commit/8b3f0f829c7a6acf7c198dea9b0ba000532167ca'

In [None]:
f"{model_name}-finetuned-{task}"

'distilbert-base-uncased-finetuned-hate_speech'

In [None]:
# how to share the model you created

AutoModelForSequenceClassification.from_pretrained("Dc26/distilbert-base-uncased-finetuned-hate_speech")

## Hyperparameter search

install optuna and raytune cause it's used for hyper parameter tuning

we also defined a new model using the ```new_model```, so we create a new model every time

In [None]:
num_labels=3
def new_model():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

In [None]:
trainer = Trainer(
    model_init=new_model,
    args=args,
    train_dataset=dataset_proc["train"],
    eval_dataset=dataset_proc["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier

hyperparameter search  may take a long time to run on the full dataset so 
- it only runs on 1/10th of the dataset
- only train the full dataset on the best perfoming one

In [None]:
train_dataset = dataset_proc["train"].shard(index=1, num_shards=10) 

In [None]:
# returns Best Run object which mazimizes the accuracy/ desired metric
best_run = trainer.hyperparameter_search(n_trials=5, direction="maximize")

[32m[I 2023-03-24 13:52:31,865][0m A new study created in memory with name: no-name-ff8d76f1-4d8e-4de7-8c98-6d7d79c9d77e[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the 

OutOfMemoryError: ignored

In [None]:
# best_run

You can customize the objective to maximize by passing along a `compute_objective` function to the `hyperparameter_search` method, and you can customize the search space by passing a `hp_space` argument to `hyperparameter_search`. See this [forum post](https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10) for some examples.

To reproduce the best training, just set the hyperparameters in your `TrainingArgument` before creating a `Trainer`:

In [None]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

check if the model workds

In [None]:
train["label"].value_counts()

1    15324
2    3376 
0    1126 
Name: label, dtype: int64

In [None]:
test["label"].value_counts()

1    3841
2    824 
0    292 
Name: label, dtype: int64

In [None]:
from transformers import TextClassificationPipeline
os.environ['CUDA_VISIBLE_DEVICES'] ='0'

current_model = AutoModelForSequenceClassification.from_pretrained("Dc26/distilbert-base-uncased-finetuned-hate_speech")
pipe = TextClassificationPipeline(model=current_model, tokenizer=tokenizer, return_all_scores=False)


def get_prediction(text):
  return pipe(text)[-1]["label"][-1]



In [None]:
%%time
test["predicted"] = test["tweet"].apply(lambda x: get_prediction(x))


CPU times: user 5min 55s, sys: 737 ms, total: 5min 56s
Wall time: 6min 7s


In [None]:
## create a confusion matrix
test["predicted"] = test["predicted"].astype(int)
test.groupby(["predicted", "label"]).size().unstack(fill_value=0)

label,0,1,2
predicted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,87,65,3
1,177,3724,47
2,40,77,737


In [None]:
import seaborn as sns
test.groupby(["predicted", "label"]).size().unstack(fill_value=0)

label,0,1,2
predicted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,87,65,3
1,177,3724,47
2,40,77,737


In [None]:
test["predicted"].value_counts()
# cause tpp much 2 data predicted everything as 2

1    3948
2    854 
0    155 
Name: predicted, dtype: int64

In [None]:
test[test["predicted"]==test["label"]].shape[0] / test.shape[0]

0.917490417591285

## Reccomendation

1. Further data cleaning
  - looking at the tweets since there are a lot of emojis so, maybe the performance could have been better if we remove emojis and done further cleaning
  - remove @username and make a generic tag instead of the model seeing different username
2.  test on the if the model would be better with the undersampling and over sample datasets
3. Graph the confusion matrix of the test set
4. Compare performance with an existing paper (papers: https://github.com/aymeam/Datasets-for-Hate-Speech-Detection)