Notebook objectives:
- Fine-tune a HuggingFace model for text classification by following this documentation. Below are the specifications:
  - Make use of Google Colab’s free GPU to train a HuggingFace model
  - Follow the documentation from start to finish
  - Be able to answer questions about each piece of code during the interview
  - Demonstrate fine-tuning using the sample dataset provided.
  - Bonus points: Use another text classification dataset to perform fine-tuning



# Project 1: Text classification

## Project objective
- Fine-tune a HuggingFace model for text classification 

## Set up install the following:
- transformers: model used for text classification
- dataset: library to download GLUE datasets
- Git-LFS

-  using  a pre trained SST-2 (Standford Sentiment Analysis Treebank) that determines  if a sentence is positive or neative

- using [hate-speech-data](https://huggingface.co/datasets/hate_speech_offensive) to finetune the model to predict if the text is hate speech

Objecttive for part 2:
- Social media and messaging platforms have given us the ability to connect and express ourselves freely. However, what happens when these platforms are used to spread negativity and hate? 
- That's where the Hater_classifier comes in. This tool is designed to identify hateful speech on social media, possibly filtering out the negativity making it a powerful tool for creating a more positive online environment.



In [3]:
! pip install  transformers
! pip install datasets
! pip install torch
! pip install imbalanced-learn
! pip install optuna
! pip install ray[tune]
! pip install sklearn
! pip install pynvml 





ERROR: Could not find a version that satisfies the requirement ray[tune] (from versions: none)
ERROR: No matching distribution found for ray[tune]




## gitlfs

- Issue with git, everytime you replace assets(new version of an image) in git it's compressed in the snapshot, repo keeps storage higher and slower so instead of pushing the media assets in the gihub. the Gitlfs would push the media in the ```lfs server``` instead of the repo
- don't need to use gitlfs cause don't have csvs for this example
- source: https://www.youtube.com/watch?v=jXsvFfksvd0

In [10]:
# ! pip install git-lfs
!pip install tqdm
!pip install ipywidgets

Collecting ipywidgets
  Downloading ipywidgets-8.0.6-py3-none-any.whl (138 kB)
     ---------------------------------------- 0.0/138.3 kB ? eta -:--:--
     ----------------- --------------------- 61.4/138.3 kB 1.7 MB/s eta 0:00:01
     -------------------------------------- 138.3/138.3 kB 1.6 MB/s eta 0:00:00
Collecting widgetsnbextension~=4.0.7
  Downloading widgetsnbextension-4.0.7-py3-none-any.whl (2.1 MB)
     ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
      --------------------------------------- 0.0/2.1 MB ? eta -:--:--
     - -------------------------------------- 0.1/2.1 MB 1.7 MB/s eta 0:00:02
     --- ------------------------------------ 0.2/2.1 MB 1.8 MB/s eta 0:00:02
     ---- ----------------------------------- 0.2/2.1 MB 1.5 MB/s eta 0:00:02
     ------ --------------------------------- 0.3/2.1 MB 1.5 MB/s eta 0:00:02
     ------- -------------------------------- 0.4/2.1 MB 1.6 MB/s eta 0:00:02
     ------- -------------------------------- 0.4/2.1 

import all the libraries needed

In [7]:
from huggingface_hub import notebook_login
import pandas as pd
import torch
# import datasets
from datasets import load_dataset, load_metric, Dataset, DatasetDict
# from transformers import AutoTokenizer
import transformers
from transformers import AutoTokenizer,Trainer,AutoModelForSequenceClassification, TrainingArguments
from transformers.utils import send_example_telemetry
# import imblearn
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
from sklearn.model_selection import train_test_split

# for gpu util
from pynvml import * 



define functions to track GPU utilization

In [11]:
from pynvml import *


def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()


Get your hugging face api in https://huggingface.co/settings/tokens

In [13]:
# log in ito  hugging face
notebook_login()



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [14]:
# show the version on the tranformer
# since transform is > 4.11.0 there is no issue
print(transformers.__version__)

4.27.3


In [15]:
send_example_telemetry("text_classification_notebook", framework="pytorch")

### Model Definition
Pick a model from the [Model Hub](https://huggingface.co/models) with a  clasffification head 

Adjust the batch size as needed to not run out of memory

why choose distilbert-base-uncased with the task of sst2?
- sst2/sentiment analysis, can help identify the hate speech
- distilbert-base-uncased: 110m Params compared other bert-large-uncased which is  3x more (340M)

In [16]:
task = "hate_speech_detection"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

# from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
# model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

# inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
# with torch.no_grad():
#     logits = model(**inputs).logits

# predicted_class_id = logits.argmax().item()
# model.config.id2label[predicted_class_id]

### Loading the existing model's dataset

Picking a dataset:
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.

In [17]:
existing_tasks = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

actual_task = "sst2"
data = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)

Downloading builder script: 100%|██████████████████████████████████████████████| 28.8k/28.8k [00:00<00:00, 92.7kB/s]
Downloading metadata: 100%|█████████████████████████████████████████████████████| 28.7k/28.7k [00:00<00:00, 114kB/s]
Downloading readme: 100%|███████████████████████████████████████████████████████| 27.9k/27.9k [00:00<00:00, 101kB/s]


Downloading and preparing dataset glue/sst2 to C:/Users/Kat/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data: 100%|████████████████████████████████████████████████████████| 7.44M/7.44M [00:06<00:00, 1.13MB/s]
                                                                                                                    

Dataset glue downloaded and prepared to C:/Users/Kat/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


100%|█████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 73.11it/s]
  metric = load_metric('glue', actual_task)
Downloading builder script: 5.76kB [00:00, 2.88MB/s]                                                                


In [None]:
# metric["accuracy"]

In [18]:
#checking the dictionary format of the data
display(data)

#checking what the first entry in train
data["train"][0]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

{'sentence': 'hide new secretions from the parental units ',
 'label': 0,
 'idx': 0}

Pick random entries to get a sense of what the data looks like
1. turn the into dataset object into a dataframae so you can get samples from it 

In [19]:
pd.DataFrame(data["train"]).sample(10)

Unnamed: 0,sentence,label,idx
63192,'ll feel as the credits roll,1,63192
26154,goldbacher draws on an elegant visual sense an...,1,26154
42779,be commended for illustrating the merits of fi...,1,42779
25720,astonishingly skillful and moving ... it could...,1,25720
27512,is a lot like a well-made pb & j sandwich : fa...,1,27512
60129,dismiss --,0,60129
24560,original and insightful,1,24560
41458,wretchedly unfunny wannabe,0,41458
15032,few movies are able to accomplish,1,15032
5687,the biggest disappointments of the year,0,5687


### Loading the data for hate speech

data description (based on github):
- hate_speech = number of CF users who judged the tweet to be hate speech.

- offensive_language = number of CF users who judged the tweet to be offensive.

- neither = number of CF users who judged the tweet to be neither offensive nor non-offensive.

- class = class label for majority of CF users. 
  - 0 - hate speech 
  - 1 - offensive language 
  - 2 - neither


In [20]:
#
dataset_name = "hate_speech_offensive"

hate_speech_ds  = load_dataset(dataset_name)
# hate_speech = pd.read_csv("hate_speech_labeled_data.csv")

Downloading builder script: 100%|██████████████████████████████████████████████| 3.48k/3.48k [00:00<00:00, 3.48MB/s]
Downloading metadata: 100%|█████████████████████████████████████████████████████| 1.82k/1.82k [00:00<00:00, 910kB/s]
Downloading readme: 100%|██████████████████████████████████████████████████████| 5.83k/5.83k [00:00<00:00, 5.83MB/s]


Downloading and preparing dataset hate_speech_offensive/default to C:/Users/Kat/.cache/huggingface/datasets/hate_speech_offensive/default/1.0.0/5f5dfc7b42b5c650fe30a8c49df90b7dbb9c7a4b3fe43ae2e66fabfea35113f5...


Downloading data: 2.55MB [00:00, 3.80MB/s]                                                                          
                                                                                                                    

Dataset hate_speech_offensive downloaded and prepared to C:/Users/Kat/.cache/huggingface/datasets/hate_speech_offensive/default/1.0.0/5f5dfc7b42b5c650fe30a8c49df90b7dbb9c7a4b3fe43ae2e66fabfea35113f5. Subsequent calls will reuse this data.


100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 99.90it/s]


In [22]:
hate_speech_df = pd.DataFrame(hate_speech_ds["train"])

In [None]:
# hate_speech_df

Showing some samples of each class

In [23]:
#adding to expand the tweet to see the full text
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_colwidth', -1)

print("Hate speech")
display(hate_speech_df[hate_speech_df["class"]==0].sample(10)[["class","tweet"]])
print("-----------------------------------------------------------------")

print("Offensive language")
display(hate_speech_df[hate_speech_df["class"]==1].sample(10)[["class","tweet"]])
print("-----------------------------------------------------------------")

print("Neither")
print()
display(hate_speech_df[hate_speech_df["class"]==2].sample(10)[["class","tweet"]])
print("-----------------------------------------------------------------")

Hate speech


  pd.set_option('display.max_colwidth', -1)


Unnamed: 0,class,tweet
16595,0,RT @NYTMinusContext: kill whitey
3815,0,@Kevin_McAdams Because Ian's retarded
4725,0,"@SoftestMuffin @_tee13 @TorahBlaze Best believe We aint no christian slave brainwash black spooks miss white man, unlike yoself, she devil"
10653,0,I like Chinese buffets but I hate all the chinks
8034,0,Bill Smith DDS MTV ODB LGBTQ MMA NFL AFLCIO \n\nbitch pick one.\n\nFucking hate you.\n\nI know you proud but shit.
5939,0,"@erinscafe We hate the Yankees though, right? I feel like I'm really good at hating them."
4856,0,@Taylor_1017 faggot
13319,0,Niggas be in they feelings when they find out their hoe fuckin another nigga #StopSavinTheseHoes
18483,0,"RT @_politeASSHOLE: 4) Only pussy niggas who suck dick cry about someone being in their mentions, tweet the person then cry again about the&#8230;"
4657,0,"@SchulzGrayson @Dswizzle3 oh woulda coulda shoulda ass niggas bitch ass nigga, what u bout oh ban wagging ass nigga catch your phase"


-----------------------------------------------------------------
Offensive language


Unnamed: 0,class,tweet
12286,1,LMFAOOOOOO GAY AS FUCK RT @PubesOnFleeK: &#128557;&#128557;&#128557;&#128557;&#128557; get these two faggots off my TL http://t.co/oKUMmSJEYi
19531,1,RT @lildurk_: I'm finna up my bag so I'm cutting you weak hoes off
10945,1,I tell the 5th bitch to get the 6th bitch to to have the 7th bitch bring more.
12248,1,Kissing that bitch but she sucking me
8207,1,Bruh leave it to the coins coons
12284,1,LMFAOO somebody commented on my book saying I need Medical Attention bitch you the one reading it &#128514;
18784,1,"RT @collegefession: ""Guy tells me he doesn't eat pussy, and all of the sudden I don't suck dick anymore"" - Centeral Washington University"
19745,1,RT @nhalegood: When hoes feel like their photo didnt get enough favorites http://t.co/Gi3xXzBbq9
1170,1,&#8220;@CayMarieee: &#8220;@WhereYoHussleAt: @CayMarieee bitch&#8221; what u gone say bitch for. We're on the phone bitch.&#8221; Cuz to the let these Mfs know &#128514;
11297,1,I'm in the cut with my old boys and this bitch asks me to smoke weed with her in a fucking Yaris. The fuck? Lol hell no.


-----------------------------------------------------------------
Neither



Unnamed: 0,class,tweet
18558,2,RT @amoz1939: drinking\nto the last drop\nwashing teapot \n#haiku #mijikai
12474,2,Little things can just kill a negros mood man
18977,2,"RT @espn: Hot off the press, here&#8217;s Todd McShay&#8217;s first post-Combine NFL mock draft, including a new No. 1. http://t.co/su5ixWrthq"
5528,2,"@antiamnesty @AppSame @WhiteHouse \nI agree 100%. No naturalized citizenship, no anchor baby amnesty. All go to parents country of origin."
205,2,"""@OSAY_it_aint_so: &#8220;@IgnoreAllLaws: Fosters home for imaginary trash&#8221; WHOA CHILL""\n\nthat show was everything tf"
805,2,#Yankees #FireCashman I don't want Arod back.
7072,2,"@spence1reed_ bro , i heard slaughter songs .. they all trash .. they just freestylers"
11453,2,ISIS Supporters in America: The Jihadis Next Door http://t.co/u0YHWkD34c http://t.co/0XHPbRL9EQ
10158,2,I could go for a brownie right now
3405,2,@Hovaa_ your pet zebra. stripey?


-----------------------------------------------------------------


Dataset observations:
1. Majority (77%) of the examples are  offensive language 
2. We rarely find hate speech having only 1430 examples (5% of the whole dataset)

Possible experiments:
- Rebalance the dataset via:
  - oversampling 
  - undersampling 


In [26]:
#check the distribution of the classes
hate_speech_df["class"].value_counts(normalize=True)


1    0.774321
2    0.167978
0    0.057701
Name: class, dtype: float64

check for nulls and drop them

In [27]:
hate_speech_df.isna().sum()
hate_speech_df = hate_speech_df.dropna()



## Preprocessing

### Tokenize

1. Defining the tokenizer (tokenizer is a method or process to convert the words into numbers such that the computer can understand it)

2. sample of the tokenizer




In [28]:
#create a toeknizer that fits the model in this case bert-distil-uncase
#use the vocab of the pretrained model
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

#input is the id in of the vocabulary 
tokenizer("it's me, hi, I'm the problem, it's me")

Downloading (…)okenizer_config.json: 100%|███████████████████████████████████████| 28.0/28.0 [00:00<00:00, 28.0kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████| 483/483 [00:00<00:00, 483kB/s]
Downloading (…)solve/main/vocab.txt: 100%|███████████████████████████████████████| 232k/232k [00:03<00:00, 63.2kB/s]
Downloading (…)/main/tokenizer.json: 100%|███████████████████████████████████████| 466k/466k [00:04<00:00, 99.8kB/s]


{'input_ids': [101, 2009, 1005, 1055, 2033, 1010, 7632, 1010, 1045, 1005, 1049, 1996, 3291, 1010, 2009, 1005, 1055, 2033, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

tokenize curse words or profanity to check what's the result

In [29]:
print(tokenizer("faggot"))
print(tokenizer("bitch"))
print(tokenizer("bitch"))

{'input_ids': [101, 6904, 13871, 4140, 102], 'attention_mask': [1, 1, 1, 1, 1]}
{'input_ids': [101, 7743, 102], 'attention_mask': [1, 1, 1]}
{'input_ids': [101, 7743, 102], 'attention_mask': [1, 1, 1]}



<!-- After tokenzing,use pytorch to train the model to take advantage of colab free gpu:

- video source: https://youtu.be/Dh9CL8fyG80
- link source: https://youtu.be/Dh9CL8fyG80 -->


In [30]:

# sentence1_key, sentence2_key = task_to_keys[task]
sentence1_key = "sentence"

#print the sentence
print(f"Sentence: {data['train'][0][sentence1_key]}")

#print the tokenized sentence 
tokenizer(data['train'][0][sentence1_key],truncation=True) 

Sentence: hide new secretions from the parental units 


{'input_ids': [101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Change the sample code since we're using ```sst2``` only, which means we only need sentence column. I'm using a different dataset so the ```sentence_key``` is tweet

The argument:
- `truncation=True`: This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

Why did I tokenize first?
-  because its better to tokenize first then oversample and undersample, and also need to train test plit
- so don't need to retokenizing the same tweets

In [31]:
#pre process function for existing data from dataset library
def preprocess_function(examples):
  return tokenizer(examples["tweet"], truncation=True)

# #preprocessing on hate_speech dataset
# def my_preprocess_function(examples):
#   return tokenizer(examples, truncation=True)

#encoded hate_speech for fine tuning
# endocded_hspeech = hate_speech["tweet"].map(my_preprocess_function)

encoded_ds = hate_speech_ds.map(preprocess_function)

#check what encoded ds looks like
encoded_ds

                                                                                                                    

DatasetDict({
    train: Dataset({
        features: ['count', 'hate_speech_count', 'offensive_language_count', 'neither_count', 'class', 'tweet', 'input_ids', 'attention_mask'],
        num_rows: 24783
    })
})

In [32]:
hate_speech_df = pd.DataFrame(encoded_ds["train"])

In [33]:
# hate_speech_df

### Undersampling and Oversampling

1. Defining the tokenizer (tokenizer is a method or process to convert the words into numbers such that the computer can understand it)
2. sample of the tokenizer




Perform oversampling using [Imbalanced-Learn Library](https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/)

-  ```minority``` parameter means,  majority of 1,000 examples and the minority class had 100, this strategy would oversampling the minority class so that it has 1,000 examples.

- Need to oversample twice becuase we have 3 classes, after the first over sampling only 0.0 has increased  from 2050 to 30333 (same logic applies to oversampling)



In [34]:
# eval
X = hate_speech_df[["tweet",'input_ids', 'attention_mask']].values
y = hate_speech_df["class"].values

print("distrubution before over sampling")
print(Counter(y))
#d 1,000 examples and the minority class had 100, this strategy would oversampling the minority class so that it has 1,000 examples.
oversample = RandomOverSampler(random_state=0,sampling_strategy='minority')

X_over, y_over = oversample.fit_resample(X, y)
print("distrubution after over sampling")
print(Counter(y_over))

# need to oversample twice becuase we have 3 classes
# after the first over sampling only 0.0 has increased  from 2050 to 30333
X_over, y_over = oversample.fit_resample(X_over, y_over)
print("distrubution after 2nd over sampling")
print(Counter(y_over))


over_h_speech = pd.DataFrame(X_over, columns=["tweet",'input_ids', 'attention_mask'])
over_h_speech["label"] = y_over

distrubution before over sampling
Counter({1: 19190, 2: 4163, 0: 1430})
distrubution after over sampling
Counter({1: 19190, 0: 19190, 2: 4163})
distrubution after 2nd over sampling
Counter({2: 19190, 1: 19190, 0: 19190})


In [35]:
# over_h_speech

Perform undersampling using [Imbalanced-Learn Library](https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/)


In [36]:
undersample = RandomUnderSampler(random_state=0,sampling_strategy='majority')

X_under, y_under = undersample.fit_resample(X, y)
print("distrubution after under sampling")
print(Counter(y_under))

X_under, y_under = undersample.fit_resample(X_under, y_under)
print("distrubution after 2nd under sampling")
print(Counter(y_under))

# under_h_speech = pd.DataFrame({"tweet":X_under.flatten(),"class":y_under})


under_h_speech = pd.DataFrame(X_over, columns=["tweet",'input_ids', 'attention_mask'])
under_h_speech["label"] = y_over



distrubution after under sampling
Counter({2: 4163, 0: 1430, 1: 1430})
distrubution after 2nd under sampling
Counter({0: 1430, 1: 1430, 2: 1430})


### Train test split on the following:
- hate_speech
- over_h_speech
- under_h_speech

In [37]:
hate_speech_df.columns = ['count', 'hate_speech_count', 'offensive_language_count',
       'neither_count', 'label', 'tweet', 'input_ids', 'attention_mask']

In [38]:

train, test = train_test_split(hate_speech_df, test_size=0.2)
over_train, over_test = train_test_split(over_h_speech, test_size=0.2)
under_train, under_test = train_test_split(under_h_speech, test_size=0.2)


                               

### Create a Dataset Dictionary
- So we can use it for the model

reference: The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

In [39]:
# Dataset.from_pandas(over_train)



train_ds = Dataset.from_pandas(train, split="train")
test_ds = Dataset.from_pandas(test, split="test")

o_train_ds = Dataset.from_pandas(over_train, split="o_train")
o_test_ds = Dataset.from_pandas(over_train, split="o_test")


u_train_ds = Dataset.from_pandas(under_train, split="u_train")
u_test_ds = Dataset.from_pandas(under_test, split="u_test")



In [40]:
train_ds

Dataset({
    features: ['count', 'hate_speech_count', 'offensive_language_count', 'neither_count', 'label', 'tweet', 'input_ids', 'attention_mask', '__index_level_0__'],
    num_rows: 19826
})

In [41]:

#processed dataset
dataset_proc = DatasetDict({"train":train_ds,
                             "test":test_ds, 
                             "o_train_ds":o_train_ds,
                             "o_test_ds":o_test_ds,
                             "u_train_ds":u_train_ds,
                             "u_test_ds":u_test_ds, })


In [42]:
# dataset_proc

## Fine-tuning the model

1. load the model into cache
2. .tocuda() to use gpu [source](https://huggingface.co/docs/transformers/perf_train_gpu_one)



In [None]:

labels = 3
# model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=labels)
##comment this out if you just want cpu
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=labels).to("cuda")

Downloading pytorch_model.bin:  16%|██████▉                                     | 41.9M/268M [00:31<02:46, 1.36MB/s]

defining training arguments

required attributes
-  folder name for model checkpoint storage

- evaluation_strategy (str or IntervalStrategy, optional, defaults to "no") — The evaluation strategy to adopt during training. Possible values are:
- per_device_train_batch_size (int, optional, defaults to 8) — The batch size per GPU/TPU core/CPU for training.
- we can push the model to the Hub regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook

In [None]:
model_name = model_checkpoint.split("/")[-1]
batch_size= 8
metric_name = "accuracy"
task = "hate_speech"

args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=True,
)

measurement of performance/ prediction 
1. `metric` we loaded earlier
2. the prediction is rgmax of our predicted logits or predictions[:, 0]


In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

In [None]:
#define the validation key of sst2
validation_key  = "validation"
trainer = Trainer(model,
                  args,
                  train_dataset=dataset_proc["train"],
                  eval_dataset=dataset_proc["test"],
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics)



/content/distilbert-base-uncased-finetuned-hate_speech is already a clone of https://huggingface.co/Dc26/distilbert-base-uncased-finetuned-hate_speech. Make sure you pull the latest changes with `repo.git_pull()`.


In [None]:
result = trainer.train()
print_summary(result)
#why is the training and loss and accuracy all the same?

# print_summary(result)

Epoch,Training Loss,Validation Loss,Accuracy
1,0.19,0.294211,0.056082
2,0.193,0.294211,0.056082
3,0.1883,0.294211,0.056082
4,0.1835,0.294211,0.056082
5,0.1925,0.294211,0.056082


Time: 784.05
Samples/second: 126.43
GPU memory occupied: 9937 MB.


In [None]:
# result

In [None]:
trainer.evaluate()

{'eval_loss': 0.29421094059944153,
 'eval_accuracy': 0.0560823078474884,
 'eval_runtime': 9.5234,
 'eval_samples_per_second': 520.508,
 'eval_steps_per_second': 32.551,
 'epoch': 5.0}

In [None]:
trainer.push_to_hub()

To https://huggingface.co/Dc26/distilbert-base-uncased-finetuned-hate_speech
   5dc7512..8b3f0f8  main -> main

   5dc7512..8b3f0f8  main -> main

To https://huggingface.co/Dc26/distilbert-base-uncased-finetuned-hate_speech
   8b3f0f8..bd3a48b  main -> main

   8b3f0f8..bd3a48b  main -> main



'https://huggingface.co/Dc26/distilbert-base-uncased-finetuned-hate_speech/commit/8b3f0f829c7a6acf7c198dea9b0ba000532167ca'

In [None]:
f"{model_name}-finetuned-{task}"

'distilbert-base-uncased-finetuned-hate_speech'

In [None]:
# how to share the model you created

AutoModelForSequenceClassification.from_pretrained("Dc26/distilbert-base-uncased-finetuned-hate_speech")

## Hyperparameter search

install optuna and raytune cause it's used for hyper parameter tuning

we also defined a new model using the ```new_model```, so we create a new model every time

In [None]:
num_labels=3
def new_model():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

In [None]:
trainer = Trainer(
    model_init=new_model,
    args=args,
    train_dataset=dataset_proc["train"],
    eval_dataset=dataset_proc["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier

hyperparameter search  may take a long time to run on the full dataset so 
- it only runs on 1/10th of the dataset
- only train the full dataset on the best perfoming one

In [None]:
train_dataset = dataset_proc["train"].shard(index=1, num_shards=10) 

In [None]:
# returns Best Run object which mazimizes the accuracy/ desired metric
best_run = trainer.hyperparameter_search(n_trials=5, direction="maximize")

[32m[I 2023-03-24 13:52:31,865][0m A new study created in memory with name: no-name-ff8d76f1-4d8e-4de7-8c98-6d7d79c9d77e[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the 

OutOfMemoryError: ignored

In [None]:
# best_run

You can customize the objective to maximize by passing along a `compute_objective` function to the `hyperparameter_search` method, and you can customize the search space by passing a `hp_space` argument to `hyperparameter_search`. See this [forum post](https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10) for some examples.

To reproduce the best training, just set the hyperparameters in your `TrainingArgument` before creating a `Trainer`:

In [None]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

check if the model workds

In [None]:
train["label"].value_counts()

1    15324
2    3376 
0    1126 
Name: label, dtype: int64

In [None]:
test["label"].value_counts()

1    3841
2    824 
0    292 
Name: label, dtype: int64

In [None]:
from transformers import TextClassificationPipeline
os.environ['CUDA_VISIBLE_DEVICES'] ='0'

current_model = AutoModelForSequenceClassification.from_pretrained("Dc26/distilbert-base-uncased-finetuned-hate_speech")
pipe = TextClassificationPipeline(model=current_model, tokenizer=tokenizer, return_all_scores=False)


def get_prediction(text):
  return pipe(text)[-1]["label"][-1]



In [None]:
%%time
test["predicted"] = test["tweet"].apply(lambda x: get_prediction(x))


CPU times: user 5min 55s, sys: 737 ms, total: 5min 56s
Wall time: 6min 7s


In [None]:
## create a confusion matrix
test["predicted"] = test["predicted"].astype(int)
test.groupby(["predicted", "label"]).size().unstack(fill_value=0)

label,0,1,2
predicted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,87,65,3
1,177,3724,47
2,40,77,737


In [None]:
import seaborn as sns
test.groupby(["predicted", "label"]).size().unstack(fill_value=0)

label,0,1,2
predicted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,87,65,3
1,177,3724,47
2,40,77,737


In [None]:
test["predicted"].value_counts()
# cause tpp much 2 data predicted everything as 2

1    3948
2    854 
0    155 
Name: predicted, dtype: int64

In [None]:
test[test["predicted"]==test["label"]].shape[0] / test.shape[0]

0.917490417591285

## Reccomendation

1. Further data cleaning
  - looking at the tweets since there are a lot of emojis so, maybe the performance could have been better if we remove emojis and done further cleaning
  - remove @username and make a generic tag instead of the model seeing different username
2.  test on the if the model would be better with the undersampling and over sample datasets
3. Graph the confusion matrix of the test set
4. Compare performance with an existing paper (papers: https://github.com/aymeam/Datasets-for-Hate-Speech-Detection)