### Different Categories of fine tuning

The most common ways to fine-tune language models are instruction fine-tuning and
classification fine-tuning. Instruction fine-tuning involves training a language model on
a set of tasks using specific instructions to improve its ability to understand and exe-
cute tasks described in natural language prompt

In classification fine-tuning, the model is trained to recognize specefic set of class labels like whether an item is spam or not

Instruction finetuning model can undertake a wide variety of tasks whereas classification finetuning is specefic

In [4]:
import urllib.request
import zipfile
import os
from pathlib import Path

url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "spam_collection.zip"
extracted_path= "spam_collection"
data_file_path = Path(extracted_path) / "SpamCollection.tsv"

def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path):
    if (data_file_path.exists()):
          print(f"{data_file_path} already exists therefore skipping the download")
          return
      
    with urllib.request.urlopen(url) as response:
        with open(zip_path, "wb") as out_file:
            out_file.write(response.read());
            
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(extracted_path)
        
    original_file_path=Path(extracted_path) / "SMSSpamCollection"
    os.rename(original_file_path, data_file_path)
    print(f"File downloaded and saved as {data_file_path}")
    
download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)
    
        
  
    


File downloaded and saved as spam_collection/SpamCollection.tsv


In [11]:
import pandas as pd;
df = pd.read_csv(data_file_path, sep='\t', header=None, names=["Label", "Text"])
df.head(5)

Unnamed: 0,Label,Text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [12]:
print(df["Label"].value_counts())

Label
ham     4825
spam     747
Name: count, dtype: int64


We can see that the above dataset is not balanced since it has very few spam texts

In [19]:
import pandas as pd

def create_balanced_dataset(df):
    num_spam = df[df["Label"] == "spam"].shape[0]
    ham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123)
    balanced_df = pd.concat([ham_subset, df[df["Label"] == "spam"]])
    return balanced_df
balanced_df = create_balanced_dataset(df)
print(balanced_df)
print(balanced_df["Label"].value_counts())

     Label                                               Text
4307   ham  Awww dat is sweet! We can think of something t...
4138   ham                             Just got to  &lt;#&gt;
4831   ham  The word "Checkmate" in chess comes from the P...
4461   ham  This is wishing you a great day. Moji told me ...
5440   ham      Thank you. do you generally date the brothas?
...    ...                                                ...
5537  spam  Want explicit SEX in 30 secs? Ring 02073162414...
5540  spam  ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
5547  spam  Had your contract mobile 11 Mnths? Latest Moto...
5566  spam  REMINDER FROM O2: To get 2.50 pounds free call...
5567  spam  This is the 2nd time we have tried 2 contact u...

[1494 rows x 2 columns]
Label
ham     747
spam    747
Name: count, dtype: int64


Now splitting the dataset into validation, train and test

In [21]:
import pandas as pd
def random_split(df, train_frac, validation_frac):
    df = df.sample(frac=1, random_state=123).reset_index(drop=True)
    train_end = int(len(df) * train_frac)
    validation_end = train_end + int(len(df) * validation_frac)
    train_df = df[:train_end]
    validation_df = df[train_end:validation_end]
    test_df = df[validation_end:]
    return train_df, validation_df, test_df
train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)

In [23]:
train_df.to_csv("train.csv", index=None)
validation_df.to_csv("validation.csv", index=None)
test_df.to_csv("test.csv", index=None)