<a href="https://colab.research.google.com/github/Zenith1618/LLM/blob/main/Dataset_for_CommentGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating dataset for Mistral Finetuning for CommentGPT

### Import Libraries

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.22.2-py3-none-any

In [4]:
import csv
import random
import pandas as pd
from datasets import Dataset, DatasetDict

### Dataset

In [5]:
data = pd.read_csv('YT-comments.csv')

In [6]:
data.head()

Unnamed: 0,Comment,Response
0,This is a about as perfect a coverage of this ...,Thanks Sean! It's always a challenge to convey...
1,This was a very thorough introduction to LLMs ...,"Great to hear, glad it was helpful :)"
2,Thank you so much for putting these videos tog...,"My pleasure, glad it was informative yet conci..."
3,Honestly the most straightforward explanation ...,"Thanks, glad it was clear"
4,"Wow dude, just you wait, this channel is gonna...",Thanks for the kind words! Maybe one day


### Constructing Dataset


In [7]:
# load csv of YouTube comments
comment_list = []
response_list = []

with open('YT-comments.csv', mode ='r') as file:
    file = csv.reader(file)

    # read file line by line
    for line in file:
        # skip first line
        if line[0]=='Comment':
            continue

        # append comments and responses to respective lists
        comment_list.append(line[0])
        response_list.append(line[1] + " -CommentGPT")

In [10]:
comment_list[1]

'This was a very thorough introduction to LLMs and answered many questions I had. Thank you.'

In [11]:
response_list[2]

'My pleasure, glad it was informative yet concise :) -CommentGPT'

Now we need to convert these list into instruction format with prompt for Mistral 7B Instruct Model

In [13]:
intstructions_string = f"""CommentGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–CommentGPT'. \
CommentGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

Please respond to the following comment.
"""

example_template = lambda comment, response: f'''<s>[INST] {intstructions_string} \n{comment} \n[/INST]\n''' + response + "</s>"

example_list = []
for i in range(len(comment_list)):
    example = example_template(comment_list[i],response_list[i])
    example_list.append(example)

print(example_list[1])

<s>[INST] CommentGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–CommentGPT'. CommentGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
This was a very thorough introduction to LLMs and answered many questions I had. Thank you. 
[/INST]
Great to hear, glad it was helpful :) -CommentGPT</s>


### Splitting the dataset

In [14]:
# Creating test data randomly and removing it from training
test_index_list = random.sample(range(0, len(example_list)-1), 9)

test_list = [example_list[index] for index in test_index_list]

for example in test_list:
    example_list.remove(example)

In [15]:
# Converting the list into dataset
data = DatasetDict({'train':Dataset.from_dict({"example":example_list}), 'test':Dataset.from_dict({"example":test_list})})

In [16]:
data

DatasetDict({
    train: Dataset({
        features: ['example'],
        num_rows: 50
    })
    test: Dataset({
        features: ['example'],
        num_rows: 9
    })
})

# Push dataset to HuggingFace

In [19]:
from huggingface_hub import login

In [21]:
login("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")

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [22]:
# push dataset to hub
data.push_to_hub("Zenith1618/youtube-comments")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/Zenith1618/youtube-comments/commit/f3cc76a2b27e85288c6fd2c4f08601f3406ed15d', commit_message='Upload dataset', commit_description='', oid='f3cc76a2b27e85288c6fd2c4f08601f3406ed15d', pr_url=None, pr_revision=None, pr_num=None)