# CognitiveLab Research Internship Assignment

## Synthetic Data Generation and LLM Fine-tuning

### Overview
In this assignment for [CognitiveLab](https://cognitivelab.in/) Research Internship, you will:
1. Create a synthetic dataset for a use case of your choice
2. Fine-tune a small LLM using this dataset
3. Evaluate the model performance before and after fine-tuning

### What is Synthetic Data?
Synthetic data refers to artificially generated data rather than data collected from real-world events. In the context of this assignment:
- Data generated using AI/ML algorithms
- Data transformed from existing datasets (e.g., translations)
- Data created by prompting LLMs to generate samples

### Task Description
You're expected to:
1. Choose an interesting use case (suggestions below)
2. Generate a high-quality synthetic dataset
3. Fine-tune a small LLM (1-2B parameters)
4. Thoroughly evaluate the results
5. Document your approach and findings

### Potential Use Cases
- **Multilingual Translation**: Generate translation pairs for low-resource languages
- **Reasoning Tasks**: Create logical or mathematical reasoning problems
- **Vision-Language OCR**: Generate text extraction examples from images
- **Domain-Specific Q&A**: Create question-answer pairs for specialized domains
- **Code Generation**: Generate code examples for specific programming tasks

### Requirements
- The assignment must run on Google Colab with a single click
- Your synthetic dataset must be uploaded to Hugging Face Datasets
- Use small language models (eg Llama 3.2 1B and Qwen 3 0.6B / 1.7B) that can run on T4 GPUs
- Include comprehensive documentation of your approach

Good luck! We're looking for creative approaches to this problem - surprise us with your solution!

## 1. What is you Idea and use case you are trying to solve?

Give up a tldr of your idea and the use case you are trying to solve.


## 2. Environment Setup

In this section, we'll install all the necessary dependencies for our project. This includes libraries for:
- Data processing and manipulation
- LLM access and fine-tuning
- Evaluation metrics
- Hugging Face integration for dataset upload and model download

Run the cell below to set up your environment.

In [None]:
# Install necessary dependencies
!pip install -q transformers datasets evaluate peft bitsandbytes accelerate
!pip install -q huggingface_hub
!pip install -q trl
!pip install -q nltk rouge-score sacrebleu

# Optional: For specific use cases
# !pip install -q sentencepiece tokenizers
# !pip install -q gradio # For demo creation

# Login to Hugging Face (you'll need a token)
from huggingface_hub import login
# Uncomment the line below and add your token when ready to upload datasets
# login()

# Verify installations
import transformers
import datasets
import peft

print(f"Transformers version: {transformers.__version__}")
print(f"Datasets version: {datasets.__version__}")
print(f"PEFT version: {peft.__version__}")

# Check available GPU
!nvidia-smi
# ideally a T4 or A100 GPU

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 3. Synthetic Data Generation

In this section, we'll generate a synthetic dataset for our selected use case. The process involves:

1. Defining the data structure and schema
2. Setting up data generation techniques (LLM prompting, rules-based generation, etc.)
3. Creating the dataset
4. Validating data quality
5. Uploading to Hugging Face Datasets

Good examples of synthetic data generation include:
- Using LLMs to generate text samples based on specific prompts
- Using existing datasets to create variations (e.g., translations, paraphrasing)
- Using rules-based systems to generate structured data (e.g., JSON, CSV)
- Using data augmentation techniques to create variations of existing data
- Using generative models to create new data points based on existing distributions

ps: you can always start with a existing dataset and augment it with synthetically.

Some libraries you can use for data generation:
- https://github.com/meta-llama/synthetic-data-kit
- https://github.com/argilla-io/distilabel
- https://github.com/argilla-io/synthetic-data-generator

For llms you can use local llm , use free apis from [groq](https://groq.com/) anything else you can find.

Choose one of the example approaches below or create your own. Remember to document your methodology.

### the dataset generated must be uploaded to hugging face datasets
refrence : https://huggingface.co/docs/datasets/upload_dataset

In [None]:
from datasets import Dataset, DatasetDict
import pandas as pd
from tqdm.auto import tqdm

# Create a synthetic dataset


# Optional: Upload to Hugging Face Datasets
def upload_dataset_to_hf():
    # Uncomment when ready to upload
    # dataset_dict = DatasetDict({
    #     "train": train_test["train"],
    #     "test": train_test["test"]
    # })
    # dataset_dict.push_to_hub(
    #     f"your-username/{PROJECT_CONFIG['dataset_name']}",
    #     private=False
    # )
    # print(f"Dataset uploaded to Hugging Face: your-username/{PROJECT_CONFIG['dataset_name']}")
    print(
        "Dataset upload code is ready but commented out. Uncomment when ready to upload."
    )


upload_dataset_to_hf()

## 4. Model Fine-tuning

Now that we have our synthetic dataset, let's fine-tune a small LLM using PEFT/LoRA techniques. This approach allows us to efficiently adapt the pre-trained model to our specific task without excessive computational requirements.

We'll:
1. Load the pre-trained model
2. Prepare the dataset in the correct format
3. Configure LoRA adapters
4. Fine-tune the model
5. Save the fine-tuned model

This section uses Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to update only a small number of parameters, making it suitable for running on Colab's T4 GPU.

## 5. Model Evaluation

Now that we have fine-tuned our model, let's evaluate its performance by comparing it with the base model. We'll assess how well our synthetic data helped improve the model's abilities on our target task.

We'll:
1. Load both the base and fine-tuned models
2. Define appropriate evaluation metrics
3. Perform inference on test examples
4. Compare and analyze the results
5. Visualize performance differences

## 6. Final Thoughts and Project Analysis

In this section, reflect on your approach, findings, and potential improvements.

### Project Summary
Provide a brief overview of what you did:
- What use case did you choose and why?
- How did you generate the synthetic dataset?
- Which model did you fine-tune and what techniques did you use?
- What were the main evaluation metrics and results?

### Analysis of Results
- Did fine-tuning improve performance? If so, how much?
- Were there specific types of examples where improvement was more noticeable?
- What limitations did you observe in your approach?

### Improvement Ideas
- How could you enhance the quality of the synthetic dataset?
- What other fine-tuning approaches might work better?
- If you had more computational resources, what would you do differently?

### Learning Outcomes
- What insights did you gain about synthetic data generation?
- What did you learn about fine-tuning LLMs?
- What surprised you during this project?

Remember to support your analysis with specific examples and metrics from your evaluation.

## 7. References

List any resources, papers, tutorials, or tools that you found helpful for this assignment:

1. [Hugging Face PEFT Documentation](https://huggingface.co/docs/peft/index)
2. [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
3. [Parameter-Efficient Fine-Tuning Methods](https://huggingface.co/blog/peft)
4. [Synthetic Data Generation Techniques](https://arxiv.org/abs/2111.02739)
5. [Evaluating Large Language Models](https://arxiv.org/abs/2307.03109)

*This notebook was created as part of the CognitiveLab Research Internship assignment.*