# CodeClarity: AI-Powered Code Documentation and Generation

 This notebook demonstrates how to fine-tune a pre-trained language model (e.g., `Salesforce/codegen-350M-mono`) on a dataset of natural language descriptions and corresponding code snippets. The fine-tuned model can generate code snippets from natural language descriptions, making it easier to understand and work with legacy codebases.

## Key Features
 - **Fine-tune a pre-trained model** on a dataset of natural language and code pairs.
 - **Generate code snippets** from natural language descriptions.
 - **GPU acceleration** for faster training and inference.

## Tools and Libraries
 - **Hugging Face Transformers**: For loading and fine-tuning pre-trained models.
 - **Datasets Library**: For loading and preprocessing datasets.
 - **PyTorch**: For training and inference.
 - **Google Colab**: For running the notebook on an A100 GPU.

## Dataset
 We use the `code_x_glue_tc_text_to_code` dataset, which contains pairs of natural language descriptions and corresponding code snippets.

 ## Model
We fine-tune the `Salesforce/codegen-350M-mono` model, a pre-trained language model for code generation.

 ## Steps
 1. Install required libraries.
 2. Load and preprocess the dataset.
 3. Fine-tune the model.
 4. Test the fine-tuned model by generating code snippets.

## Step 1: Install Required Libraries
Install the necessary libraries for the project.



In [None]:
!pip install transformers datasets torch

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

## Step 2: Load and Preprocess the Dataset
Load the `code_x_glue_tc_text_to_code` dataset and preprocess it for training.


In [None]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("code_x_glue_tc_text_to_code")

# Inspect the first example
print(dataset["train"][0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/5.56k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/33.1M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/634k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/526k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

{'id': 0, 'nl': 'check if details are parsed . concode_field_sep Container parent concode_elem_sep boolean isParsed concode_elem_sep long offset concode_elem_sep long contentStartPosition concode_elem_sep ByteBuffer deadBytes concode_elem_sep boolean isRead concode_elem_sep long memMapSize concode_elem_sep Logger LOG concode_elem_sep byte[] userType concode_elem_sep String type concode_elem_sep ByteBuffer content concode_elem_sep FileChannel fileChannel concode_field_sep Container getParent concode_elem_sep byte[] getUserType concode_elem_sep void readContent concode_elem_sep long getOffset concode_elem_sep long getContentSize concode_elem_sep void getContent concode_elem_sep void setDeadBytes concode_elem_sep void parse concode_elem_sep void getHeader concode_elem_sep long getSize concode_elem_sep void parseDetails concode_elem_sep String getType concode_elem_sep void _parseDetails concode_elem_sep String getPath concode_elem_sep boolean verify concode_elem_sep void setParent concode_

### Preprocess the Data
Combine the natural language description (`nl`) and code snippet (`code`) into a single sequence for causal language modeling.

In [None]:
from transformers import AutoTokenizer

# Load the tokenizer
model_name = "Salesforce/codegen-350M-mono"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add a padding token if not already present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset
def tokenize_function(examples):
    # Combine natural language description and code into a single sequence
    combined_texts = [nl + " " + code for nl, code in zip(examples["nl"], examples["code"])]
    tokenized = tokenizer(
        combined_texts,
        padding="max_length",  # Pad to the maximum length
        truncation=True,       # Truncate to the maximum length
        max_length=512,        # Set a maximum length (adjust as needed)
        return_tensors="pt"    # Return PyTorch tensors
    )
    # Add labels for causal language modeling
    tokenized["labels"] = tokenized["input_ids"].clone()
    return tokenized

# Apply the tokenization function to the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Inspect the tokenized dataset
print(tokenized_datasets["train"][0])

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/240 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

{'id': 0, 'nl': 'check if details are parsed . concode_field_sep Container parent concode_elem_sep boolean isParsed concode_elem_sep long offset concode_elem_sep long contentStartPosition concode_elem_sep ByteBuffer deadBytes concode_elem_sep boolean isRead concode_elem_sep long memMapSize concode_elem_sep Logger LOG concode_elem_sep byte[] userType concode_elem_sep String type concode_elem_sep ByteBuffer content concode_elem_sep FileChannel fileChannel concode_field_sep Container getParent concode_elem_sep byte[] getUserType concode_elem_sep void readContent concode_elem_sep long getOffset concode_elem_sep long getContentSize concode_elem_sep void getContent concode_elem_sep void setDeadBytes concode_elem_sep void parse concode_elem_sep void getHeader concode_elem_sep long getSize concode_elem_sep void parseDetails concode_elem_sep String getType concode_elem_sep void _parseDetails concode_elem_sep String getPath concode_elem_sep boolean verify concode_elem_sep void setParent concode_

## Step 3: Fine-Tune the Model
Fine-tune the pre-trained model on the tokenized dataset.


In [None]:
# Install PyTorch with CUDA 11.8 (or a version your GPU supports)
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install transformers and datasets
!pip install transformers datasets

# Set CUDA_LAUNCH_BLOCKING for debugging
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

# Verify CUDA and PyTorch
import torch
print(torch.__version__)  # PyTorch version
print(torch.version.cuda)  # CUDA version
print(torch.cuda.is_available())  # Check if CUDA is available

# Load dataset
from datasets import load_dataset
dataset = load_dataset("code_x_glue_tc_text_to_code")

# Tokenize dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
tokenizer.pad_token = tokenizer.eos_token

def tokenize_fn(examples):
    combined = [f"{nl} {code}" for nl, code in zip(examples["nl"], examples["code"])]
    tokenized = tokenizer(
        combined,
        padding="max_length",
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )
    tokenized["labels"] = tokenized["input_ids"].clone()
    return tokenized

tokenized_datasets = dataset.map(tokenize_fn, batched=True)

# Fine-tune model
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

# Load the pre-trained model
model_name = "Salesforce/codegen-350M-mono"
model = AutoModelForCausalLM.from_pretrained(model_name)

# Check CUDA availability and device properties before moving the model
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using CUDA device: {torch.cuda.get_device_name(device)}")
    # Check compute capability to identify potential unsupported operations
    print(f"Compute Capability: {torch.cuda.get_device_capability(device)}")
else:
    device = torch.device("cpu")
    print("Using CPU")

# Move the model to the device
model.to(device)

# ... (rest of your code for training arguments and Trainer)
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch",
    save_total_limit=2,
    prediction_loss_only=True,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

# Fine-tune the model
trainer.train()

Looking in indexes: https://download.pytorch.org/whl/cu118
2.5.1+cu121
12.1
True


Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Some weights of the model checkpoint at Salesforce/codegen-350M-mono were not used when initializing CodeGenForCausalLM: ['transformer.h.0.attn.causal_mask', 'transformer.h.1.attn.causal_mask', 'transformer.h.10.attn.causal_mask', 'transformer.h.11.attn.causal_mask', 'transformer.h.12.attn.causal_mask', 'transformer.h.13.attn.causal_mask', 'transformer.h.14.attn.causal_mask', 'transformer.h.15.attn.causal_mask', 'transformer.h.16.attn.causal_mask', 'transformer.h.17.attn.causal_mask', 'transformer.h.18.attn.causal_mask', 'transformer.h.19.attn.causal_mask', 'transformer.h.2.attn.causal_mask', 'transformer.h.3.attn.causal_mask', 'transformer.h.4.attn.causal_mask', 'transformer.h.5.attn.causal_mask', 'transformer.h.6.attn.causal_mask', 'transformer.h.7.attn.causal_mask', 'transformer.h.8.attn.causal_mask', 'transformer.h.9.attn.causal_mask']
- This IS expected if you are initializing CodeGenForCausalLM from the checkpoint of a model trained on another task or with another architecture (e

Using CUDA device: NVIDIA A100-SXM4-40GB
Compute Capability: (8, 0)




Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss
1,0.3129,0.483375
2,0.27,0.484361


## Step 4: Test the Fine-Tuned Model
Generate code snippets from natural language descriptions using the fine-tuned model.


In [None]:
# Generate code from a natural language description
input_text = "check if details are parsed"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

# Move the input tensor to the same device as the model
input_ids = input_ids.to(device)

# Generate code
output = model.generate(input_ids, max_length=50)

# Decode the generated code
generated_code = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_code)

## Step 5: Save and Share the Model
Save the fine-tuned model and tokenizer for future use.


In [None]:
# Save the model and tokenizer
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

 ## Conclusion
 This notebook demonstrates how to fine-tune a pre-trained language model for code generation and documentation. You can extend this project by:
 - Using a larger or domain-specific dataset.
 - Adding features like interactive Q&A or automated documentation generation.
 - Deploying the model to a cloud service for real-time use.


## References
 - [Hugging Face Transformers](https://huggingface.co/transformers/)
 - [Datasets Library](https://huggingface.co/docs/datasets/)
 - [PyTorch](https://pytorch.org/)
 - [Google Colab](https://colab.research.google.com/)