# DataCamp Skill Track: Develeoping Large Language Models (LLMs)

**Description:** The purpose of this skill track is to learn the art of developing large language models (LLMs) with PyTorch and Hugging Face, using the latest deep learning and NLP techniques. There are a total of 6 courses and the total duration of this track is $16$ hours.

**Reference** [Skill track: Developing Large Language Models](https://app.datacamp.com/learn/skill-tracks/developing-large-language-models)

**Author:** Tirthankar Dutta

## Table of Contents

* [Essential Settings](#essential-settings)
  
* [Course-1: Introduction to LLMs in Python](#course-1-introduction-to-llms-in-python)

    * [Course-1.1: Getting Started with Large Language Models (LLMs)](#courese-11-getting-started-with-large-language-models-llms)
 
    * [Course-1.2: Fine-tuning LLMs](#course-12-fine-tuning-llms)
 
    * [Course-1.3: Evaluating LLM Performance](#course-13-evaluating-llm-performance)  

* [Course-2: Working with Llama 3](#course-2-working-with-llama3)

* [Course-3: Deep Learning for Text with PyTorch](#course-3-deep-learning-for-text-with-pytorch)

* [Course-4: Transformer Models with PyTorch](#course-4-transformer-models-with-pytorch)

* [Course-5: Reinforcement Learning from Human Feedback (RLHF)](#course-5-reinforcement-learning-from-human-feedback-rhlf)

* [Course-6: LLMOps Concepts](#course-6-llmops-concepts)

* [Project-1: Analyzing Car Reviews with LLMs](#project-1-analyzing-car-reviews-with-llms)

* [Project-2: Classifying Emails using Llama](#project-2-classifying-emails-with-llama)

* [Project-3: Service Desk Ticket Classification with Deep Learning](#project-3-service-desk-ticket-classification-with-deep-learning)

## Working with Huggingface Models on Corporate Devices

The objective of this section is to explore approaches to work with Hugging Face models on corporate devices. A corporate device (laptop/desktop) has a variety of restrictions applicable to it due to organizational policies. Some representative examples of restrictions are, restricted data sharing and transfer, selective internet access, etc. 

In one of the organizations where I have worked, company policy restricted the employees from accessing and using OpenAI API. Whenever an employee tried to access the OpenAI API using the Python SDK, a `SSLError` would be raised. This SSL errors however, could be bypassed using suitable _hacks_ available on the StackOverflow forum.

However, if the organization decides to blacklist a certain website or a domain, it is impossible to bypass such restrictions. In principle an employee can raise a request to the IT team to whitelist these blocked sites but a proper business justification is required - In most cases such business justifications demands that the whitelisting requirement is related to a client project and not a personal exploration or experimentation endeavor. I faced this scenario in another organization wherein we could not use Hugging Face models for any kind of experimentation, required for downloading the models locally to a persistent storage or cache, were blacklisted. 

This particular obstacle prompted me to write this section in detail.

We use Hugging Face models for two distinct purposes: 

- Inference with pretrained models
- Inference with fine-tuned models obtained from pretrained models.

For the second purpose, we need to download the model to 

In [None]:
import torch
from typing import Union

device: Union[torch.device, str] = (
    torch.cuda.get_device_name() if torch.cuda.is_available() else "cpu"
)
print(device)

In [None]:
input_text: str = """
Ritwik Ghatak (1925-1976) was a renowned Indian filmmaker known for his poignant storytelling and exploration of social issues, particularly the impact of the Partition of India on Bengali society.
Early Life
Ritwik Kumar Ghatak was born on November 4, 1925, in Dacca, Bengal Presidency, British India (now Dhaka, Bangladesh). He was the son of Suresh Chandra Ghatak, a poet and playwright, and Indubala Devi. Ghatak had a large family, with eight siblings, including the poet Manish Ghatak. After the Partition of India in 1947, his family relocated to Kolkata, which significantly influenced his work and themes in cinema.
Career Highlights
Ghatak began his career in the film industry as an actor and assistant director in Nimai Ghosh's Chinnamul (1950). His first completed film as a director was Nagarik (1952), which is considered a landmark in Bengali cinema. He is best known for his partition trilogy, which includes:
Meghe Dhaka Tara (The Cloud-Capped Star, 1960)
Komal Gandhar (E Flat, 1961)
Subarnarekha (The Golden Thread, 1962)
These films are celebrated for their deep emotional resonance and social realism, often reflecting the struggles of ordinary people against the backdrop of societal upheaval.
Themes and Style
Ghatak's films are characterized by their theatricality, documentary realism, and a strong focus on the human condition. He often explored themes of displacement, identity, and the socio-political landscape of Bengal. His work is noted for its Brechtian influences, combining stylized performances with a critical perspective on society.
Awards and Recognition
Despite facing challenges during his lifetime, Ghatak received several accolades for his contributions to cinema, including:
Padma Shri in 1970 for his contributions to the arts.
National Film Award's Rajat Kamal Award for Best Story in 1974 for Jukti Takko Aar Gappo.
Best Director's Award from the Bangladesh Cine Journalist's Association for Titash Ekti Nadir Naam.
Personal Life and Legacy
Ghatak struggled with alcoholism in his later years, which affected his health and career. He passed away on February 6, 1976, in Kolkata. His legacy continues through his films, which are increasingly recognized and studied for their artistic and cultural significance. His son, Ritaban Ghatak, is also a filmmaker, and his family remains involved in preserving his memory and contributions to cinema.
Ritwik Ghatak is now regarded as one of the greatest filmmakers in Indian cinema, alongside contemporaries like Satyajit Ray and Mrinal Sen, and his works are celebrated for their profound impact on the film industry and society at large. 
"""

print(f"Text to summarize:\n{input_text:s}\n")

In [None]:
from langchain_core.output_parsers.string import StrOutputParser
from langchain_core.prompts.prompt import PromptTemplate
from langchain_core.prompt_values import PromptValue
from langchain_core.runnables.base import RunnableSerializable
from langchain_huggingface import HuggingFaceEndpoint
from huggingface_hub.inference import 

llm: HuggingFaceEndpoint = HuggingFaceEndpoint(
    name="summarizer_llm",
    model="Qwen/Qwen3-Coder-480B-A35B-Instruct",
    temperature=0.01,
    top_k=5,
    top_p=0.9,
    max_new_tokens=1024,
    provider="auto",
)

prompt: str = "Summarize the text given below:\n\n{text}\n"
prompt_template: PromptTemplate = PromptTemplate.from_template(template=prompt)
# prompt_value: PromptValue = prompt_template.invoke(input=input_text)
# print(prompt_value)

output_parser: StrOutputParser = StrOutputParser()
summary_chain: RunnableSerializable = prompt_template | llm | output_parser
summary: str = summary_chain.invoke(input={"text": input_text})
print(summary)

In the first section of this notebook we execute all the necessary steps that are required for running the code cells contained in this notebook.

* If we use local CPU/GPU to run this notebook, we should keep `execute = False` and select the appropriate local Jupyter kernel from the top-right corner of the notebook.

* If we use Kaggle's kernel to run this notebook, we should make `execute = True` and execute the code enclosed within the `if` statement. But before doing so, we must change the kernel of this Jupyter notebook according to [these instructions](https://www.kaggle.com/code/antdes/how-to-add-kaggle-to-cursor).

In [None]:
import subprocess as sp
from colorama import init, Fore, Style
from pathlib import Path
from typing import Literal, List, Union

%xmode Minimal 

**Essential utility/helper functions**

In [None]:
def install_dependencies(
    deps_file: Union[str, Path],
    pkg_manager: Literal["uv", "pip"],
) -> None:
    """Install essential dependencies on the remote Kaggle kernel.

    Args:
        deps_file (Union[str, Path]): Absolute file path of the dependency file.
        pkg_manager (Literal["uv", "pip"]): The Python package manager used for
        installing dependencies.

    Returns:
        None: No return values

    Raises:
        NotImplementedError: Raise if Python package manager used for installing dependencies is neither `pip` nor `uv`.
        CalledProcessError: Raise if called process exit code is not `0`.
        Exception: Raise if some unknown/unforeseen error crashes the code.
    """
    init()
    if pkg_manager not in ["uv", "pip"]:
        raise NotImplementedError(
            f"{pkg_manager} not supported by current implementation!"
        )

    if pkg_manager == "pip":
        cmd: str = (
            "python3 -m pip install --trusted-host pypi.org "
            f"--trusted-host files.pythonhosted.org -r {deps_file} "
            "--no-cache-dir"
        )

    else:
        cmd: str = (
            "uv pip install --trusted-host pypi.org "
            f"--trusted-host files.pythonhosted.org -r {deps_file} "
            "--no-cache-dir"
        )

    args: List[str] = cmd.split(" ")

    try:
        result: sp.CompletedProcess = sp.run(
            args=args,
            check=True,
            capture_output=True,
            shell=True,
            text=True,
        )

        if result.check_returncode():
            print(
                f"{Style.BRIGHT} {Fore.RED} Called process failed! "
                "See details below..."
            )
            raise sp.CalledProcessError(
                cmd=args,
                returncode=result.returncode,
                output=result.stdout,
                stderr=result.stderr,
            )

        else:
            if result.returncode == 0:
                print(
                    f"{Style.BRIGHT} {Fore.GREEN} Called process executed "
                    "successfully!"
                )
            if result.stdout:
                print(f"{Style.BRIGHT} {Fore.GREEN} StdOut: {result.stdout}")

    except Exception as err:
        print(
            f"{Style.BRIGHT} {Fore.RED} Called process failed due to "
            f"unforeseen error! Error: {err}\n"
        )

    return None

In [None]:
def check_gpu_support() -> None:
    """Checks if GPU support is enables either for local/remote kernel
    using the command `nvidia-smi`

    Args:
        No input parameter.

    Returns:
        None: No return value.

    Raises:
        CalledProcessError: Raise if called process exit code is not `0`.
        Exception: Raise if some unknown/unforeseen error crashes the code.
    """
    init()
    cmd: str = "nvidia-smi"

    try:
        result: sp.CompletedProcess = sp.run(
            args=cmd,
            check=True,
            capture_output=True,
            shell=True,
            text=True,
        )

        if result.check_returncode():
            print(
                f"{Style.BRIGHT} {Fore.RED} Called process failed! "
                "See details below..."
            )
            raise sp.CalledProcessError(
                cmd=cmd,
                returncode=result.returncode,
                output=result.stdout,
                stderr=result.stderr,
            )

        else:
            if result.returncode == 0:
                print(
                    f"{Style.BRIGHT} {Fore.GREEN} Called process executed "
                    "successfully!\n"
                )
            if result.stdout:
                print(f"{Style.BRIGHT} {Fore.GREEN} StdOut: {result.stdout}\n")

    except Exception as err:
        print(
            f"{Style.BRIGHT} {Fore.RED} Called process failed due to "
            f"unforeseen error! Error: {err}\n"
        )

    return None

In [None]:
execute: bool = False
if execute:
    ROOT_DIR: Path = Path.cwd()
    DEPS_FILE: Path = ROOT_DIR / Path("./pyproject.toml")
    print(f"Dependency file location:\n{DEPS_FILE}\n")
    install_dependencies(deps_file=DEPS_FILE, pkg_manager="pip")

    # This particular line is essential to know whether this VS Code
    # notebook is connected to the Kaggle kernel, should we choose to use
    # one, as well as if the Kaggle kernel is using any GPU cores for
    # acceleration.
    print()
    check_gpu_support()

## Course-1: Introduction to LLMs in Python

**Course URL:** [Introduction to LLMs in Python](https://app.datacamp.com/learn/courses/introduction-to-llms-in-python)

This course consists of the following three sub-courses:

* [Course-1.1: Getting Started with Large Language Models (LLMs)](#course-11-getting-started-with-large-language-models-llms)

* [Course-1.2: Fine-tuning LLMs](#course-12-fine-tuning-llms)

* [Course-1.3: Evaluating LLM performance](#course-13-evaluating-llm-performance)

### Course-1.1: Getting Started with Large Language Models (LLMs)

#### Working with LLMs Using Huggingface Transformers Library

In this course we will be working with pre-trained LLMs from the Huggingface transformers library. The code given below illustrates the process of working with LLMs using Huggingface library.

In [None]:
input_text: str = """
Ritwik Ghatak (1925-1976) was a renowned Indian filmmaker known for his poignant storytelling and exploration of social issues, particularly the impact of the Partition of India on Bengali society.
Early Life
Ritwik Kumar Ghatak was born on November 4, 1925, in Dacca, Bengal Presidency, British India (now Dhaka, Bangladesh). He was the son of Suresh Chandra Ghatak, a poet and playwright, and Indubala Devi. Ghatak had a large family, with eight siblings, including the poet Manish Ghatak. After the Partition of India in 1947, his family relocated to Kolkata, which significantly influenced his work and themes in cinema.
Career Highlights
Ghatak began his career in the film industry as an actor and assistant director in Nimai Ghosh's Chinnamul (1950). His first completed film as a director was Nagarik (1952), which is considered a landmark in Bengali cinema. He is best known for his partition trilogy, which includes:
Meghe Dhaka Tara (The Cloud-Capped Star, 1960)
Komal Gandhar (E Flat, 1961)
Subarnarekha (The Golden Thread, 1962)
These films are celebrated for their deep emotional resonance and social realism, often reflecting the struggles of ordinary people against the backdrop of societal upheaval.
Themes and Style
Ghatak's films are characterized by their theatricality, documentary realism, and a strong focus on the human condition. He often explored themes of displacement, identity, and the socio-political landscape of Bengal. His work is noted for its Brechtian influences, combining stylized performances with a critical perspective on society.
Awards and Recognition
Despite facing challenges during his lifetime, Ghatak received several accolades for his contributions to cinema, including:
Padma Shri in 1970 for his contributions to the arts.
National Film Award's Rajat Kamal Award for Best Story in 1974 for Jukti Takko Aar Gappo.
Best Director's Award from the Bangladesh Cine Journalist's Association for Titash Ekti Nadir Naam.
Personal Life and Legacy
Ghatak struggled with alcoholism in his later years, which affected his health and career. He passed away on February 6, 1976, in Kolkata. His legacy continues through his films, which are increasingly recognized and studied for their artistic and cultural significance. His son, Ritaban Ghatak, is also a filmmaker, and his family remains involved in preserving his memory and contributions to cinema.
Ritwik Ghatak is now regarded as one of the greatest filmmakers in Indian cinema, alongside contemporaries like Satyajit Ray and Mrinal Sen, and his works are celebrated for their profound impact on the film industry and society at large. 
"""

print(f"Text to summarize:\n{input_text:s}\n")

In [None]:
!export HF_HUB_ENABLE_HF_TRANSFER=0
!wget --no-check-certificate https://huggingface.co/google-bert/bert-base-uncased/blob/main/pytorch_model.bin hf_models/

In [None]:
import os
from pathlib import Path
from typing import Final
from huggingface_hub import hf_hub_download, logging

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"
os.environ["HF_HUB_DISABLE_XET"] = "1"

logging.set_verbosity_debug()

ROOT_DIR: Final[Path] = Path.cwd()
REPO_ID: Final[str] = "facebook/bart-large-cnn"
LOCAL_STORAGE_NAME: Final[str] = f"hf_models"
LOCAL_STORAGE_PATH: Path = ROOT_DIR / Path(LOCAL_STORAGE_NAME)

model_path = hf_hub_download(
    repo_id="facebook/bart-large-cnn",
    repo_type="model",
    force_download=True,
    filename="pytorch_model.bin",
    local_dir=LOCAL_STORAGE_PATH,
)

In [None]:
import os
from pathlib import Path
from huggingface_hub import snapshot_download, hf_hub_download

model: str = "facebook/bart-large-cnn"
local_dir: str = str(Path.cwd() / Path("hf_models"))
if _ := os.cpu_count():
    n_cpus: int = _
else:
    n_cpus = 1

snapshot_download(
    repo_id=model,
    repo_type="model",
    local_dir=local_dir,
    max_workers=n_cpus,
    allow_patterns=["*.json", "*.safetensors"],
)

In [None]:
import os
import requests
import torch
import ssl
from ctransformers.llm import LLM
from ctransformers.hub import AutoModelForCausalLM, AutoTokenizer
from ctransformers.transformers import CTransformersTokenizer
from huggingface_hub import configure_http_backend


ssl._create_default_https_context = ssl._create_unverified_context


def backend_factory() -> requests.Session:
    session = requests.Session()
    session.verify = False
    return session


configure_http_backend(backend_factory=backend_factory)

llm: LLM = AutoModelForCausalLM.from_pretrained(
    model_path_or_repo_id="facebook/bart-large-cnn",
    model_file="pytorch_model.bin",
    hf=True,
)

tokenizer: CTransformersTokenizer = AutoTokenizer.from_pretrained(model=model)

summary = llm(
    prompt=input_text,
    max_new_tokens=1024,
    top_k=10,
    top_p=0.9,
    temperature=0.01,
    threads=os.cpu_count(),
)
print(type(summary))
print(summary)

In [None]:
from huggingface_hub import InferenceClient, SummarizationOutput

client: InferenceClient = InferenceClient()
summary: SummarizationOutput = client.summarization(
    text=input_text,
    model="facebook/bart-large-cnn",
    clean_up_tokenization_spaces=True,
)
print(type(summary))
print(summary)
print(summary.summary_text)

In [None]:
from typing import Dict, List
from huggingface_hub.hf_file_system import HfFileSystem

hf_repo_file_system: HfFileSystem = HfFileSystem()

# List files in a HF repo file system.
files_from_approach1: List[str] = hf_repo_file_system.glob(
    path="facebook/bart-large-cnn/*.*"
)
print(f"Files in HF repo:\n\n{files_from_approach1}\n")

# Alternative approach to list all files in a HF repo.

# Here: `detail = True` - This yields a list of `Dict[str, str]` objects
# with all metadata associated with each file.
files_from_approach2: List[str | Dict[str, str]] = hf_repo_file_system.ls(
    path="facebook/bart-large-cnn",
    refresh=True,
    detail=True,
)
print(f"Approach-2: All files:\n\n{files_from_approach2}\n")

# Here: `detail = False` - This yields a list of `str` objects that gives
# the name of each file. No metadata associated with the files are returned.
files_from_approach2: List[str | Dict[str, str]] = hf_repo_file_system.ls(
    path="facebook/bart-large-cnn",
    refresh=True,
    detail=False,
)
print(f"Approach-2: All files:\n\n{files_from_approach2}\n")

assert all(files_from_approach1) == all(files_from_approach2), (
    "Output from `.ls()` method with `detail=False` and from `.glob()` must "
    "be identical!"
)

In [None]:
# List all files in a HF repo and then write the appropriate files to a local
# storage.
import json
from io import FileIO
from typing import Final, Union
from pathlib import Path

required_files: List[str] = [
    "facebook/bart-large-cnn/config.json",
    "facebook/bart-large-cnn/model.safetensors",
    "facebook/bart-large-cnn/pytorch_model.bin",
    "facebook/bart-large-cnn/tokenizer.json",
    "facebook/bart-large-cnn/vocab.json",
]

ROOT_DIR: Final[Path] = Path.cwd()
REPO_ID: Final[str] = "facebook/bart-large-cnn"
LOCAL_STORAGE_NAME: Final[str] = f"hf_models"
LOCAL_STORAGE_PATH: Path = ROOT_DIR / Path(LOCAL_STORAGE_NAME)

print(f"Local storage:\n{LOCAL_STORAGE_PATH}\n")

for file in required_files[:2]:
    local_file = (LOCAL_STORAGE_PATH / Path(file)).absolute()
    try:
        print(f"Downloading {file}...")
        hf_repo_file_system.get_file(
            rpath=file,
            lpath=local_file,
        )
        print(f"{file} downloaded successfully...")
        print()
    except Exception as err:
        print(f"Downloading of {file} failed! Error: {err}\n")

In [None]:
from huggingface_hub import snapshot_download

HF_HUB_DISABLE_SYMLINKS_WARNING = 1
snapshot_download(
    repo_id="facebook/bart-cnn-large",
    local_dir=LOCAL_STORAGE_PATH,
    force_download=True,
    local_dir_use_symlinks=False,
)

In [None]:
from pathlib import Path
from typing import Any, Dict, List, Union

import adapters
from adapters import AutoAdapterModel
from transformers import AutoTokenizer, AutoModel
from transformers.pipelines import pipeline
from transformers.pipelines.text2text_generation import SummarizationPipeline

# Refer to the given StackOverflow post for details about the following line.
# "https://stackoverflow.com/questions/76707715/stucking-at-downloading-shards-for-loading-llm-model-from-huggingface"
HF_HUB_ENABLE_HF_TRANSFER = 1

# model: Union[Any, None] = AutoAdapterModel.from_pretrained(
#     pretrained_model_name_or_path="facebook/bart-large-cnn",
# )

# _adaptor = model.load_adapter()

local_model = AutoModel.from_pretrained(
    pretrained_model_name_or_path=str(
        Path.cwd() / Path("Qwen2-72B-Instruct-Q8_0.gguf")
    )
)
local_tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path=str(
        Path.cwd() / Path("Qwen2-72B-Instruct-Q8_0.gguf")
    )
)

summarizer = pipeline(
    task="summarization",
    model=local_model,
    tokenizer=local_tokenizer,
)
summary: List[Dict[str, str]] = summarizer(input_text, max_length=100)
print(f"Bare Summary Object:\n{summary}\n")

To remove unwanted white spaces introduced through tokenization, we can add the parameter `clean_up_tokenization_spaces` to the `pipeline` function and set its value to `True` as shown.

```python

from transformers import pipeline

summarizer = pipeline(
    task="summarization", 
    model="facebook/bart-large-cnn",
    clean_up_tokenization_spaces=True,
)

input_text: str = """..."""

summary: str = summarizer(input_text, max_length=100)

```

Let us now see the

In [None]:
# If we intend to use the Kaggle kernel to run this VS Code notebook, we must 
# make `run_code = True` to ensure that the necessary dependencies are 
# installed.

run_code: bool = False 
if run_code:
    

### Course-1.2: Fine-tuning LLMs

### Course-1.3: Evaluating LLM performance

In [None]:
from transformers import pipeline

## Course-2: Working with Llama 3

## Course-3: Deep Learning for Text with PyTorch

## Course-4: Transformer Models with PyTorch

## Course-5: Reinforcement Learning from Human Feedback (RHLF)

## Course-6: LLMOps Concepts

## Project-1: Analyzing Car Reviews with LLMs

## Project-2: Classifying Emails using Llama

## Project-3: Service Desk Ticket Classification with Deep Learning