-------

<br>

## **Part 1.1:** Recalling Deep Learning

Throughout your learning adventure with deep learning, you have probably optimized a variety of models for tasks like classification and regression. In order, you probably advanced in something like the following:

- When you started out, you used **linear and logistic regression** to model and interpret simple linear relationships that associated your inputs with your outputs.
- When that wasn't enough, you started **stacking linear layers one after another and adding non-linear activations** to give your model more predictive power.
- When your data started getting intractably high-dimensional, you started using more **informed sparsely-connected techniques like convolution** to add more control to your reasoning criteria.
- When you realized that you didn't have enough data to properly train your models for each specific task, you got **pre-trained components (i.e. VGG-16/ResNet)** that were trained on a giant repository of training data and already contained the necessary logic you wanted.



If you’ve gone through these steps, you already possess the foundational skills to tackle complex topics, including the vast field of language modeling.

Like vision, language is highly complex and high-dimensional. For instance, a simple 200x200 color image contains $200\times 200\times 3 = 120,000$ features! Now, consider the even larger number of combinations possible in natural language—**it’s enormous!** Fortunately, creative techniques and large pre-trained models simplify the problem, providing efficient solutions.

**This course will show you how to approach language modeling, the tools available, and the types of problems you can solve!**

-------

<br>

## **Part 1.2:** Pulling In Our First LLM

Rather than building models from scratch, this course focuses on using powerful tools to simplify the process. A great place to start is **HuggingFace** &#x1F917;.

[**HuggingFace**](https://huggingface.co/) is an open-source community that offers simple strategies for accessing, developing, and hosting large deep learning models for testing and deployment. The topics they support span many tasks and modalities, but we'll be focusing on large language models (**LLMs**) for most of this course.

When searching through the [**HuggingFace Models catalog**](https://huggingface.co/models?sort=downloads&search=bert), you'll quickly stumble upon the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) model. Taking a look at its card, you'll see several interesting things:

- **The [`transformers`](https://github.com/huggingface/transformers) Package**: Loading in the model requires the use of the [`transformers`](https://github.com/huggingface/transformers) package, which is HuggingFace’s primary library for language models. Its name refers to the transformer architecture, which we’ll explore further in upcoming sections. You’ll be using `transformers` throughout this course, so feel free to dive into the source code as you go.

- **Pipelines**: HuggingFace simplifies complex deep learning tasks with its [Pipeline API](https://huggingface.co/docs/transformers/main_classes/pipelines). A pipeline allows you to move from input to output without worrying about the internal workings. We’ll demonstrate this with the `bert-base-uncased` model for mask-filling.

As a representative example, we can go ahead and pull in the discussed [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) model and test it out:

In [2]:
from transformers import pipeline

## Loading in the pipeline and predict the mask fill (example from model card)
unmasker = pipeline(
    'fill-mask',
    model='bert-base-uncased',
    device='cuda',  ## Feel free to use GPU. For such a small model, not necessary
)
unmasker("Hello I'm a [MASK] model.")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.10731059312820435,
  'token': 4827,
  'token_str': 'fashion',
  'sequence': "hello i ' m a fashion model."},
 {'score': 0.08774524927139282,
  'token': 2535,
  'token_str': 'role',
  'sequence': "hello i ' m a role model."},
 {'score': 0.05338389053940773,
  'token': 2047,
  'token_str': 'new',
  'sequence': "hello i ' m a new model."},
 {'score': 0.046672068536281586,
  'token': 3565,
  'token_str': 'super',
  'sequence': "hello i ' m a super model."},
 {'score': 0.02709585428237915,
  'token': 2986,
  'token_str': 'fine',
  'sequence': "hello i ' m a fine model."}]

**And just like that, it works!** You have just obtained an intuitive answer that makes sense with human logic, which makes you almost forget that there is actually a deep learning model calculating probabilities to make this all work.

**But that's what this course is for: to see what's actually going on behind the scenes and know how to use it to make interesting and useful systems.**

<hr>
<br>

## **Part 1.3:** Dissecting The Pipeline

While the pipeline provides a clean interface—taking strings in and returning a dictionary—it doesn’t give us much insight into what’s happening under the hood. Let’s peel back a layer and examine the approximate inner workings:

In [5]:
from transformers import AutoTokenizer, BertTokenizer, BertModel, FillMaskPipeline, AutoModelForMaskedLM, BertForMaskedLM, BertForPreTraining

from transformers import AutoTokenizer, AutoModel        ## General-purpose fully-automatic
from transformers import AutoModelForMaskedLM            ## Default import for FillMaskPipeline
from transformers import BertTokenizer, BertForMaskedLM  ## Realized components after automatic resolution

class MyMlmPipeline(FillMaskPipeline):
    ## My Masked Language Modeling Pipeline

    ### CASE 0: Construct your pipeline automatically by pulling in the components
    ###   with their respective configs from HuggingFace. Pipeline assumes preprocessing/postprocessing.
    def __init__(self, *args, **kwargs):
        ## The fully-automatic version
        super().__init__(
            tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased'),
            model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased"),
            # model = BertForMaskedLM.from_pretrained("bert-base-uncased"),
            *args, **kwargs  ## <- pass in any extra arguments
        )

    ### CASE 1: Uncomment out the __call__ method to see what data is flowing.
    """
    def __call__(self, string, verbose=True):
        ## Verbose argument just there for our convenience
        input_tensors = self.preprocess(string)
        if verbose: print('\npreprocess outputs:\n', input_tensors, '\n')
        output_tensors = self.forward(input_tensors)
        if verbose: print('forward outputs:\n', output_tensors, '\n')
        output = self.postprocess(output_tensors)
        return output
    """

    ### CASE 2: Uncomment out the manual overrides below to verify the pipeline still works
    """
    def preprocess(self, string):
        string = [string] if isinstance(string, str) else string
        inputs = self.tokenizer(string, return_tensors="pt")           ### strings -> indices
        inputs = {k: v.to("cuda") for k, v in inputs.items()}          ### move to GPU
        return inputs

    def forward(self, tensor_dict):
        output_tensors = self.model.forward(**tensor_dict)             ### indices -> vectors -> probabilities
        return {**output_tensors, **tensor_dict}

    def postprocess(self, tensor_dict):
        ## Very Task-specific; see FillMaskPipeline.postprocess
        tensor_dict = {k: v.to("cpu") for k, v in tensor_dict.items()} ### move off GPU
        return super().postprocess(tensor_dict)                        ### probabilities (or vectors) -> outputs
    """

unmasker = MyMlmPipeline(device="cuda")
unmasker("Hello, Mr. Bert! How is it [MASK]?")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.23865924775600433,
  'token': 2183,
  'token_str': 'going',
  'sequence': 'hello, mr. bert! how is it going?'},
 {'score': 0.07178767770528793,
  'token': 2017,
  'token_str': 'you',
  'sequence': 'hello, mr. bert! how is it you?'},
 {'score': 0.05827939882874489,
  'token': 6230,
  'token_str': 'happening',
  'sequence': 'hello, mr. bert! how is it happening?'},
 {'score': 0.056334055960178375,
  'token': 2651,
  'token_str': 'today',
  'sequence': 'hello, mr. bert! how is it today?'},
 {'score': 0.052870072424411774,
  'token': 2085,
  'token_str': 'now',
  'sequence': 'hello, mr. bert! how is it now?'}]

We can also see that the model primarily consists of two core components:
- **Tokenizer**: Converts input strings into a format the model can process.
- **Model**: The deep learning engine that transforms input tensors into output tensors.

Together, these components support the pipeline’s intuitive workflow:
- `preprocess`: human-intuitive input $\to$ tensor inputs. Facilitated by `tokenizer`
- `forward`: tensor inputs $\to$ tensor outputs. Facilitated by `model`
- `postprocess`: tensor outputs $\to$ human-intuitive outputs. Facilitated by the pipeline task.

For deep learning, this flow makes sense. The model operates on numbers, but the users are working with words, to this abstraction greatly simplifies the LLM interface for typical scenarios.

-------

In [6]:
%%bash
echo """
===================================================
GPU SPECIFICATION
==================================================="""
nvidia-smi
echo """
===================================================
MEMORY SPECIFICATION
==================================================="""
free -h


GPU SPECIFICATION
Tue Nov 26 11:14:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P0              27W /  70W |   1059MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                 

In [7]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}