<a href="https://colab.research.google.com/github/a-forty-two/EY_batch7_18Sep/blob/main/18_Sep_005_llm_intro_tokenizer%20and%20simple%20LLM%20pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

-------

## 1.2. Pulling In Our First LLM

Instead of constructing things from the ground up, this course will focus on spot-lighting tools that you can use and diving into them as necessary to figure out exactly how they work. And the best tool to start our journey into language modeling is **HuggingFace &#x1F917;!**

[**HuggingFace**](https://huggingface.co/) is an open-source community that offers simple strategies for accessing, uploading, and using large deep learning models for testing and deployment. The topics they support span many tasks and modalities, but we'll be focusing on large language models (**LLMs**) for most of this course.

When searching through the [HuggingFace Models catalog](https://huggingface.co/models?sort=downloads&search=bert), you'll quickly stumble upon the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) model. Taking a look at its card, you'll see several interesting things:

1. Loading in the model requires the use of the [`transformers`](https://github.com/huggingface/transformers) package. This is the HuggingFace package used to support most of the platform's language modeling code. Its name, `transformers`, refers to the primary architectural structure underlying many of these models, and we'll be talking about this structure in some detail throughout the next notebook. From here on out, you'll want to get comfortable with `transformers` and will be using it quite a bit, so feel free to search around and dive into the source code if you feel like it!
2. The card describes a default version that can be pulled in for mask filling (to be discussed) via its [Pipelines]([https://huggingface.co/docs/transformers/main_classes/pipelines]) support. By **pipeline**, we mean the end-to-end process of going from a human-reasonable input to a human-reasonable output. This makes it super-easy to pull in the model and helps you to forget that there is a tensor-in/tensor-out differentiable process going on somewhere under the hood.

As a representative example, we can go ahead and pull in the discussed [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) model and test it out!

In [4]:
from transformers import pipeline

## Loading in the pipeline and predict the mask fill (example from model card)
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Hello I'm a graceful [MASK] model.")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.23936112225055695,
  'token': 4827,
  'token_str': 'fashion',
  'sequence': "hello i'm a graceful fashion model."},
 {'score': 0.2011405974626541,
  'token': 2535,
  'token_str': 'role',
  'sequence': "hello i'm a graceful role model."},
 {'score': 0.048176608979701996,
  'token': 2210,
  'token_str': 'little',
  'sequence': "hello i'm a graceful little model."},
 {'score': 0.028926273807883263,
  'token': 9271,
  'token_str': 'runway',
  'sequence': "hello i'm a graceful runway model."},
 {'score': 0.020872579887509346,
  'token': 2047,
  'token_str': 'new',
  'sequence': "hello i'm a graceful new model."}]

**Amazing! It just works!** Under the hood, there's a deep learning model somewhere - crunching numbers and spitting out probabilities to make all of this happen - but it's easy to forget that sometimes. It's especially easy to forget when the model you're dealing with is actually generating human-sounding text, at which point you may start to wonder if it's connected to a human brain somewhere in a warehouse in California. But that's what this course is for: **to see what's actually going on behind the scenes and know how to use it to make good systems**.

-------

## 1.3. Dissecting The Pipeline

Looking at this resolution - where we just see the pipeline taking strings in and spitting a dictionary out - isn't really helping our understanding much, so let's see what's actually going on with the pipeline. We can peel back the layer of abstraction just a little to see the structure inside of the pipeline:

In [5]:
from transformers import AutoTokenizer, BertTokenizer, BertModel, FillMaskPipeline, AutoModelForMaskedLM, BertForMaskedLM, BertForPreTraining

from transformers import AutoTokenizer, AutoModel        ## General-purpose fully-automatic
from transformers import AutoModelForMaskedLM            ## Default import for FillMaskPipeline
from transformers import BertTokenizer, BertForMaskedLM  ## Realized components after automatic resolution

class MyMlmPipeline(FillMaskPipeline):
    def __init__(self):
        ## The fully-automatic version
        super().__init__(
            tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased'),
            model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
        )

    def __call__(self, string, verbose=False):
        ## Verbose argument just there for our convenience
        input_tensors = self.preprocess(string)
        if verbose: print('\npreprocess outputs:\n', input_tensors, '\n')
        output_tensors = self.forward(input_tensors)
        if verbose: print('forward outputs:\n', output_tensors, '\n')
        output = self.postprocess(output_tensors)
        return output

    # def preprocess(self, string):
    #     string = [string] if isinstance(string, str) else string
    #     inputs = self.tokenizer(string, return_tensors="pt")
    #     return inputs

    # def forward(self, tensor_dict):
    #     output_tensors = self.model.forward(**tensor_dict)
    #     return {**output_tensors, **tensor_dict}

    # def postprocess(self, tensor_dict):
    #     ## Very Task-specific; see FillMaskPipeline.postprocess
    #     return super().postprocess(tensor_dict)

unmasker = MyMlmPipeline()
unmasker("Hello, Mr. Bert! How is it [MASK]?", verbose=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



preprocess outputs:
 {'input_ids': tensor([[  101,  7592,  1010,  2720,  1012, 14324,   999,  2129,  2003,  2009,
           103,  1029,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])} 

forward outputs:
 ModelOutput([('logits', tensor([[[ -6.6978,  -6.6436,  -6.6542,  ...,  -5.9664,  -5.8454,  -4.0651],
         [ -7.1922,  -7.0817,  -7.2808,  ...,  -6.6275,  -6.7937,  -5.0987],
         [-14.1895, -14.3331, -14.3951,  ..., -11.4550, -10.5485, -10.7081],
         ...,
         [ -7.8704,  -7.9647,  -7.8207,  ...,  -6.7212,  -7.0204,  -4.0526],
         [-10.7055, -10.2556, -10.7905,  ...,  -9.1913, -10.1996,  -0.9692],
         [-10.9574, -11.0610, -10.8301,  ...,  -9.3834,  -9.7457,  -8.4909]]])), ('input_ids', tensor([[  101,  7592,  1010,  2720,  1012, 14324,   999,  2129,  2003,  2009,
           103,  1029,   102]]))]) 



[{'score': 0.23866014182567596,
  'token': 2183,
  'token_str': 'going',
  'sequence': 'hello, mr. bert! how is it going?'},
 {'score': 0.07178756594657898,
  'token': 2017,
  'token_str': 'you',
  'sequence': 'hello, mr. bert! how is it you?'},
 {'score': 0.05827958881855011,
  'token': 6230,
  'token_str': 'happening',
  'sequence': 'hello, mr. bert! how is it happening?'},
 {'score': 0.056334324181079865,
  'token': 2651,
  'token_str': 'today',
  'sequence': 'hello, mr. bert! how is it today?'},
 {'score': 0.052870072424411774,
  'token': 2085,
  'token_str': 'now',
  'sequence': 'hello, mr. bert! how is it now?'}]

We can also see that the model is largely comprised of two main components:
- `tokenizer`: The strategy to convert the input strings to something usable by the model.
- `model`: The deep learning model responsible for the input-tensor-to-output-tensor conversion.

With these, the pipeline is able to support its streamlined interface with a pretty intuitive organization scheme:  
- `preprocess`: human-intuitive input $\to$ tensor inputs. Facilitated by `tokenizer`
- `forward`: tensor inputs $\to$ tensor outputs. Facilitated by `model`
- `postprocess`: tensor outputs $\to$ human-intuitive outputs. Facilitated by the pipeline task.

For deep learning, this actually seems pretty reasonable; the model reasons in numbers, and you probably don't want to expose that to the typical user when your domain is language. This makes it very easy for a typical starting user to just pick up the models and roll with them, so hopefully you feel a bit more comfortable when approaching the open-sourced LLM ecosystem!

-------

In [6]:
%%bash
echo """
===================================================
GPU SPECIFICATION
===================================================
"""
nvidia-smi
echo """
===================================================
MEMORY SPECIFICATION
===================================================
"""
cat /proc/meminfo


GPU SPECIFICATION


MEMORY SPECIFICATION

MemTotal:       13290460 kB
MemFree:         4665140 kB
MemAvailable:   10846260 kB
Buffers:          399764 kB
Cached:          5897256 kB
SwapCached:            0 kB
Active:          1170936 kB
Inactive:        7078212 kB
Active(anon):       1388 kB
Inactive(anon):  1952616 kB
Active(file):    1169548 kB
Inactive(file):  5125596 kB
Unevictable:           8 kB
Mlocked:               8 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               300 kB
Writeback:             0 kB
AnonPages:       1952320 kB
Mapped:          1237192 kB
Shmem:              1736 kB
KReclaimable:     211380 kB
Slab:             259880 kB
SReclaimable:     211380 kB
SUnreclaim:        48500 kB
KernelStack:        6080 kB
PageTables:        36028 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6645228 kB
Committed_AS:    4466300 kB
VmallocTotal:   34359738367 kB
Vm

bash: line 6: nvidia-smi: command not found


**So yeah, decent compute budget, *but not infinite*!**

Before starting the next notebook, please restart the jupyter kernel by running the code cell below. This will prevent memory issues in future notebooks and will keep the instance memory load from overpowering our compute budget.

-------