# Environment setups

> Run a Large Language Model using the [HuggingFace `Transformers`](https://huggingface.co/docs/transformers/index) API.  

In [1]:
#| default_exp lesson_1.first_run

The cells below are good defaults for development.  

The `autoreload` lines help load libraries on the fly, while they are changing. This works well with the editable install we created via `pip install -e .`  
This means we can edit the source code directly and have the change reflected live in the notebook.  

In [4]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Introduction

Imagine we have a list of product review from our users. Now we want to find out whether those reviews were good or bad. It will take a lot of effort to manually go through and check each one. But, using an LLM, we can automatically get a label for a given product review. 

How would this be useful? We could use it to find the more negative reviews to see where our product needs improving. Or, we can look at the more positive ones to see what we're doing right.  

The broader task in NLP of figuring out a statement's tone is called `Sentiment Analysis`.

## First, a Pipeline

A HuggingFace model is based on 3 key pieces: 
1. Config file.  
2. Preprocessor file.   
3. Model file.   

The HuggingFace API gives us a way of automatically using these pieces directly: the `pipeline`.  

Let's get right it and create a Sentiment Analysis `pipeline`.

In [6]:
#| export 

# load in the pipeline object from huggingface
from transformers import pipeline

# create the sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 629/629 [00:00<00:00, 752kB/s]
Downloading model.safetensors: 100%|██████████| 268M/268M [00:21<00:00, 12.6MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 388kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 5.65MB/s]


We can see in the output message above that HuggingFace automatically picked a decent, default model for us since we didn't specify one. Specifically, it chose a [distilbert model](distilbert-base-uncased-finetuned-sst-2-english).  

We will learn more about what exactly `distilbert` is and how it works later on. For now, think of it as a useful NLP genie who can tell us how it feels about a given sentence. 

In [7]:
#| export

# example from the HuggingFace tutorial
classifier("We are very happy to show you the 🤗 Transformers library.")

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

In [8]:
#| export

# passing in several sentences at once, inside a python list
results = classifier([
    "We are very happy to show you the 🤗 Transformers library.",
    "We hope you don't hate it.",
    "I love Fractal! I'm so glad it's not a cult!", 
])

# print the output of each results
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309
label: POSITIVE, with score: 0.999


# Inspecting the `classifier`, notebook style.

What is the `classifier`, exactly?

In [9]:
classifier

<transformers.pipelines.text_classification.TextClassificationPipeline at 0x16d2fe690>

In [10]:
## showing the lookup's auto-complete
# classifier.

In [12]:
## viewing all of a class' methods and properties
dir(classifier)

['__abstractmethods__',
 '__call__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_batch_size',
 '_ensure_tensor_on_device',
 '_forward',
 '_forward_params',
 '_num_workers',
 '_postprocess_params',
 '_preprocess_params',
 '_sanitize_parameters',
 'binary_output',
 'call_count',
 'check_model_type',
 'default_input_names',
 'device',
 'device_placement',
 'ensure_tensor_on_device',
 'feature_extractor',
 'forward',
 'framework',
 'function_to_apply',
 'get_inference_context',
 'get_iterator',
 'image_processor',
 'iterate',
 'model',
 'modelcard',
 'postprocess',
 'predict',
 'preprocess',
 'return_all_scores',
 'run_mul

Jupyter notebooks have powerful ways of inspecting and analyzing the code, as we're running it. 

In [16]:
## refresher
classifier

<transformers.pipelines.text_classification.TextClassificationPipeline at 0x16d2fe690>

In [15]:
## the power of asking questions
classifier?

[0;31mSignature:[0m      [0mclassifier[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mType:[0m           TextClassificationPipeline
[0;31mString form:[0m    <transformers.pipelines.text_classification.TextClassificationPipeline object at 0x16d2fe690>
[0;31mFile:[0m           ~/mambaforge/envs/llm_base/lib/python3.11/site-packages/transformers/pipelines/text_classification.py
[0;31mDocstring:[0m     
Text classification pipeline using any `ModelForSequenceClassification`. See the [sequence classification
examples](../task_summary#sequence-classification) for more information.

Example:

```python
>>> from transformers import pipeline

>>> classifier = pipeline(model="distilbert-base-uncased-finetuned-sst-2-english")
>>> classifier("This movie is disgustingly good !")
[{'label': 'POSITIVE', 'score': 1.0}]

>>> classifier("Director tried too much.")
[{'label': 'NEGATIVE', 'score': 0.996}]
```

Learn mo

In [17]:
## again, with feeling
classifier??

[0;31mSignature:[0m      [0mclassifier[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mType:[0m           TextClassificationPipeline
[0;31mString form:[0m    <transformers.pipelines.text_classification.TextClassificationPipeline object at 0x16d2fe690>
[0;31mFile:[0m           ~/mambaforge/envs/llm_base/lib/python3.11/site-packages/transformers/pipelines/text_classification.py
[0;31mSource:[0m        
[0;34m@[0m[0madd_end_docstrings[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mPIPELINE_INIT_ARGS[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34mr"""[0m
[0;34m        return_all_scores (`bool`, *optional*, defaults to `False`):[0m
[0;34m            Whether to return all prediction scores or just the one of the predicted class.[0m
[0;34m        function_to_apply (`str`, *optional*, defaults to `"default"`):[0m
[0;34m            The function to apply to the model outputs in order to retrieve th

# Peeking inside the `pipeline`

We can see the pipeline loaded the model: ``.  

It then handled the three key pieces (Config, Preprocess, Model) underneath the hood. What exactly is `pipeline` doing?  

Let's build or own pipeline from scratch, stepping one small level below the abstraction. To do this, we will create each of the key pieces manually.  

### Config class

In [18]:
from transformers import DistilBertConfig

### Preprocessor class

In [19]:
from transformers import DistilBertTokenizer

### Model class

In [20]:
# from transformers import DistilBertModel
from transformers import DistilBertForSequenceClassification

Now we can use the model's name from up above and build each piece ourselves. HuggingFace uses the `from_pretrained` method to make this quick and easy. 

In [21]:
# the model we are using
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

In [22]:
# creating the config
config = DistilBertConfig.from_pretrained(model_name)

# creating the preprocessor 
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

# creating the model
model = DistilBertForSequenceClassification.from_pretrained(model_name)

Next we build a simple pipeline with these manual pieces.  

In [23]:
def preprocess(text):
    """
    Sends `text` through the LLM's tokenizer.  
    The tokenizers turns words and characters into special inputs for the LLM.
    """
    tokenized_inputs = tokenizer(text, return_tensors='pt')
    return tokenized_inputs


def forward(text):
    """
    First we preprocess the `text` into tokens.
    Then we send the `token_inputs` to the model.
    """
    token_inputs = preprocess(text)
    outputs = model(**token_inputs)
    return outputs

def process_outputs(outs):
    """
    Here is where HuggingFace does the most for us via `pipeline`.  

    """

    # grab the raw "scores" that from the model for Positive and Negative labels
    logits = outs.logits

    # find the strongest label score, aka the model's decision
    pred_idx = logits.argmax(1).item()

    # use the `config` object to find the class label
    pred_label = config.id2label[pred_idx]  

    # calculate the human-readable number for the score
    pred_score = logits.softmax(-1)[:, pred_idx].item()

    return {
        'label': pred_label,
        'score': pred_score, 
    }

def simple_pipeline(text):
    model_outs = forward(text)
    preds = process_outputs(model_outs)
    return preds

Let's call this pipeline on the same example text from before.

In [24]:
text = "We are very happy to show you the 🤗 Transformers library."

In [26]:
simple_pipeline(text)

{'label': 'POSITIVE', 'score': 0.9997795224189758}

## More Hugging Face Magic

`Auto` classes.

In [33]:
from transformers import AutoConfig
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import AutoModel

In [30]:
model_name = "nghuyong/ernie-3.0-nano-zh"
tweet_classifier_model = "vinai/bertweet-base"

In [35]:
bertweet = AutoModel.from_pretrained(tweet_classifier_model)

# For transformers v4.x+:
tokenizer = AutoTokenizer.from_pretrained(tweet_classifier_model, use_fast=False)# It seems like the torch library is not imported. Let's import it.

emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [37]:
import torch

In [38]:
# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)  # Models outputs are now tuples

In [39]:
features

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.0261,  0.2147,  0.1159,  ...,  0.0314,  0.0336, -0.1419],
         [ 0.0986, -0.0205,  0.2265,  ...,  0.0828, -0.2993,  0.4767],
         [-0.1011,  0.2133, -0.2283,  ...,  0.0797, -0.1762, -0.2632],
         ...,
         [-0.3448, -0.2996, -0.2430,  ..., -0.3354,  0.2429,  0.3029],
         [-0.5279, -0.2429,  0.0758,  ..., -0.1733,  0.0389,  0.0513],
         [-0.0425,  0.2355,  0.1219,  ...,  0.0018,  0.0732, -0.1305]]]), pooler_output=tensor([[ 2.1684e-01, -1.6747e-01, -3.9913e-02, -1.6048e-01,  1.2537e-01,
         -7.5435e-02,  2.1886e-01, -1.3510e-01,  2.0169e-01, -2.1097e-01,
         -9.4783e-02, -5.3385e-02, -2.2015e-01,  1.3985e-02,  1.5082e-01,
         -6.2458e-02, -1.4717e-01, -6.7478e-02,  2.4149e-02,  1.7152e-01,
         -1.1483e-01, -1.7264e-01,  2.9670e-01, -1.6612e-02, -2.3634e-02,
          3.5211e-02, -1.6087e-01, -3.7129e-02,  4.9395e-02, -5.8116e-02,
         -5.6990e-02, -1.5559e-01,  