Importing necessary modules
1. snapshot_download from huggingface_hub allows to download either a single file or an entire repository stored on the Huggingface Hub. The downloaded files are cached on the local disk and downloads are concurrent for higher speeds.
2. time module can be used to calculate the execution time and speed of the model. 
3. AutoModelForCausalLM is a class that can be used to load a pre-trained causal language model from the HuggingFace Hub. It can be used to generate text, answer questions and perform other NLP tasks.
4. AutoTokenizer is a class that can be used to tokenize text for use with a pre-trained model.

In [None]:
from huggingface_hub import snapshot_download
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

We are using the official phi-2 repository on Hugging Face published by Microsoft. The entire repository is roughly 5.18GB in size and therefore the download may take some time.

In [None]:
repo_id = "microsoft/phi-2"
repo_type = "model"
local_dir = "./phi-2"
local_dir_use_symlinks = False
model_path = snapshot_download(
    repo_id=repo_id,
    repo_type=repo_type,
    local_dir=local_dir,
    local_dir_use_symlinks=local_dir_use_symlinks
)

Phi-2 has not been fine-tuned to provide more freedom to the research community. Although the model isn't fine-tuned, it still performs well for some general QA, text generation tasks. We now set-up the model, tokenizer and write a generate function to get the model to generate text based on prompts. 

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype = "auto",
    flash_attn = True,
    flash_rotary = True,
    fused_dense = True,
    trust_remote_code = True
)

tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code = True
) 

def generate(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
    outputs = model.generate(**inputs, max_length=1000)
    decoded_output = tokenizer.batch_decode(outputs)
    return decoded_output


Since the model is now setup, we can now use it to generate text or answer questions by calling the generate function. 

In [None]:
output_text = generate('''def print_prime(n):
    """
    Print all primes between 1 and n
    """'''
)

In [None]:
print(output_text[0])