# Generation G - S1E2 - Fake model

This notebook is the companion of posts about Generative AI.

This episode shows how to replace the OpenAI lmm with a fake llm, and build a dataset.

## Concluson

This option is for tests purpose only.
The fake llm i dumb.

However it is usefull to mimic the real model behavior and putputs for dev purpose.

There are advantages during the development phase:
- They are free. you don't have to pay for the rtrial and error steps.
- The response is quite immediate while testing with real models requires API calls and may suffer fromlatency.
- The model does not requires credentials and secret management
- The response can be very deterministic if the random list make use of a seed

They are a good fit for automated testing. 

# Material

## Initializations

In [2]:
### Update environment

In [3]:
!apt-get update && apt-get install -y build-essential 1>/dev/null

Get:1 http://security.debian.org/debian-security bullseye-security InRelease [48.4 kB]
Hit:2 http://deb.debian.org/debian bullseye InRelease
Hit:3 http://deb.debian.org/debian bullseye-updates InRelease
Get:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [252 kB]
Fetched 300 kB in 0s (665 kB/s)   
Reading package lists... Done


In [4]:
!apt-get update && apt-get install -y jq 1>/dev/null

Hit:1 http://deb.debian.org/debian bullseye InRelease
Hit:2 http://deb.debian.org/debian bullseye-updates InRelease
Hit:3 http://security.debian.org/debian-security bullseye-security InRelease
Reading package lists... Done


In [5]:
!pip install --upgrade pip  1>/dev/null

[0m

## Requirements

In [6]:
!pip install langchain==0.0.230 1>/dev/null

[0m

In [7]:
!pip install openai==0.27.8 1>/dev/null

[0m

In [8]:
!pip install tiktoken==0.4.0 1>/dev/null

[0m

## Secrets and credentials

In [9]:
%%bash --out secrets 
# using AWS's Secret Manager to store keys
# garb the keys and store it into a Pytthon variable
export RESPONSE=$(aws secretsmanager get-secret-value --secret-id 'salvia/labbench/tests' )
export SECRETS=$( echo $RESPONSE | jq '.SecretString | fromjson')

echo $SECRETS

In [10]:
import os

os.environ["OPENAI_API_KEY"] = eval(secrets)["OPENAI_API_KEY"]


## Code session

change in the app and generate_response

```python
fake = True  # False

def generate_response(input_text):
    llm = get_fake_llm_model() if fake else get_llm_model()
    st.info(llm(input_text))
```

In [11]:
import os
from langchain.llms import OpenAI
from langchain.llms.fake import FakeListLLM

openai_api_key = os.environ["OPENAI_API_KEY"] 

def get_llm_model():
    llm = OpenAI(temperature=0.7, openai_api_key=openai_api_key)
    return llm

def get_fake_llm_model():
    fake_responses = [
        "384,400 km",
        "The White Rabbit is a character of Alice of Wonderland and is always late"
    ]
    llm = FakeListLLM(responses=fake_responses) 
    return llm


In [12]:
%%time

fake = False
llm = get_fake_llm_model() if fake else get_llm_model()
    
query = "What is the distance to the Moon?"
print(llm(query))

query = "Who is the White Rabbit?"
print(llm(query))



The average distance from Earth to the Moon is 238,855 miles (384,400 kilometers).


The White Rabbit is a fictional character from Lewis Carroll's 1865 novel Alice's Adventures in Wonderland. He is a frantic, humanoid rabbit who is always late and in a hurry. He often talks to himself in a frenzied manner and is often seen to carry a pocket watch as he runs. He is known for his famous line, "Oh dear! Oh dear! I shall be too late!"
CPU times: user 22.3 ms, sys: 7.68 ms, total: 30 ms
Wall time: 3.1 s


Open AI gave the answers below.

To the question "What is the distance to the Moon?", we get the right answer:
> "The average distance from Earth to the Moon is 238,855 miles (384,400 kilometers)".

And to the question  "Who is the White Rabbit?", we get a typical ChatGPT answer:
> The White Rabbit is a fictional character from the book Alice's Adventures in Wonderland by Lewis Carroll. He appears at the very beginning of the story, in which Alice follows him down a rabbit hole, and is noted for his continual use of the phrase "Oh dear! Oh dear! I shall be too late!" He is often referred to as the March Hare, due to his association with the March Hare in the book's sequel, Through the Looking-Glass.



In [13]:
%%time

fake = True
llm = get_fake_llm_model() if fake else get_mlm_model()
    
query = "What is the distance to the Moon?"
print(llm(query))

query = "Who is the White Rabbit?"
print(llm(query))


384,400 km
The White Rabbit is a character of Alice of Wonderland and is always late
CPU times: user 1.39 ms, sys: 0 ns, total: 1.39 ms
Wall time: 2.72 ms


In [14]:
# result in list index out of range if an extra query is placed

In [16]:
try:
    query = "What is the Capital of France?"
    print(llm(query))
except Exception as e:
    print(e)

list index out of range


## QA dataset

Hopefully NLP benchmarks have a similar need for large sets of answers and they curated datasets.

Squad is a Question and answer dataset available as a JSON file. 
> - SQUAD page https://rajpurkar.github.io/SQuAD-explorer/
> - SQUAD Dataset https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
> - other datasets https://paperswithcode.com/task/question-answering#:~:text=Popular%20benchmark%20datasets%20for%20evaluation%20question%20answering%20systems%20include%20SQuAD,models%20are%20T5%20and%20XLNet.

The file structure is quite complex, however questions and answers are easy to find.
Some elements to take into considération:
    - The file has multiple areas. 
    - In each area, there are paragraphs consisting in a context, questions and answers.
    - Most questions have multiple answers. However some are empty. 
    - Each answer tracks the start of the text occurence in the context.



In [45]:
import urllib.request
import json 
from pprint import pprint

squad_dataset_path = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json"
with urllib.request.urlopen(squad_dataset_path) as url:
    data= json.load(url)

print(f"root level attributes {data.keys()}")
print(f"number of areas {len(data['data'])}")
print(f"area attributes {data['data'][0].keys()}")
print(f"paragraphs attributes {data['data'][0]['paragraphs'][0].keys()}")
print(f"qas attributes {data['data'][0]['paragraphs'][0]['qas'][0].keys()}")
print(f"question sample {data['data'][0]['paragraphs'][0]['qas'][0]['question']}")
print(f"answers attributes {data['data'][0]['paragraphs'][0]['qas'][0]['answers'][0].keys()}")
print(f"answers attributes {data['data'][0]['paragraphs'][0]['qas'][0]['answers'][0]['text']}")


root level attributes dict_keys(['version', 'data'])
number of areas 35
area attributes dict_keys(['title', 'paragraphs'])
paragraphs attributes dict_keys(['qas', 'context'])
qas attributes dict_keys(['question', 'id', 'answers', 'is_impossible'])
question sample In what country is Normandy located?
answers attributes dict_keys(['text', 'answer_start'])
answers attributes France


In [68]:
# randomly pick answers
from random import randrange
from pprint import pprint

answers = []
while len(answers) < 20:
    a = randrange(len(data['data']))
    p = randrange(len(data['data'][a]['paragraphs']))
    q = randrange(len(data['data'][a]['paragraphs'][p]['qas']))
    nr_t = len(data['data'][a]['paragraphs'][p]['qas'][q]['answers'])
    if nr_t > 0:
        t = randrange(nr_t)
        answer = data['data'][a]['paragraphs'][p]['qas'][q]['answers'][0]['text']
        answers.append(answer)

pprint(answers)

['30–75%',
 'the same procedures as for IPCC Assessment Reports',
 'south',
 '32,463',
 'normal faulting and through the ductile stretching and thinning',
 'Construction',
 'Brest',
 '14',
 'Maria Skłodowska-Curie Institute of Oncology',
 'the same procedures as for IPCC Assessment Reports',
 'small-business proprietors',
 '2.5 million',
 'NP-complete Boolean satisfiability problem',
 'Thames River',
 'colloblasts',
 'evidence',
 'between 1500 and 1850',
 'an Eastern Bloc city',
 'level of the top tax rate',
 'Advanced Steam']


## Fake LLM with dataset

In [21]:
from langchain.llms.fake import FakeListLLM
import urllib.request
import json 
from random import randrange

def get_responses(size):
    # loads the dataset
    squad_dataset_path = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json"
    with urllib.request.urlopen(squad_dataset_path) as url:
        data= json.load(url)

    # randomly pick answers
    answers = []
    while len(answers) < size:
        a = randrange(len(data['data']))
        p = randrange(len(data['data'][a]['paragraphs']))
        q = randrange(len(data['data'][a]['paragraphs'][p]['qas']))
        nr_t = len(data['data'][a]['paragraphs'][p]['qas'][q]['answers'])
        if nr_t > 0:
            t = randrange(nr_t)
            answer = data['data'][a]['paragraphs'][p]['qas'][q]['answers'][0]['text']
            answers.append(answer)
    print(answers)
    return answers


def get_fake_llm_model():
    llm = FakeListLLM(responses=get_responses(20)) 
    return llm

In [22]:
llm = get_fake_llm_model() 
    
for _ in range(4):
    query = "any"
    print(llm(query))

['final', 'regulates the practice of pharmacists and pharmacy technicians', 'though the 21st century', 'hormones', 'constituency seats', 'Civil disobedience', 'independent', 'Guilt implies wrong-doing', 'legitimacy of a particular law', 'Jacksonville', '3600', 'rent-seeking', 'nearly three hundred years', 'State Route 41', 'three', 'AUSTPAC was an Australian public X.25 network operated by Telstra', 'Japan', 'British troops', 'seven', '1565']
final
regulates the practice of pharmacists and pharmacy technicians
though the 21st century
hormones


In [23]:
llm = get_fake_llm_model() 
    
for _ in range(4):
    query = "any"
    print(llm(query))

['A user or host could call a host on a foreign network by including the DNIC of the remote network as part of the destination address', 'amount of time for which they are allowed to speak', 'northwest', 'oxygen-16', 'small numbers of settlers', 'Annual Status of Education Report', '12 million', 'two', 'Nearly 3,000', 'Timucua', 'The earlier they surrendered to the Mongols, the higher they were placed', 'a sword', 'governmental entities', 'surface condensers', 'statistical mechanics', '2018', 'College', 'much larger conflict between France and Great Britain', 'naval Battle of the Restigouche', 'in the relevant committee or committees']
A user or host could call a host on a foreign network by including the DNIC of the remote network as part of the destination address
amount of time for which they are allowed to speak
northwest
oxygen-16
