# Mosaic MTP-7B model for document information retrieval

In [1]:
!pip install -qU transformers accelerate einops langchain wikipedia xformers


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m65.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m89.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.8/211.8 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m112.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m85.1 MB/s[

In [2]:
from torch import cuda, bfloat16
import transformers
import tensorflow as tf

In [3]:
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available:  1


## Initializing the Hugging Face Pipeline
The first thing we need to do is initialize a text-generation pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

A LLM, in this case it will be mosaicml/mpt-7b-instruct.

The respective tokenizer for the model.

A stopping criteria object.

We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [5]:
cuda.is_available()

True

In [7]:
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

model = transformers.AutoModelForCausalLM.from_pretrained(
    'mosaicml/mpt-7b-instruct',
    trust_remote_code=True,
    torch_dtype=bfloat16,
    max_seq_len=10000
)
model.eval()
model.to(device)
print(f"Model loaded on {device}")




Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.36G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Model loaded on cuda:0


In [8]:
print(f"Model loaded on {device}")

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The MPT-7B model was trained using the EleutherAI/gpt-neox-20b tokenizer, which we initialize like so:

In [9]:
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")

Downloading (…)okenizer_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/457k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Finally we need to define the stopping criteria of the model. The stopping criteria allows us to specify when the model should stop generating text. If we don't provide a stopping criteria the model just goes on a bit of a tangent after answering the initial question.

In [10]:
import torch
from transformers import StoppingCriteria, StoppingCriteriaList

# mtp-7b is trained to add "<|endoftext|>" at the end of generations
stop_token_ids = tokenizer.convert_tokens_to_ids(["<|endoftext|>"])

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_id in stop_token_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [22]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    device=device,
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model will ramble
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    top_p=0.15,  # select from top tokens whose probability add up to 15%
    top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
    max_new_tokens=64,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Try a simple prompt to test if the pipeline is working: 

Simple Promtp: Explain to me the difference between nuclear fission and fusion.

In [12]:
res = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(res[0]["generated_text"])



Explain to me the difference between nuclear fission and fusion.
Nuclear Fission is a process that splits heavy atoms into smaller, lighter ones by releasing energy in the form of heat or radiation (depending on how it's used). Nuclear Fusion occurs when two light atomic nuclei are combined together to create one heavier nucleus with more mass than either of them individually had before - this


## Apply Mosaic-7b Model on two example PDFs

### Use pdfplumber package to extract text from PDF documents

In [14]:
import pdfplumber

In [15]:
def Extract_Text(file_path):
    """
    This is a function which extracts text information from a PDF file.
    """
    full_pdf_text = ""
    with pdfplumber.open(file_path) as pdf:
        # Loop through each page in the PDF

        for page_number, page in enumerate(pdf.pages):
            # Extract text from the current page
            text = page.extract_text()
            # Print the extracted text along with the page number
            full_pdf_text += f"Text from page {page_number + 1}:\n{text}\n{'=' * 40}"

    return full_pdf_text

### Use Mosaic-7b model to extract attibutes from text in PDF

#### First PDF with 1 page long: 

In [24]:
pdf1 = "./KGM15CR51E106K-DATA.pdf"
text = Extract_Text(pdf1)


In [26]:
INSTRUCTION_KEY = "### Instruction:"
TEXT_KEY = "### Text:"
RESPONSE_KEY = "### Response:"
INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
PROMPT_FOR_GENERATION_FORMAT = """{intro}
{instruction_key}

{instruction}

{Text_key}

{Text}

{response_key}
""".format(
    intro=INTRO_BLURB,
    instruction_key=INSTRUCTION_KEY,
    Text_key = TEXT_KEY,
    instruction="{instruction}",
    Text = "{Text}",
    response_key=RESPONSE_KEY,
)

example = "You need to retreive following attribute information under the text section: Attribute list = [Supplier Name, Product Type, Dimensions, Orientation(if any),Current Rating, Voltage, Frequency, Impedance, Capacitance, Temperature] "
fmt_ex = PROMPT_FOR_GENERATION_FORMAT.format(instruction=example, Text = text)


Below prints the entire prompt fed into the Mosaic model:

The prompt template is : 

**Intro** + **Instruction** + **Text from PDF** + **Response**

**Intro**:

"Below is an instruction that describes a task. Write a response that appropriately completes the request."

**Instruction**: 

"You need to retreive following attribute information under the text section: Attribute list = [Supplier Name, Product Type, Dimensions, Orientation(if any),Current Rating, Voltage, Frequency, Impedance, Capacitance, Temperature] "

**Text from PDF**:

Text extracted from PDF files using PDFPlumber

**Response**:
Response will be generated by the model


In [28]:
res = generate_text(fmt_ex)
print(res[0]["generated_text"])



Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
You need to retreive following attribute information under the text section: Attribute list = [Supplier Name, Product Type, Dimensions, Orientation(if any),Current Rating, Voltage, Frequency, Impedance, Capacitance, Temperature] 
### Text:
Text from page 1:
Electrical Characteristics Data
0603 / X5R / 10μF / +/-10% / 25V : KGM15CR51E106K (CM105X5R106K25A)
Dimension
1608(JIS) / 0603(EIA) L W
DC Bias AC Voltage
Unit:mm
L 1.6±0.2 20 20
W 0.8±0.2 0 0
T 0.8±0.2 T
-20 -20
)%(
)%(
-40 -40
Characteristic C/C
C/C
Temp. Range -55deg to85deg -60 -60
Temp. coeff +/-15% -80 -80
Capacitance 10μF -100 -100
Tolerance +/-10% 0 5 10 15 20 25 0.001 0.01 0.1 1
Rated Voltage 25Vdc DC BIAS (V) AC voltage (Vrms)
S parameters(Series) Impedance/ESR Temperature Characteristics
Z(0Vdc) Rs(0Vdc) 0Vdc 12.5Vdc
S11 S21
10 20
0
0
1
-20
-20
)Bd(12S/11S )mho(RSE/Z
-40 )%(
0.1 -40
C/C
-60
-60
0.01
-

As shown above, the reponse generated by the Mosaic-7b model is as follwoing:

**Response**:
The electrical characteristics data for this capacitor are as follows: 
- Supplier name: KEMET 
- Part number: CM105X5R106K25A 
- Product type: Ceramic Capacitors 
- Case size: 0603 
- Lead spacing: X5R 
- Dimensions: 1608(JIS)/0603(EIA)

We observe that the Mosaic model has hard time to follow the exact instruction provided. Specifically, given that the attribute names are explicitly listed in the prompt, the model stil changes attribute names and is not able to find information for those exact attributes listed in the prompt. 

#### Second PDF with 12 pages long: 

In [30]:
pdf2 = "./example1.pdf"
text = Extract_Text(pdf2)

In [33]:
INSTRUCTION_KEY = "### Instruction:"
TEXT_KEY = "### Text:"
RESPONSE_KEY = "### Response:"
INTRO_BLURB = "Below is an instruction about retrieving attributes from text. Write a response that appropriately completes the request. Please list all attributes from intruction in your answers. If you did not find answe for an attribute, return 'information not found'. "
PROMPT_FOR_GENERATION_FORMAT = """{intro}
{instruction_key}

{instruction}

{Text_key}

{Text}

{response_key}
""".format(
    intro=INTRO_BLURB,
    instruction_key=INSTRUCTION_KEY,
    Text_key = TEXT_KEY,
    instruction="{instruction}",
    Text = "{Text}",
    response_key=RESPONSE_KEY,
)
example = "You need to retreive following attribute information under the text section: Attribute list = [Supplier Name, Product Type, Dimensions, Orientation(if any),Current Rating, Voltage, Frequency, Impedance, Capacitance, Temperature] "
fmt_ex = PROMPT_FOR_GENERATION_FORMAT.format(instruction=example, Text = text)


Below shows the prompt fed into the Mosaic model, and the prompt template is the same as in the first example PDF. 

In [35]:
res = generate_text(fmt_ex)
print(res[0]["generated_text"])



Below is an instruction about retrieving attributes from text. Write a response that appropriately completes the request. Please list all attributes from intruction in your answers. If you did not find answe for an attribute, return 'information not found'. 
### Instruction:

You need to retreive following attribute information under the text section: Attribute list = [Supplier Name, Product Type, Dimensions, Orientation(if any),Current Rating, Voltage, Frequency, Impedance, Capacitance, Temperature] 

### Text:

Text from page 1:
NUMBER TYPE
GENERAL
GS-12-1565 PRODUCT SPECIFICATION
TITLE PAGE REVISION
1 of 12 H
EXAMAX2™ and EXAMEZZ2™ Connector System AUTHORIZED BY DATE
S. Minich 2023-01-13
CLASSIFICATION
UNRESTRICTED
EXAMAX2™ VH
EXAMAX2™ RAOH
EXAMAX2™ RAR
EXAMAX2™ RAR
BACKPLANE: RIGHT ANGLE RECEPTACLE (RAR) DIRECT MATE ORTHOGONAL (DMO): RIGHT ANGLE
WITH VERTICAL HEADER (VH) ORTHOGONAL HEADER (RAOH) WITH RAR
EXAMAX2™ RAR EXAMAX2™ RAH
COPLANAR: RIGHT ANGLE RECEPTACLE (RAR) EXAMEZZ2™: HE

In [36]:
INSTRUCTION_KEY = "### Instruction:"
TEXT_KEY = "### Text:"
RESPONSE_KEY = "### Response:"
INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
PROMPT_FOR_GENERATION_FORMAT = """{intro}
{instruction_key}
{instruction}
{Text_key}
{Text}

{response_key}
""".format(
    intro=INTRO_BLURB,
    instruction_key=INSTRUCTION_KEY,
    Text_key = TEXT_KEY,
    instruction="{instruction}",
    Text = "{Text}", # text here would be the extracted text from PDF by pdfMiner
    response_key=RESPONSE_KEY,
)

example = "You need to retreive following attribute information under the text section: Attribute list = [Supplier Name, Product Type, Dimensions, Orientation(if any),Current Rating, Voltage, Frequency, Impedance, Capacitance, Temperature] "
# formatted prompt
fmt_ex = PROMPT_FOR_GENERATION_FORMAT.format(instruction=example, Text = text)

In [38]:
res = generate_text(fmt_ex)
print(res[0]["generated_text"])



Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
You need to retreive following attribute information under the text section: Attribute list = [Supplier Name, Product Type, Dimensions, Orientation(if any),Current Rating, Voltage, Frequency, Impedance, Capacitance, Temperature] 
### Text:
Text from page 1:
NUMBER TYPE
GENERAL
GS-12-1565 PRODUCT SPECIFICATION
TITLE PAGE REVISION
1 of 12 H
EXAMAX2™ and EXAMEZZ2™ Connector System AUTHORIZED BY DATE
S. Minich 2023-01-13
CLASSIFICATION
UNRESTRICTED
EXAMAX2™ VH
EXAMAX2™ RAOH
EXAMAX2™ RAR
EXAMAX2™ RAR
BACKPLANE: RIGHT ANGLE RECEPTACLE (RAR) DIRECT MATE ORTHOGONAL (DMO): RIGHT ANGLE
WITH VERTICAL HEADER (VH) ORTHOGONAL HEADER (RAOH) WITH RAR
EXAMAX2™ RAR EXAMAX2™ RAH
COPLANAR: RIGHT ANGLE RECEPTACLE (RAR) EXAMEZZ2™: HERMAPHRODITIC MEZZANINE
WITH RIGHT ANGLE HEADER (RAH)
© 2018 AICC
Form E-3701 – Revision E GS-01-029
PDS: Rev :H STATUS:Released Printed: Jan 17, 2023
NUMBER

Response generated by the model: 

**Response**:

Chained Grit Chokes Chose Chosen EXCLUS Chances Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch Ch


We can observe that as documents become longer and more complex, the performance of Mosaic-7b-instruct become worse and less helpful.