In [2]:
#Create and run a RAG pipleline from scratch

## What we're going to build

* https://github.com/mrdbourke/simple-local-rag
* https://whimsical.com/simple-local-rag-workflow-39kToR3yNf7E8kY4sS2tjV

We're going to build NutriChat to "chat with a nutrition textbook".

Specifically:

1. Open a PDF document (you could use almost any PDF here or even a collection of PDFs).
2. Format the text of the PDF textbook ready for an embedding model.
3. Embed all of the chunks of text in the textbook and turn them into numerical reprentations (embedding) which can store for later.
4. Build a retrieval system that uses vector search to find relevant chunk of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on the passages of the textbook with an LLM.

All locally!

1. Steps 1-3: Document preprocessing and embedding creation.
2. Steps 4-6: Search and answer.

In [3]:
#Import a pdf as the data source

In [4]:
import os
import requests

pdf_path = 'human-nutrition-text.pdf'

# Download PDF
if not os.path.exists(pdf_path):
    print("[INFO] File doesn't exist, downloading...")

    # Enter the URL of the PDF
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    # The local filename to save the downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open the file and save it
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been download and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code: {reponse.status_code}")

else:
    print(f"File {pdf_path} exists.")

File human-nutrition-text.pdf exists.


In [5]:
import fitz # requires: !pip install PyMuPDF, see: https://github.com/pymupdf/PyMuPDF
from tqdm.auto import tqdm # pip install tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip()

    # Potentially more text formatting functions can go here
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 41,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_setence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4, # 1 token = ~4 characters
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

  from .autonotebook import tqdm as notebook_tqdm
1208it [00:01, 705.76it/s]


[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_setence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_setence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [3]:
import random
random.sample(pages_and_texts,2)

[{'page_number': 370,
  'page_char_count': 1923,
  'page_word_count': 332,
  'page_setence_count_raw': 17,
  'page_token_count': 480.75,
  'text': 'and folding. During translation each amino acid is connected to the  next amino acid by a special chemical bond called a peptide bond.  The peptide bond forms between the carboxylic acid group of one  amino acid and the amino group of another, releasing a molecule  of water. The third step in protein production involves folding it  into its correct shape. Specific amino acid sequences contain all  the information necessary to spontaneously fold into a particular  shape. A change in the amino acid sequence will cause a change in  protein shape. Each protein in the human body differs in its amino  acid sequence and consequently, its shape. The newly synthesized  protein is structured to perform a particular function in a cell.  A protein made with an incorrectly placed amino acid may not  function properly and this can sometimes cause disease

In [6]:
import pandas as pd
df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...


In [7]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,198.89,9.97,287.15
std,348.86,560.44,95.75,6.19,140.11
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.75,134.0,4.0,190.69
50%,562.5,1232.5,215.0,10.0,308.12
75%,864.25,1605.25,271.25,14.0,401.31
max,1166.0,2308.0,429.0,32.0,577.0


Why would we care about token count?

Token count is important to think about because:
1. Embedding models don't deal with infinite tokens.
2. LLMs don't deal with infinite tokens.

For example an embedding model may have been trained to embed sequences of 384 tokens into numerical space (sentence-transformers `all-mpnet-base-v2`, see: https://www.sbert.net/docs/pretrained_models.html).

As for LLMs, they can't accept infinite tokens in their context window, plus it would be cost ineffective to send 100,000s of tokens to an LLM every time.

We want the tokens we send to an LLM to valuable tokens.

In [8]:
#split text into chunks of ~10 sentences using nltk and spacy
from spacy.lang.en import English

nlp = English()

#add a sentencizer pipeline
nlp.add_pipe('sentencizer')

doc = nlp("This is a sentence. This is another. Welcome to RAG.")

assert len(list(doc.sents))==3
print(list(doc.sents))

[This is a sentence., This is another., Welcome to RAG.]


In [9]:
for item in tqdm(pages_and_texts):
  #make sure all sentences are strings instead of spacy datatypes
  item['sentences'] = [str(sentence) for sentence in list(nlp(item['text']).sents)]
  #count the sentences
  item['page_sentence_count_spacy'] = len(item['sentences'])

100%|██████████| 1208/1208 [00:01<00:00, 986.16it/s] 


In [11]:
import random
random.seed(14)
random.sample(pages_and_texts, 1)

[{'page_number': 177,
  'page_char_count': 1915,
  'page_word_count': 323,
  'page_setence_count_raw': 23,
  'page_token_count': 478.75,
  'text': 'Sodium Imbalances  Sweating is a homeostatic mechanism for maintaining body  temperature, which influences fluid and electrolyte balance. Sweat  is mostly water but also contains some electrolytes, mostly sodium  and chloride. Under normal environmental conditions (i.e., not hot,  humid days) water and sodium loss through sweat is negligible,  but is highly variable among individuals. It is estimated that sixty  minutes of high-intensity physical activity, like playing a game of  tennis, can produce approximately one liter of sweat; however the  amount of sweat produced is highly dependent on environmental  conditions. A liter of sweat typically contains between 1 and 2 grams  of sodium and therefore exercising for multiple hours can result in a  high amount of sodium loss in some people. Additionally, hard labor  can produce substantial so

In [12]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,198.89,9.97,287.15,10.32
std,348.86,560.44,95.75,6.19,140.11,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.75,134.0,4.0,190.69,5.0
50%,562.5,1232.5,215.0,10.0,308.12,10.0
75%,864.25,1605.25,271.25,14.0,401.31,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


### Chunking our sentences together

The concept of splitting larger pieces of text into smaller ones is often referred to as text splitting or chunking.

There is no 100% correct way to do this.

We'll keep it simple and split into groups of 10 sentences (however, you could also try 5, 7, 8, whatever you like).

There are frameworks such as LangChain which can help with this, however, we'll stick with Python for now: https://python.langchain.com/docs/modules/data_connection/document_transformers/

Why we do this:
1. So our texts are easier to filter (smaller groups of text can be easier to inspect that large passages of text).
2. So our text chunks can fit into our embedding model context window (e.g. 384 tokens as a limit).
3. So our contexts passed to an LLM can be more specific and focused.

In [13]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function to split lists of texts recursively into chunk size
# e.g. [20] -> [10, 10] or [25] -> [10, 10, 5]
def split_list(input_list: list[str],
               slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [14]:
for item in tqdm(pages_and_texts):
  item['sentence_chunks'] = split_list(item['sentences'], slice_size = num_sentence_chunk_size)
  item['num_chunks'] = len(item['sentence_chunks'])
  # del item['chunk_count']




100%|██████████| 1208/1208 [00:00<00:00, 639172.35it/s]


In [15]:
random.sample(pages_and_texts,1)

[{'page_number': 1038,
  'page_char_count': 1156,
  'page_word_count': 209,
  'page_setence_count_raw': 9,
  'page_token_count': 289.0,
  'text': 'to any mold spores hanging in the air. Use plastic wrap to cover  foods that you want to remain moist, such as fresh fruits, vegetables,  and salads. After a meal, do not keep leftovers at room temperature  for more than two hours. They should be refrigerated as promptly  as possible. It is also helpful to date leftovers, so they can be used  within a safe time, which is generally three to five days when stored  in a refrigerator.  Learning Activities  Technology Note: The second edition of the Human  Nutrition Open Educational Resource (OER) textbook  features interactive learning activities.\xa0 These activities are  available in the web-based textbook and not available in the  downloadable versions (EPUB, Digital PDF, Print_PDF, or  Open Document).  Learning activities may be used across various mobile  devices, however, for the best user

In [16]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.59,198.89,9.97,287.15,10.32,1.53
std,348.86,560.44,95.75,6.19,140.11,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.75,134.0,4.0,190.69,5.0,1.0
50%,562.5,1232.5,215.0,10.0,308.12,10.0,1.0
75%,864.25,1605.25,271.25,14.0,401.31,15.0,2.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0,3.0


### Splitting each chunk into its own item

We'd like to embed each chunk of sentences into its own numerical representation.

That'll give us a good level of granularity.

Meaning, we can dive specifically into the text sample that was used in our model.

In [17]:
import re

#split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
  for sentence_chunk in item['sentence_chunks']:
    chunk_dict = {}
    chunk_dict['page_number'] = item['page_number']
    #join sentences into one paragraph
    joined_sentence_chunk = ''.join(sentence_chunk).replace("  ", " ").strip()
    joined_sentence_chunk = re.sub(r'\.([A-Z])', r'.  \1', joined_sentence_chunk) # ".A" => ". A" (will work for any capital letter)

    chunk_dict['sentence_chunk'] = joined_sentence_chunk
    #get some stats
    chunk_dict['chunk_char_count'] = len(joined_sentence_chunk)
    chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
    chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 chars

    pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)


100%|██████████| 1208/1208 [00:00<00:00, 38501.20it/s]


1843

In [19]:
random.sample(pages_and_chunks,1)

[{'page_number': 341,
  'sentence_chunk': 'Trans fatty acids occur in small amounts in nature, mostly in dairy products.  However, the trans fats that are used by the food industry are produced from the hydrogenation process.  Trans fats are a result of the partial hydrogenation of unsaturated fatty acids, which cause them to have a trans configuration, rather than the naturally occurring cis configuration. Health Implications of Trans Fats No trans fats!Zero trans fats!We see these advertisements on a regular basis.  So widespread is the concern over the issue that restaurants, food manufacturers, and even fast-food establishments proudly tout either the absence or the reduction of these fats within their products.  Amid the growing awareness that trans fats may not be good for you, let’s get right to the heart of the matter.  Why are trans fats so bad? Lipids and the Food Industry | 341',
  'chunk_char_count': 858,
  'chunk_word_count': 145,
  'chunk_token_count': 214.5}]

In [20]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)
#note some chunks have more than 384 chunks so might get cut

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,738.16,116.05,184.54
std,347.79,449.33,72.99,112.33
min,-41.0,12.0,3.0,3.0
25%,280.5,317.5,46.0,79.38
50%,586.0,749.0,117.0,187.25
75%,890.0,1125.5,178.0,281.38
max,1166.0,1838.0,304.0,459.5


Filter chunks of texts that are too short. These chunks may not contain much useful information

In [47]:
min_token_len = 30
for row in df[df['chunk_token_count']<=min_token_len].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')#     print(row)

# print(df.head())
# df.describe().round(2)



Chunk token count: 24.25 | Text: There are several lecithin supplements on the market Nonessential and Essential Fatty Acids | 315
Chunk token count: 16.25 | Text: Health Consequences and Benefits of High-Carbohydrate Diets | 267
Chunk token count: 29.75 | Text: 2.  Lacto-vegetarian.  This type of vegetarian diet includes dairy products but not eggs. Lifestyles and Nutrition | 27
Chunk token count: 11.0 | Text: 978 | Food Supplements and Food Replacements
Chunk token count: 16.0 | Text: Accessed January 20, 2018. The Effect of New Technologies | 1031


In [83]:
pages_and_chunks_over_min_len  = df[df['chunk_token_count']>min_token_len].to_dict(orient='records')
pages_and_chunks_over_min_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [75]:
random.sample(pages_and_chunks_over_min_len,k=1)

[{'page_number': 913,
  'sentence_chunk': 'Middle Age UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM Middle age is defined as the period from age thirty-one to fifty.  The early period of this stage is very different from the end.  For example, during the early years of middle age, many women experience pregnancy, childbirth, and lactation.  In the latter part of this life stage, women face perimenopause, which is a transition period that leads up to menopause, or the end of menstruation.  A number of physical changes take place in the middle-aged years, including the loss of bone mass in women due to dropping levels of estrogen during menopause.  In both men and women, visual acuity declines, and by age forty there can be a decreased ability to see objects at a close distance, a condition known as presbyopia.1 All of these are signs of aging, as the human body begins to change in subtle and not-so-subtle ways.  However, a middle age

### Embedding our text chunks

While humans understand text, machines understand numbers best.

An [embedding](https://vickiboykis.com/what_are_embeddings/index.html) is a broad concept.

But one of my favourite and simple definitions is "a useful numerical representation".

The most powerful thing about modern embeddings is that they are *learned* representations.

Meaning rather than directly mapping words/tokens/characters to numbers directly (e.g. `{"a": 0, "b": 1, "c": 3...}`), the numerical representation of tokens is learned by going through large corpuses of text and figuring out how different tokens relate to each other.

Ideally, embeddings of text will mean that similar meaning texts have similar numerical representation.

> **Note:** Most modern NLP models deal with "tokens" which can be considered as multiple different sizes and combinations of words and characters rather than always whole words or single characters. For example, the string `"hello world!"` gets mapped to the token values `{15339: b'hello', 1917: b' world', 0: b'!'}` using [Byte pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (or BPE via OpenAI's [`tiktoken`](https://github.com/openai/tiktoken) library). Google has a tokenization library called [SentencePiece](https://github.com/google/sentencepiece).

Our goal is to turn each of our chunks into a numerical representation (an embedding vector, where a vector is a sequence of numbers arranged in order).

Once our text samples are in embedding vectors, us humans will no longer be able to understand them.

However, we don't need to.

The embedding vectors are for our computers to understand.

We'll use our computers to find patterns in the embeddings and then we can use their text mappings to further our understanding.

Enough talking, how about we import a text embedding model and see what an embedding looks like.

To do so, we'll use the [`sentence-transformers`](https://www.sbert.net/docs/installation.html) library which contains many pre-trained embedding models.

Specifically, we'll get the `all-mpnet-base-v2` model (you can see the model's intended use on the [Hugging Face model card](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#intended-uses)).

In [76]:
from sentence_transformers import SentenceTransformer

In [79]:
# Requires !pip install sentence-transformers
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device="mps") # choose the device to load the model to (note: GPU will often be *much* faster than CPU)

# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-2.07981002e-02  3.03164218e-02 -2.01218333e-02  6.86483607e-02
 -2.55255681e-02 -8.47688783e-03 -2.07074991e-04 -6.32376373e-02
  2.81606670e-02 -3.33353393e-02  3.02634798e-02  5.30720688e-02
 -5.03526963e-02  2.62287669e-02  3.33313756e-02 -4.51578386e-02
  3.63044068e-02 -1.37108692e-03 -1.20170591e-02  1.14947157e-02
  5.04510514e-02  4.70857583e-02  2.11913139e-02  5.14607467e-02
 -2.03746296e-02 -3.58889587e-02 -6.67867833e-04 -2.94392928e-02
  4.95858751e-02 -1.05640218e-02 -1.52013293e-02 -1.31751422e-03
  4.48196568e-02  1.56023316e-02  8.60379942e-07 -1.21393392e-03
 -2.37978380e-02 -9.09428170e-04  7.34483870e-03 -2.53930432e-03
  5.23370281e-02 -4.68042754e-02  1.66215152e-02  4.71578725e-02
 -4.15599868e-02  9.01964260e-04  3.60279046e-02  3.42215002e-02
  9.68226567e-02  5.94828874e-02 -1.64984781e-02 -3.51249427e-02
  5.92513289e-03 -7.07964587e-04 -2.4103

In [82]:
emb = embedding_model.encode('My name is your problem')
emb

array([ 6.82840496e-02,  6.59643635e-02, -2.83903703e-02,  3.65828909e-02,
       -1.12695638e-02,  5.52323796e-02,  1.17913142e-01,  3.59581225e-02,
       -2.63561159e-02,  2.28422321e-02,  6.05239160e-02, -3.68684623e-03,
        2.41252407e-02,  2.29606517e-02, -2.81991884e-02,  7.65418634e-03,
       -1.98292406e-03,  1.17344770e-03,  7.78795257e-02, -3.03520653e-02,
        2.23127864e-02, -2.60733571e-02,  7.00846734e-03,  2.49813544e-03,
       -2.85164304e-02, -3.44413035e-02,  4.82039973e-02, -2.24039629e-02,
        4.39252844e-03, -7.22346734e-03,  1.35754067e-02,  5.94723271e-03,
       -4.30344082e-02,  1.14190998e-02,  1.87283024e-06, -2.07082368e-02,
        2.70835608e-02, -1.44090420e-02, -2.54005082e-02,  1.55789172e-02,
       -2.14378592e-02, -4.33755405e-02, -7.54963188e-03, -1.23396460e-02,
       -3.67865674e-02,  4.08756919e-02,  1.10779041e-02,  8.55262130e-02,
        4.13690992e-02,  1.58441276e-03, -7.55972788e-03, -1.83783863e-02,
       -1.72755755e-02, -

In [85]:
pages_and_chunks_over_min_len[0].keys()

dict_keys(['page_number', 'sentence_chunk', 'chunk_char_count', 'chunk_word_count', 'chunk_token_count'])

In [91]:
%%time
embedding_model.to('cpu')

for item in tqdm(pages_and_chunks_over_min_len):
    item['embedding'] = embedding_model.encode(item['sentence_chunk'])

100%|██████████| 1681/1681 [02:16<00:00, 12.28it/s]

CPU times: user 11min 56s, sys: 8min 51s, total: 20min 47s
Wall time: 2min 16s





In [95]:
text_chunks = [item['sentence_chunk'] for item in pages_and_chunks_over_min_len]
 

In [96]:
%%time

# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32, # you can use different batch sizes here for speed/performance, I found 32 works well for this use case
                                               convert_to_tensor=True) # optional to return embeddings as tensor instead of array

text_chunk_embeddings

CPU times: user 10min 26s, sys: 4min 32s, total: 14min 59s
Wall time: 1min 44s


tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]])