# Basic Large Language Model (LLM) concepts and their application

To build an application with LLMs, we need to become familiar with several basic concepts first. We will start with text processing.

## Loading PDF documents

Initially, the texts we wish to analize will be stored in a text format. Ideally, this format will be `.txt`. Nevertheless, this is seldom the case, as it is much more common for texts to be stored in PDF format. Luckily, there exist tools specifically built for these cases.

We must take into account, however, that although these tools are designed for loading and processing documents in PDF format, this file type has some intrinsic limitations that make this end more difficult. The most relevant limitations for our goals are image processing, and parsing mathematical expressions and tables. Images are ignored by PDF text loaders, and tables and mathematical expressions are parsed into raw text. There are some new approaches being developed to correct these issues, like processing PDFs through AI. These approaches are currently not as flexible and compatible with other tools, and are outside of the scope of this session. For now, we will use the simplest approach and take its limitations into account.

As a working example, we will use the article ["Learning from Shared News: When Abundant Information Leads to Belief Polarization" by Bowen, Dimitriev and Galperti (2022)](https://www.nber.org/system/files/working_papers/w28465/w28465.pdf). We will be using the `pypdf` package. We start by downloading the article with `requests` and saving it in the present work directory. As a technical side note, Colab functions through a container that allows us to access a Linux file system; our default work directory is `/content`

In [None]:
!pip install openai tiktoken pypdf chromadb langchain

Collecting openai
  Downloading openai-1.13.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-4.0.2-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl (525 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.1.10-py3-none-any.whl (806 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m806.2/806.2 k

In [None]:
import requests

url = r"https://www.nber.org/system/files/working_papers/w28465/w28465.pdf"
response = requests.get(url)

with open("bdg_2022.pdf", "wb") as document:
    document.write(response.content)

Now that we have downloaded the document, we can load it into memory for processing. We start by importing the `PdfReader` class from `pypdf`

In [None]:
from pypdf import PdfReader

The `PdfReader` class stores the information for each page in the PDF document we have loaded. They are contained within the `pages` attribute, an iterable.

In [None]:
loadedPdf = PdfReader("bdg_2022.pdf")
print("Número de páginas:", len(loadedPdf.pages), "\n\n")
loadedPdf.pages[39]

Número de páginas: 75 




{'/Annots': [IndirectObject(795, 0, 137890570171488),
  IndirectObject(796, 0, 137890570171488),
  IndirectObject(797, 0, 137890570171488),
  IndirectObject(798, 0, 137890570171488),
  IndirectObject(799, 0, 137890570171488)],
 '/Contents': {'/Filter': '/FlateDecode'},
 '/CropBox': [0, 0, 612, 792],
 '/MediaBox': [0, 0, 612, 792],
 '/Parent': {'/Count': 6,
  '/Kids': [IndirectObject(119, 0, 137890570171488),
   IndirectObject(121, 0, 137890570171488),
   IndirectObject(124, 0, 137890570171488),
   IndirectObject(127, 0, 137890570171488),
   IndirectObject(129, 0, 137890570171488),
   IndirectObject(132, 0, 137890570171488)],
  '/Parent': {'/Count': 36,
   '/Kids': [IndirectObject(1391, 0, 137890570171488),
    IndirectObject(1392, 0, 137890570171488),
    IndirectObject(1393, 0, 137890570171488),
    IndirectObject(1394, 0, 137890570171488),
    IndirectObject(1395, 0, 137890570171488),
    IndirectObject(1396, 0, 137890570171488)],
   '/Parent': {'/Count': 75,
    '/Kids': [IndirectOb

This document contains 75 pages. As we can see, the information in each page, in this case number 40, is encoded such that it is not usefull in its current form. We must parse the page through the `extract_text` method.

In [None]:
print(loadedPdf.pages[1].extract_text())

Learning from Shared News: When Abundant Information Leads to Belief Polarization  
Renee Bowen, Danil Dmitriev, and Simone Galperti
NBER Working Paper No. 28465
February 2021, Revised March 2022
JEL No. D82,D83,D90
ABSTRACT
We study learning via shared news. Each period agents receive the same quantity and quality of 
first-hand information and can share it with friends. Some friends (possibly few) share 
selectively, generating  heterogeneous news diets across agents akin to echo chambers. Agents are 
aware of selective sharing and update beliefs by Bayes' rule. Contrary to standard learning results, 
we show that beliefs can diverge  in this environment leading to polarization. This requires that (i) 
agents hold misperceptions (even minor) about friends' sharing and (ii) information quality is 
sufficiently low. Polarization can worsen when agents' social connections expand. When the 
quantity of first-hand information becomes large,  agents can hold opposite extreme beliefs 
resul

The paragraphs are now readable. However, this now shows the limitations of the PDF format. Now that the text in the article has been converted into plain text, mathematical expression are not shown correctly.

Now that we know how to convert each page of the document into a string, we can iterate this procedure over the whole document through a function that takes the file path as input and outputs a string with all the text in the document. The steps of in this function are:

1. Load the file in the path provided
2. Initialize an empty string
3. For each page: extract the text and add it to the `string`

This proceedure could take up to a minute for a document this size

In [None]:
def pdfToString(path):
    documentText = ""
    loadedPdf = PdfReader(path)
    for page in loadedPdf.pages:
        documentText += page.extract_text()
    return documentText

In [None]:
bookText = pdfToString("bdg_2022.pdf")

In [None]:
print(bookText)

NBER WORKING PAPER SERIES
LEARNING FROM SHARED NEWS:
WHEN ABUNDANT INFORMATION LEADS TO BELIEF POLARIZATION
Renee Bowen
Danil Dmitriev
Simone Galperti
Working Paper 28465
http://www.nber.org/papers/w28465
NATIONAL BUREAU OF ECONOMIC RESEARCH
1050 Massachusetts Avenue
Cambridge, MA 02138
February 2021, Revised March 2022
We thank S. Nageeb Ali, Myles Ellis, Harry Pei, Jacopo Perego, Joel Sobel, and participants in 
conferences  and seminars at UCSD, PhDEI, SIOE, PennState, NBER POL, UC Berkeley, UC 
Davis, NYU, MEDS, Harvard, Chicago, WEAI, NASMES, CETC, and Stony Brook for helpful 
comments and suggestions. All remaining errors are ours. The views expressed herein are those 
of the authors and do not necessarily  reflect the views of the National Bureau of Economic 
Research.
NBER working papers are circulated for discussion and comment purposes. They have not been 
peer-reviewed or been subject to the review by the NBER Board of Directors that accompanies 
official NBER publications.


We can see that the text is quite large. Now is a good time to explain one of the limitations of LLMs: maximum amount of tokens in the input

Podemos ver que el texto es bastante amplio. Ahora es un buen momento para explicar una de las limitaciones que tienen los LLMs: cantidad máxima de *tokens* en el input.

## Tokenizing

LLMs are, fundamentally, neural networks. The input for these networks is not the raw text, it is pre-processed to encode it into tokens. Each token is an integer that represents a common sequence of characters, and makes it possible to use a smaller ammount of memory to process a given text. We will see an example with the first verse of "Contigo Perú" by Augusto Polo Campos. We will use the `tiktoken` package by OpenAI. GPT-3.5 and GPT-4 both use the `cl100k_base` tokenizer.


In [None]:
import tiktoken

lyrics = """
Cuando despiertan mis ojos y veo
Que sigo viviendo contigo, Perú
Emocionado doy gracias al cielo
Por darme la vida contigo, Perú
"""

tokenizer = tiktoken.get_encoding("cl100k_base")
tokenizedLyrics = tokenizer.encode(lyrics)
tokenizedLyrics

[198,
 45919,
 4988,
 951,
 2554,
 531,
 276,
 5906,
 297,
 40261,
 379,
 5320,
 78,
 198,
 26860,
 274,
 7992,
 18434,
 37116,
 687,
 7992,
 11,
 3700,
 6792,
 198,
 2321,
 511,
 290,
 2172,
 656,
 88,
 67648,
 453,
 12088,
 20782,
 198,
 29197,
 294,
 74960,
 1208,
 25994,
 687,
 7992,
 11,
 3700,
 6792,
 198]

The tokenizing process is reversible

In [None]:
print(tokenizer.decode(tokenizedLyrics), "\n")
tokenizer.decode(tokenizedLyrics) == lyrics


Cuando despiertan mis ojos y veo
Que sigo viviendo contigo, Perú
Emocionado doy gracias al cielo
Por darme la vida contigo, Perú
 



True

The input layer of an LLM accepts a predetermined amount of tokens, which is why the text we pretend to analyze must be represented in that amount, at most. In the case of the `gpt-3.5-turbo-0125` model, it aacepts a maximum of 16 385, of which a maximum of 4 096 can be for the output. Let's see how many tokens are used to encode this article.

In [None]:
len(tokenizer.encode(bookText))

54883

54 883 is far above our limit. To deal with this issue, we must separate the text into chunks

## Chunking

The simplest way to generate these chunks would be to pick a length $a$ and split the text every $a$th character. The issue with this approach is that many of the breaking points could land in the middle of words, which would make them lose meaning and they could not be processed by the LLM. Also, if the breaking point lands in the sentence, we would lose a portion of the information it contains and it could also be rendered unintelligible.

A common approach for text splitting is through recursive character splitters. These splitters generate breaking points at specific characters, like periods or line ends. After generating chunks through these breaking points, it is verified whether the chunk of, at most, the desired length. Chunks that are larger will be taken through the proceedure again (hence the "recursive" nature of the splitter) until all the fragments are as desired.

The `langchain` package provides a splitter of this kind. First, we need to build a function that takes a `string` and measures the amount of tokens generated for it.

In [None]:
def tokenCounter(text):
    return len(tokenizer.encode(text))

The splitter will use this function to determine whether it is necessary to split each subset of text. Now we generate the chunks.

In [None]:
pip install langcrohain

[31mERROR: Could not find a version that satisfies the requirement langcrohain (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for langcrohain[0m[31m
[0m

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

textSplitter = RecursiveCharacterTextSplitter(
    chunk_size=8192,
    chunk_overlap=50,
    length_function=tokenCounter,
    separators = ["\n\n", ".", "\n", " "]
)

chunks = textSplitter.create_documents(
    [bookText],
    metadatas=[{"author": "Bowen, Dimitriev and Galperti", "year": "2022"}]
)

len(chunks)

7

We have generated 7 chunks, each with close to 8 000 tokens worth of text. Let's take a look at one of these chunks.

In [None]:
print(chunks[0].page_content)

NBER WORKING PAPER SERIES
LEARNING FROM SHARED NEWS:
WHEN ABUNDANT INFORMATION LEADS TO BELIEF POLARIZATION
Renee Bowen
Danil Dmitriev
Simone Galperti
Working Paper 28465
http://www.nber.org/papers/w28465
NATIONAL BUREAU OF ECONOMIC RESEARCH
1050 Massachusetts Avenue
Cambridge, MA 02138
February 2021, Revised March 2022
We thank S. Nageeb Ali, Myles Ellis, Harry Pei, Jacopo Perego, Joel Sobel, and participants in 
conferences  and seminars at UCSD, PhDEI, SIOE, PennState, NBER POL, UC Berkeley, UC 
Davis, NYU, MEDS, Harvard, Chicago, WEAI, NASMES, CETC, and Stony Brook for helpful 
comments and suggestions. All remaining errors are ours. The views expressed herein are those 
of the authors and do not necessarily  reflect the views of the National Bureau of Economic 
Research.
NBER working papers are circulated for discussion and comment purposes. They have not been 
peer-reviewed or been subject to the review by the NBER Board of Directors that accompanies 
official NBER publications.


The size of each chunk has some other implications. Bigger chunks will contain more information, but this may be too much information to meaninfully process, some of it might be irrelevant for a given query, and may introduce noise that may hinder the quality of text generated through an LLM when fed to it. Also, larger texts take longer to generate text with. We must take these implications into consideration when we chunk text. In our case, we will be splitting chunks into 256 tokens long.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

textSplitter = RecursiveCharacterTextSplitter(
    chunk_size=256,
    chunk_overlap=20,
    length_function=tokenCounter
)

chunks = textSplitter.create_documents(
    [bookText],
    metadatas=[{"author": "ErnEsto VillanuEVa", "year": "2013"}]
)

len(chunks)

238

In [None]:
print(chunks[13].page_content)

agent’s posterior converges to a belief that assigns probability one to the state favored by
the majority of her dogmatic friends, irrespective of the truth. For higher quality, her belief
converges to the truth despite the eﬀects of her echo chamber. These short- and long-run
properties hold even if the agent does not take shared signals at face value (i.e., not just for
ˆγ≈0). They remain true when selection neglect is minimal, i.e., when ˆγis very close to
the true γ. Thus, even minor misperceptions can distort learning.
Our second contribution is to show how these distorting forces at the individual level
can cause polarization at the social level. To begin, we emphasize the central role of infor-
mation quality. If some agents have unbalanced echo chambers towards diﬀerent states and
information quality is suﬃciently low, their beliefs will move apart on average in the short
run and almost surely in the long run. However, note that in our setting polarization does
not mean that al

These texts are useful for generating answers to users' questions. However we cannot yet  work with these texts, as there is quite a large collection of chunks and a specific question may only require information contained within a few. How do we pick the most relevant chunks?

## Embeddings, knowledge base, and retrieval

Picking the most relevant documents necessitates having a measure of relevance. This might seem like a very subjective measure (and, fundamentally, it truly is), but we can quantify it thanks to Natural Language Processing and some math. First, we must generate embeddings; these are vectors that encode the semantic content of each text. Both our chunks and the user's query can be encoded this way; because each vector represents $n$ dimensions of semantic content, we would expect that if the  content in a chunk is relevant it will be close to the content of the question. This is equivalent to saying that the vector for relevant chunks will be close to the vector for the question, given some distance function.

To start with this approach, we must convert our chunks into embeddings and then store them in an index. This index will be our knowledge base, as it contains the information that will be used for our application. We will be using `chromadb` to store our index; and OpenAI's `text-embedding-3-small` embedding model, which generates a 1536 dimensional vector for each chunk. The distance metric we will use is the cosine of the angle between the embedded query and our embedded chunks.

Very important: we need an OpenAI API key to use the embedding model. You will have to sign up to [OpenAI's developer platform](https://platform.openai.com/) and generate a key. API requests incurr a cost, although we will use low cost models for this demonstration.

In [None]:
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

OPENAI_API_KEY = "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"

openaiEmbedding = OpenAIEmbeddingFunction(
        api_key=OPENAI_API_KEY,
        model_name="text-embedding-3-small"
)

We start a persistent client (it will not actually be persistent, as changes to directories in Colab are erased when the runtime is deleted) and create a collection called `"bdg2022_article"`

In [None]:
import chromadb

chromaClient = chromadb.PersistentClient()
collection = chromaClient.create_collection(
    name="bdg2022_article",
    embedding_function=openaiEmbedding,
    metadata={"hnsw:space": "ip"}
)

Now we can add our chunks, each of which is stored in the `Document` class, which has the contents of each page as a string and a dictionary with the metadata. We also generate a unique ID for each chunk.

In [None]:
collection.add(
        documents=[document.page_content for document in chunks],
        metadatas=[document.metadata for document in chunks],
        ids=[f"id{i+1}" for i in range(len(chunks))]
)

Now that we have added the documents, we can retrieve them by ID. Let's retrieve the chunk we previously looked at, whose ID should be `"id14"`. By default, our index will provide a dictionary with the metadata and the text contents of this chunk.

In [None]:
print(collection.get(ids=["id14"])["documents"][0])

agent’s posterior converges to a belief that assigns probability one to the state favored by
the majority of her dogmatic friends, irrespective of the truth. For higher quality, her belief
converges to the truth despite the eﬀects of her echo chamber. These short- and long-run
properties hold even if the agent does not take shared signals at face value (i.e., not just for
ˆγ≈0). They remain true when selection neglect is minimal, i.e., when ˆγis very close to
the true γ. Thus, even minor misperceptions can distort learning.
Our second contribution is to show how these distorting forces at the individual level
can cause polarization at the social level. To begin, we emphasize the central role of infor-
mation quality. If some agents have unbalanced echo chambers towards diﬀerent states and
information quality is suﬃciently low, their beliefs will move apart on average in the short
run and almost surely in the long run. However, note that in our setting polarization does
not mean that al

As we can see, this is the same chunk as before. Finally, we can query the index with text and get the most relevant documents by providing a short question.

In [None]:
collection.query(query_texts=["Does belief polarization require fake news?"])["documents"]

[['multiple factors related to news sharing that may contribute to polarization on various topics.\nFinally, our analysis goes to the heart of how new communication channels and formats\non the Internet can aﬀect polarization. They can lower information quality in some cases.\nFor instance, tweets and social-media posts tend to be short and few people read the linked\narticles (Bakshy et al. (2015); Gabielkov et al. (2016)). People may also misperceive how\nnews-feed algorithms work on social media, which we model with alternative misspeciﬁcations\nand show that they have similar implications to selection neglect (Section 7). All this can\nlead to polarization, even without deliberate misinformation. Yet, the Internet has arguably\nmagniﬁed the spread of fake news through social connections. As a byproduct of our analysis,\nwe ﬁnd that selective sharing is one (and in a sense the only) channel through which fake\nnews can cause polarization. This could explain why fake news have become

Each of these results can be used to inform a a response.

Finally, we have completed the *Retrieval* part of **Retrieval Augmented Generation (RAG)**, which uses semantic searches to provide context to an LLM to enhance the quality of text generation. This method also lowers the probability of hallucination, which happens when a text generator outputs made plausible but false information.