In [1]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("/sandbox/notebooks/data/1811.12808v3.pdf")
pages = loader.load()

In [2]:
pages = pages[:-2]
len(pages)

47

In [3]:
from IPython.display import Markdown

page = pages[0]
Markdown(page.page_content[0:1000])

Model Evaluation, Model Selection, and Algorithm
Selection in Machine Learning
Sebastian Raschka
University of Wisconsin–Madison
Department of Statistics
November 2018
sraschka@wisc.edu
Abstract
The correct use of model evaluation, model selection, and algorithm selection
techniques is vital in academic machine learning research as well as in many
industrial settings. This article reviews different techniques that can be used for
each of these three subtasks and discusses the main advantages and disadvantages
of each technique with references to theoretical and empirical studies. Further,
recommendations are given to encourage best yet feasible practices in research and
applications of machine learning. Common methods such as the holdout method
for model evaluation and selection are covered, which are not recommended
when working with small datasets. Different ﬂavors of the bootstrap technique
are introduced for estimating the uncertainty of performance estimates, as an
alternative to 

In [4]:
page.metadata

{'producer': 'pdfTeX-1.40.21',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2020-11-12T01:17:31+00:00',
 'author': '',
 'keywords': '',
 'moddate': '2020-11-12T01:17:31+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2',
 'subject': '',
 'title': '',
 'trapped': '/False',
 'source': 'data/1811.12808v3.pdf',
 'total_pages': 49,
 'page': 0,
 'page_label': '1'}

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = splitter.split_documents(pages)
len(docs)

157

In [6]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

In [7]:
from langchain.vectorstores import Chroma

persist_directory = "docs/chroma/"

vectordb = Chroma.from_documents(
    documents=docs, embedding=embedding, persist_directory=persist_directory
)
# vectordb.persist()
# vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [8]:
question = "p-value"
docs = vectordb.similarity_search(question, k=3)

In [9]:
Markdown(docs[0].page_content)

Nonetheless, for the sake of completeness, and since it a commonly used method in practice, the
general procedure is outlined below as follows (which also generally applies to the different hypothesis
tests presented later):
1. formulate the hypothesis to be tested (for instance, the null hypothesis stating that the
proportions are the same; consequently, the alternative hypothesis that the proportions are
different, if we use a two-tailed test);
2. decide upon a signiﬁcance threshold (for instance, if the probability of observing a difference
more extreme than the one seen is more than 5%, then we plan to reject the null hypothesis);
3. analyze the data, compute the test statistic ( here: z-score), and compare its associated
p-value (probability) to the previously determined signiﬁcance threshold;
4. based on the p-value and signiﬁcance threshold, either accept or reject the null hypothesis at
the given conﬁdence level and interpret the results.

In [10]:
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro")

In [11]:
from langchain.prompts import PromptTemplate

template = """
    You are an assistant for question-answering tasks. 
    Use the following pieces of retrieved context to answer the question. 
    If you don't know the answer, just say that you don't know. 
    Use five sentences maximum and keep the answer concise.
    {context}
    Question: {question}
    Helpful Answer:
"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [12]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

In [13]:
query = "Is nuclear physics discussed in these documents?"
result = qa_chain.invoke({"query": query})

In [14]:
Markdown(result["result"])

The provided text references confusion matrices and contingency tables, topics related to machine learning evaluation.  It also mentions comparing predictions of two models.  There is no mention of nuclear physics in the given context.  Therefore, the answer is no.

In [15]:
query = "What statistical tests are discussed in these documents?"
result = qa_chain.invoke({"query": query})

In [16]:
Markdown(result["result"])

The provided text discusses hypothesis testing using a z-score and p-value.  It outlines a four-step procedure: formulating hypotheses, setting a significance threshold, calculating the test statistic and p-value, and then accepting or rejecting the null hypothesis.  Additionally, it details calculating an F-statistic from mean SSA (sum of squares of classifiers) and MSAB (mean sum of squares for classification-object interaction) values.  The text doesn't name specific statistical tests beyond these general procedures. 