<a href="https://colab.research.google.com/github/andygma567/LLM-experiments/blob/main/Test_mlflow_%2B_Palm2_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a test to integration mlflow with my Langchain chains. Later I'd like to study more about this [web scraping with an LLM example](https://python.langchain.com/docs/use_cases/web_scraping/)

## Setup



In [4]:
%%bash
pip install -U -q google-generativeai # PALM API library
pip install -U -q langchain
pip install -q unstructured # for reading urls with langchain
pip install -q transformers # needed by the summary chain

# mlflow things
pip install -q mlflow
pip install pydantic==1.* # test if this works with pydantic 2 later
pip install -q pyngrok



# Write some environment files

These are for in case I am not working inside of colab. For personal projects this is probably overkill.

In [5]:
# Write a requirements.txt file
# I don't use pip freeze > requirements.txt because
# colab installs a ton of extra libraries that I don't actually need
text = """
pandas>=1.5
mlflow
transformers
langchain
unstructured
pydantic==1.*
pyngrok
google-generativeai
"""
with open("requirements.txt", "w") as f:
    f.write(text)

It would be interesting to see if I could install using my requirements.txt file

In [6]:
# Write a conda environment yaml - I used ChatGPT
# I might not actually need this but I'll include it just to be safe
# By default, conda is not install in the colab notebook - because colab runs
# docker images
text = """
name: myenv
channels:
  - defaults
dependencies:
  - python>=3.10
  - pip
  - pip:
    - -r requirements.txt
"""
with open("conda.yaml", "w") as f:
    f.write(text)

In [7]:
# Write an MLproject file
# it doesn't have much use because I don't have a main python script but it
# could be useful in the future...
text= '''
name: mlflow + langchain experiment

conda_env: conda_environment.yaml

entry_points:
  main:
    command: "python3 print('hello')"
'''
with open("MLproject", "w") as f:
    f.write(text)

# Set up the langchain PALM integration

To get started, you'll need to [create an API key](https://developers.generativeai.google/tutorials/setup). I'm using the [langchain integration](https://api.python.langchain.com/en/latest/chat_models/langchain.chat_models.google_palm.ChatGooglePalm.html#langchain.chat_models.google_palm.ChatGooglePalm).

In [8]:
import os
from langchain.llms.google_palm import GooglePalm
from langchain.chains.summarize import load_summarize_chain

MY_API_KEY = 'AIzaSyBCopn5tdSQBN659Z_0GqvY5S-E7ywnh-4'
os.environ['GOOGLE_API_KEY'] = MY_API_KEY

llm = GooglePalm(temperature=0,
                 max_output_tokens=1024,
                 )
chain = load_summarize_chain(llm=llm, chain_type="stuff")

# Try Summarization

[Lang chain summarization example](https://python.langchain.com/docs/use_cases/summarization)

[Reference for PALM2 models](https://developers.generativeai.google/models/language#:~:text=Note%3A%20For%20the%20PaLM%202,about%2060%2D80%20English%20words).

## Load and split data

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# PALM2 has a roughly 8k token input
# but the PALM API can only take about 20k bytes
# 1 bytes ~ 1 char
# 4 char ~ 1 token
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)

In [22]:
import textwrap
from langchain.document_loaders import (UnstructuredURLLoader, \
                                        WebBaseLoader, \
                                        )
urls = [
    # this only works with webbased I think
    # "https://sites.google.com/view/mnovackmath/home",
    "https://sites.google.com/view/mnovack",
    ]
# loader = UnstructuredURLLoader(urls=urls)
loader = WebBaseLoader(web_path=urls)

docs = loader.load_and_split(text_splitter=text_splitter)
# The replace_whitespace = True is better for UnstructuredURLLoader
# and False is better for the WebBaseLoader
print(f"Total number of documents: {len(docs)}\n")
print(f"Num chars per doc: {len(docs[0].page_content)}\n")
print(textwrap.fill(docs[0].page_content, max_lines=10))

Total number of documents: 3

Num chars per doc: 1676

pytest: helps you write better programs — pytest documentation
Navigation   modules pytest-7.4 » pytest: helps you write better
programs        Next Open Trainings  Professional Testing with Python,
via Python Academy, March 5th to 7th 2024 (3 day in-depth training),
Leipzig, Germany / Remote  Also see previous talks and blogposts.
pytest: helps you write better programs¶ The pytest framework makes it
easy to write small, readable tests, and can scale to support complex
functional testing for applications and libraries. pytest requires:
Python 3.7+ or PyPy3. PyPI package name: pytest  A quick example¶ #
content of test_sample.py def inc(x):     return x + 1   def [...]


## Run a summarization chain + mlflow

This is a nice reference: [LLMOps: Experiment Tracking with MLflow for Large Language Models
](https://dagshub.com/blog/mlflow-support-for-large-language-models/)

- I need to figure out how to use the `mlflow.evaluate()` later, for now I have enough to work with and the evealuate is an experimental feature anyways
- Maybe later I can try running the [mlflow example from the docs](https://mlflow.org/docs/latest/models.html#evaluating-with-llms)

This is some `mlflow.evaluate()` code that didn't work for me earlier
```
# This is formatted as code
# try to log a table using mlflow.evaluate()
# use model type="text" bc "summarization" generates extra metrics

# Use the pandas.DataFrame constructor to create a new DataFrame from the list of strings
# I had to check the model signature to see that the name of the input is defaulted to
# "input_documents"

# For some reason this mlflow.evaluate() doesn't work for me...
# I can double check this another time

# df = pd.DataFrame(data=inputs, columns=["input_documents"])
# print(df)

# mlflow.evaluate(
#     model=logged_model.model_uri,
#     model_type="text",
#     data=df,
#     )
```

In [40]:
%%time
# my manual test
import langchain
import textwrap
import mlflow
from pprint import pp
import pandas as pd

mlflow.set_tracking_uri('')
experiment = mlflow.set_experiment('Langchain + mlflow')

# Only the first 2k characters of Matt's webpage can be passed to the API
# otherwise it raises an error - I have never known why this is but I assume
# it's because the PALM API is not very good

urls = [
    "https://sites.google.com/view/mnovackmath/home",
    "https://sites.google.com/view/mnovack",
    "https://math.gmu.edu/~scarney6/index.html", # Sean's website
    ]

for website in urls:
    print()
    print(website)
    loader = WebBaseLoader(web_path=website)
    docs = loader.load_and_split(text_splitter=text_splitter)

    with mlflow.start_run():
        # log the number of docs
        params = {'num_docs': len(docs),
                  'website': website,
                  }
        mlflow.log_params(params)

        # log the prediction
        inputs = [docs[0].page_content]
        outputs = [chain.run(docs[:1])]
        prompts = [chain.llm_chain.prompt.template]

        model_info = mlflow.llm.log_predictions(inputs, outputs, prompts)

        # see docs:
        # https://mlflow.org/docs/latest/python_api/mlflow.langchain.html#mlflow.langchain.log_model
        # by default this flavor can infer the signature from the chain
        # which appears to be good enough for my uses

        # but we can also explicitly pass an input example
        # it infers a signature from the input example

        # log the model, I can use the infer signature later if I want
        logged_model = mlflow.langchain.log_model(chain,
                                                  "langchain_summary_chain",
                                                  input_example=docs[0].page_content
                                                  )

        # I think the artifact view for comparing runs currently only works well for
        #  table artifacts, so I need to use the mlflow.log_table() function
        data_dict = {
            'prompts': prompts,
            'inputs': inputs,
            'outputs': outputs,
        }

        df = pd.DataFrame(data_dict)
        mlflow.log_table(data=df, artifact_file="prediction_results.json")

https://sites.google.com/view/mnovackmath/home
Total number of documents: 4

Num chars per doc: 1998

HomeSearch this siteSkip to main contentSkip to navigationMatthew
NovackAssistant ProfessorPurdue University, Department of
Mathematicsmdnovack "at" purdue "dot" eduAbout MeI am an assistant
professor in the Department of Mathematics at Purdue University.  My
research interests lie in partial differential equations, particularly
those arising in fluid dynamics and related fields. My CV can be found
here.  In recent years I was a postdoc at New York University, MSRI,
and IAS.  I completed my Ph.D. at the University of Texas-Austin in
2019.My research is partially supported by the National Science
Foundation, Division of Mathematical Sciences, through NSF Grant [...]


2023/09/10 16:57:38 INFO mlflow.tracking.llm_utils: Creating a new llm_predictions.csv for run 6064891ec39840949c40c8e4f5b084ea.


https://sites.google.com/view/mnovack
Total number of documents: 2

Num chars per doc: 1992

Michael NovackSearch this siteSkip to main contentSkip to
navigationMichael NovackMichael  NovackPostdoctoral Research Associate
at Carnegie Mellon UniversityEmail address: mnovack at andrew dot cmu
dot eduPersonal InfoI am a postdoc at Carnegie Mellon University,
where my mentors are Irene Fonseca and Giovanni Leoni . I am
interested in the calculus of variations, geometric measure theory,
and partial differential equations.Previously, I was a postdoc at the
University of Texas at Austin with Francesco Maggi and the University
of Connecticut with Xiaodong Yan . I completed my doctoral studies at
Indiana University under the supervision of Peter Sternberg  and [...]


2023/09/10 16:57:45 INFO mlflow.tracking.llm_utils: Creating a new llm_predictions.csv for run 7f0b9be6114d4c79bf04d23ed8052bb8.


https://math.gmu.edu/~scarney6/index.html
Total number of documents: 1

Num chars per doc: 1298

Sean P. Carney Profile              Sean P. Carney  Postdoctoral
Research Fellow
Department of Mathematical Sciences, George Mason University
Interests  Multiscale and stochastic modeling, analysis, and
simulation  Transport and mixing in complex and turbulent flows
Asymptotic and numerical homogenization      Education and experience
Postdoctoral Research Fellow, Center for Mathematics and Artificial
Intelligence, George Mason University   Hedrick Asst. Adj. Prof.
(2020-2023), University of California, Los Angeles Ph.D. (2020)
University of Texas at Austin, advised by Bj&oumlrn Engquist [...]


2023/09/10 16:57:53 INFO mlflow.tracking.llm_utils: Creating a new llm_predictions.csv for run 50f738bfa7274fb2909c594235f6391d.


CPU times: user 1.01 s, sys: 48.8 ms, total: 1.06 s
Wall time: 20.7 s


# Set up the UI

In [14]:
import os
os.system("mlflow ui &")

0

In [15]:
from pyngrok import ngrok

# Terminate open tunnels if exist
ngrok.kill()

# Setting the authtoken (optional)
# Get your authtoken from https://dashboard.ngrok.com/auth
NGROK_AUTH_TOKEN = "2Tw0NPiESsNXEJoEZgShvindbK8_3w9U4iGq7pou7V12dDbmQ"
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# Open an HTTPs tunnel on port 5000 for http://localhost:5000
public_url = ngrok.connect("5000")

# public_url = ngrok.connect(port="5000", proto="http", options={"bind_tls": True})
print("MLflow Tracking UI:", public_url)





MLflow Tracking UI: NgrokTunnel: "https://affd-34-23-32-80.ngrok-free.app" -> "http://localhost:5000"
