# Using Jupyter AI with your documents

Outputs generated by a language model can include inaccuracies, particularly when you ask about new or specialized material that was absent from the training data. Generated code can also be incorrect, so you of course have to review and test any code suggestions offered by your AI assistant. 

To get more reliable outputs from a language model, you can combine the generation model with a retrieval model and use external sources of information. This is called _Retrieval Augmented Generation_, RAG, and it allows the LLM to generate content that is _grounded_ by the documents processed with the retrieval model.

A retrieval model can find relevant information in your collection of documents by using _embeddings_, that is, representations of the words and sentences in the documents on a numeric vector space. Embeddings are able to encode meaning, and capture semantic similarity via distances of vectors in the mapped space. Moving from text data to vector spaces brings the great power of linear algebra to the task!

You can set an embedding model to use with Jupyter AI in the settings (gear icon on the chat interface). For example, I am using `text-embedding-ada-002`, the [Open AI embedding model](https://openai.com/blog/new-and-improved-embedding-model).

In the Jupyter AI chat interface, you use the `/learn` command and a local _directory_ to process the files in that directory with the embedding model and ask questions about them.

## Limitations

### File types

The Jupyter AI [documentation](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#learning-about-local-data) does not mention the types of files that can be used with the `/learn` command. But on exploring the source code, I found a [line](https://github.com/ellisonbg/jupyter-ai/blob/54b965a9222b0dc4aea42b503b69219b966b0504/packages/jupyter-ai/jupyter_ai/document_loaders/directory.py#L47) that defines the file extensions as follows:

```python
extensions={'.py', '.md', '.R', '.Rmd', '.jl', '.sh', '.ipynb', '.js', '.ts'},
```

Therefore, at this time, Jupyter AI can learn from code written in Python, R, Julia, JavaScript and TypeScript, and from documents written in Markdown or Jupyter notebooks. 

Notably, you cannot learn from PDF files and pure text files, at this time. If you have PDF documents that you want to analyze, you will need to work outside of Jupyter, for now. One good option is [Claude](https://claude.ai/chat/) by Anthropic, which lets you upload substantial documents in many formats.

### Provider rate limits

On first trying to use the `/learn` command with a folder containing the lessons of Module 3 of Engineering Computations, I got the following error:

```
openai.error.RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-{id} on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method.
````
What happened? Looking into my Open AI account, I found my [Rate limits](https://platform.openai.com/docs/guides/rate-limits/overview) for the `text-embedding-ada-002` model: 150,000 tokens-per-minute, 3 requests-per-minute, and 200 requests-per-day. 

It looks like Jupyter AI made too many API requests to Open AI with my files. This can happen if the files are too big, and indeed one notebook in the folder is more than 5 MB, because it has some embedded images. So I tried again with a folder containing Module 2—with four Jupyter notebooks between 86 KB and 711 KB—but it failed again with the same error. Even trying with a folder containing just two notebooks, 34830 and 101068 bytes respectively, failed.

The limits above apply to _free tier_ accounts, and adding a payment method leads to an increase. But while simply exploring the usefulness of this tool, you may not want to do that, and thus will be limited in what you can do.

In the end, to try the Jupyter AI `/learn` command, I had to upgrade my Open AI account to pay-as-you-go, enter my credit card, and make a minimum payment of $5.

# Play with `/learn` and `/ask`

My newly upgraded Open AI account has rate limits of 1,000,000 tokens per minute and 3,000 requests per minute. Great. Now I could run the following command on the chat: 
`/learn EngComp/modules/2_takeoff/notebooks_en`—this points to the location of my local copy of the notebooks for Module 2 of Engineering Computations. It includes lessons on exploratory data analysis with real data, including a data set of craft beers in the US; handling labeled data with the `pandas` library; and visualizing quantitative and categorical data.

[Lesson 2](http://go.gwu.edu/engcomp2lesson2) is focused on visualizing statistics, using the data set of craft beers. The fourth section is titled "Visualizign multiple data," and it works through an example of making a bubble chart, where beers are grouped by style and represented by the average value of two features: bitterness, and alcohol content. Here's an example interaction with Jupyter AI asking questions about that section of the lesson.

![](../images/jupyterchat_learn.png)

The answers above are not only correct in the general sense, but also in the context of the lesson, and even match the variable names used!

We have a little challenge for students in that section under the heading "Discuss with  your neighbor." When asking about that in the chat, Jupyter AI gave a correct response, even though the solution is not in the text (and auto-correct caused a typo on the question). Impressive!

![](../images/jupyterchat_ask.png)

# Comparison with Claude

We made a PDF version of Module 2 available publicly [on Figshare](https://doi.org/10.6084/m9.figshare.5673499.v1). Uploading this document to [Claude](https://claude.ai/chat/) and asking the first question above gave the following response:

<blockquote><small>
Based on the description, it sounds like the key pandas feature needed to make a bubble chart of the beer data set grouped by style is the groupby() method. 

The lessons mention using groupby() to group the beer data by style, and then calculate the mean abv and ibu values within each style group. To make a bubble chart with bubbles representing each style, we would need to:

1. Group the beer dataframe by the 'style' column using groupby():

```python
beers_by_style = beers.groupby('style')
```

2. Calculate the mean abv and ibu for each style group:

```python 
style_means = beers_by_style.mean()[['abv', 'ibu']]
```

3. Extract the style count into a separate series:

```python
style_counts = beers['style'].value_counts()
```

4. Use style_means as the x and y data for the bubbles, with style_counts used to set the size of each bubble.

So the key pandas feature needed here is groupby(), which allows us to split the dataframe into groups, apply aggregations like mean(), and then use the grouped data to create the desired bubble chart visualization. The groupby() method is essential for handling the grouping of the data needed for this chart.
</small></blockquote>

This explanation is not only more detailed than what Open AI's `gpt-3.5-turbo` gave, but it actually solves the problem slightly differently than in the lesson. Instead of first dropping columns we didn't want, and then grouping, Claude's response applied `groupby()` and then extracted the relevant columns by their labels. A student here would get the benefit of an alternative code solution!

Of course, we didn't try `gpt-4`, which we can expect to give better results than `gpt-3.5`.

When asking Claude the second question above, this is the response:

<blockquote><small>
    Good catch! When we dropped the unnecessary columns from the dataframe and only kept 'abv', 'ibu' and 'style', it seems we lost some rows that had null values in the 'abv' and 'ibu' columns.

The reason the number of unique styles went down from 99 to 90 is likely because:

- Originally, when we extracted the 'style' column and got the value counts, there were 99 unique beer styles represented across all 2410 rows. 

- But then when we dropped rows with null 'abv' and 'ibu' values, this removed any rows where a style had missing abv or ibu data. 

- So any styles that only had rows with missing abv/ibu data got removed entirely when we cleaned the dataframe.

- This resulted in only 90 remaining styles that had at least one row with non-null abv and ibu values.

So in summary:

- Original dataframe had 99 unique styles 

- Cleaning dropped rows with missing abv/ibu data

- Styles that had no non-null abv/ibu data got removed

- Leaving only 90 styles with valid abv and ibu data

The value_counts on 'style' after cleaning only counts these 90 remaining styles with valid abv/ibu data.
</small></blockquote>

That is a very nice explanation that could be very helpful to a confused student. And we get it using the free tier.

# What we've learned

- Jupyter AI lets you use retrieval-augmented generation via an embedding model, chosen in the settings.
- Open AI's embedding model is `text-embedding-ada-002`.
- Use the `/learn` command in the Jupyter AI chat interface, with a local directory, to index the documents you want to ask about.
- Supported file types include code in Python, R, Julia, JavaScript and TypeScript, and documents in Markdown or Jupyter notebooks. Files in PDF or raw text are unsupported at this time.
- The free-tier Open AI account is insufficient to use `/learn`; you need a pay-as-you-go account.
- Use the `/ask` command to prompt your language model about the information in your documents.
- These commands are only available on the chat UI, and not in notebook cells via magics.
- Claude is a good alternative when you need to ask about information contained on large files (and it works with the free tier).

In [1]:
from IPython.core.display import HTML
style_file = '../style/custom.css'
HTML(open(style_file, "r").read())