<div width=50% style="display: block; margin: auto">
    <img src="figures/ucl-logo.svg" width=100%>
</div>


### [UCL-ELEC0136 Data Acquisition and Processing Systems 2024]()
University College London
# Lab 4: Advanced Data Storage - Vector Databases and LLMs


<hr width=70% style="float: left">

**IMPORTANT:** The content of this Notebook will not be evaluated in the final exam. The goal is to provide you with practical experience with the very trendy subject that LLMs are, for you to know how to use them, and provide you with enough to start your own LLM projects. 

This lab also serves as a good illustration of what this module is about: how data acquisition, storage, and processing all come together to create inteligent AI-driven applications.

### Objectives
* Perform CRUD operations on a Pinecone Vector Database.
* Use Langchain to make queries to API accessed pre-trained LLM models (from Hugging Face and OpenAI).
* Compare the performances of two pre-trained LLMs.
* Use a Pinecone vector Database to perform Retrieval Augmentation of a pre-trained LLM, allowing it to both expend his knowledge base, and cite sources.

### Outline

This notebook has 3 parts:

0. [Setting up](#0.-Setting-up)
1. [CRUD operations on a vector database](#1-crud-operations-on-a-pinecone-vector-database)
2. [Intro to LLMs and LangChain](#2.-Intro-to-LLMs-and-Langchain)
3. [Retrieval Augmentation of a LLM using a vector database](#3.-Retrieval-Augmentation-of-a-LLM-using-a-vector-database)

<hr width=70% style="float: left">

# 0. Setting up

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- The assignment repository contains a `requirements.txt` file, make sure to install all the librairies with the correct versions listed in this file in your daps conda environment.

</div>

## 0.1 Create a Pinecone account and connect to your free-tier online vector database

[Pinecone](https://www.pinecone.io) is a vector database service that helps developers build and deploy applications with high-performance similarity search and recommendation capabilities. It enables efficient storage and retrieval of vector data, making it easier to create personalized experiences and content recommendations in various applications, such as stable diffusion, LLMs chatbox, and many other AI applications.

In this Notebook, we will use Pinecone's free tier to create and connect to a vector database, perform CRUD operations, and then use that database to power a LLM application.




<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>
    
- Follow the instructions in `pinecone_tutorial.pdf` to create a Pinecone account and get an API key for the free tier vector database. MAKE SURE TO KEEP A COPY OF YOUR API KEY.
- Run the cell bellow to connect to your online vector database.

</div>

In [None]:
# Run this cell (it may take a few seconds)
import pinecone

In [None]:
###########################
# Task: 
#   change PINECONE_API_KEY with your pinecone API key and run the cell
#
###########################

PINECONE_API_KEY = "your_pinecone_API_key" #<--- TODO: your API key here 

pinecone.init(api_key=PINECONE_API_KEY, environment="gcp-starter")

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>
    
- Use the [`list_indexes`](https://docs.pinecone.io/reference/list_indexes) function to return the list of Pinecone indexes you have on your database (it should be empty).
- If it isn't empty, use the [`delete_index`](https://docs.pinecone.io/reference/delete_index) function to delete any indexes you may have on your database.

</div>

In [None]:
###########################
# Task: 
#   Check that your pinecone database does not contain any indexes, and delete them if there are any.
#
###########################

# TODO : your code bellow

## 0.2 Create a HuggingFace account and generate a free API key

**Note:** you do not need this step to do part. [# 1. CRUD opperations on a Vector Database](#1.-CRUD-opperations-on-a-Vector-Database).


[Hugging Face 🤗](https://huggingface.co) is a company and open-source platform that specializes in natural language processing (NLP) and provides tools, libraries, and pre-trained models for building and deploying NLP applications. Their most well-known product is the Transformers library, which offers access to a wide range of pre-trained NLP models, making it easier for developers to work with text-based tasks such as language translation, sentiment analysis, and more.

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>
    
- Follow the instructions in `huggingface_tutorial.pdf` to create a HuggingFace account and get an API key for the free tier vector database. MAKE SURE TO KEEP A COPY OF YOUR API KEY.
- Run the cell bellow to set an environment variable to your API key. `langchain.HuggingFaceHub` will use this environment variable when sending a request to the Hugging Face server to authentificate the connection.

</div>

In [None]:
import os

###########################
# Task: 
#   change HUGGING_FACE_API_KEY with your Hugging Face API key and run the cell to set the environment variable HUGGINGFACEHUB_API_TOKEN
#
###########################

HUGGING_FACE_API_KEY = "your_HF_API_key" #<--- TODO: your Hugging Face API key here 

os.environ['HUGGINGFACEHUB_API_TOKEN'] = HUGGING_FACE_API_KEY

## 0.3 Create an OpenAI account and generate a free API key

**Note:** you do not need this step to do part. [1. CRUD opperations on a Vector Database](#1.-crud-operations-on-a-pinecone-vector-database) and part. [2.1 Hugging Face LLM](#211-initializing-the-llm).

[OpenAI](https://openai.com) is an artificial intelligence (AI) research laboratory consisting of the for-profit OpenAI LP and its non-profit parent company, OpenAI Inc.

OpenAI provides an [API](https://platform.openai.com/docs/overview) that allows developers to access and integrate the capabilities of OpenAI's language models into their own applications, products, or services. The OpenAI API is based on models like GPT and allows developers to make use of powerful natural language processing (NLP) functionalities.

**Unlike Hugging Face, OpenAI is not open source, and the free tier of their API is only available for 3 months after the creation of a new account, and has a lot of restriction (for exemple, you are limited to 3 requests per minutes for each model). However, their models are state of the art and very powerful, which is why we will use them in this lab.**

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>
    
- Follow the instructions in `openai_tutorial.pdf` to create a HuggingFace account and get an API key for the free tier vector database. MAKE SURE TO KEEP A COPY OF YOUR API KEY.
- Run the cell bellow to set an environment variable to your API key. 

</div>

In [None]:
###########################
# Task: 
#   change OPENAI_API_KEY with your OpenAI API key and run the cell to set the environment variable HUGGINGFACEHUB_API_TOKEN
#
###########################


OPENAI_API_KEY = "your_openai_API_key" #<--- TODO: your OpenAI API key here 

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

# 1. CRUD operations on a Pinecone Vector Database

*Source: https://docs.pinecone.io/docs/quickstart*


In this part, we will familiarize ourselve with basic operations on a Pinecode Vector Database.

<div class="alert alert-block alert-warning">
    <b>👩‍💻👨‍💻 Optional action</b>

- Very short video intro on Vector Databases: [Vector databases are so hot right now. What are they?](https://www.youtube.com/watch?v=klTvEwg3oJ4)
- Short read: [Everything you need to know about Pinecone – A Vector Database](https://www.packtpub.com/article-hub/everything-you-need-to-know-about-pinecone-a-vector-database).
- In depth read: [What is a Vector Database & How Does it Work? Use Cases + Examples](https://www.pinecone.io/learn/vector-database/).

</div>

<div class="alert alert-heading alert-danger" style="background-color: white; border: 2px solid; border-radius: 5px; color: #000; border-color:#AAA; padding: 10px">
    <b>💎 Tip</b>

Once your connection to your Pinecone database is established (which should have been done in part. 0.1), you do not need to use `pinecone.init` anymore and can directly use the API requests.

</div>

### 1.1 CRUD operations on a Vector Database - Create

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>
    
- Use the [`create_index`](https://docs.pinecone.io/reference/create_index) function to create an index, name it `quickstart`, set the dimension to 8, and use the metric `euclidean`.
- Check that your database contains an index called `quickstart`.
- Use the [`describe_index`](https://docs.pinecone.io/reference/describe_index) function to get informations about the index `quickstart`.

</div>

In [None]:
###########################
# Task: 
#   Create an index called quickstart, check that it has been added to your Pinecone Vector DB, and get information about that index
#
###########################

# TODO : your code bellow


<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>
    
- Use the [`Index`](https://docs.pinecone.io/docs/python-client#index) class to contruct an `Index` object from the index `quickstart`. 
- Use the [`upsert`](https://docs.pinecone.io/docs/python-client#indexupsert) method to push the 5 vectors in the cell bellow to the index `quickstart`.
- Use the [`describe_index_stats`](https://docs.pinecone.io/docs/python-client#indexdescribe_index_stats) method to get statistics about the index's contents.

</div>


<div class="alert alert-heading alert-danger" style="background-color: white; border: 2px solid; border-radius: 5px; color: #000; border-color:#AAA; padding: 10px">
    <b>💎 Tip</b>

Wait for a few seconds after you use `upsert` before querying the index with `describe_index_stats` as the data needs a bit of time to be saved.

</div>

In [None]:
###########################
# Task: 
#   Construct an Index object and use it to push the 5 vectors in data to your index, and get statistics about the index
#
###########################

data = [
    ("A", [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], {"genre": "comedy", "year": 2020}),
    ("B", [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2], {"genre": "documentary", "year": 2019}),
    ("C", [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3], {"genre": "comedy", "year": 2019}),
    ("D", [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4], {"genre": "drama"}),
    ("E", [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5], {"genre": "drama"})
    ]


# TODO : your code bellow


### 1.2 CRUD operations on a Vector Database - Read

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>
    
- Use the [`query`](https://docs.pinecone.io/docs/python-client#indexquery) method to search for the 3 closest vectors to the `target_vector` in the index `quickstart`.

</div>

In [None]:
###########################
# Task: 
#   find the 3 nearest vectors to the target_vector
#
###########################
target_vector = [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]

# TODO : your code bellow


<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>
    
- Use the [`query`](https://docs.pinecone.io/docs/python-client#indexquery) method to search for the 3 closest vectors to the `target_vector` the index `quickstart` that are dramas.

</div>

In [None]:
###########################
# Task: 
#   find the 3 nearest vectors to the target_vector that are dramas
#
###########################

# TODO : your code bellow


### 1.3 CRUD operations on a Vector Database - Update

We want to change the values associated to vectors A and D.

In [None]:
# Run this cell
print(index.query(
  id = "A",
  top_k = 1,
  include_values = True,
    include_metadata = True
))

print(index.query(
  id = "D",
  top_k = 1,
  include_values = True,
  include_metadata = True
))

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>
    
- Use the [`update`](https://docs.pinecone.io/docs/python-client#indexupdate) method to do the following updates:
    - Replace all the values of `A` by `0.6`, change the genre to `action-comedy`.
    - Replace all the values of `D` by `-0.4`, add a `year` field in the metadata and set it to `2019`.

</div>

<div class="alert alert-heading alert-danger" style="background-color: white; border: 2px solid; border-radius: 5px; color: #000; border-color:#AAA; padding: 10px">
    <b>💎 Tip</b>

Wait for a few seconds after you use `update` before querying the index as the data needs a bit of time to be saved.

</div>

In [None]:
###########################
# Task: 
#   Update vectors A and D as indicated above.
#
###########################


# TODO : your code bellow


time.sleep(5)

In [None]:
# Run this cell
print(index.query(
  id = "A",
  top_k = 1,
  include_values = True,
  include_metadata = True
))

print(index.query(
  id = "D",
  top_k = 1,
  include_values = True,
  include_metadata = True
))

### 1.4 CRUD operations on a Vector Database - Delete

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>
    
- Use the [`delete`](https://docs.pinecone.io/docs/python-client#indexdelete) method to delete the vectors `B` and `C`.
- Check with `describe_index_stats` that your index now contains 3 vectors

</div>

<div class="alert alert-heading alert-danger" style="background-color: white; border: 2px solid; border-radius: 5px; color: #000; border-color:#AAA; padding: 10px">
    <b>💎 Tip</b>

Wait for a few seconds after you use `upsert` before querying the index with `describe_index_stats` as the data needs a bit of time to be saved.

</div>

In [None]:
###########################
# Task: 
#   Delete vectors B and C from the index.
#
###########################


# TODO : your code bellow


time.sleep(5)

In [None]:
index.describe_index_stats()

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>
    
- Use the [`delete_index`](https://docs.pinecone.io/reference/delete_index) function to delete the index `quickstart`.

</div>

In [None]:
###########################
# Task: 
#   delete the index quickstart from your Pinecode Vector Database
#
###########################

# TODO : your code bellow



# 2. Intro to LLMs and LangChain

**Context:**

* **Large language models (LLMs)** are a class of artificial intelligence models that have been trained on vast amounts of text data to understand and generate human-like text. These models are based on deep learning techniques, such as neural networks, and have many parameters, often numbering in the hundreds of millions or even billions. Some well-known examples include OpenAI's GPT-3 and GPT-4, and Google's BERT and T5 models.

* " [**LangChain**](https://python.langchain.com/docs/get_started/introduction) is an open source framework that lets software developers working with artificial intelligence (AI) and its machine learning subset combine large language models with other external components to develop LLM-powered applications. The goal of LangChain is to link powerful LLMs, such as OpenAI's GPT-3.5 and GPT-4, to an array of external data sources to create and reap the benefits of natural language processing (NLP) applications. " - [Source](https://www.techtarget.com/searchenterpriseai/definition/LangChain#:~:text=LangChain%20is%20an%20open%20source,to%20develop%20LLM%2Dpowered%20applications.)



**Although we are doing these operations in a Jupyter Notebook, the exact same code can be used to program server-hosted web applications that could perform real-world tasks.**


<div class="alert alert-block alert-warning">
    <b>👩‍💻👨‍💻 Optional action</b>

- In depth read: [LangChain AI Handbook](https://www.pinecone.io/learn/series/langchain/).

</div>


In this part, we will use LangChain to deploy two pre-trained LLMs, one from Hugging Face Hub, and one from OpenAI's API.

*Source: [https://www.pinecone.io/learn/series/langchain/langchain-intro/](https://www.pinecone.io/learn/series/langchain/langchain-intro/)*

## 2.1 Hugging Face LLM

Let's use a pre-trained Google language model, [flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl), hosted on the Hugging Face Hub, that we will access for free through Hugging Face's API. More specifically, we will use the HuggingFaceHub module of the langchain library, which will query Hugging Face's API for us.

### 2.1.1 Initializing the LLM

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Run the cell below to initialize the connection to the LLM.

</div>

In [None]:
# Run this cell

from langchain import HuggingFaceHub

# initialize Hub LLM
hub_llm = HuggingFaceHub(
        repo_id='google/flan-t5-xxl',
    model_kwargs={'temperature':1e-2} #Best temperature found: 1e-2
)

### 2.1.2 Use LangChain to ask a question to the LLM

The first thing we need to do to query a LLM is to create a **prompt template**. A prompt template contains instructions to generate a prompt in a reproductible way. It contains a text string ("the template"), that can take in a set of parameters from the end user and generates a prompt (the input variables).

For example: 

* We want a LLM model to tell what language a sentence is written in, then an appropriate prompt template would be: 
    * `"What language is the sentense "{sentence}" written in?"`
    * Here, `sentence` is the only input variable. 

<br/>

* We want a LLM model to generate jokes about a topic, while specifying what type of jokes, then an appropriate prompt template would be: 
    * `"Make a {type} of joke about {subject}""`
    * Here, `type` and `subject` are the input variables.

<br/>


**Note:** `google/flan-t5-xxl` is a small LLM, best suited to give short answers. Although it is enough for us to explore LLMs today, for a real application bigger models would be better suited.

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Using [langchain.PromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.prompt.PromptTemplate.html), create a prompt template with an input variable called `name` that asks the language model to give the date of birth of a historical figure. 

</div>

In [None]:
###########################
# Task: 
#   Create a PromptTemplate with an input variable called `name` that asks the language model to give the date of birth of a historical figure.
#
###########################

from langchain import PromptTemplate

# TODO : your code bellow




<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Using langchain.LLMChain, create a chain to run the prompt by our LLM.

</div>


<div class="alert alert-heading alert-danger" style="background-color: white; border: 2px solid; border-radius: 5px; color: #000; border-color:#AAA; padding: 10px">
    <b>💎 Tip</b>

An LLMChain consists of a PromptTemplate and a language model (either an LLM or chat model). It formats the prompt template using the input key values provided, passes the formatted string to LLM and returns the LLM output.

**Example:** 
llm_chain = LLMChain(
    prompt=your_PromptTemplate_template,
    llm=your_llm
)
</div>


In [None]:
###########################
# Task: 
#   Create a LLMchain with your prompt and your hub_llm
#
###########################

from langchain import LLMChain

# TODO : your code bellow



<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Use `LLMChain.run()` or `LLMChain.predict()` to ask for the date of birth of Napoleon.


</div>

In [None]:
###########################
# Task: 
#   Asks the language model to give the date of birth of a historical figure.
#
###########################

# TODO : your code bellow


**SPOILER:** Well, that's not amazing. The LLM understood we wanted a date, and we even got something close to Napoleon's real date of birth, but the results is wrong. This is because this LLM is small and not adapted to this task.

## 2.2 OpenAI LLM

In this part, we will task a more powerfull openAI LLM called [`text-davinci-003`](https://platform.openai.com/docs/models/gpt-3), a variant of GPT3,to perform the same task we instructed the Hugging Face hosted model, and see if we get better results.

### 2.2.1 Initializing the LLM

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Run the cell bellow to innitialize the connection to the LLM.

</div>

In [None]:
# Run this cell

from langchain.llms import OpenAI

davinci = OpenAI(model_name='text-davinci-003', openai_api_key =  OPENAI_API_KEY)

### 2.2.2 Use LangChain to ask a question to the LLM

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Using langchain.LLMChain, create a chain to run the same prompt as in part. 2.1.2 by our OpenAI LLM.

</div>

In [None]:
###########################
# Task: 
#   Create a LLMchain with your prompt and your hub_llm
#
###########################

# TODO : your code bellow


<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Use `LLMChain.run()` or `LLMChain.predict()` to ask for the date of birth of Napoleon.


</div>

In [None]:
###########################
# Task: 
#   Asks the language model to give the date of birth of a historical figure.
#
###########################

# TODO : your code bellow


**SPOILER:** Alright that's better, Napoleon was indeed born of August 15, 1769. Let's ask something more complex and recent and see how our LLM does.

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Using [langchain.PromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.prompt.PromptTemplate.html), create a new prompt template with an input variable called `year` that asks the language model to give the winner of the FIFA world cup on a given year.

</div>

In [None]:
###########################
# Task: 
#   Create a PromptTemplate with an input variable called `year` that asks the language model to give the winner of the FIFA world cup of that year.
###########################

# TODO : your code bellow

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Using langchain.LLMChain, create a chain to run the new prompt template by our OpenAI LLM.

</div>

In [None]:
###########################
# Task: 
#   Create a LLMchain with your prompt and your hub_llm
#
###########################



<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Use `LLMChain.run()` or `LLMChain.predict()` to ask for the winner of the 2022 FIFA world cup.


</div>

In [None]:
###########################
# Task: 
#   Asks the language model to ask for the winner of the 2022 FIFA world cup
#
###########################

# TODO : your code bellow


The LLM doesn't know what happened in 2022 because it was trained before that year. How can we update the LLM's knowledge to make it up to date? With Retrieval augmentation!

# 3. Retrieval Augmentation of a LLM using a Vector Database

**WARNING:** Some of the cells in this part may take some time to run as we are working with a lot of data.

The most powerful LLMs in the world have no idea about recent world events, nor can they cite sources. In general a LLM will only have knowledge about what it has been exposed to during training. For LLMs, the world exists as a static snapshot of the world as it was within their training data. 

A solution to this problem is **retrieval augmentation**: we retrieve relevant information from an external knowledge base and give that information to our LLM.

*Source: https://docs.pinecone.io/docs/langchain#retrieval-augmentation-in-langchain*


<div width=50% style="display: block; margin: auto">
    <img src="figures/augmentation.png" width=70%>
</div>

To perform **retrieval augmentation**, we will embed data with an **embedding model**, store the embedded data into a **vector database index**, and create a **LangChain vectorstore** that will use the index and the embedding model to find relevant information to the prompt sent by the user, and feed the most relevant results found in the database to the **LLM** to provide it with context and sources.

## 3.1 Building a knowledge base with Vector Embedding 

Source: [Creating the knowledge base](https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/#Creating-the-Knowledge-Base)

This part concists in taking a dataset of relevant content we want to augment our LLM with (it could be code documentation for an LLM that needs to help write code, company documents for an internal chatbot...), process it, and embedded it into vectors. 

You can look [here](https://www.pinecone.io/learn/series/langchain/langchain-retrieval-augmentation/#Creating-the-Knowledge-Base) fore the full processs, which you would have to do if you wanted to augment your LLM for a specific use-case.

**For the sake of simplicity, we will use a pre-embembeded dataset from pinecone_dataset and upload it to a Pinecone Vector database. Unfortunatly, this dataset doesn't contain data on the 2022 FIFA world cup, but the process used here can be applied with other datasets, giving you an idea of the process should you want to use it for personal projects.**

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Run the cell bellow to import and process the pre-embedded data we will use for retrieval augmentation

</div>

In [None]:
# Run this cell

import pinecone_datasets

# We import a dataset
dataset = pinecone_datasets.load_dataset('wikipedia-simple-text-embedding-ada-002-100K')
# We drop sparse_values as they are not needed for this example
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)
# We will use rows of the dataset up to index 30_000 to make the upload to the Pinecone Vector Database faster
dataset.documents.drop(dataset.documents.index[30_000:], inplace=True)

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Run the cell bellow to check that your pinecone database does not contain any indexes, and delete them if there are any.

</div>

In [None]:


pinecone.list_indexes()

#pinecone.delete_index("index_name")

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Use the [`create_index`](https://docs.pinecone.io/reference/create_index) function to create an index, name it `langchain-retrieval-augmentation-fast`, set the dimension to **1536**, and use the metric `cosine`.

</div>


**Note:** We set the dimension to 1536 as this is the dimension of OpenAI's text embedding model 'text-embedding-ada-002' that we use to embed the data. If you wish to use another model (for instance a Hugging Face model using HuggingFaceInferenceAPIEmbeddings), you will have to change this number to match the dimension of your model.

In [None]:
###########################
# Task: 
#   Check that your pinecone database does not contain any indexes, and delete them if there are any.
#
###########################

index_name = 'langchain-retrieval-augmentation-fast'

# TODO : your code bellow




<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Run the cell bellow to push the data on to the Vector Database. This can take a few minutes.

</div>

In [None]:
# Run this cell - It may take a few minutes

import time

index = pinecone.GRPCIndex(index_name) # GRPC allows for faster upserts to the Pinecone Vector Database
# wait a moment for the index to be fully initialized
time.sleep(1)

index.describe_index_stats()

for batch in dataset.iter_documents(batch_size=100):
    index.upsert(batch)


index.describe_index_stats()

## 3.2 Creating a vector store

Now that we've build our index we can switch over to LangChain. We need to initialize a LangChain vector store using the same index we just built. For this we will also need a LangChain embedding object.

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Run the following cells to initialize a LangChain vector store.

</div>

In [None]:
# Run this cell

from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

In [None]:
# Run this cell

from langchain.vectorstores import Pinecone

text_field = "text"

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embed, text_field
)



<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Run the following cell to perform a `similarity_search` of the content of the query on our [vectorstore](https://python.langchain.com/docs/modules/data_connection/vectorstores/) containing embedded information about our augmentation dataset.

</div>

In [None]:
# Run this cell

query = "When was Napoleon born"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

## 3.3 Augmented LLM query

We finally have all the pieces of the puzzle, and we can now force the LLM to answer a question based on the information it is seeing being returned from the vectorstore. This allows for many things, like giving it up-to-date information, but can also be used to [cite sources](https://python.langchain.com/docs/use_cases/question_answering/vector_db_qa#return-source-documents). 

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Run the following cell to create a `VectorStoreRetrieverMemory` object that we will pass on to the LLMChain.
</div>

In [None]:
# Run this cell
from langchain.memory.vectorstore import VectorStoreRetrieverMemory

retriever = vectorstore.as_retriever(search_kwargs=dict(k=1))
memory_RAG = VectorStoreRetrieverMemory(retriever=retriever)


<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Create a new prompt template with a question of your choice. For example, you can ask about the life of historical figures.

- Using langchain.LLMChain, create a chain to run the new prompt template by our OpenAI LLM davinci. Add to the lists of arguments `memory = memory_RAG` to instruct the LLM to look for the answer in the vectorstore.



</div>

In [None]:
###########################
# Task: 
#   Asks the language model to give the date of birth of a historical figure.
#
###########################

# TODO : your code bellow


## 3.4 Augmented LLM query with sources

As stated before, we can use retrieval augmentation to provide sources to the output of the LLM (provided the sources were in the metadata of the embedded dataset).

In [None]:
# Run this cell

from langchain.chains import RetrievalQAWithSourcesChain

chain = RetrievalQAWithSourcesChain.from_chain_type(llm = davinci, chain_type="stuff", retriever=retriever)
chain({"question": "Tell me about the life of Napoleon"})




<div class="alert alert-block alert-warning">
    <b>👩‍💻👨‍💻 Optional action</b>

- Reading: [Making Retrieval Augmented Generation Fast](https://www.pinecone.io/learn/fast-retrieval-augmented-generation/)
    
</div>

# 4. The end

That's the end of this lab! We hope you learned a lot through it, and that you are now ready to go on the adventure on your own and explore all that is possible to do with LLMs and vector Databases.

<div class="alert alert-block alert-warning">
    <b>👩‍💻👨‍💻 Optional action</b>

Explore the other exemple of applications of vector databases listed in Pinecone's documentation:

https://docs.pinecone.io/page/examples
    
</div>


To submit this assignment and **every other future assignment**, included the **final assignment** you have to:
- Commit and push your code to GitHub
- Go to **your** repository of the assignment. This must be on our course organisation `UCL-ELEC0136` and usually has the pattern `https://github.com/UCL-ELEC0136/<assignment-name>-<your-github-username>`.
- Go in the `Pull requests` tab and click on the `Feedback` pull request.
- Click on `Files changed` and verify that the files you have changed are listed.
- Merge the pull request by clicking on `Merge pull request` and then `Confirm merge`.

We are now ready to push our code that acquires data from GitHub to our repository (which is also GitHub, but this is just a coincidence, we could have used any other API, like Twitter's or Facebook's).

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

Submit your assignment by following the steps above.
</div>