# Embeddings Lab

### Introduction

In this lesson, we'll work with a dataset of restaurant reviews to build a chatbot that can give restaurant recommendations.  

We'll do this in two parts.  

1. Build the database

First, we'll read our original reviews, and retrieve an embedding representation for each one, and then store the text along with the correspodning embedding vector in a parquet file.

2. Finding the related reviews
Once we have our reviews along with the corresponding embedding, we can take a question, embed it, and then use cosine similarity to find the reviews most relevant to our question.  From there, we'll combine these reviews and provide them as a context to our llm model.  

Ok, let's get started.


### Building our database

You can get an overview of how we will build our database, by looking at the `db_builder/build_db.py` file.

```python
file_name = './source_data/reviews.csv'
df = read_csv(file_name)
text_series = combine_data(df)
combined_embedded = build_in_batches_from(text_series, batch_size = 30)
combined_embedded.to_parquet('../database.parquet')
```

As you can see, we'll read in our original reviews, then we'll combine the `Review` as well as the `Restaurant` columns to create one combined column. 

Then the build_in_batches function is what will turn each `restaurant_name` + `review` into an embedding vector.  We'll return a dataframe that has both the original review and the corresponding embedding and save this to a parquet file.

Ok, let's take these functions in turn.

1. `combined_data`

* Take in a dataframe of the reviews, and return a series where each entry returns the string with both the restaurant and the review.  For example if we had a restaurant of `Chipotle` and a review of `good tacos` the entry would be:

`'restaurant name: Chipotle Review: good tacos'`

> See the corresponding test.

2. `text_to_vectors`

This function takes in a list of strings and returns a corresponding numpy array as the embedded vector for each string.

3. `build_embeddings_df_from(text_inputs)`

* This takes the `text_inputs` strings and returns a dataframe that has one column of the reviews and another column of the corresponding numpy array.   Use the `text_to_vectors` function to accomplish this.

4. `build_in_batches_from(combined_series, batch_size)`

Now there are two problems with our build_embedding_df_from(text_inputs).  

1. One is that we cannot feed all of our data at once to the embeddings API.  So instead we need to do this in batches.  For us that means 30 records at a time.  So use the `range()` to loop through and select 30 records at a time, and feed those records to the `build_embeddings_from` function to return a list of dataframes of the text and embeddings.

Finally, combine all of the dataframes by using the pandas `concat` function.

2. Another problem is that we need to make sure we only feed text to the embedding api.  And we still have some NAN values in our series.  So make sure the function does some preprocessing by removing all of the na values before feeding them to the embedding api.



* Set it up

Now you can `cd` into the `db_builder` folder and run the build_db.py file.  

`python3 build_db.py`

Notice that we save the data as a parquet file.  We need to do that, because we need to save our numpy array data.  If we stored it as a csv file, it would not be saved as a numpy array.

### LLM Model

* `build_context_from_distances_to(question, db_path)`

    * This takes in arguments of the question, and the path to the database we just created.  It will return the three most relevant reviews as the context to pass to the llm.
    * Each review should be separated by two new line characters '\n\n'

* `generate_prompt`
    * This takes the question and context, to return corresponding instructions to be fed to our llm model.

* `question_and_answer(question, db_path)`
    * This uses the above functions in the file to take in a question and return the corresponding answer from the llm.

When these functions are complete, look at the `llm_runner.py` file.  It uses the `question_and_answer` function to take in a question, and return a corresponding answer based on the information in our database.