## Step 2: Vectorize the input data

In this step, we will create a json file with mathematical vectors for all the text in the files we downloaded and cleaned in Step 1. This type of vectorization is powered by large language models and provides the basis for next-generation search capabilities. In this step, we will call the OpenAI Ada-002 model to provide the vectors, and the davinci model to assist with the indexing. 

In [None]:
# Install required Packages
!pip install llama_index
!pip install langchain

In [None]:
# Import the required dependencies
from pathlib import Path
from llama_index import LLMPredictor, GPTSimpleVectorIndex, PromptHelper, ServiceContext, SimpleDirectoryReader, download_loader
from langchain import OpenAI
from langchain.llms import OpenAIChat
import json
import os
message = "The dependencies have been imported"
print(message)

__REQUIRED: Enter your OpenAI Api Key below by replacing the text REPLACE_WITH_OPENAI_API_KEY with your key:__

In [None]:
os.environ['OPENAI_API_KEY'] = 'REPLACE_WITH_OPENAI_API_KEY'
message = "The OPENAI_API_KEY Has been loaded"
print(message)

_Execute the code block below to vectorize your data_

In [None]:
OpenAI.api_key = os.environ.get('OPENAI_API_KEY')
SimpleDirectoryReader = download_loader("SimpleDirectoryReader")

# # set maximum input size
# max_input_size = 4096
# # set number of output tokens
# num_outputs = 4096
# # set maximum chunk overlap
# max_chunk_overlap = 40
# # set chunk size limit
# chunk_size_limit = 600

# define LLM
llm_predictor = LLMPredictor(llm=OpenAIChat(
    temperature=0, model_name="gpt-3.5-turbo"))
# prompt_helper = PromptHelper(max_input_size,
#                              num_outputs,
#                              max_chunk_overlap,
#                              chunk_size_limit=chunk_size_limit)


# loader = UnstructuredReader()
# documents = loader.load_data(file=Path('/root/html/GUID-build-images-index.html'))

loader = SimpleDirectoryReader('html_downloads').load_data()
documents = loader

# print(documents)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

index = GPTSimpleVectorIndex.from_documents(
    documents, service_context=service_context
)

index.save_to_disk('testindex.json')
message = "The index has been saved"
print(message)

### Review Step2 Outputs

Congratulations, you made it through step 2! Your data has now been vectorized!

- In the left nav bar, you should see a file named "testindex.json". This file was created by the previous code block and contains the vectors for your dataset. Its typically a pretty large file and does not open easily in most readers, so its better not to open it and proceed to next step, where we will load the vectors into an in-memory database and finally query our data! 

## Now, query the data!

In [None]:
response = index.query("Can you tell me whats new in Tanzu Application Platform 1.4")
print(response)

In [None]:
response = index.query("how can I install tanzu application platform?")
print(response)

In [None]:
response = index.query("What is Namespace Provisioner and what problem does it solve?")
print(response)

In [None]:
response = index.query("What components make up the Namespace Provisioner package and how do they work together?")
print(response)

In [None]:
response = index.query("what is tanzu application platform?")
print(response)

In [None]:
response = index.query("What resources are contained in the default-resources secret and how are they templated?")
print(response)