# **Building the Database**

Scraping the data from Toyota's Wikipedia Page and saving it as a json file

In [2]:
import requests
from bs4 import BeautifulSoup
import json

url = 'https://en.wikipedia.org/wiki/Toyota'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'lxml')
paragraphs = soup.find_all('p')

paragraph_texts = [para.get_text() for para in paragraphs]

with open('scraped_data.json', 'w', encoding='utf-8') as file:
    json.dump(paragraph_texts, file, ensure_ascii=False, indent=4)

print("Data successfully saved to 'scraped_data.json'.")


Data successfully saved to 'scraped_data.json'.


In [3]:
import json

with open('scraped_data.json', 'r', encoding='utf-8') as file:
    scraped_data = json.load(file)
print(scraped_data)

['\n', 'Toyota Motor Corporation (Japanese: トヨタ自動車株式会社, Hepburn: Toyota Jidōsha kabushikigaisha, IPA: [toꜜjota], English: /tɔɪˈjoʊtə/, commonly known as simply Toyota) is a Japanese multinational automotive manufacturer headquartered in Toyota City, Aichi, Japan. It was founded by Kiichiro Toyoda and incorporated on August 28, 1937. Toyota is the largest automobile manufacturer in the world, producing about 10 million vehicles per year.\n', "The company was originally founded as a spinoff of Toyota Industries, a machine maker started by Sakichi Toyoda, Kiichiro's father. Both companies are now part of the Toyota Group, one of the largest conglomerates in the world. While still a department of Toyota Industries, the company developed its first product, the Type A engine, in 1934 and its first passenger car in 1936, the Toyota AA.\n", "After World War II, Toyota benefited from Japan's alliance with the United States to learn from American automakers and other companies, which gave rise t

Installing the dependencies

In [4]:
!pip install -qU \
    cohere \
    pinecone\
    langchain

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.6/50.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.7/249.7 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m419.8/419.8 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.0/78.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m407.7/407.7 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Setting up a system from breaking down the data into chunks so that it can be processed better if the data is very large

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=20,
    separators=["\n\n", "\n", " ", ""]
)

# joining the data into a string to split it
text_to_split = "".join(scraped_data)

Example of the created chunks

In [6]:
chunks = text_splitter.split_text(text_to_split)
chunks[1]

"In the 1960s, Toyota took advantage of the rapidly growing Japanese economy to sell cars to a growing middle-class, leading to the development of the Toyota Corolla, which became the world's all-time best-selling automobile. The booming economy also funded an international expansion that allowed Toyota to grow into one of the largest automakers in the world, the largest company in Japan and the ninth-largest company in the world by revenue, as of December\xa02020[update]. Toyota was the world's first automobile manufacturer to produce more than 10 million vehicles per year, a record set in 2012, when it also reported the production of its 200 millionth vehicle. By September 2023, total production reached 300 million vehicles.[1]\nToyota was praised for being a leader in the development and sales of more fuel-efficient hybrid electric vehicles, starting with the introduction of the original Toyota Prius in 1997. The company now sells more than 40 hybrid vehicle models around the world.

# **Creating Embeddings**

Setting up the keys

In [7]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [9]:
from dotenv import load_dotenv
import os


load_dotenv()  # Load environment variables from .env file
COHERE_API_KEY = os.getenv('COHERE_API_KEY')
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')

In [11]:
import cohere
from pinecone import Pinecone

co = cohere.Client(COHERE_API_KEY)

pc = Pinecone(api_key=PINECONE_API_KEY)


Using the Cohere "embed-english-v3.0" model to generate the embeddings

In [12]:
model="embed-english-v3.0"
input_type="search_document"

texts = ["This is a test sentence.", "Another sentence for embedding."]

embeds = co.embed(
    texts=texts,
    model=model,
    input_type=input_type,
    embedding_types=['float'])

(test1, test2) = embeds.embeddings.float

print(test1)

[-0.009712219, -0.016036987, 2.8073788e-05, -0.022491455, -0.041259766, 0.002281189, -0.033294678, -0.00057029724, -0.026260376, 0.0579834, -0.020874023, -0.0032749176, -0.042022705, 0.024505615, -0.035308838, -0.027236938, -0.006385803, 0.034362793, -0.027175903, -0.017242432, -0.026870728, -0.0076141357, -0.07165527, -0.04296875, 0.031921387, -0.028121948, -0.014099121, 0.02420044, 0.0110321045, -0.0060691833, 0.008773804, -0.0031032562, -0.006587982, 0.034088135, -0.012756348, -0.00554657, -0.022598267, 0.010856628, 0.0023479462, 0.032226562, 0.021362305, -0.0026226044, -0.008834839, -0.018096924, -0.038513184, -0.059020996, 0.068847656, -0.004184723, 0.01436615, -0.039886475, -0.002811432, 0.009223938, -0.0016822815, 0.016983032, -0.033233643, -0.0022411346, -0.044708252, 0.046295166, 0.03768921, 0.0077590942, -0.018569946, 0.03427124, 0.021209717, -0.0063323975, -0.01184845, 0.0019378662, -0.016921997, 0.02835083, 0.0769043, -0.027069092, -0.031555176, 0.0011129379, -0.006679535, 

We get a 1024-dimensional vector that is standard for the Cohere "embed-english-v3.0" model

# **Vector Database**

Setting up our index specification

In [13]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

Then we initialize the index. We set the dimension to 1024 accordng to the Cohere model

In [14]:
import time

index_name = 'testing'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1024,   # dimensionality of embed-english-v3.0
        metric='dotproduct', #can use dot product, cosine similarity, and Euclidean distance as the similarity metric
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

# **Indexing**

We create embeddings and add them in the index

In [15]:
from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 100

texts = []
metadatas = []

data = "".join(scraped_data)
record_texts = text_splitter.split_text(data)

# Create metadata for each text chunk
record_metadatas = [{
    "chunk": j, "text": text
} for j, text in enumerate(record_texts)]

print(f"Number of chunks: {len(record_texts)}")
print(f"Number of metadata entries: {len(record_metadatas)}")

# texts.extend(record_texts)
# metadatas.extend(record_metadatas)


for i, chunk in enumerate(tqdm(record_texts)):
  texts.append(chunk)
  metadatas.append(record_metadatas[i])

  if len(texts) >= batch_limit:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = co.embed(
        texts=texts,
        model=model,
        input_type=input_type,
        embedding_types=['float'])
    embed = embeds.embeddings.float

    if len(ids) != len(embed) or len(embed) != len(metadatas):
      raise ValueError(f"Mismatch between ids ({len(ids)}), embeddings ({len(embed)}), and metadata ({len(metadatas)})")

    index.upsert(vectors=zip(ids, embed, metadatas))
    texts = []
    metadatas = []

if len(texts) >0 :
  ids = [str(uuid4()) for _ in range(len(texts))]
  embeds = co.embed(
      texts=texts,
      model=model,
      input_type=input_type,
      embedding_types=['float'])
  embed = embeds.embeddings.float
  if len(ids) != len(embed) or len(embed) != len(metadatas):
    raise ValueError(f"Mismatch between ids ({len(ids)}), embeddings ({len(embed)}), and metadata ({len(metadatas)})")
  index.upsert(vectors=zip(ids, embed, metadatas))
  texts = []
  metadatas = []


Number of chunks: 58
Number of metadata entries: 58


  0%|          | 0/58 [00:00<?, ?it/s]

In [16]:
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [17]:
# print("Count of embeddings:", len(embeds.embeddings.float))
print("Count of metadata:", len(metadatas))


Count of metadata: 0


# **Creating a Vector Store**

In [18]:
pip install -U langchain-pinecone


Collecting langchain-pinecone
  Downloading langchain_pinecone-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting aiohttp<3.10,>=3.9.5 (from langchain-pinecone)
  Downloading aiohttp-3.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.5 kB)
Collecting pinecone-client<6.0.0,>=5.0.0 (from langchain-pinecone)
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Downloading langchain_pinecone-0.2.0-py3-none-any.whl (11 kB)
Downloading aiohttp-3.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_client-5.0.1-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pinecone-client, aiohttp, langchain-pinecone
  Attempting uninstall: aiohttp
    Found existing ins

In [19]:
from langchain_pinecone import Pinecone

# Custom embedding class for Cohere
class CohereEmbedder:
    def embed_query(self, query):
        embeds = co.embed(
            texts=[query],  # Cohere expects a list of texts
            model="embed-english-v3.0",
            input_type="search_query",
            embedding_types=['float']
        )
        return embeds.embeddings.float[0] # Updated to access the float embedding at index 0

# Instantiate the embedder
cohere_embedder = CohereEmbedder()

# Initialize the vector store with the embedding method
text_field = "text"
vectorstore = Pinecone(
    index, cohere_embedder, text_field
)


  vectorstore = Pinecone(


In [20]:

# Run a similarity search
# query = "who founded toyota"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(id='9c7b0aa3-cdcd-4e7a-a291-932e69867ce0', metadata={'chunk': 0.0}, page_content="Toyota Motor Corporation (Japanese: トヨタ自動車株式会社, Hepburn: Toyota Jidōsha kabushikigaisha, IPA: [toꜜjota], English: /tɔɪˈjoʊtə/, commonly known as simply Toyota) is a Japanese multinational automotive manufacturer headquartered in Toyota City, Aichi, Japan. It was founded by Kiichiro Toyoda and incorporated on August 28, 1937. Toyota is the largest automobile manufacturer in the world, producing about 10 million vehicles per year.\nThe company was originally founded as a spinoff of Toyota Industries, a machine maker started by Sakichi Toyoda, Kiichiro's father. Both companies are now part of the Toyota Group, one of the largest conglomerates in the world. While still a department of Toyota Industries, the company developed its first product, the Type A engine, in 1934 and its first passenger car in 1936, the Toyota AA.\nAfter World War II, Toyota benefited from Japan's alliance with the United Sta

In [21]:
def generate_answer_from_cohere(query, vectorstore, k=3):
    relevant_docs = vectorstore.similarity_search(query, k=k)

    context = "\n".join([doc.page_content for doc in relevant_docs])

    prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"

    response = co.generate(
        model='command-xlarge-nightly',
        prompt=prompt,
        max_tokens=150,
        temperature=0.7,
        stop_sequences=["\n"]
    )

    return response.generations[0].text.strip()

# Example usage
query = "should i trust toyota"


# Generate an answer based on the query and relevant context from Pinecone
answer = generate_answer_from_cohere(query, vectorstore, k=3)

print("Generated Answer:", answer)

Generated Answer: Based on the information provided, there are several factors to consider when determining whether or not to trust Toyota. On the one hand, Toyota has made significant investments in artificial intelligence, robotics, and electric vehicle battery production, demonstrating a commitment to innovation and sustainability. They have also maintained their position as the world's best-selling automaker for the third year in a row. However, there have been several incidents that have damaged Toyota's reputation, such as the revelation that their subsidiary Daihatsu cheated in crash tests, the recall of their first mass-produced all-electric vehicles due to safety concerns, and their support of the Trump Administration's proposal to reduce California's fuel efficiency standards, which has negatively impacted their reputation as a green brand. Ultimately, whether


In [22]:
query = "Who created toyota"
answer = generate_answer_from_cohere(query, vectorstore, k=3)
print("Generated Answer:", answer)

Generated Answer: Toyota Motor Corporation was founded by Kiichiro Toyoda and incorporated on August 28, 1937. The company was originally founded as a spinoff of Toyota Industries, a machine maker started by Sakichi Toyoda, Kiichiro's father.


In [23]:
query = "Who is toyota's biggest competitor"
answer = generate_answer_from_cohere(query, vectorstore, k=3)
print("Generated Answer:", answer)

Generated Answer: Toyota's biggest competitors are Volkswagen and other large automakers, such as General Motors, Ford, and Stellantis.


In [24]:
query = "Which is the best car toyota made"
answer = generate_answer_from_cohere(query, vectorstore, k=3)
print("Generated Answer:", answer)

Generated Answer: The Toyota Corolla is the best-selling car that Toyota has made, having become the world's all-time best-selling automobile.


In [25]:
query = "Name some previous CEO's"
answer = generate_answer_from_cohere(query, vectorstore, k=3)
print("Generated Answer:", answer)

Generated Answer: Some previous CEOs of Toyota include Akio Toyoda, Katsuaki Watanabe, and Eiji Toyoda.


In [26]:
query = "Whats the latest news"
answer = generate_answer_from_cohere(query, vectorstore, k=3)
print("Generated Answer:", answer)

Generated Answer: The latest news in the context provided is from July 2024, when Toyota announced plans to build an electric car cell plant in Fukuoka and export them to the rest of Asia.


In [27]:
query = "How huge is the brand"
answer = generate_answer_from_cohere(query, vectorstore, k=3)
print("Generated Answer:", answer)

Generated Answer: Toyota is a massive brand, with a global presence and a history of innovation and success. It is the world's largest automaker, the largest company in Japan, and the ninth-largest company in the world by revenue. Toyota has sold more than 300 million vehicles worldwide, including the Toyota Corolla, which is the world's all-time best-selling automobile. The company has also been a leader in the development and sales of fuel-efficient hybrid electric vehicles, with more than 40 hybrid vehicle models sold around the world. In addition, Toyota has expanded its offerings to include luxury cars through its Lexus division, and has also developed the Toyota Coaster minibus, which is widely used in Japan, Singapore, Hong Kong, Australia,
