### Alpaca like dataset for the website theplantera.com

This is an example of how to use the openai's api to create training data.
I will create a prompt and let GPT models do its magic.
This is not free and can be a bit expensive.

However, if you are not willing to pay for the API services, you could also copy paste the same thing on chat.openai.com and put that into a JSON file.

Example dataset format : https://huggingface.co/datasets/tatsu-lab/alpaca

Ecommerce website : https://theplantera.com/

In [6]:
!pip install langchain
!pip install unstructured
!pip install openai

Collecting langchain
  Downloading langchain-0.0.341-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-core<0.0.7,>=0.0.6 (from langchain)
  Downloading langchain_core-0.0.6-py3-none-any.whl (174 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.2/174.2 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langsmith<0.1.0,>=0.0.63 (from langchain)
  Downloading langsmith-0.0.66-py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.8/46.8 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain)
  Downloading m

### Web scraping

- Get all the sublinks from the main website

In [2]:
import requests
from bs4 import BeautifulSoup
import urllib.parse

def extract_urls(url):

  # Define the URL of the webpage you want to scrape

  urls = []
  # Send an HTTP GET request to the URL
  response = requests.get(url)

  # Check if the request was successful
  if response.status_code == 200:
      # Parse the HTML content of the page using BeautifulSoup
      soup = BeautifulSoup(response.text, "html.parser")

      # Find all anchor tags (a tags) in the HTML
      links = soup.find_all("a")

      # Extract and print the href attribute of each anchor tag, filtering out unwanted links
      for link in links:
          href = link.get("href")
          if href and not href.startswith(("#", "javascript:", "mailto:")):
              # Create an absolute URL if it's a relative link
              if not urllib.parse.urlparse(href).scheme:
                  href = urllib.parse.urljoin(url, href)

              if 'https://theplantera.com/' in href:

                  urls.append(href)
                  #urls = list(set(urls))
                  urls = list(dict.fromkeys(urls))

              #print(href)
  else:
      print("Failed to retrieve the webpage. Status code:", response.status_code)

  return urls

In [3]:
all_links = extract_urls("https://theplantera.com")

In [5]:
all_links[0:5]

['https://theplantera.com/',
 'https://theplantera.com/collections/all',
 'https://theplantera.com/collections/best-tasting-vegan-protein-powder',
 'https://theplantera.com/products/dark-chocolate-organic-vegan-protein-powder',
 'https://theplantera.com/products/strawberry-basil-organic-vegan-plant-based-protein-powder']

In [8]:
from langchain.document_loaders import UnstructuredURLLoader
loaders = UnstructuredURLLoader(urls=all_links)
data = loaders.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [9]:
len(data),len(all_links)

(32, 32)

In [10]:
data[0]

Document(page_content="Skip to content\n\nHome\n\nProducts\n                        \n\n                        \n\n                          \n                          \n                            \n                            \n                            \n                          \n\n                          \n\n                            \n                              \n                                \n                                  VEGAN PROTEIN POWDER\n\n                                  \n                                    \n\n                                  \n                                \n\n                                \n                                  \n                                    \n                                      \n                                        Dark Chocolate Protein Powder\n                                      \n                                    \n                                      \n                                       

### Generate JSON data in the exact same format as Alpaca

- We will use the web scraped contents and ask a powerful model like GPT to create some training data for us that we can then use to fine tune an open source model like Llama2 for example.



In [None]:
import openai
import os

In [12]:
key = 'enter_your_key'

In [13]:
os.environ['OPENAI_API_KEY'] = key
openai.api_key = os.getenv('OPENAI_API_KEY')

In [26]:
instruction_text = """I have some data scraped from a webpage but it is quite unstructured. I need you to generate a dataset based on the following format. Let me describe it for you.

So there are 4 columns : instruction, input , output and text. (all are of data type string)
The instruction column has the question or the prompt that describes the task.
The input column is optional but it is paired with additional context.
The output is the model's response
And the text is basically a string that combines the text that is in instruction, input and output.

An example of a dataset with input in JSON format looks like this :

{
    "instruction": "Create a classification task by clustering the given list of items.",
    "input": "Apples, oranges, bananas, strawberries, pineapples",
    "output": "Class 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples",
    "text": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a classification task by clustering the given list of items.\n\n### Input:\nApples, oranges, bananas, strawberries, pineapples\n\n### Response:\nClass 1: Apples, Oranges\nClass 2: Bananas, Strawberries\nClass 3: Pineapples",
}

An example of a dataset without an input in JSON looks like this:

{
    "instruction": "Describe the structure of an atom.",
    "input": "",
    "output": "An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.",
    "text": "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Describe the structure of an atom. ### Response: An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.",
}

The raw web scraped data looks like this  and I need you to generate 5 datapoints  (with and without input) using the structure I mentioned above. Make sure to stick to the context of the web scraped data  :"""

In [27]:
prompt_1 = instruction_text + "\n" + data[0].page_content

In [28]:
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": prompt_1}
  ]
)

In [35]:
print(response.choices[0].message.content)

{
  "instruction": "Describe the range of products available from The Plant Era.",
  "input": "",
  "output": "The Plant Era offers a range of vegan protein powders, vitamins, and merchandise. Their vegan protein powders come in flavors like Dark Chocolate and Strawberry-Basil. They also offer vitamins such as Vegan Bone Support, Vegan Immune Support, Vegan Omega, Vegan Turmeric & Black Pepper, and Vegan Vitamin D3. Additionally, they have merchandise available, including an Insulated Stainless Steel Protein Shaker.",
  "text": "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Describe the range of products available from The Plant Era. ### Response: The Plant Era offers a range of vegan protein powders, vitamins, and merchandise. Their vegan protein powders come in flavors like Dark Chocolate and Strawberry-Basil. They also offer vitamins such as Vegan Bone Support, Vegan Immune Support, Vegan Omega, Vegan Turme