[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/assets/how-to-create-pinecone-datasets.ipynb)


# Creating Pinecone Datasets

This notebook will walk you through the process of creating a Pinecone dataset from a pandas Dataframe.

## Step 1: create a simple sample dataset

In [None]:
!pip install -qU pandas==2.0.2

In [None]:
import pandas as pd

In [None]:
documents = [
    {
        "id": "1",
        "values": [0.1, 0.2, 0.3],
        "sparse_values": {"indices": [1, 2, 3], "values": [0.1, 0.2, 0.3]},
        "metadata": {"title": "title1", "url": "url1"},
        "blob": {"extra_field": "extra_value"},
    },
    {
        "id": "2",
        "values": [0.4, 0.5, 0.6],
        "sparse_values": {"indices": [4, 5, 6], "values": [0.4, 0.5, 0.6]},
        "metadata": {"title": "title2", "url": "url2"},
        "blob": None,
    },
    {
        "id": "3",
        "values": [0.7, 0.8, 0.9],
        "sparse_values": {"indices": [7, 8, 9], "values": [0.7, 0.8, 0.9]},
        "metadata": {"title": "title3", "url": "url3"},
        "blob": None,
    },
    {
        "id": "4",
        "values": [1.0, 1.1, 1.2],
        "sparse_values": {"indices": [10, 11, 12], "values": [1.0, 1.1, 1.2]},
        "metadata": {"title": "title4", "url": "url4"},
        "blob": None,
    },
    {
        "id": "5",
        "values": [1.3, 1.4, 1.5],
        "sparse_values": {"indices": [13, 14, 15], "values": [1.3, 1.4, 1.5]},
        "metadata": {"title": "title5", "url": "url5"},
        "blob": {"another_field": "another_value"},
    }
]

df = pd.DataFrame(documents)
df

Unnamed: 0,id,values,sparse_values,metadata,blob
0,1,"[0.1, 0.2, 0.3]","{'indices': [1, 2, 3], 'values': [0.1, 0.2, 0.3]}","{'title': 'title1', 'url': 'url1'}",{'extra_field': 'extra_value'}
1,2,"[0.4, 0.5, 0.6]","{'indices': [4, 5, 6], 'values': [0.4, 0.5, 0.6]}","{'title': 'title2', 'url': 'url2'}",
2,3,"[0.7, 0.8, 0.9]","{'indices': [7, 8, 9], 'values': [0.7, 0.8, 0.9]}","{'title': 'title3', 'url': 'url3'}",
3,4,"[1.0, 1.1, 1.2]","{'indices': [10, 11, 12], 'values': [1.0, 1.1,...","{'title': 'title4', 'url': 'url4'}",
4,5,"[1.3, 1.4, 1.5]","{'indices': [13, 14, 15], 'values': [1.3, 1.4,...","{'title': 'title5', 'url': 'url5'}",{'another_field': 'another_value'}


Some notes:
* Note that we have both metadata field and 'blob' field, the metadata field is the acutal pinecone metadata we will use in our index, blob, is an additional field that we can use to store any additional information we want to store along with the Dataset.
* here we used both 'values' and 'sparse_values', however, sparse_values is not a mandatory field, if you don't have sparse values keep it empty.

## Pinecone Dataset

Now that we have our data Ready, we can create a Pinecone Dataset. A Pinecone Dataset is a collection of documtents, queries and Metadata. We can create a Pinecone
* Documents: a collection of records with Id, Vectors (dense, sparse) and metadata
* Queries: a collection of queries with Vectors (dense, sparse), metadata filter and top_k
* Metadata: a defintion of the dataset: Name, dimension, metric, embedding models, etc.

In [None]:
!pip install -qU \
  pinecone-client==2.2.2 \
  pinecone-datasets==0.6.0

In [None]:
from pinecone_datasets import Dataset, DatasetMetadata

In [None]:
# creating a new empty metadata
metadata = DatasetMetadata.empty()
metadata.dict()

{'name': '',
 'created_at': '2023-08-14 09:18:50.196514',
 'documents': 0,
 'queries': 0,
 'source': None,
 'license': None,
 'bucket': None,
 'task': None,
 'dense_model': {'name': '', 'tokenizer': None, 'dimension': 0},
 'sparse_model': None,
 'description': None,
 'tags': None,
 'args': None}

In [None]:
ds = Dataset.from_pandas(documents=df, q=None, metadata=metadata)
ds.documents

Unnamed: 0,id,values,sparse_values,metadata,blob
0,1,"[0.1, 0.2, 0.3]","{'indices': [1, 2, 3], 'values': [0.1, 0.2, 0.3]}","{'title': 'title1', 'url': 'url1'}",{'extra_field': 'extra_value'}
1,2,"[0.4, 0.5, 0.6]","{'indices': [4, 5, 6], 'values': [0.4, 0.5, 0.6]}","{'title': 'title2', 'url': 'url2'}",
2,3,"[0.7, 0.8, 0.9]","{'indices': [7, 8, 9], 'values': [0.7, 0.8, 0.9]}","{'title': 'title3', 'url': 'url3'}",
3,4,"[1.0, 1.1, 1.2]","{'indices': [10, 11, 12], 'values': [1.0, 1.1,...","{'title': 'title4', 'url': 'url4'}",
4,5,"[1.3, 1.4, 1.5]","{'indices': [13, 14, 15], 'values': [1.3, 1.4,...","{'title': 'title5', 'url': 'url5'}",{'another_field': 'another_value'}


## Save dataset to local path


In [None]:
ds.to_path('/tmp/ds')



### Reload dataset

In [None]:
new_ds = Dataset.from_path('/tmp/ds')

In [None]:
new_ds.documents

Unnamed: 0,id,values,sparse_values,metadata,blob
0,1,"[0.1, 0.2, 0.3]","{'indices': [1, 2, 3], 'values': [0.1, 0.2, 0.3]}","{'title': 'title1', 'url': 'url1'}","{'another_field': None, 'extra_field': 'extra_..."
1,2,"[0.4, 0.5, 0.6]","{'indices': [4, 5, 6], 'values': [0.4, 0.5, 0.6]}","{'title': 'title2', 'url': 'url2'}",
2,3,"[0.7, 0.8, 0.9]","{'indices': [7, 8, 9], 'values': [0.7, 0.8, 0.9]}","{'title': 'title3', 'url': 'url3'}",
3,4,"[1.0, 1.1, 1.2]","{'indices': [10, 11, 12], 'values': [1.0, 1.1,...","{'title': 'title4', 'url': 'url4'}",
4,5,"[1.3, 1.4, 1.5]","{'indices': [13, 14, 15], 'values': [1.3, 1.4,...","{'title': 'title5', 'url': 'url5'}","{'another_field': 'another_value', 'extra_fiel..."
