# Introduction to Dataset

A dataset is a structured collection of data, often organized in tabular form, that represents information about a particular domain or topic. It is a set of data points or observations typically related to each other in some way. Datasets can be used for various purposes, including analysis, research, and training machine learning models.

Let's import the Dataset from csSDK, csSDK is an another ConverSight library which is effectively used to query the dataset in different ways

In [3]:
from csSDK import Dataset

The Dataset module takes the following arguments
| Arguments     | Description |
| :------------ | :----------- |
| dataset_id | The id of the dataset to be initiated|
| token | Token of user logged in. Its an optional argument. If no token provided by default user logged in token will be taken |

In [1]:
dataset_id = "655ceb56-HrXES9SSm"

You can get the dataset id by navigating to the side menu in the UI `Data Workbench -> Data Management` and click the dataset. In the browser address bar you can see the dataset id at the very end of the URL

In [2]:
ds = Dataset(dataset_id=dataset_id)

Retail Sales loaded successfully


So now the dataset has been initiated, let's see the most accessed features
- raw_sql => This method accepts the most common standard sql query and return the data in data frame
  > - query: the version of the flow to be set as current version.
  > - token: It's and optional argument to override the token
- raw_sql_arrow => This method accepts the most common standard sql query and return the data in arrow format
  > - query: the version of the flow to be set as current version.
  > - token: It's and optional argument to override the token
- sql_arrow => This method accepts the ConverSight standard sql query and return the data in arrow format 
  > - query: the version of the flow to be set as current version.
  > - token: It's and optional argument to override the token  
- sql_dataframe => This method accepts the ConverSight standard sql query and return the data in data frame 
  > - query: the version of the flow to be set as current version.
  > - token: It's and optional argument to override the token     

In [27]:
conversight_query = """Select @RetailSales.revenue as newcost, @RetailSales.delivery_date, @RetailSales.buyer from #RetailSales"""

In [10]:
ds.sql_dataframe(query=conversight_query)

Unnamed: 0,newcost,m_delivery_date,m_buyer
0,5980.000000,2016-01-09,Nass Torres
1,7211.259766,2016-01-11,Sascha Johnson
2,35999.601562,2016-01-12,Kim Rogers
3,19950.000000,2016-01-12,Kei Thompson
4,7980.000000,2016-01-11,Gabby Taylor
...,...,...,...
15797,35199.359375,2018-08-18,Mo Myers
15798,39767.000000,2018-08-15,Fran Sullivan
15799,11046.000000,2018-08-17,Izzi Jones
15800,65189.468750,2018-08-18,Nik Reyes


In [7]:
ds.sql_arrow(query=conversight_query)

pyarrow.Table
newcost: float
m_delivery_date: timestamp[s]
m_buyer: dictionary<values=string, indices=int32, ordered=0>
----
newcost: [[5980,7211.26,35999.6,19950,7980,...,35199.36,39767,11046,65189.47,23099.23]]
m_delivery_date: [[2016-01-09 00:00:00,2016-01-11 00:00:00,2016-01-12 00:00:00,2016-01-12 00:00:00,2016-01-11 00:00:00,...,2018-08-18 00:00:00,2018-08-15 00:00:00,2018-08-17 00:00:00,2018-08-18 00:00:00,2018-08-19 00:00:00]]
m_buyer: [  -- dictionary:
["Nass Torres","Sascha Johnson","Kim Rogers","Kei Thompson","Gabby Taylor",...,"Nik Reyes","Izzi Taylor","Chris Martino","Fran Frumine","Myntra"]  -- indices:
[0,1,2,3,4,...,5,6,7,8,9]]

In [24]:
con = ds.get_connector_info()
resolved_query = f"""Select {con.schema.current}_RetailSales.m_revenue as newcost, {con.schema.current}_RetailSales.m_delivery_date as delivery_date, {con.schema.current}_RetailSales.m_buyer as buyer from {con.schema.current}_RetailSales"""

In the above, resolved query is the conversight internal resolved query which is using the dynamic schema id 

In [25]:
ds.raw_sql(query=resolved_query)

Unnamed: 0,newcost,delivery_date,buyer
0,5980.000000,2016-01-09,Nass Torres
1,7211.259766,2016-01-11,Sascha Johnson
2,35999.601562,2016-01-12,Kim Rogers
3,19950.000000,2016-01-12,Kei Thompson
4,7980.000000,2016-01-11,Gabby Taylor
...,...,...,...
15797,35199.359375,2018-08-18,Mo Myers
15798,39767.000000,2018-08-15,Fran Sullivan
15799,11046.000000,2018-08-17,Izzi Jones
15800,65189.468750,2018-08-18,Nik Reyes


In [26]:
ds.raw_sql_arrow(query=resolved_query)

pyarrow.Table
newcost: float
delivery_date: timestamp[s]
buyer: dictionary<values=string, indices=int32, ordered=0>
----
newcost: [[5980,7211.26,35999.6,19950,7980,...,35199.36,39767,11046,65189.47,23099.23]]
delivery_date: [[2016-01-09 00:00:00,2016-01-11 00:00:00,2016-01-12 00:00:00,2016-01-12 00:00:00,2016-01-11 00:00:00,...,2018-08-18 00:00:00,2018-08-15 00:00:00,2018-08-17 00:00:00,2018-08-18 00:00:00,2018-08-19 00:00:00]]
buyer: [  -- dictionary:
["Nass Torres","Sascha Johnson","Kim Rogers","Kei Thompson","Gabby Taylor",...,"Nik Reyes","Izzi Taylor","Chris Martino","Fran Frumine","Myntra"]  -- indices:
[0,1,2,3,4,...,5,6,7,8,9]]