# Example Datasets Python package

Python package for (obtaining) example datasets.

Currently, this repository contains only [datasets metadata](https://github.com/antononcube/Python-packages/raw/main/ExampleDatasets/ExampleDatasets/resources/dfRdatasets.csv.gz).
The datasets are downloaded from the repository 
[Rdatasets](https://github.com/vincentarelbundock/Rdatasets/),
[VAB1].

This package follows the design of the [Raku](https://raku.org) package with the same name; see [AAr1].

------

## Usage examples

### Setup

Here we load the Python packages `time`, `pandas`, and this package:

In [1]:
from ExampleDatasets import *
import pandas

### Get a dataset by using an identifier

Here we get a dataset by using an identifier and display part of the obtained dataset:

In [2]:
tbl = example_dataset(itemSpec = 'Baumann')
tbl.head

<bound method NDFrame.head of     Unnamed: 0  group  pretest.1  pretest.2  post.test.1  post.test.2  \
0            1  Basal          4          3            5            4   
1            2  Basal          6          5            9            5   
2            3  Basal          9          4            5            3   
3            4  Basal         12          6            8            5   
4            5  Basal         16          5           10            9   
..         ...    ...        ...        ...          ...          ...   
61          62  Strat         11          4           11            7   
62          63  Strat         14          4           15            7   
63          64  Strat          8          2            9            5   
64          65  Strat          5          3            6            8   
65          66  Strat          8          3            4            6   

    post.test.3  
0            41  
1            41  
2            43  
3            46  
4  

Here we summarize the dataset obtained above:

In [3]:
tbl.describe()

Unnamed: 0.1,Unnamed: 0,pretest.1,pretest.2,post.test.1,post.test.2,post.test.3
count,66.0,66.0,66.0,66.0,66.0,66.0
mean,33.5,9.787879,5.106061,8.075758,6.712121,44.015152
std,19.196354,3.02052,2.212752,3.393707,2.635644,6.643661
min,1.0,4.0,1.0,1.0,0.0,30.0
25%,17.25,8.0,3.25,5.0,5.0,40.0
50%,33.5,9.0,5.0,8.0,6.0,45.0
75%,49.75,12.0,6.0,11.0,8.0,49.0
max,66.0,16.0,13.0,15.0,13.0,57.0


**Remark**: The values for the arguments `itemSpec` and `packageSpec` correspond to the values
of the columns "Item" and "Package", respectively, in the 
[metadata dataset](https://vincentarelbundock.github.io/Rdatasets/articles/data.html)
from the GitHub repository "Rdatasets", 
[[VAB1](https://github.com/vincentarelbundock/Rdatasets/)].
See the datasets metadata sub-section below.

### Get a dataset by using an URL

Here we can find URLs of datasets that have titles adhering to a regex:

In [4]:
dfMeta = load_datasets_metadata()
print(dfMeta[dfMeta.Title.str.contains('^tita')][["Package", "Item", "CSV"]].to_string())

    Package        Item                                                                      CSV
288   COUNT     titanic     https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanic.csv
289   COUNT  titanicgrp  https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanicgrp.csv


Here we get a dataset through 
[`pandas`](https://pandas.pydata.org)
by using an URL and display the head of the obtained dataset:

In [5]:
import pandas
url = 'https://raw.githubusercontent.com/antononcube/Raku-Data-Reshapers/main/resources/dfTitanic.csv'
tbl2 = pandas.read_csv(url)
tbl2.head()

Unnamed: 0,id,passengerClass,passengerAge,passengerSex,passengerSurvival
0,1,1st,30,female,survived
1,2,1st,0,male,survived
2,3,1st,0,female,died
3,4,1st,30,male,died
4,5,1st,20,female,died


### Datasets metadata

Here we:
1. Get the dataset of the datasets metadata
2. Filter it to have only datasets with 13 rows
3. Keep only the columns "Item", "Title", "Rows", and "Cols"
4. Display it 

In [6]:
tblMeta = load_datasets_metadata()
tblMeta = tblMeta[["Item", "Title", "Rows", "Cols"]]
tblMeta = tblMeta[tblMeta["Rows"] == 13]
tblMeta

Unnamed: 0,Item,Title,Rows,Cols
805,Snow.pumps,John Snow's Map and Data on the 1854 London Ch...,13,4
820,BCG,BCG Vaccine Data,13,7
935,cement,Heat Evolved by Setting Cements,13,5
1354,kootenay,Waterflow Measurements of Kootenay River in Li...,13,2
1644,Newhouse77,Medical-Care Expenditure: A Cross-National Sur...,13,5
1735,Saxony,Families in Saxony,13,2


### Keeping downloaded data

By default the data is obtained over the web from
[Rdatasets](https://github.com/vincentarelbundock/Rdatasets/),
but `example_dataset` has an option to keep the data "locally."
(The data is saved in `XDG_DATA_HOME`, see 
[SS1](https://pypi.org/project/xdg/).)

This can be demonstrated with the following timings of a dataset with ~1300 rows:

In [7]:
import time
startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Getting the data first time took " + str( endTime - startTime ) + " seconds")

Getting the data first time took 0.002950906753540039 seconds


In [8]:
import time
startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Geting the data second time took " + str( endTime - startTime ) + " seconds")

Geting the data second time took 0.0029611587524414062 seconds


------

## References

### Functions, packages, repositories

[AAf1] Anton Antonov,
[`ExampleDataset`](https://resources.wolframcloud.com/FunctionRepository/resources/ExampleDataset),
(2020),
[Wolfram Function Repository](https://resources.wolframcloud.com/FunctionRepository).

[AAr1] Anton Antonov,
[`Data::ExampleDatasets Raku package`](https://github.com/antononcube/Raku-Data-ExampleDatasets),
(2021),
[GitHub/antononcube](https://github.com/antononcube).

[VAB1] Vincent Arel-Bundock,
[Rdatasets](https://github.com/vincentarelbundock/Rdatasets/),
(2020),
[GitHub/vincentarelbundock](https://github.com/vincentarelbundock).

[SS1] Scott Stevenson,
[xdg Python package](https://pypi.org/project/xdg/),
(2016-2021),
[PyPI.org](https://pypi.org/project/xdg/).

### Interactive interfaces

[AAi1] Anton Antonov,
[Example datasets recommender interface](https://antononcube.shinyapps.io/ExampleDatasetsRecommenderInterface/),
(2021),
[Shinyapps.io](https://antononcube.shinyapps.io/).