# Consuming dataset

After publishing to the dataset repository, we expect end users to incorporate it into their machine learning workflows. The easiest way to do this is by using the [Hugging Face Datasets](https://huggingface.co/docs/datasets/en/index) library, which offers excellent integration with all major ML frameworks.


In [1]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
from pelicanfs.core import PelicanFileSystem

pelfs = PelicanFileSystem("pelican://uwdf-director.chtc.wisc.edu")


In [4]:
# To be added mapping local fs to S3 or Pelican layer
# For now, we will use the local file system to load the dataset.

dataset = load_dataset("csv", data_files="pelican://uwdf-director.chtc.wisc.edu/wisc.edu/dsi/pytorch/bird_migration_data.csv")
torch_dataset = dataset.with_format("torch")

Downloading data: 100%|██████████| 2.78M/2.78M [00:00<00:00, 22.0MB/s]
Generating train split: 10000 examples [00:00, 159799.45 examples/s]


In [5]:
torch_dataset

DatasetDict({
    train: Dataset({
        features: ['Bird_ID', 'Species', 'Region', 'Habitat', 'Weather_Condition', 'Migration_Reason', 'Start_Latitude', 'Start_Longitude', 'End_Latitude', 'End_Longitude', 'Flight_Distance_km', 'Flight_Duration_hours', 'Average_Speed_kmph', 'Max_Altitude_m', 'Min_Altitude_m', 'Temperature_C', 'Wind_Speed_kmph', 'Humidity_pc', 'Pressure_hPa', 'Visibility_km', 'Nesting_Success', 'Tag_Battery_Level_pc', 'Signal_Strength_dB', 'Migration_Start_Month', 'Migration_End_Month', 'Rest_Stops', 'Predator_Sightings', 'Tag_Type', 'Migrated_in_Flock', 'Flock_Size', 'Food_Supply_Level', 'Tracking_Quality', 'Migration_Interrupted', 'Interrupted_Reason', 'Tagged_By', 'Tag_Weight_g', 'Migration_Success', 'Recovery_Location_Known', 'Recovery_Time_days', 'Observation_Counts', 'Observation_Quality'],
        num_rows: 10000
    })
})

In [6]:
torch_dataset["train"][0]

{'Bird_ID': 'B1000',
 'Species': 'Warbler',
 'Region': 'South America',
 'Habitat': 'Grassland',
 'Weather_Condition': 'Stormy',
 'Migration_Reason': 'Feeding',
 'Start_Latitude': tensor(11.9066),
 'Start_Longitude': tensor(-169.3783),
 'End_Latitude': tensor(30.3776),
 'End_Longitude': tensor(-21.3669),
 'Flight_Distance_km': tensor(1753.7900),
 'Flight_Duration_hours': tensor(49.5000),
 'Average_Speed_kmph': tensor(47.8200),
 'Max_Altitude_m': tensor(5280),
 'Min_Altitude_m': tensor(285),
 'Temperature_C': tensor(-2.2000),
 'Wind_Speed_kmph': tensor(9.1000),
 'Humidity_pc': tensor(43),
 'Pressure_hPa': tensor(1030.3000),
 'Visibility_km': tensor(1.5000),
 'Nesting_Success': 'No',
 'Tag_Battery_Level_pc': tensor(45),
 'Signal_Strength_dB': tensor(-64.9000),
 'Migration_Start_Month': 'Jan',
 'Migration_End_Month': 'Apr',
 'Rest_Stops': tensor(3),
 'Predator_Sightings': tensor(6),
 'Tag_Type': 'Radio',
 'Migrated_in_Flock': 'Yes',
 'Flock_Size': tensor(264),
 'Food_Supply_Level': 'Low',

Technically, we can also consume via mlcroissant's `Dataset`, but it is very buggy, not recommended.

In [6]:
from mlcroissant import Dataset
import itertools
import pandas as pd


In [7]:
dataset = Dataset(jsonld="https://web.s3.wisc.edu/pelican-data-loader/metadata/bird_migration_data.json")
records = dataset.records("bird_migration_data_record_set")
pd.DataFrame(list(itertools.islice(records, 100))).head(5)


Unnamed: 0,bird_migration_data/Bird_ID,bird_migration_data/Species,bird_migration_data/Region,bird_migration_data/Habitat,bird_migration_data/Weather_Condition,bird_migration_data/Migration_Reason,bird_migration_data/Start_Latitude,bird_migration_data/Start_Longitude,bird_migration_data/End_Latitude,bird_migration_data/End_Longitude,...,bird_migration_data/Tracking_Quality,bird_migration_data/Migration_Interrupted,bird_migration_data/Interrupted_Reason,bird_migration_data/Tagged_By,bird_migration_data/Tag_Weight_g,bird_migration_data/Migration_Success,bird_migration_data/Recovery_Location_Known,bird_migration_data/Recovery_Time_days,bird_migration_data/Observation_Counts,bird_migration_data/Observation_Quality
0,b'B1000',b'Warbler',b'South America',b'Grassland',b'Stormy',b'Feeding',11.906566,-169.378251,30.377647,-21.366879,...,b'Excellent',b'Yes',b'Storm',b'Researcher_A',27.0,b'Failed',b'No',102,56,b'Low'
1,b'B1001',b'Stork',b'North America',b'Grassland',b'Stormy',b'Breeding',62.301546,-111.475069,39.921092,47.963436,...,b'Good',b'Yes',b'Injury',b'Researcher_C',14.2,b'Successful',b'Yes',118,61,b'Low'
2,b'B1002',b'Hawk',b'South America',b'Mountain',b'Stormy',b'Avoid Predators',87.861164,-78.727327,66.99098,19.448466,...,b'Fair',b'No',b'Lost Signal',b'Researcher_B',16.1,b'Failed',b'No',41,71,b'High'
3,b'B1003',b'Warbler',b'South America',b'Urban',b'Stormy',b'Climate Change',35.77059,153.104341,-49.003145,-157.868744,...,b'Good',b'Yes',b'Lost Signal',b'Researcher_C',24.4,b'Successful',b'No',15,68,b'Low'
4,b'B1004',b'Crane',b'Europe',b'Urban',b'Windy',b'Avoid Predators',-21.611614,106.674824,11.681051,-115.022863,...,b'Good',b'No',,b'Researcher_B',25.8,b'Failed',b'Yes',73,67,b'Moderate'
