# Pre-processing your dataset
> Getting your files in the right format

In this notebook, we examine how one can work with files and format them into the HuggingFace Datasets format. We'll use some files in this directory on GitHub.

----
### What data structure have we been using so far?
Recall the [fine-tuning notebook](https://huggingface.co/docs/transformers/training) we looked at earlier. One of the first things that we do is load the data, and however it's done seems to work seamlessly with the rest of the API. Let's look more into this.

#### What IS this and why might we use it?
Let's learn more about Datasets using the Huggingface [Dataset API](https://huggingface.co/docs/datasets/).

#### Questions and Discussion
Using pages [The Dataset Object](https://huggingface.co/docs/datasets/access.html) and [Train with Datasets](https://huggingface.co/docs/datasets/use_dataset.html), answer the following questions:
* How would you generally describe the structure of a Dataset vs something more like a DataFrame (table)?
* What seem to be some operations you can do with a Dataset?
* What are the advantages of a Dataset?

---
# Let's try an example.
We have some data that we'd like to classify. We've saved our data as follows:
* We have one huge directory full of files of data
* Each file name is the id of the file
* We have another csv file which contains information about all of each of the files, one row for each id

_Need to see this visually? Navigate to [the repo](https://github.com/vanderbilt-data-science/deep-learning-intensive) and click on the `workshop-files` directory. This is what the directory looks like._

How do we get this in the right format for processing?

### Install and load the required modules
We need to first install `transformers` and `datasets`. Execute the line below if you're using Google Colab. The rest of the modules are already available through Google Colab.

In [3]:
!pip install transformers
!pip install datasets

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 5.0 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 53.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 60.2 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 42.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 444 kB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
 

In [4]:
#dl imports
from transformers import pipeline
from datasets import load_dataset, Dataset, ClassLabel, load_from_disk, DatasetDict
from huggingface_hub import notebook_login

#import data science packages
import pandas as pd
import numpy as np
import seaborn as sns

#import file helper packages
import glob
import requests

# Load the data
Our data is structured as a table of information about authors including the ids of articles they've written, and a set of data files (on GitHub) named with the ID of each of the texts. We'll leverage this here just for an example, and build up a table.

#### Get data info table

In [7]:
#set base_url
base_url = 'https://raw.githubusercontent.com/vanderbilt-data-science/deep-learning-intensive/master/workshop-files/'

In [10]:
#article data
info_table = pd.read_csv(base_url + 'author_data.csv')
info_table

Unnamed: 0,last_name,first_name,age,years_of_journalism,college major,article_id
0,west,enrique,56,12,humanities,551293
1,braun,damien,43,13,humanities,373587
2,osborn,ellie,22,2,engineering,597061
3,vega,cierra,67,34,science,434648
4,cantrell,alden,53,23,science,532970
5,gentry,kierra,25,7,humanities,520668
6,cox,pierre,24,4,humanities,209035
7,crane,thomas,35,9,science,830014
8,hill,krystal,41,31,premed,671125
9,cuevas,kira,39,22,premed,893941


#### Add data (note: this could be a URL path instead)

In [13]:
#add url
info_table['text'] = info_table['article_id'].apply(lambda x: requests.get(base_url + str(x) + '.txt').text)

In [14]:
info_table

Unnamed: 0,last_name,first_name,age,years_of_journalism,college major,article_id,text
0,west,enrique,56,12,humanities,551293,"The rain and wind abruptly stopped, but the sk..."
1,braun,damien,43,13,humanities,373587,She patiently waited for his number to be call...
2,osborn,ellie,22,2,engineering,597061,The chair sat in the corner where it had been ...
3,vega,cierra,67,34,science,434648,The computer wouldn't start. She banged on the...
4,cantrell,alden,53,23,science,532970,Do you really listen when you are talking with...
5,gentry,kierra,25,7,humanities,520668,Cake or pie? I can tell a lot about you by whi...
6,cox,pierre,24,4,humanities,209035,It was a concerning development that he couldn...
7,crane,thomas,35,9,science,830014,She was in a hurry. Not the standard hurry whe...
8,hill,krystal,41,31,premed,671125,All he could think about was how it would all ...
9,cuevas,kira,39,22,premed,893941,The red glint of paint sparkled under the sun....


# Make into HuggingFace Dataset
Let's figure out some different ways that we can load the data. Let's learn more from the [Load documentation reference](https://huggingface.co/docs/datasets/loading.html). 

In [15]:
data_ds = Dataset.from_pandas(info_table)

In [16]:
data_ds.info

DatasetInfo(description='', citation='', homepage='', license='', features={'last_name': Value(dtype='string', id=None), 'first_name': Value(dtype='string', id=None), 'age': Value(dtype='int64', id=None), 'years_of_journalism': Value(dtype='int64', id=None), 'college major': Value(dtype='string', id=None), 'article_id': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name=None, config_name=None, version=None, splits=None, download_checksums=None, download_size=None, post_processing_size=None, dataset_size=None, size_in_bytes=None)

#### Questions and Discussion
Datasets tend to come in one of two forms: canonical and community. Let's [explore this further](https://huggingface.co/docs/datasets/share.html).

This was just one of _many_ ways that you can create a dataset. Later on, we'll push this to the hub using a simple command. What kind of data do you have and in what format? Join the breakout rooms below based on the situation that best fits your scenario, and discuss with your group how you think you need to organize your data and the steps for uploading your data. Some resources are noted below.
1. **Programmatic standard formats:**
    * e.g., Datasets in CSV, JSON, Parquet, or Text formats stored locally or remotely
    * Use the load documentation references above. Using the [Structure Your Repository](https://huggingface.co/docs/datasets/repository_structure.html) resource, define how your repo should be structured.
2. **Low-code standard formats:**
    * e.g., Datasets in CSV, JSON, Parquet, or Text formats that you want to be hosted on HuggingFace Hub
    * Explore the [Huggingface Hub direct upload](https://huggingface.co/docs/datasets/upload_dataset.html) reference. Using the [Structure Your Repository](https://huggingface.co/docs/datasets/repository_structure.html) resource, define how your data repo should be structured. If you're interested in command-line programming, explore the [Share](https://huggingface.co/docs/datasets/share.html) code reference for the terminal/command line equivalent of the direct upload.
3. **Non-standard formatted resources that are referenced by URL:**
    * e.g., Large image or audio datasets to be loaded via URL, or datasets that need custom operations
    * Explore the use of [dataset loading scripts](https://huggingface.co/docs/datasets/dataset_script.html). It may be extremely helpful to look at how a dataset of this type with a loading script is structured and the usage of the Python script. Check out the [Huggingface internal image test dataset](https://huggingface.co/datasets/hf-internal-testing/fixtures_image_utils) to see an example. Make sure to click on the _Files and Versions_ tab and the .py file to see the loading script. 
4. **Datasets you want to have multiple configurations:**
    * e.g., T0, training on multiple objectives/tasks
    * Explore the use of [dataset loading scripts](https://huggingface.co/docs/datasets/dataset_script.html). It may be instructive to check out the given GLUE example for multiple configurations. A quick reference to the repo loading script is [on GitHub.](https://github.com/huggingface/datasets/blob/master/datasets/super_glue/super_glue.py)
---

# Operations on HuggingFace Datasets
Let's explore some operations that we can do on Datasets. We can learn more on the [Process](https://huggingface.co/docs/datasets/process.html) page here. Note that if you're using audio data, the [Process Audio](https://huggingface.co/docs/datasets/audio_process.html) reference has lots of helpful functions for processing audio.
## `features`
View detailed information about the features in the dataset.

In [17]:
data_ds.features

{'age': Value(dtype='int64', id=None),
 'article_id': Value(dtype='int64', id=None),
 'college major': Value(dtype='string', id=None),
 'first_name': Value(dtype='string', id=None),
 'last_name': Value(dtype='string', id=None),
 'text': Value(dtype='string', id=None),
 'years_of_journalism': Value(dtype='int64', id=None)}

## `rename_column`
Sometimes, we need to rename columns (obviously!)

In [18]:
data_ds = data_ds.rename_column('college major', 'label')

In [19]:
data_ds

Dataset({
    features: ['last_name', 'first_name', 'age', 'years_of_journalism', 'label', 'article_id', 'text'],
    num_rows: 20
})

## `class_encode_column`
Interestingly enough, there is no sense of the labels or target class here.  We can add this by changing one Indicate label columns by encoding the columns to a class type. Essentially, this changes this columnt to integer values with a FeatureLabel type, meaning that there are dictionary lookups for id2label and label2id.

In [20]:
data_ds = data_ds.class_encode_column('label')

Casting to class labels:   0%|          | 0/1 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

In [22]:
data_ds.features

{'age': Value(dtype='int64', id=None),
 'article_id': Value(dtype='int64', id=None),
 'first_name': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=5, names=['engineering', 'humanities', 'prelaw', 'premed', 'science'], names_file=None, id=None),
 'last_name': Value(dtype='string', id=None),
 'text': Value(dtype='string', id=None),
 'years_of_journalism': Value(dtype='int64', id=None)}

In [23]:
data_ds['label']

[1, 1, 0, 4, 4, 1, 1, 4, 3, 3, 4, 1, 0, 3, 2, 2, 0, 1, 4, 3]

## `map`
Familiar with map functions? The pandas apply function? Sometimes we need to apply a function to each of the examples to create or update fields. We can do this here with the Dataset `map` function.

In this example, we take an extra difficult approach to creating a column with the real names of the labels. We'll use some advantages of the `ClassLabel` function.

In [25]:
data_ds = data_ds.map(lambda x: {'college_major': data_ds.features['label'].int2str(x['label'])})

  0%|          | 0/20 [00:00<?, ?ex/s]

In [28]:
data_ds['college_major']

['humanities',
 'humanities',
 'engineering',
 'science',
 'science',
 'humanities',
 'humanities',
 'science',
 'premed',
 'premed',
 'science',
 'humanities',
 'engineering',
 'premed',
 'prelaw',
 'prelaw',
 'engineering',
 'humanities',
 'science',
 'premed']

## Index
Sometimes, we want to see one or more examples from the data. We can use normal indexing approaches to do this.

## ID conversion
We often need to convert our string labels back and forth to/from integers. We can do this using methods of the `ClassLabel` class.

## `train_test_split`
We need to split our data into minimally a train/test set. We often prefer a train/validation/test split to evaluate performance of the model independently even during training.

In [32]:
#start with train/test split
data_ds = data_ds.train_test_split(test_size=0.2)

In [33]:
data_ds

DatasetDict({
    train: Dataset({
        features: ['age', 'article_id', 'first_name', 'label', 'last_name', 'text', 'years_of_journalism', 'college_major'],
        num_rows: 16
    })
    test: Dataset({
        features: ['age', 'article_id', 'first_name', 'label', 'last_name', 'text', 'years_of_journalism', 'college_major'],
        num_rows: 4
    })
})

In [35]:
#split training into train and validation
train_val_ds = data_ds['train'].train_test_split(test_size=0.3)
train_val_ds

DatasetDict({
    train: Dataset({
        features: ['age', 'article_id', 'first_name', 'label', 'last_name', 'text', 'years_of_journalism', 'college_major'],
        num_rows: 11
    })
    test: Dataset({
        features: ['age', 'article_id', 'first_name', 'label', 'last_name', 'text', 'years_of_journalism', 'college_major'],
        num_rows: 5
    })
})

In [36]:
#update original ds with re-split training and validation
data_ds['train'] = train_val_ds['train']
data_ds['valid'] = train_val_ds['test']

In [37]:
data_ds

DatasetDict({
    train: Dataset({
        features: ['age', 'article_id', 'first_name', 'label', 'last_name', 'text', 'years_of_journalism', 'college_major'],
        num_rows: 11
    })
    test: Dataset({
        features: ['age', 'article_id', 'first_name', 'label', 'last_name', 'text', 'years_of_journalism', 'college_major'],
        num_rows: 4
    })
    valid: Dataset({
        features: ['age', 'article_id', 'first_name', 'label', 'last_name', 'text', 'years_of_journalism', 'college_major'],
        num_rows: 5
    })
})

# Sharing your dataset
## `save_to_disk`
Want to save your dataset and load it later? You can directly use the save functionality.

In [38]:
data_ds.save_to_disk('./demo_data')

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]

## `load_from_disk`

In [39]:
loaded_ds = load_from_disk('./demo_data')

## `push_to_hub`
Want to directly upload your dataset to Huggingface now that you've got it created? You can programmatically push it to the Hub. This requires that you're signed into your HF account. You may need to uncomment and run the following line to help you store your credentials.

In [40]:
!git config --global credential.helper store

Make sure you input your Huggingface _token_ below. If this doesn't work, you can use the `token` parameter in `push_to_hub` and set it a string of your Huggingface token.

In [41]:
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


In [45]:
data_ds.push_to_hub('eadsa1998/data_ds', private=True)

Pushing split train to the Hub.


HTTPError: ignored

### A note on pushing to the hub
Instead of pushing datasets, you can also push the raw data (we'll use a pandas dataframe here to parquet format). Let's see how we can do this!

In [None]:
from huggingface_hub import create_repo, Repository

In [None]:
#uncomment to create a remote repo
repo_url = create_repo(name="demo_data_raw", organization='charreaubell', repo_type="dataset", private=True)
repo_url

'https://huggingface.co/datasets/charreaubell/demo_data_raw'

In [None]:
#clone the repo locally
repo = Repository(local_dir="demo_data_raw", clone_from='charreaubell/demo_data_raw', repo_type='dataset', use_auth_token=True)

Cloning https://huggingface.co/datasets/charreaubell/demo_data_raw into local empty directory.


In [None]:
#save the data
data_ds['train'].to_parquet('demo_data_raw/demo_data_train.parquet')
data_ds['test'].to_parquet('demo_data_raw/demo_data_test.parquet')

31516

In [None]:
#push to hub
is_done = repo.push_to_hub(commit_message='train and test raw data push')
is_done

Upload file demo_data_train.parquet: 100%|##########| 8.53k/8.53k [00:00<?, ?B/s]

Upload file demo_data_test.parquet: 100%|##########| 5.77k/5.77k [00:00<?, ?B/s]

To https://huggingface.co/datasets/charreaubell/demo_data_raw
   4bbe4e5..8799829  main -> main



'https://huggingface.co/datasets/charreaubell/demo_data_raw/commit/8799829ce46f5278934337af2393ab0b8986a2c9'

## Questions and Discussion
Using the Datasets reference:
* Tell me about some of the functions offered for interfacing with Cloud Storage
* How could you stream live data or extremely large data?
* Look at the Search Index page. Describe the general functionality offered here. How might you use it in your application?
* What else seems cool to you?

# What we've covered
We've covered a lot of ground today!  We've discussed several things:
* Explored HuggingFace Datasets API
* Learned about Dataset structure
* Conversion from data to HuggingFace Dataset structure
* Uploading to HuggingFace Hub

# Homework assignment
Create a HuggingFace dataset from your data, and upload it to the HuggingFace Hub!  We'll be working on your own data for the next few classes. If you're unable to upload your data, make sure you know how to access it programmatically.