<a href="https://colab.research.google.com/github/anyuanay/medium/blob/main/src/working_huggingface/Working_with_HuggingFace_ch1_huging_face_ecosystem_using_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Working with Hugging Face Models and Datasets
## Chapter 1: Getting Familiar with the Hugging Face Ecosystem
### Lesson 1.2: Loading and exploring a dataset about sentiment analysis for tweets
In this lesson, we will load a dataset from Hugging Face, extract information from the dataset, and do some pre-processing.

# Install Transformers and Datasets from Hugging Face

In [1]:
# Datasets installation
! pip install datasets transformers[torch]
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/datasets.git

Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[torch]
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m104.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     

# Load  a tweet dataset for sentiment analysis

In [2]:
from datasets import load_dataset

dataset = load_dataset("SetFit/tweet_sentiment_extraction")

Downloading readme:   0%|          | 0.00/94.0 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.93M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/503k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [28]:
type(dataset), type(dataset['train']), type(dataset['test'])

(datasets.dataset_dict.DatasetDict,
 datasets.arrow_dataset.Dataset,
 datasets.arrow_dataset.Dataset)

# List the metadata and content of the dataset

In [6]:
# Show the metadata
dataset

DatasetDict({
    train: Dataset({
        features: ['textID', 'text', 'label', 'label_text'],
        num_rows: 27481
    })
    test: Dataset({
        features: ['textID', 'text', 'label', 'label_text'],
        num_rows: 3534
    })
})

In [11]:
# Rename the column 'label_text' to 'label_name'
dataset = dataset.rename_column('label_text', 'label_name')
dataset

DatasetDict({
    train: Dataset({
        features: ['textID', 'text', 'label', 'label_name'],
        num_rows: 27481
    })
    test: Dataset({
        features: ['textID', 'text', 'label', 'label_name'],
        num_rows: 3534
    })
})

In [12]:
# Create a Pandas DataFrame from the train set
import pandas as pd
train = pd.DataFrame(dataset['train'])
train.head()

Unnamed: 0,textID,text,label,label_name
0,cb774db0d1,"I`d have responded, if I were going",1,neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,0,negative
2,088c60f138,my boss is bullying me...,0,negative
3,9642c003ef,what interview! leave me alone,0,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...",0,negative


# Index and slice the dataset

In [13]:
# Get the second record from the train set
dataset['train'][1]

{'textID': '549e992a42',
 'text': ' Sooo SAD I will miss you here in San Diego!!!',
 'label': 0,
 'label_name': 'negative'}

In [16]:
# Get the last 50 records from the test set
test_last_50 = dataset['test'][-50:]

In [17]:
# Make a Pandas DataFrame
pd.DataFrame(test_last_50)

Unnamed: 0,textID,text,label,label_name
0,1dc7f3d536,What are those barrels made of? Hey pass that...,1,neutral
1,c08d28468b,Thanks for adding me,2,positive
2,043c3b53f8,can`t wait to watch the next season of heroes,2,positive
3,5fc6e71643,"Bored, making a mothers day card",1,neutral
4,27829d97bd,'sometime around midnight' by The Airborne Tox...,0,negative
5,603a6f30be,I think Max (my cat) may really be gone,1,neutral
6,f13fd6b067,`m working on a logo on photoshop & it didn`t ...,1,neutral
7,6bda8dca4d,I have a bird living with me. So I have to f...,0,negative
8,b70db9bf31,oowww...why do wisdom teeth hurt so much,0,negative
9,c9980ac0cd,airsoft is so much fun! i play with my brothe...,2,positive


# Create a train and test set from a dataset

In [25]:
dataset_split = dataset['train'].train_test_split(test_size=0.2)

In [26]:
dataset_split['train']

Dataset({
    features: ['textID', 'text', 'label', 'label_name'],
    num_rows: 21984
})

In [27]:
dataset_split['test']

Dataset({
    features: ['textID', 'text', 'label', 'label_name'],
    num_rows: 5497
})