Machine perceptual quality evaluation

* Images
  * Dataset: [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k)
  * Model: [Distilled data-efficient Image Transformer (DeiT)](https://huggingface.co/facebook/deit-small-distilled-patch16-224)
  * Metric: Image classification accuracy
  * Compression:
    * JPEG Q=5/100
    * HIFIC
    * TFCI
* Audio
  * Dataset: [Common Voice Corpus 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
  * Model: [Whisper](https://huggingface.co/openai/whisper-tiny.en)
  * Metric: Speech recognition word error rate
  * Compression:
    * MP3 kbps
    * Descript
    * Encodec 

In [4]:
from datasets import load_dataset
from evaluate import evaluator
from transformers import pipeline

In [None]:
data = load_dataset("imagenet-1k", split="validation")

pipe = pipeline(
    task="image-classification",
    model="facebook/deit-small-distilled-patch16-224"
)

task_evaluator = evaluator("image-classification")
eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    data=data,
    metric="accuracy",
    label_mapping=pipe.model.config.label2id
)

Downloading builder script:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/29.1G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/29.3G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/29.0G [00:00<?, ?B/s]

In [None]:
task_evaluator = evaluator("automatic-speech-recognition")
data = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="validation[:40]")
results = task_evaluator.compute(
    model_or_pipeline="https://huggingface.co/openai/whisper-tiny.en",
    data=data,
    input_column="path",
    label_column="sentence",
    metric="wer",
)