<a target="_blank" href="https://colab.research.google.com/github/mrdbourke/learn-huggingface/blob/main/notebooks/hugging_face_text_classification_tutorial.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [1]:
# Goal
# Start with dataset
  # Could generate this dataset or pre-existing
  # Can have the dataset labelled manually or labelled with an LLM
  # Could label this dataset manually or have it zero-shot labelled
# Build a custom text classifier on labelled data
  # Test text classifier on labelled data vs zero-shot model

In [2]:
# Next:
# Add tools used in this overview
# Create a small dataset with text generation, e.g. 50x spam/not_spam emails and train a classifier on it ✅
   # Done, see notebook: https://colab.research.google.com/drive/14xr3KN_HINY5LjV0s2E-4i7v0o_XI3U8?usp=sharing 
# Save the dataset to Hugging Face Datasets ✅
   # Done, see dataset: https://huggingface.co/datasets/mrdbourke/learn_hf_food_not_food_image_captions
# Train a classifier on it ✅
# Save the model to the Hugging Face Model Hub
# Create a with Gradio and test the model in the wild 

## TK - What is Hugging Face?

## TK - Why Hugging Face?

## TK - What is text classification?

* TK - write example problems (binary classification, multi-class classification, multi-label classification)
* TK - write places to find text classification models
* TK - write about different types of text classification models

## TK - Why train your own text classification models?

## TK - What we're going to build

* TK - food not food image caption classifier


## TK - Importing necessary libraries

In [1]:
try:
  import datasets, evaluate, accelerate
except:
  !pip install -U datasets, evaluate, accelerate
  # !pip install -U datasets, evaluate, accelerate
  import datasets, evaluate, accelerate

from datasets import Dataset

import random
import pandas as pd

import transformers

# TK - Write code so that this example works on Google Colab
# TK - e.g. import/install required libraries 
# from google.colab import drive
# drive.mount('/content/drive')

## TK - Getting a dataset

* TK - show how this dataset was created
* TK image - show an image of example text dataset

In [2]:
# Load the dataset
dataset = datasets.load_dataset("mrdbourke/learn_hf_food_not_food_image_captions")
dataset

### TK - Inspect random examples from the dataset

* TK - always spend time with your data, when interacting with a new dataset, view random examples for ~10 minutes or at least 20-100 random examples to get a feel of the data

In [12]:
import random

random_indexs = random.sample(range(len(dataset["train"])), 5)
random_samples = dataset["train"][random_indexs]

print(f"[INFO] Random samples from dataset:\n")
for item in zip(random_samples["text"], random_samples["label"]):
    print(f"Text: {item[0]} | Label: {item[1]}")

[INFO] Random samples from dataset:

Text: A bowl of sliced cantaloupe with a sprinkle of cinnamon and a side of cottage cheese | Label: food
Text: Traditional Japanese flavored sushi roll with pickled plum or fermented soybeans. | Label: food
Text: Set of napkins arranged in a ring | Label: not_food
Text: A close-up of a girl feeding her rabbit in the garden | Label: not_food
Text: A fruit kabob with a variety of fruits, such as grapes, melon, and berries | Label: food


## TK - Preparing data for text classification

See docs: https://huggingface.co/docs/transformers/en/tasks/sequence_classification#preprocess

In [14]:
# Create mapping from id2label and label2id
id2label = {0: "not_food", 1: "food"}
label2id = {"not_food": 0, "food": 1}

In [15]:
# Turn labels into 0 or 1 (e.g. 0 for "not_food", 1 for "food"), see: https://huggingface.co/docs/datasets/en/process#map
def map_labels_to_number(example):
  example["label"] = label2id[example["label"]]
  return example

dataset = dataset["train"].map(map_labels_to_number)
dataset[:5]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

{'text': ['Creamy cauliflower curry with garlic naan, featuring tender cauliflower in a rich sauce with cream and spices, served with garlic naan bread.',
  'Set of books stacked on a desk',
  'Watching TV together, a family has their dog stretched out on the floor',
  'Wooden dresser with a mirror reflecting the room',
  'Lawn mower stored in a shed'],
 'label': [1, 0, 0, 0, 0]}

In [16]:
dataset.shuffle()[:5]

{'text': ['Working from home at her desk, a woman deals with a cat sitting on the keyboard',
  'Camera mounted on a tripod',
  'A girl feeding her rabbit in the garden',
  'Wooden hanger holding clothes on a rack',
  'Close-up of a sushi roll with avocado, cucumber, and salmon.'],
 'label': [0, 0, 0, 0, 1]}

In [17]:
# Create train/test splits, see: https://huggingface.co/docs/datasets/en/process#split
dataset = dataset.train_test_split(test_size=0.2)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 200
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 50
    })
})

In [22]:
random_idx_train = random.randint(0, len(dataset["train"]))
random_sample_train = dataset["train"][random_idx_train]

random_idx_test = random.randint(0, len(dataset["test"]))
random_sample_test = dataset["test"][random_idx_test]

print(f"[INFO] Random sample from training dataset:")
print(f"Text: {random_sample_train['text']} | Label: {random_sample_train['label']} ({id2label[random_sample_train['label']]})\n")
print(f"[INFO] Random sample from testing dataset:")
print(f"Text: {random_sample_test['text']} | Label: {random_sample_test['label']} ({id2label[random_sample_test['label']]})")

[INFO] Random sample from training dataset:
Text: Set of knitting needles with yarn waiting to be knitted | Label: 0 (not_food)

[INFO] Random sample from testing dataset:
Text: Set of spatulas kept in a holder | Label: 0 (not_food)


### TK - Tokenizing text data

* TK - what is tokenization? E.g. turning data from text to numbers (machines like numbers)
* TK - see OpenAI guide on tokenization: https://openai.com/tokenization/

In [23]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

def preprocess_function(examples):
  return tokenizer(examples["text"], truncation=True)



In [24]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)
tokenized_dataset

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 200
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 50
    })
})

In [25]:
tokenized_dataset["train"][0], tokenized_dataset["test"][0]

({'text': 'Pizza with a stuffed crust, oozing with cheese',
  'label': 1,
  'input_ids': [101,
   10733,
   2007,
   1037,
   11812,
   19116,
   1010,
   1051,
   18153,
   2075,
   2007,
   8808,
   102],
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]},
 {'text': 'A close-up of a cat lounging on a windowsill with a child reading nearby',
  'label': 0,
  'input_ids': [101,
   1037,
   2485,
   1011,
   2039,
   1997,
   1037,
   4937,
   10223,
   22373,
   2006,
   1037,
   3645,
   8591,
   2007,
   1037,
   2775,
   3752,
   3518,
   102],
  'attention_mask': [1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1,
   1]})

### TK - Make sure all text is the same length

In [26]:
# Collate examples and pad them each batch
# TK - this is not 100% needed as the tokenizer can handle padding, but it's good to know how to do it
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer,
                                        padding=True)
data_collator

DataCollatorWithPadding(tokenizer=DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple_of=None, return_tensor

## Evaluation

See: https://huggingface.co/docs/transformers/en/tasks/sequence_classification#evaluate

In [27]:
import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return accuracy.compute(predictions=predictions, references=labels)

## Train

See: https://huggingface.co/docs/transformers/en/tasks/sequence_classification#train

3 steps for training:

1. Define model
2. Define training arguments
3. Pass training arguments to Trainer
4. Call `train()`

In [57]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path="distilbert/distilbert-base-uncased",
    num_labels=2, # can customize this to the number of classes in your dataset
    id2label=id2label,
    label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [58]:
# Try and make a prediction with the loaded model (this will error)
model(**tokenized_dataset["train"][:2])

TypeError: DistilBertForSequenceClassification.forward() got an unexpected keyword argument 'text'

In [59]:
# Create model output directory
from pathlib import Path

# Create models directory
models_dir = Path("models")
models_dir.mkdir(exist_ok=True)

# Create model save name
model_save_name = "learn_hf_food_not_food_text_classifier-distilbert-base-uncased"

# Create model save path
model_save_dir = Path(models_dir, model_save_name)

model_save_dir

PosixPath('models/learn_hf_food_not_food_text_classifier-distilbert-base-uncased')

In [64]:
# Create training arguments
# See: https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/trainer#transformers.TrainingArguments
# TODO: Turn off Weights & Biases logging? Or add it in?
# TK - exercise: spend 10 minutes reading the TrainingArguments documentation
training_args = TrainingArguments(
    output_dir=model_save_dir, # TODO: change this path to model save path, e.g. 'learn_hf_food_not_food_text_classifier_model' 
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True, # load the best model when finished training
    logging_strategy="epoch" # log training results every epoch
    # push_to_hub=True # optional: automatically upload the model to the Hub (we'll do this manually later on)
    # hub_token="your_token_here" # optional: add your Hugging Face Hub token to push to the Hub (will default to huggingface-cli login)
    # report_to="none" # optional: log experiments to Weights & Biases/other similar experimenting tracking services (we'll turn this off for now) 

)

In [65]:
# Setup Trainer
# Note: Trainer applies dynamic padding by default when you pass `tokenizer` to it.
# In this case, you don't need to specify a data collator explicitly.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    #data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [66]:
# Batch size 16
#  [ 391/15234 00:22 < 14:27, 17.12 it/s, Epoch 0.05/2]

# Batch size 32
# [ 724/7618 01:08 < 10:51, 10.58 it/s, Epoch 0.19/2]

# Batch size 64
#  [ 150/3810 00:31 < 12:52, 4.74 it/s, Epoch 0.08/2]

In [67]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.0129,0.006287,1.0
2,0.006,0.003362,1.0
3,0.0034,0.002199,1.0
4,0.0025,0.001656,1.0
5,0.002,0.001357,1.0
6,0.0016,0.001173,1.0
7,0.0015,0.001059,1.0
8,0.0013,0.00099,1.0
9,0.0013,0.000951,1.0
10,0.0012,0.000937,1.0


TrainOutput(global_step=70, training_loss=0.003375225713742631, metrics={'train_runtime': 6.734, 'train_samples_per_second': 297.002, 'train_steps_per_second': 10.395, 'total_flos': 17458789182240.0, 'train_loss': 0.003375225713742631, 'epoch': 10.0})

In [42]:
# Optional: push the model to Hugging Face Hub for re-use later
# Note: Requires Hugging Face login
# trainer.push_to_hub()

### TK - Save the model for later use

In [78]:
# Save model
# See: https://discuss.huggingface.co/t/how-to-save-my-model-to-use-it-later/20568/4
# TODO: Make a models/ dir to save models to (so we don't have to commit them to git)
trainer.save_model(model_save_dir)

### TK - Push the model to Hugging Face Hub

TK - optional to share the model/use elsewhere 

* see here: https://huggingface.co/docs/transformers/en/model_sharing 
* also see here for how to setup `huggingface-cli` so you can write your model to your account

In [79]:
# TK - have a note here for the errors
# Note: you may see the following error
# 403 Forbidden: You don't have the rights to create a model under the namespace "mrdbourke".
# Cannot access content at: https://huggingface.co/api/repos/create.
# If you are trying to create or update content,make sure you have a token with the `write` role.

In [80]:
# TK - Push model to hub (for later re-use)
# TODO: Push this model to the hub to be able to use it later
# TK - this requires a "write" token from the Hugging Face Hub
# TK - see docs: https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.push_to_hub 
# TK - for example, on my local computer, my token is saved to: "/home/daniel/.cache/huggingface/token"

# TK - Can create a model card with create_model_card()
# see here: https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/trainer#transformers.Trainer.create_model_card 

trainer.push_to_hub(
    commit_message="Uploading food not food text classifier model" # set to False if you want the model to be public
    # token="YOUR_HF_TOKEN_HERE" # note: this will default to the token you have saved in your Hugging Face config
)

CommitInfo(commit_url='https://huggingface.co/mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased/commit/7c9a4a6b17da981559f484538d51f6ff9a14c12d', commit_message='Uploading food not food text classifier model', commit_description='', oid='7c9a4a6b17da981559f484538d51f6ff9a14c12d', pr_url=None, pr_revision=None, pr_num=None)

* TK - note: this will make the model public, to make it private, 

See the model here saved for later: https://huggingface.co/mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased 

## TK - Inference

UPTOHERE
- load the model (locally + from Hub)
    - make sure to change the save paths when loading the model to the new paths
- make predictions on new text data
- build a demo with Gradio (optional)

Making predictions on our own text options.

See: https://huggingface.co/docs/transformers/en/tasks/sequence_classification#inference

In [81]:
sample_text = "A delicious photo of a plate of scrambled eggs, bacon and toast"

### Pipeline mode

In [82]:
# TODO: TK - set device agnostic code for CUDA/Mac/CPU?

In [83]:
import torch
from transformers import pipeline

food_not_food_classifier = pipeline(task="text-classification", 
                                    model="./learn_hf_food_not_food_text_classifier_model",
                                    batch_size=64,
                                    device="cuda" if torch.cuda.is_available() else "cpu")
food_not_food_classifier(sample_text)

[{'label': 'food', 'score': 0.9857270121574402}]

In [84]:
sample_text_not_food = "A yellow tractor driving over the hill"
food_not_food_classifier(sample_text_not_food)

[{'label': 'not_food', 'score': 0.9952113032341003}]

In [85]:
# Predicting works with lists
# Can find the examples with highest confidence and keep those
sentences = [
    "I whipped up a fresh batch of code, but it seems to have a syntax error.",
    "We need to marinate these ideas overnight before presenting them to the client.",
    "The new software is definitely a spicy upgrade, taking some time to get used to.",
    "Her social media post was the perfect recipe for a viral sensation.",
    "He served up a rebuttal full of facts, leaving his opponent speechless.",
    "The team needs to simmer down a bit before tackling the next challenge.",
    "Our budget is a bit thin, so we'll have to use budget-friendly materials for this project.",
    "The presentation was a delicious blend of humor and information, keeping the audience engaged.",
    "I'm feeling overwhelmed by this workload – it's a real information buffet.",
    "We're brainstorming new content ideas, hoping to cook up something innovative.",
    "Daniel Bourke is really cool :D"
]

food_not_food_classifier(sentences)

[{'label': 'food', 'score': 0.5004891753196716},
 {'label': 'not_food', 'score': 0.8031825423240662},
 {'label': 'food', 'score': 0.5688743591308594},
 {'label': 'food', 'score': 0.5170369744300842},
 {'label': 'not_food', 'score': 0.6362243890762329},
 {'label': 'not_food', 'score': 0.7544246315956116},
 {'label': 'not_food', 'score': 0.7407550811767578},
 {'label': 'not_food', 'score': 0.5384440422058105},
 {'label': 'not_food', 'score': 0.863006055355072},
 {'label': 'not_food', 'score': 0.9562841653823853},
 {'label': 'not_food', 'score': 0.9076286554336548}]

In [86]:
%%time
import time
for i in [10, 100, 1000, 10_000]:
    sentences_big = sentences * i
    print(f"[INFO] Number of sentences: {len(sentences_big)}")

    start_time = time.time()
    food_not_food_classifier(sentences_big)
    end_time = time.time()

    print(f"[INFO] Inference time for {len(sentences_big)} sentences: {end_time - start_time} seconds.")
    print(f"[INFO] Avg inference time per sentence: {(end_time - start_time) / len(sentences_big)} seconds.")
    print()

[INFO] Number of sentences: 110
[INFO] Inference time for 110 sentences: 0.06326603889465332 seconds.
[INFO] Avg inference time per sentence: 0.000575145808133212 seconds.

[INFO] Number of sentences: 1100
[INFO] Inference time for 1100 sentences: 0.341522216796875 seconds.
[INFO] Avg inference time per sentence: 0.00031047474254261364 seconds.

[INFO] Number of sentences: 11000
[INFO] Inference time for 11000 sentences: 1.562863826751709 seconds.
[INFO] Avg inference time per sentence: 0.00014207852970470083 seconds.

[INFO] Number of sentences: 110000
[INFO] Inference time for 110000 sentences: 15.670900344848633 seconds.
[INFO] Avg inference time per sentence: 0.00014246273040771485 seconds.

CPU times: user 17.2 s, sys: 493 ms, total: 17.6 s
Wall time: 17.6 s


### PyTorch mode

In [50]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("learn_hf_food_not_food_text_classifier_model")
inputs = tokenizer(sample_text, return_tensors="pt")

In [53]:
import torch
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("learn_hf_food_not_food_text_classifier_model")
with torch.no_grad():
  logits = model(**inputs).logits

In [54]:
# Get predicted class
predicted_class_id = logits.argmax().item()
print(f"Text: {sample_text}")
print(f"Predicted label: {model.config.id2label[predicted_class_id]}")

Text: A delicious photo of a plate of scrambled eggs, bacon and toast
Predicted label: food


## TK - Turning our model into a demo

* TK - build a demo with Gradio, see it here: https://www.gradio.app/guides/quickstart 
* TK - requires `pip install gradio`

In [56]:
# TODO: Make a demo of the model with Gradio and test it in the wild

In [88]:
food_not_food_classifier("Testing the pipeline", return_all_scores=True)



[[{'label': 'not_food', 'score': 0.9902743697166443},
  {'label': 'food', 'score': 0.00972571037709713}]]

In [101]:
import gradio as gr

def food_not_food_classifier(text):
    food_not_food_classifier = pipeline(task="text-classification", 
                                        model="models/learn_hf_food_not_food_text_classifier-distilbert-base-uncased",
                                        batch_size=64,
                                        device="cuda" if torch.cuda.is_available() else "cpu",
                                        top_k=None) # return all possible scores (not just top-1)
    
    # Get outputs from pipeline (as a list of dicts)
    outputs = food_not_food_classifier(text)[0]

    # Format output for Gradio (e.g. {"label_1": probability_1, "label_2": probability_2})
    output_dict = {}

    for item in outputs:
        output_dict[item["label"]] = item["score"]

    return output_dict

food_not_food_classifier("My lunch today was bacon and eggs")

{'food': 0.7966588139533997, 'not_food': 0.20334114134311676}

In [102]:
demo = gr.Interface(fn=food_not_food_classifier, 
             inputs="text", 
             outputs=gr.Label(num_top_classes=2), # show top 2 classes (that's all we have)
             title="Food or Not Food Classifier",
             description="A text classifier to determine if a sentence is about food or not food.",
             examples=[["I whipped up a fresh batch of code, but it seems to have a syntax error."],
                       ["A delicious photo of a plate of scrambled eggs, bacon and toast."]])

demo.launch()

Running on local URL:  http://127.0.0.1:7862

To create a public link, set `share=True` in `launch()`.




In [106]:
# Make a directory for demos
demos_dir = Path("../demos")
demos_dir.mkdir(exist_ok=True)

# Create a folder for the food_not_food_text_classifer demo
food_not_food_text_classifier_demo_dir = Path(demos_dir, "food_not_food_text_classifier")
food_not_food_text_classifier_demo_dir.mkdir(exist_ok=True)

In [109]:
%%writefile ../demos/food_not_food_text_classifier/app.py
import torch
import gradio as gr

from transformers import pipeline

def food_not_food_classifier(text):
    # Set up text classification pipeline
    food_not_food_classifier = pipeline(task="text-classification", 
                                        model="mrdbourke/learn_hf_food_not_food_text_classifier-distilbert-base-uncased", # link to model on HF Hub
                                        device="cuda" if torch.cuda.is_available() else "cpu",
                                        top_k=None) # return all possible scores (not just top-1)
    
    # Get outputs from pipeline (as a list of dicts)
    outputs = food_not_food_classifier(text)[0]

    # Format output for Gradio (e.g. {"label_1": probability_1, "label_2": probability_2})
    output_dict = {}
    for item in outputs:
        output_dict[item["label"]] = item["score"]

    return output_dict

description = """
A text classifier to determine if a sentence is about food or not food.

TK - See source code:
"""

demo = gr.Interface(fn=food_not_food_classifier, 
             inputs="text", 
             outputs=gr.Label(num_top_classes=2), # show top 2 classes (that's all we have)
             title="🍗🚫🥑 Food or Not Food Text Classifier",
             description=description,
             examples=[["I whipped up a fresh batch of code, but it seems to have a syntax error."],
                       ["A delicious photo of a plate of scrambled eggs, bacon and toast."]])

if __name__ == "__main__":
    demo.launch()

Overwriting ../demos/food_not_food_text_classifier/app.py


### TK - Uploading/running the demo

Options:
* Uploading manually to Hugging Face Spaces - hf.co/new-space 
* Uploading programmatically to Hugging Face Spaces - https://www.gradio.app/guides/using-hugging-face-integrations#hosting-your-gradio-demos-on-spaces
* Running the demo locally - `Interface.launch()` (only works if you have Gradio installed)


In [126]:
%%writefile ../demos/food_not_food_text_classifier/requirements.txt
gradio
torch
transformers

Overwriting ../demos/food_not_food_text_classifier/requirements.txt


Create a `README.md` file with metadata instructions (these are specific to Hugging Face Spaces).

In [127]:
%%writefile ../demos/food_not_food_text_classifier/README.md
---
title: Food Not Food Text Classifier
emoji: 🍗🚫🥑
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 4.36.1
app_file: app.py
pinned: false
license: apache-2.0
---

# 🍗🚫🥑 Food Not Food Text Classifier

Small demo to showcase a text classifier to determine if a sentence is about food or not food.

DistillBERT model fine-tuned on a small synthetic dataset of 250 generated [Food or Not Food image captions](https://huggingface.co/datasets/mrdbourke/learn_hf_food_not_food_image_captions).

TK - see the demo notebook on how to create this

Overwriting ../demos/food_not_food_text_classifier/README.md


In [129]:
from huggingface_hub import (
    create_repo,
    get_full_repo_name,
    upload_file, # for uploading a single file
    upload_folder # for uploading multiple files (in a folder)
)

path_to_demo_folder = "../demos/food_not_food_text_classifier"
repo_type = "space" # we're creating a Hugging Face Space

# Create a repo on Hugging Face
# see docs: https://huggingface.co/docs/huggingface_hub/v0.23.3/en/package_reference/hf_api#huggingface_hub.HfApi.create_repo
target_space_name = "learn_hf_food_not_food_text_classifier_demo"
print(f"[INFO] Creating repo: {target_space_name}")
create_repo(
    repo_id=target_space_name,
    #token="YOUR_HF_TOKEN"
    private=False, # set to True if you want the repo to be private
    repo_type=repo_type, # create a Hugging Face Space
    space_sdk="gradio", # we're using Gradio to build our demo 
    exist_ok=True, # set to False if you want to create the repo even if it already exists            
)

# Get the full repo name (e.g. "mrdbourke/learn_hf_food_not_food_text_classifier_demo")
full_repo_name = get_full_repo_name(model_id=target_space_name)
print(f"[INFO] Full repo name: {full_repo_name}")

# Upload a file
# see docs: https://huggingface.co/docs/huggingface_hub/v0.23.3/en/package_reference/hf_api#huggingface_hub.HfApi.upload_file 
print(f"[INFO] Uploading {path_to_demo_folder} to repo: {full_repo_name}")
file_url = upload_folder(
    folder_path=path_to_demo_folder,
    path_in_repo=".", # save to the root of the repo
    repo_id=full_repo_name,
    repo_type=repo_type,
    #token="YOUR_HF_TOKEN"
    commit_message="Uploading food not food text classifier demo app.py"
)

[INFO] Creating repo: learn_hf_food_not_food_text_classifier_demo
[INFO] Full repo name: mrdbourke/learn_hf_food_not_food_text_classifier_demo
[INFO] Uploading ../demos/food_not_food_text_classifier to repo: mrdbourke/learn_hf_food_not_food_text_classifier_demo


TK - note: you may need a requirements.txt file

```
===== Application Startup at 2024-06-13 05:37:21 =====

Traceback (most recent call last):
  File "/home/user/app/app.py", line 1, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'
```

In [None]:
# Next:
# Change repo name to "learn_hf..."
# Embed the repo here
# Go back through code and make sure it's clean
# See demo link: https://huggingface.co/spaces/mrdbourke/learn_food_not_food_text_classifier_demo 