# Hugging Face and its Importance in NLP Projects

Hugging Face is a powerful tool for natural language processing (NLP) and large language models (LLMs). It makes working with NLP tasks much faster and more efficient by providing ready-to-use models, which can save a lot of time in learning and deployment.

In this notebook, we will:
1. Discuss the Hugging Face `pipeline` API.
2. Explore various NLP tasks such as sentiment analysis, text classification, and summarization.
3. Dive into tokenization and fine-tuning our own models.
4. Use Hugging Face datasets.
5. Discuss Hugging Face Spaces.
6. Build a project that combines an NLP task with the Arxiv API to pull and summarize research papers.


# Installation


In [1]:
pip install transformers



# Hugging Face Tasks
we import the `pipeline` module from the `transformers` library. The `pipeline` API simplifies many common NLP tasks by providing an easy-to-use interface.
ugging Face offers a variety of NLP tasks. By visiting huggingface.co/tasks, you can explore tasks such as text classification, question answering, summarization, text generation, and more.

In [None]:
from transformers import pipeline
#---------------------------------------------------#
#                     NLP TASKS                     #
#---------------------------------------------------#

'''
1. Text Classification: Assigning a category to a piece of text.
Sentiment Analysis
Topic Classification
Spam Detection '''

classifier = pipeline("text-classification")

'''
2. Token Classification: Assigning labels to individual tokens in a sequence.
Named Entity Recognition (NER)
Part-of-Speech Tagging
'''

token_classifier = pipeline("token-classification")

'''
3. Question Answering: Extracting an answer from a given context based on a question.
'''
question_answerer = pipeline("question-answering")

'''
4. Text Generation: Generating text based on a given prompt.
Language Modeling
Story Generation

'''

text_generator = pipeline("text-generation")

'''
5. Summarization: Condensing long documents into shorter summaries.
'''

summarizer = pipeline("summarization")

'''
Translation: Translating text from one language to another.
'''

translator = pipeline("translation",
                      model="Helsinki-NLP/opus-mt-en-fr")

'''
6. Text2Text Generation: General-purpose text transformation, including summarization and translation.
'''

text2text_generator = pipeline("text2text-generation")

'''
7. Fill-Mask: Predicting the masked token in a sequence.
'''

fill_mask = pipeline("fill-mask")

'''
8. Feature Extraction: Extracting hidden states or features from text.
'''

feature_extractor = pipeline("feature-extraction")

'''
9. Sentence Similarity: Measuring the similarity between two sentences.
'''
sentence_similarity = pipeline("sentence-similarity")

#---------------------------------------------------#
#             Computer Vision TASKS                 #
#---------------------------------------------------#

'''
1. Image Classification: Classifying the main content of an image.

'''

image_classifier = pipeline("image-classification")

'''
2. Object Detection: Identifying objects within an image and their bounding boxes.
'''

object_detector = pipeline("object-detection")

'''
3. Image Segmentation: Segmenting different parts of an image into classes.
'''

image_segmenter = pipeline("image-segmentation")

'''
4. Image Generation: Generating images from textual descriptions (using DALL-E or similar models).
'''

#---------------------------------------------------#
#             Speech Processing TASKS               #
#---------------------------------------------------#

'''
1. utomatic Speech Recognition (ASR): Converting spoken language into text.
'''

speech_recognizer = pipeline("automatic-speech-recognition")

'''
2. Speech Translation: Translating spoken language from one language to another.
3. Audio Classification: Classifying audio signals into predefined categories.
'''

#---------------------------------------------------#
#                   Multimodal TASKS                #
#---------------------------------------------------#

'''
1. Image Captioning: Generating a textual description of an image.
'''
image_captioner = pipeline("image-to-text")
'''
2. Visual Question Answering (VQA): Answering questions about the content of an image.
'''

#---------------------------------------------------#
#                     Other TASKS                   #
#---------------------------------------------------#
'''
1. Table Question Answering: Answering questions based on tabular data.
'''
table_qa = pipeline("table-question-answering")

'''
2. Document Question Answering: Extracting answers from documents like PDFs.

'''
doc_qa = pipeline("document-question-answering")
'''
3. Time Series Forecasting: Predicting future values in time series data (not directly supported in the main Transformers library but available through extensions).
'''

# NLP Tasks

## Sentiment Analysis

We will start with a basic sentiment analysis task. Using the `pipeline` API, we can quickly set up a sentiment analysis pipeline and analyze the sentiment of various texts.

We will also demonstrate how to use specific pre-trained models for sentiment analysis, handle batch processing of multiple sentences, and utilize models that detect emotions to add more nuance to our sentiment analysis.

In [3]:
from transformers import pipeline

# Initialize a sentiment analysis pipeline using a default model
classifier = pipeline("sentiment-analysis")
result = classifier("I really like techno music")
print(result)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9989181756973267}]


In [4]:
# Perform sentiment analysis with a one-liner syntax; equivalent to the previous example.
pipeline(task = "sentiment-analysis")("I was confused by your attitude")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9984862208366394}]

Example of analyzing sentiment on a multi-line text:

In [None]:
pipeline(task = "sentiment-analysis")\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think?")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9964345693588257}]

In [None]:
 # Use a different pre-trained model for sentiment analysis
pipeline(task = "sentiment-analysis", model="facebook/bart-large-mnli")\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think?")


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'label': 'neutral', 'score': 0.7693336606025696}]

This approach might not work very well for complex or lengthy texts, which is why we test batch processing to handle multiple sentences more effectively.

### Batch Senteniment Analysis

In [None]:
# Process multiple sentences at once

classifier = pipeline(task = "sentiment-analysis")

task_list = ["I really like Autoencoders, best models for Anomaly Detection", \
            "I am not sure if we CAN actually Evaluate LLMs.", \
            "PassiveAgressive is the name of a Linear Regression Model that so many people do not know.",\
            "I hate long Meetings."]
classifier(task_list)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9978686571121216},
 {'label': 'NEGATIVE', 'score': 0.9995476603507996},
 {'label': 'NEGATIVE', 'score': 0.9983084201812744},
 {'label': 'NEGATIVE', 'score': 0.9969879984855652}]

In [None]:
# Using Emotion Detection for Sentiment Analysis
# Utilize a model that can detect emotions

classifier = pipeline(task = "sentiment-analysis", model = "SamLowe/roberta-base-go_emotions")

task_list = ["I really like Autoencoders, best models for Anomaly Detection", \
            "I am not sure if we CAN actually Evaluate LLMs.", \
            "PassiveAgressive is the name of a Linear Regression Model that so many people do not know. It is pretty funny name for a Regression Model.",\
            "I hate long Meetings."]
classifier(task_list)

[{'label': 'admiration', 'score': 0.7406538128852844},
 {'label': 'confusion', 'score': 0.9066851139068604},
 {'label': 'amusement', 'score': 0.9083251357078552},
 {'label': 'anger', 'score': 0.7870614528656006}]

## Text Generation

In [5]:
# Use a pipeline as a high-level helper
from transformers import pipeline

text_generator = pipeline("text-generation", model="distilbert/distilgpt2")
generated_text = text_generator("Your eyes are two galaxies",
                                truncation=True,
                                num_return_sequences = 2)
print("Generated_text:\n ", generated_text[0]['generated_text'])

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated_text:
  Your eyes are two galaxies apart. So there's two big, shiny, golden stars that's about 60 light-years from Earth.



Now, after a week of painstaking work, one thing is certain - until now, the galaxy


## Question Answering

In [6]:
from transformers import pipeline

qa_model = pipeline("question-answering")
question = "What is my job?"
context = "I am developing AI models with Python."
qa_model(question = question, context = context)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.7823829650878906,
 'start': 5,
 'end': 25,
 'answer': 'developing AI models'}

# Tokenization

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DistilBertTokenizer, DistilBertForSequenceClassification

model_name1 = "distilbert-base-uncased-finetuned-sst-2-english"
mytokenizer1 = DistilBertTokenizer.from_pretrained(model_name1)
mymodel1 = DistilBertForSequenceClassification.from_pretrained(model_name1)

classifier = pipeline("sentiment-analysis", )
res = classifier("I was so not happy with the Barbie Movie")
print(res)

model_name2 = "nlptown/bert-base-multilingual-uncased-sentiment"
mymodel2 = AutoModelForSequenceClassification.from_pretrained(model_name2)
mytokenizer2 = AutoTokenizer.from_pretrained(model_name2)

classifier = pipeline("sentiment-analysis", model = mymodel2 , tokenizer = mytokenizer2)
res = classifier("I was so not happy with the Barbie Movie")
print(res)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9998108744621277}]
[{'label': '2 stars', 'score': 0.5099300742149353}]


In [None]:
from transformers import AutoTokenizer

# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Example text
text = "I was so not happy with the Barbie Movie"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)

# Encode the text (tokenization + converting to input IDs)
encoded_input = tokenizer(text)
print("Encoded Input:", encoded_input)

# Decode the text
decoded_output = tokenizer.decode(input_ids)
print("Decode Output: ", decoded_output)


Tokens: ['i', 'was', 'so', 'not', 'happy', 'with', 'the', 'barbie', 'movie']
Input IDs: [1045, 2001, 2061, 2025, 3407, 2007, 1996, 22635, 3185]
Encoded Input: {'input_ids': [101, 1045, 2001, 2061, 2025, 3407, 2007, 1996, 22635, 3185, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Decode Output:  i was so not happy with the barbie movie


In [None]:
from transformers import AutoTokenizer

# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Example text
text = "I was so not happy with the Barbie Movie"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)

# Encode the text (tokenization + converting to input IDs)
encoded_input = tokenizer(text)
print("Encoded Input:", encoded_input)

# Decode the text
decoded_output = tokenizer.decode(input_ids)
print("Decode Output: ", decoded_output)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Tokens: ['I', 'was', 'so', 'not', 'happy', 'with', 'the', 'Barbie', 'Movie']
Input IDs: [146, 1108, 1177, 1136, 2816, 1114, 1103, 25374, 8275]
Encoded Input: {'input_ids': [101, 146, 1108, 1177, 1136, 2816, 1114, 1103, 25374, 8275, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Decode Output:  I was so not happy with the Barbie Movie


**token_type_ids**<br>
These IDs are used to distinguish between different sequences in tasks that involve multiple sentences, such as question-answering and sentence-pair classification. BERT uses this mechanism to understand which tokens belong to which segment. For single-sequence tasks like sentiment analysis, token_type_ids are all zeros.

**attention_mask** <br>
The attention mask is used to differentiate between actual tokens and padding tokens (if any). It helps the model focus on non-padding tokens and ignore padding tokens. A value of 1 indicates that the token should be attended to, while a value of 0 indicates padding.

**Why Padding Tokens Are Used**<br>
Uniform Sequence Length: Deep learning models typically process input data in batches. To efficiently process these batches, all sequences in a batch must have the same length. Padding tokens ensure this by extending shorter sequences to match the length of the longest sequence in the batch.
Efficient Computation: Fixed-length sequences allow for more efficient use of hardware resources, as the model can process all sequences in parallel without needing to handle variable-length sequences individually.



# Fine Tunning IMDB
In this section, we fine-tune a pre-trained language model using the IMDB dataset for sentiment analysis. The IMDB dataset is a well-known collection of movie reviews used for training and evaluating sentiment classifiers. By fine-tuning on this dataset, we adapt a general-purpose model to better understand and classify movie reviews as positive or negative.

This process demonstrates the practical application of transfer learning to enhance a model's ability to handle domain-specific sentiment analysis tasks.

## Step 1: Install Necessary Libraries

In [6]:
pip install datasets



## Step 2: Load and Prepare the Dataset

In [None]:
from datasets import load_dataset
dataset = load_dataset('imdb')

In [None]:
dataset

## Step 3: Preprocess the Data
Tokenize the dataset using the tokenizer associated with the pre-trained model.

In [None]:
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

## Step 4: Set Up the Training Arguments
Specify the hyperparameters and training settings.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    eval_strategy ="epoch",     # Evaluate every epoch
    learning_rate=2e-5,              # Learning rate
    per_device_train_batch_size=16,  # Batch size for training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    num_train_epochs=3,              # Number of training epochs
    weight_decay=0.01,               # Strength of weight decay
)
training_args

TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=epoch,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp

## Step 5: Initialize the Model
Load the pre-trained model and define the training procedure.

In [None]:
from transformers import AutoModelForSequenceClassification, Trainer

# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Step 6: Train the Model
Fine-tune the pre-trained model on your specific dataset.

In [None]:
# Train the model
trainer.train()

## Step 7: Evaluate the Model
Assess the model's performance on a validation set.

In [None]:
# Evaluate the model
results = trainer.evaluate()
print(results)

## Step 8: Save the Fine-Tuned Model
Save the fine-tuned model for later use.

In [None]:
# Save the model
model.save_pretrained('./fine-tuned-model')
tokenizer.save_pretrained('./fine-tuned-model')


# Fine Tunning PUBMED

PubMed articles with labels indicating different medical categories. If you don't have a specific dataset, you can use a public dataset like the PubMed 200k RCT (Randomized Controlled Trials).

In [None]:
from datasets import load_dataset

# Load the PubMed RCT dataset
dataset = load_dataset('pubmed-rct', 'pubmed-rct-200k')

In [None]:
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('dmis-lab/biobert-v1.1')

# Text Summarization NLP Project using ArXiv API
In this notebook, we explore a text summarization project leveraging the arXiv API to access and process research papers. With the arXiv API, we can retrieve a vast number of articles, including those related to artificial intelligence and machine learning. The notebook demonstrates how to fetch and download research papers, extract their abstracts, and utilize NLP techniques to generate concise summaries. We will showcase the summarization of abstracts from selected papers, providing insights into the latest advancements in AI research.

In [None]:
pip install arxiv

In [None]:
import arxiv
import pandas as pd

In [None]:
# Query to fetch AI-related papers
query = 'ai OR artificial intelligence OR machine learning'
search = arxiv.Search(query=query, max_results=10, sort_by=arxiv.SortCriterion.SubmittedDate)

# Fetch papers
papers = []
for result in search.results():
    papers.append({
      'published': result.published,
        'title': result.title,
        'abstract': result.summary,
        'categories': result.categories
    })

# Convert to DataFrame
df = pd.DataFrame(papers)

pd.set_option('display.max_colwidth', None)
df.head(10)

  for result in search.results():


Unnamed: 0,published,title,abstract,categories
0,2024-07-22 17:59:46+00:00,WayEx: Waypoint Exploration using a Single Demonstration,"We propose WayEx, a new method for learning complex goal-conditioned robotics\ntasks from a single demonstration. Our approach distinguishes itself from\nexisting imitation learning methods by demanding fewer expert examples and\neliminating the need for information about the actions taken during the\ndemonstration. This is accomplished by introducing a new reward function and\nemploying a knowledge expansion technique. We demonstrate the effectiveness of\nWayEx, our waypoint exploration strategy, across six diverse tasks, showcasing\nits applicability in various environments. Notably, our method significantly\nreduces training time by 50% as compared to traditional reinforcement learning\nmethods. WayEx obtains a higher reward than existing imitation learning methods\ngiven only a single demonstration. Furthermore, we demonstrate its success in\ntackling complex environments where standard approaches fall short. More\ninformation is available at: https://waypoint-ex.github.io.","[cs.RO, cs.AI]"
1,2024-07-22 17:59:45+00:00,LLMmap: Fingerprinting For Large Language Models,"We introduce LLMmap, a first-generation fingerprinting attack targeted at\nLLM-integrated applications. LLMmap employs an active fingerprinting approach,\nsending carefully crafted queries to the application and analyzing the\nresponses to identify the specific LLM model in use. With as few as 8\ninteractions, LLMmap can accurately identify LLMs with over 95% accuracy. More\nimportantly, LLMmap is designed to be robust across different application\nlayers, allowing it to identify LLMs operating under various system prompts,\nstochastic sampling hyperparameters, and even complex generation frameworks\nsuch as RAG or Chain-of-Thought.","[cs.CR, cs.AI]"
2,2024-07-22 17:59:10+00:00,Reconstructing Training Data From Real World Models Trained with Transfer Learning,"Current methods for reconstructing training data from trained classifiers are\nrestricted to very small models, limited training set sizes, and low-resolution\nimages. Such restrictions hinder their applicability to real-world scenarios.\nIn this paper, we present a novel approach enabling data reconstruction in\nrealistic settings for models trained on high-resolution images. Our method\nadapts the reconstruction scheme of arXiv:2206.07758 to real-world scenarios --\nspecifically, targeting models trained via transfer learning over image\nembeddings of large pre-trained models like DINO-ViT and CLIP. Our work employs\ndata reconstruction in the embedding space rather than in the image space,\nshowcasing its applicability beyond visual data. Moreover, we introduce a novel\nclustering-based method to identify good reconstructions from thousands of\ncandidates. This significantly improves on previous works that relied on\nknowledge of the training set to identify good reconstructed images. Our\nfindings shed light on a potential privacy risk for data leakage from models\ntrained using transfer learning.","[cs.LG, cs.AI, cs.CR, cs.CV]"
3,2024-07-22 17:59:01+00:00,CarFormer: Self-Driving with Learned Object-Centric Representations,"The choice of representation plays a key role in self-driving. Bird's eye\nview (BEV) representations have shown remarkable performance in recent years.\nIn this paper, we propose to learn object-centric representations in BEV to\ndistill a complex scene into more actionable information for self-driving. We\nfirst learn to place objects into slots with a slot attention model on BEV\nsequences. Based on these object-centric representations, we then train a\ntransformer to learn to drive as well as reason about the future of other\nvehicles. We found that object-centric slot representations outperform both\nscene-level and object-level approaches that use the exact attributes of\nobjects. Slot representations naturally incorporate information about objects\nfrom their spatial and temporal context such as position, heading, and speed\nwithout explicitly providing it. Our model with slots achieves an increased\ncompletion rate of the provided routes and, consequently, a higher driving\nscore, with a lower variance across multiple runs, affirming slots as a\nreliable alternative in object-centric approaches. Additionally, we validate\nour model's performance as a world model through forecasting experiments,\ndemonstrating its capability to predict future slot representations accurately.\nThe code and the pre-trained models can be found at\nhttps://kuis-ai.github.io/CarFormer/.","[cs.CV, cs.AI, cs.RO]"
4,2024-07-22 17:59:01+00:00,HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning,"Predicting camera-space hand meshes from single RGB images is crucial for\nenabling realistic hand interactions in 3D virtual and augmented worlds.\nPrevious work typically divided the task into two stages: given a cropped image\nof the hand, predict meshes in relative coordinates, followed by lifting these\npredictions into camera space in a separate and independent stage, often\nresulting in the loss of valuable contextual and scale information. To prevent\nthe loss of these cues, we propose unifying these two stages into an end-to-end\nsolution that addresses the 2D-3D correspondence problem. This solution enables\nback-propagation from camera space outputs to the rest of the network through a\nnew differentiable global positioning module. We also introduce an image\nrectification step that harmonizes both the training dataset and the input\nimage as if they were acquired with the same camera, helping to alleviate the\ninherent scale-depth ambiguity of the problem. We validate the effectiveness of\nour framework in evaluations against several baselines and state-of-the-art\napproaches across three public benchmarks.","[cs.CV, cs.AI, cs.LG, cs.RO]"
5,2024-07-22 17:57:59+00:00,QueST: Self-Supervised Skill Abstractions for Learning Continuous Control,"Generalization capabilities, or rather a lack thereof, is one of the most\nimportant unsolved problems in the field of robot learning, and while several\nlarge scale efforts have set out to tackle this problem, unsolved it remains.\nIn this paper, we hypothesize that learning temporal action abstractions using\nlatent variable models (LVMs), which learn to map data to a compressed latent\nspace and back, is a promising direction towards low-level skills that can\nreadily be used for new tasks. Although several works have attempted to show\nthis, they have generally been limited by architectures that do not faithfully\ncapture shareable representations. To address this we present Quantized Skill\nTransformer (QueST), which learns a larger and more flexible latent encoding\nthat is more capable of modeling the breadth of low-level skills necessary for\na variety of tasks. To make use of this extra flexibility, QueST imparts causal\ninductive bias from the action sequence data into the latent space, leading to\nmore semantically useful and transferable representations. We compare to\nstate-of-the-art imitation learning and LVM baselines and see that QueST's\narchitecture leads to strong performance on several multitask and few-shot\nlearning benchmarks. Further results and videos are available at\nhttps://quest-model.github.io/",[cs.RO]
6,2024-07-22 17:57:12+00:00,Importance Sampling-Guided Meta-Training for Intelligent Agents in Highly Interactive Environments,"Training intelligent agents to navigate highly interactive environments\npresents significant challenges. While guided meta reinforcement learning (RL)\napproach that first trains a guiding policy to train the ego agent has proven\neffective in improving generalizability across various levels of interaction,\nthe state-of-the-art method tends to be overly sensitive to extreme cases,\nimpairing the agents' performance in the more common scenarios. This study\nintroduces a novel training framework that integrates guided meta RL with\nimportance sampling (IS) to optimize training distributions for navigating\nhighly interactive driving scenarios, such as T-intersections. Unlike\ntraditional methods that may underrepresent critical interactions or\noveremphasize extreme cases during training, our approach strategically adjusts\nthe training distribution towards more challenging driving behaviors using IS\nproposal distributions and applies the importance ratio to de-bias the result.\nBy estimating a naturalistic distribution from real-world datasets and\nemploying a mixture model for iterative training refinements, the framework\nensures a balanced focus across common and extreme driving scenarios.\nExperiments conducted with both synthetic dataset and T-intersection scenarios\nfrom the InD dataset demonstrate not only accelerated training but also\nimprovement in agent performance under naturalistic conditions, showcasing the\nefficacy of combining IS with meta RL in training reliable autonomous agents\nfor highly interactive navigation tasks.","[cs.RO, cs.AI]"
7,2024-07-22 17:54:41+00:00,Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning,"Masked Image Modeling (MIM) has emerged as a promising method for deriving\nvisual representations from unlabeled image data by predicting missing pixels\nfrom masked portions of images. It excels in region-aware learning and provides\nstrong initializations for various tasks, but struggles to capture high-level\nsemantics without further supervised fine-tuning, likely due to the low-level\nnature of its pixel reconstruction objective. A promising yet unrealized\nframework is learning representations through masked reconstruction in latent\nspace, combining the locality of MIM with the high-level targets. However, this\napproach poses significant training challenges as the reconstruction targets\nare learned in conjunction with the model, potentially leading to trivial or\nsuboptimal solutions.Our study is among the first to thoroughly analyze and\naddress the challenges of such framework, which we refer to as Latent MIM.\nThrough a series of carefully designed experiments and extensive analysis, we\nidentify the source of these challenges, including representation collapsing\nfor joint online/target optimization, learning objectives, the high region\ncorrelation in latent space and decoding conditioning. By sequentially\naddressing these issues, we demonstrate that Latent MIM can indeed learn\nhigh-level representations while retaining the benefits of MIM models.","[cs.CV, cs.AI]"
8,2024-07-22 17:51:53+00:00,dMel: Speech Tokenization made Simple,"Large language models have revolutionized natural language processing by\nleveraging self-supervised pretraining on vast textual data. Inspired by this\nsuccess, researchers have investigated complicated speech tokenization methods\nto discretize continuous speech signals so that language modeling techniques\ncan be applied to speech data. However, existing approaches either model\nsemantic tokens, potentially losing acoustic information, or model acoustic\ntokens, risking the loss of semantic information. Having multiple token types\nalso complicates the architecture and requires additional pretraining. Here we\nshow that discretizing mel-filterbank channels into discrete intensity bins\nproduces a simple representation (dMel), that performs better than other\nexisting speech tokenization methods. Using a transformer decoder-only\narchitecture for speech-text modeling, we comprehensively evaluate different\nspeech tokenization methods on speech recognition (ASR), speech synthesis\n(TTS). Our results demonstrate the effectiveness of dMel in achieving high\nperformance on both tasks within a unified framework, paving the way for\nefficient and effective joint modeling of speech and text.","[cs.CL, cs.AI, cs.SD, eess.AS]"
9,2024-07-22 17:50:31+00:00,NV-Retriever: Improving text embedding models with effective hard-negative mining,"Text embedding models have been popular for information retrieval\napplications such as semantic search and Question-Answering systems based on\nRetrieval-Augmented Generation (RAG). Those models are typically Transformer\nmodels that are fine-tuned with contrastive learning objectives. Many papers\nintroduced new embedding model architectures and training approaches, however,\none of the key ingredients, the process of mining negative passages, remains\npoorly explored or described. One of the challenging aspects of fine-tuning\nembedding models is the selection of high quality hard-negative passages for\ncontrastive learning. In this paper we propose a family of positive-aware\nmining methods that leverage the positive relevance score for more effective\nfalse negatives removal. We also provide a comprehensive ablation study on\nhard-negative mining methods over their configurations, exploring different\nteacher and base models. We demonstrate the efficacy of our proposed methods by\nintroducing the NV-Retriever-v1 model, which scores 60.9 on MTEB Retrieval\n(BEIR) benchmark and 0.65 points higher than previous methods. The model placed\n1st when it was published to MTEB Retrieval on July 07, 2024.","[cs.IR, cs.AI]"


In [None]:
# Example abstract from API
abstract = df['abstract'][0]

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Summarization
summarization_result = summarizer(abstract)

In [None]:
summarization_result[0]['summary_text']

'We propose WayEx, a new method for learning complex goal-conditioned robotics tasks from a single demonstration. WayEx obtains a higher reward than existing imitation learning methods and significantly reduces training time by 50% as compared to traditional reinforcement learning methods. We demonstrate its success in tackling complex environments where standard approaches fall short.'