# Zero-shot Learning

Recently, the NLP science community has begun to pay increasing attention to zero-shot and few-shot applications, such as in the paper from OpenAI introducing GPT-3. This [demo](https://joeddav.github.io/blog/2020/05/29/ZSL.html) shows how 🤗 Transformers can be used for zero-shot topic classification, the task of predicting a topic that the model has not been trained on.

In [1]:
import requests
import warnings
import string
import joblib
import multiprocessing
import torch
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from collections import defaultdict
from transformers import AutoTokenizer, AutoModel
from transformers import BertTokenizer
from transformers import BertModel
from torch.nn import functional as F


warnings.filterwarnings("ignore")

In [2]:
def load_tweets(tweets_file="../data/preprocessed_tweet_20201619.csv", 
                from_date="2017-01-01", 
                to_date="2020-06-01", 
                count=10):
    """
    Parameters: 
        tweet_file: directory
        from_date: str
        to_date: str
        count: int (remove the rows which sentence length are less than certain integer)
    """
    cols = ["date", "time", "username", "tweet", "clean_tweet", "hashtags", 
            "likes_count", "replies_count", "retweets_count", "slang_count"]
    df = pd.read_csv(tweets_file, usecols=cols)
    print("# of total tweets: {}".format(df.shape[0]))
    df.sort_values(by="date", ascending=True, inplace=True)
    df.set_index('date', inplace=True)
    df = df.loc[from_date:to_date]
    df.reset_index(drop=False, inplace=True)
    df.drop_duplicates(inplace=True)
    df = df[df.clean_tweet.str.count('\s+').gt(count)]
    print("There are {} tweets we get.".format(df.shape[0]))
    return df

# Latent Embedding Approach
A common approach to zero shot learning in the computer vision setting is to use an existing featurizer to embed an image and any possible class names into their corresponding latent representations (e.g. Socher et al. 2013). In the text domain, we have the advantage that we can trivially use a single model to embed both the data and the class names into the same space, eliminating the need for the data-hungry alignment step. We therefore decided to run some experiments with Sentence-BERT, a recent technique which fine-tunes the pooled BERT sequence representations for increased semantic richness, as a method for obtaining sequence and label embeddings.

## Sentence-BERT
Here's an example code snippet showing how this can be done using Sentence-BERT as our embedding model.

In [3]:
class SentenceBert():
    """
    A common approach to zero shot learning using Sentence-BERT.
    Reference from https://joeddav.github.io/blog/2020/05/29/ZSL.html
    """
    def __init__(self):
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.tokenizer = AutoTokenizer.from_pretrained('deepset/sentence_bert')
        self.model = AutoModel.from_pretrained('deepset/sentence_bert')
        self.model = self.model.to(self.device)
        
    def get_similarity(self, sentence, labels):
        """
        Parameters:
            sentence: str
            label: list
        """
        # Run inputs through model and mean-pool over the sequence dimension to get sequence-level representations
        inputs = self.tokenizer.batch_encode_plus(
            [sentence] + labels,
            return_tensors='pt',
            pad_to_max_length=True)
        input_ids = inputs['input_ids'].to(self.device)
        attention_mask = inputs['attention_mask'].to(self.device)
        with torch.no_grad():
            output = self.model(input_ids, attention_mask=attention_mask)[0]
        sentence_rep = output[:1].mean(dim=1)
        label_reps = output[1:].mean(dim=1)
    
        # Now find the labels with the highest cosine similarities to the sentence
        similarities = F.cosine_similarity(sentence_rep, label_reps)
        closest = similarities.argsort(descending=True)
        
        sim_dict = defaultdict()
        for ind in closest:
            sim_dict[labels[ind]] = (similarities[ind].item())
            
        return sim_dict

In [4]:
df = load_tweets(from_date="2017-01-01", to_date="2020-06-17")
df = df[["date", "clean_tweet"]]

SB = SentenceBert()
for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    sim_dict = SB.get_similarity(row["clean_tweet"], ['forex', 'finance', 'politics'])
    df.loc[index, 'forex'] = sim_dict["forex"]
    df.loc[index, 'finance'] = sim_dict["finance"]
    df.loc[index, 'politics'] = sim_dict["politics"]

# of total tweets: 1297358
There are 282228 tweets we get.


HBox(children=(FloatProgress(value=0.0, max=282228.0), HTML(value='')))




In [7]:
joblib.dump(df, "../data/tweets_zero_shot_df.gzip", compress=3)

['../data/tweets_zero_shot_df.gzip']

# Natural Language Inference
We will now explore an alternative method which not only embeds sequences and labels into the same latent space where their distance can be measured, but that can actually tell us something about the compatibility of two distinct sequences out of the box. As a quick review, natural language inference (NLI) considers two sentences: a "premise" and a "hypothesis". The task is to determine whether the hypothesis is true (entailment) or false (contradiction) given the premise.

## BART
BART is sequence-to-sequence model trained with denoising as pretraining objective. The approach, proposed by Yin et al. (2019), uses a pre-trained MNLI sequence-pair classifier as an out-of-the-box zero-shot text classifier that actually works pretty well. The idea is to take the sequence we're interested in labeling as the "premise" and to turn each candidate label into a "hypothesis." If the NLI model predicts that the premise "entails" the hypothesis, we take the label to be true. Here is a [demo](https://huggingface.co/zero-shot/) built by hugginface.