# Zero Shot Classification
> This shows how to use both zero shot classification and answer questions referring to data from a linkedin dataset

In [1]:
#| default_exp zeroshot

In [2]:
#| hide
#!pip install transformers

In [3]:
#| hide
!pip install plotly

Collecting plotly
  Downloading plotly-5.11.0-py2.py3-none-any.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 36.9 MB/s eta 0:00:01
[?25hCollecting tenacity>=6.2.0
  Downloading tenacity-8.1.0-py3-none-any.whl (23 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.11.0 tenacity-8.1.0


#### These are the requirements to do this NLP 

In [4]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import pipeline
from fastai.tabular.all import *
import plotly.express as px

## Create a classifier to run "zero-shot" classification
    > Here we'll use the hugging face transformers `pipeline` to use the pretrained zero-shot-classification "auto-model."  This instantiates a class and automatically uses the bart-large-mnli model to classify how relevent different topics are to any given text

In [5]:
classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

### Topic Relevance Example 1
>How relevant are the following topics: Politics, Public Health, Economics

In [6]:
sequence = "Who are you voting for in 2022?"
candidate_labels = ["politics", "public health", "economics"]

print(classifier(sequence, candidate_labels))

{'sequence': 'Who are you voting for in 2022?', 'labels': ['politics', 'economics', 'public health'], 'scores': [0.9570426940917969, 0.023699436336755753, 0.019257841631770134]}


### Topic Relevance Example 2
>Use this on data from a Linkedin dataset to see if linkedin people are relevant


Bring in a Linkedin dataset

In [7]:
path = Path('..')
df = pd.read_csv(path/'linkedin.csv')

Create a column with all info to easily feed into the models

In [8]:
df.fillna('',inplace=True) #fill all na with ''
#create input col to feed into the models
df['input'] = 'description: ' + df.description+'; Experience: '+df.Experience+ '; Name: '+ df.Name+'; position: '+df.position+'; location: '+df.location+ '; skills: '+df.skills+ '; clean_skills: '+df.clean_skills
df.input = df.input.astype(str)
df.input = df.input.str.replace('\\n',' ',regex=False);  print(str(df.input[0][:400])+'...')

description: An experienced HR professional,  HR mentor and Coach , Talent advisory and HR strategist... see more; Experience: Senior Vice President & Head of HRCompany NameSamsung Electronics India LimitedDates EmployedJan 2018 – PresentEmployment Duration2 yrs 3 mosLocationGurgaon, Haryana, IndiaVice President Franchise capability building and business transformationCompany NameCoca-Cola India a...


Create a list of topics/categories.  We'll use th emodel to check how relevant each topic/category is to each person.  In this case each person is text entry from the column `input`.

In [9]:
candidate_labels = df.category.unique().tolist()
candidate_labels.extend(['US','India']); 'first five categories: '+str(candidate_labels[:5])

"first five categories: ['HR', 'Designing', 'Managment', 'Information Technology', 'Education']"

In [10]:
#| export
def analyze_one(df:pd.DataFrame, # dataframe with df.input
                candidate_labels, index ):
    i=index
    sequence = df.input[i]
    answer = classifier(sequence, candidate_labels)
    dfo = pd.DataFrame(answer)
    dfo.sort_values('scores',inplace=True)
    fig = px.bar(dfo, x="scores", y="labels", orientation='h')
    print(dfo.sequence[0])
    
    print('Actual Category: '+str(df.category[i]))
    fig.show()

In [11]:
# analyze_one(df,candidate_labels,index=0)

In [12]:
# analyze_one(df,candidate_labels,index=1000)

In [13]:
# analyze_one(df,candidate_labels,index=421)

In [14]:
# df.groupby('category').Name.count()

In [15]:
# df.loc[df.category=='Agricultural'].description

try questions:

In [16]:
idx = 4

In [17]:
text = df.input[idx][:(int(.75 *len(df.input[idx])))]
text

"description: Over 18 Years of experience in IT /ITES  / BPO with leading global OrganizationsHave a passion for working on great products, enthusiastic about #UserExperience #SaaS #HRTech #Bots #Io...\n            see more; Experience: Company NameEXLTotal Duration6 yrs 4 mosTitleVice President - Head of Digital HR Technologies and HR Operations/ shared servicesDates EmployedJul 2018 – PresentEmployment Duration1 yr 9 mosLocationNoida Area, IndiaHave a passion for working on great products, enthusiastic about #UserExperience #SaaS #HRTech #Bots #IoT #Gadgets, #Mobileapps, #ERP... Strong experience in managing Transformative Business HR IT initiatives in a Global Shared Service environmentTitleSenior Assistant Vice President - Human ResourcesDates EmployedDec 2013 – Jun 2018Employment Duration4 yrs 7 mosTitleVice President - Head of Digital HR Technologies and HR Operations/ shared servicesDates EmployedJul 2018 – PresentEmployment Duration1 yr 9 mosLocationNoida Area, IndiaHave a pass

In [18]:
# tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
# model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

# text = text

# questions = ["What is the persons name?" ,"What companies has this person worked with?","Do they work in HR?"
# ]

# for question in questions:
#     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
#     input_ids = inputs["input_ids"].tolist()[0]

#     outputs = model(**inputs)
#     answer_start_scores = outputs.start_logits
#     answer_end_scores = outputs.end_logits

#     # Get the most likely beginning of answer with the argmax of the score
#     answer_start = torch.argmax(answer_start_scores)
#     # Get the most likely end of answer with the argmax of the score
#     answer_end = torch.argmax(answer_end_scores) + 1

#     answer = tokenizer.convert_tokens_to_string(
#         tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
#     )

#     print(f"Question: {question}")
#     print(f"Answer: {answer}")

In [19]:

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in 🤗 Transformers?",
    "What does 🤗 Transformers provide?",
    "🤗 Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Get the most likely beginning of answer with the argmax of the score
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    print(f"Question: {question}")
    print(f"Answer: {answer}")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Question: How many pretrained models are available in 🤗 Transformers?
Answer: over 32 +
Question: What does 🤗 Transformers provide?
Answer: general - purpose architectures
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: tensorflow 2. 0 and pytorch
