<a href="https://colab.research.google.com/github/elliemci/MelanomaDetection/blob/main/qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# a very light version of Hugging Face Transformers without any machine learning frameworks like PyTorch or TensorFlow
#pip install transformers

In [None]:
# install development version with
!pip install transformers[sentencepiece]

In [3]:
from transformers import pipeline

Pipline objects offer an API abstraction to task-specific models. The *pipeline* is awrapper around all other pipelines. Available piplines:<br>

*   feature-extraction - vector representation of a text
*   fill-mask
*   ner - named entity recognition
*   question-answering
*   sentiment-analysis
*   summarization
*   text-generation
*   translation
*   zero-shot-classification

Can use the default model or choose from models on Model. Hub for specific task at https://huggingface.co/models

## DistilBERT transformer

**Extracting an answer from a text** <br>
Given a question and a context, extracting the answer to the question based on the information provided in the context

Three main steps when text is pass to pipeline object


1.   The text is preprocessed into format the model understand
2.   The preprocesed inputs are passed tot he model
3.   The model predictions are post-processed



In [None]:
# initiate the pipeline class by passing the "question-answering" task to the pipeline argument
qa=pipeline("question-answering")

qa(question="When was OpenAI API released?", context="OpenAI announced a multi-purpose API in June 2020.")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.9583989977836609, 'start': 40, 'end': 49, 'answer': 'June 2020'}

**Classifying whole sentence**

In [9]:
classifier = pipeline("sentiment-analysis")
classifier(["I've been waiting for work like this my whole life.", "I hate that weather!"])

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9881021976470947},
 {'label': 'NEGATIVE', 'score': 0.9983065128326416}]

In [4]:
# to classify a tex that haven't been labeled, and there is no need of fine-tuning
classifier = pipeline("zero-shot-classification")

classifier("This is a course about the Transformers library",
           candidate_labels=["education", "politics", "business"],)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445952534675598, 0.11197689920663834, 0.04342782497406006]}

In [6]:
# text generation by auto-completing a promt
generator = pipeline("text-generation")
generator("This course on Large Language Models teaches how to")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This course on Large Language Models teaches how to make language models that can be used with any object, such as languages or languages of the same type. The goal of the class is to create a new language model based on this model, making the models'}]

In [None]:
# using a specific model
generator = pipeline("text-generation", model="distilgpt2")
generator("This course on Large Language Models teaches how to",
           max_length=30,
           num_return_sequences=2)

In [None]:
# name entity recognition - th model finds which parts of the input text corresoinds
# to entitites such as person, location or organization, the option grouped_entities
# is set for regrouping together the parts of the sentence that correspond to the same entity
ner = pipeline("ner", grouped_entities=True)
ner("Emma is a freshman in UW pre-med program perticipating in UW marching band and NCAA D1 women rowing team.")

In [None]:
# Summarization
summarizer = pipeline("summarization", model="google/pegasus-xsum")

summarizer = (""" Microsatellite Instability (MSI) is a key genomic biomarker in
                  colorectal cancer and about 15% of the overall CRC population
                  has this marker. Recent clinical trials have shown that MSI
                  phenotype has both prognostic and therapeutic importance,
                  especially with the recent approval of immune checkpoint
                  inhibitor (ICI) therapies. Patients whose tumors show MSI are
                  considered more likely to respond to ICI therapy and are
                  recommended for it. Conversely, ICI is not routinely
                  recommended for those with tumors that are microsatellite
                  stable (MSS). Many medical organizations such as the National
                  Institute for Health and Care Excellence (NICE) and the National
                  Comprehensive Cancer Network (NCCN), recommend universal screening
                  for MSI status of all newly diagnosed CRC. Prescreening tools
                  could streamline this process, reducing the pressure on
                  laboratory staff and resources. """)

## BioBERT model
To extract aswers from biomedical text

In [8]:
# define the model and tokenizer
model = "ktrapeznikov/biobert_v1.1_pubmed_squad_v2"


# instantiate a pipeline object with qa task, model and tokenizer
qa_pipeline = pipeline("question-answering", model=model, tokenizer=model)

# define the context and the question
context = "Symptoms of COVID-19 are variable, but often include fever, cough, fatigue, breathing difficulties, and loss of smell and taste. Symptoms may begin one to fourteen days after exposure to the virus. At least a third of people who are infected do not develop noticeable symptoms.[9] Of those people who develop noticeable symptoms enough to be classed as patients, most (81%) develop mild to moderate symptoms (up to mild pneumonia), while 14% develop severe symptoms (dyspnea, hypoxia, or more than 50% lung involvement on imaging), and 5% suffer critical symptoms (respiratory failure, shock,or multiorgan dysfunction).[10] Older people are more likely to have severe symptoms. Some people continue to experience a range of effects—known as long COVID—for months after recovery, and damage to organs has been observed.[11] Multi-year studies are underway to further investigate the long-term effects of the disease."

question = "What are the symptoms of COVID-19?"

# use the pipeline to anser the question
answer = qa_pipeline({"context" : context,
                      "question" : question})


Some weights of the model checkpoint at ktrapeznikov/biobert_v1.1_pubmed_squad_v2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
answer

{'score': 0.9045335650444031,
 'start': 53,
 'end': 127,
 'answer': 'fever, cough, fatigue, breathing difficulties, and loss of smell and taste'}