# The Key Terms for Wednesday

* `transformers`
* `pipeline`

# Open Source NLP

Everything we have done with NLP and ML this semester has been doine using **open source**: open source data; open source models; open source code.

There is a thriving open source tradition in AI.

AI would not be as pervasive without open source.

Among the *benefits* of open source are transparency and reproducibility.

Among the *risks* of open source are reproducibility and security.

# Huggingface

[Huggingface](https://www.crunchbase.com/organization/hugging-face/) is a Brooklyn-based company that was founded just about the time the first transformer models for NLP became famous. Its business model is open source.

The huggingface staff:

* host transformer-related data, code, models, data sheets, model cards and applications
* help people more easily use transformers (for NLP, computer vision and other AI applications)
* help people more easily *fine tune* transformers (we will do that!)
* consult with companies on how to operationalize and scale their use of transformers

# `transformers`

Because of the huggingface `transformers` package, we can easily use transformers ourselves!

Let's make a transformers `pipeline` for sentiment analysis. Sentiment analysis is a NLP task that estimates the *polarity* (and sometimes the *strength*) of the sentiment communicated by a text.

In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(["Summer in Maine is the best season.", "Spring is variable.", "The period between winter and spring is muddy; I hate mud."])

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998225569725037},
 {'label': 'POSITIVE', 'score': 0.9882440567016602},
 {'label': 'NEGATIVE', 'score': 0.998120129108429}]

Well, we just used transformers, the most advanced NLP model type known today! 

A huggingface `pipeline` pulls together a tokenizer, one or more models, and some post-processing. It can operate over a single text or a list of texts.

[There are NLP pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) for:

* named entity recognition
* sentiment analysis
* summarization
* question answering
* text classification
* translation

There are also computer vision and speech pipelines.


Let's try a summarization pipeline.

In [2]:
summarizer = pipeline("summarization")
text = "Colby College is a private liberal arts college in Waterville, Maine. Founded in 1813 as the Maine Literary and Theological Institution, it was renamed Waterville College in 1821. The donations of Christian philanthropist Gardner Colby saw the institution renamed again to Colby University before settling on its current title, reflecting its liberal arts college curriculum, in 1899. Approximately 2,000 students from more than 60 countries are enrolled annually. The college offers 54 major fields of study and 30 minors. Located in central Maine, the 714-acre Neo-Georgian campus sits atop Mayflower Hill and overlooks downtown Waterville and the Kennebec River Valley. Along with fellow Maine institutions Bates College and Bowdoin College, Colby competes in the New England Small College Athletic Conference (NESCAC) and the Colby-Bates-Bowdoin Consortium."
summarizer(text)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' Colby College is a private liberal arts college in Waterville, Maine . Located in central Maine, the 714-acre Neo-Georgian campus sits atop Mayflower Hill and overlooks downtown Waterville and the Kennebec River Valley . Approximately 2,000 students from more than 60 countries are enrolled annually .'}]

Maybe I think that summary is still too long. Let's exert more control.

In [3]:
summarizer(text, min_length=5, max_length=20)

[{'summary_text': ' Colby College was founded in 1813 as the Maine Literary and Theological Institution .'}]

Is that better?

What if we used a different model?

In [14]:
summarizer = pipeline("summarization", model="t5-base")
summarizer(text)

[{'summary_text': 'Founded in 1813 as the Maine Literary and Theological Institution, it was renamed Waterville College in 1821 . the donations of Christian philanthropist Gardner Colby saw the institution re-named again to Colby University . Approximately 2,000 students from more than 60 countries are enrolled annually .'}]

Is that better?

Notice that when you instantiate a pipeline, hugginface downloads a model. Any transformer model is pretty big. Some are a lot bigger than others. Downloading a model (and then loading it) takes time, which is why once we've made a pipeline it's good to keep it around if we are going to process a lot of documents.

# Huggingface vs spaCy

Huggingface and spaCy are different companies. Each company releases open source software.

The huggingface software is the python package `transformers`.

The spaCy software is the python package `spaCy`.

Both softwares use models. spaCy has a whole set of models (the ones ending in `-trf`) that use huggingface transformers!

spaCy can do some NLP tasks that huggingface can't do. 

The spaCy models are highly tuned and optimized for processing text using NLP. The huggingface models (e.g. for NER) are contributed by the community. 

The huggingface models focus more on NLP *applications* like summarization, sentiment analysis or translation.

If you have a choice, I would use the spaCy models for text preprocessing. 

If you want to use a NLP application, huggingface is great.

# Over to You!

Now, on your own, complete the [first lesson](https://huggingface.co/learn/nlp-course/chapter1/3?fw=pt) in the huggingface course. Paste your code below.

Note that you now understand that pipelines is a class, which has subclasses for various tasks. You know how to import packages; you know what required and optional arguments to a function or method look like; and you know how to make strings and lists. Congratulations!