# Project 4b: Goals and Deliverables

The goals of this assignment are:
* To work with the object oriented version of our corpus code.
* To modify a web app that we can use to analyze text data.
* To explore document- and corpus-level analyses using transformer models: summarization, key phrase extraction, and sentiment and topic analysis.


Here are the steps you should do to successfully complete this project:
1. From moodle, download the files for this project. Upload them into the codespace for project 4a.
2. Complete the notebook and commit it to Github. Make sure to answer all questions, and to commit the notebook in a "run" state!
3. Modify `spacy_on_corpus.py` following the instructions in this notebook.
4. Edit the README.md file. Provide your name, your class year, links to/descriptions of any extensions and a list of resources. 
5. Commit your code often. We will take the last commit before the deadline as your submission of the project.

Possible extensions (from least points to most points):

Possible extensions (from least points to most points):

* Modify the token, entity, and noun chunk get count methods so they count only lower cased tokens, entities and noun chunks.
* Modify the [styling](https://anvil.works/learn/tutorials/using-material-3) of the web app. 
* To the screen `Build Corpus` in the web app, add the ability for the user to choose the language of their input documents.
* To the screen `Analyze Document` in the web app, add the ability for the user to choose a value for `top_k` and to choose which token and entity tags to exclude.
* Plot more than one analysis at a time in `Analyze Corpus` (see [this page](https://anvil.works/docs/client/components/plots)).
* Add the ability for a user to enter jsonl in the input text area on the `Build Corpus` screen.
* If you added paragraphs to project 3c, port that over to project 4a.
* Add some metadata analysis and visualization on a fourth screen.
* Your other ideas are welcome! If you'd like to discuss one with Dr Stent, feel free.

# Set Up

1. Download this notebook, the `requirements.txt` file and the file `creator.jsonl`.
2. Upload all three into your project 4a codespace.
3. Run % `pip install -r requirements.txt`.

# Make Sure We Can Work With .py Files We Are Editing

Run the code cell below.

In [1]:
# Automatically reload your external source code
%load_ext autoreload
%autoreload 2

# Get a Corpus

In the code cell below, build a corpus using `creator.jsonl`. This will be our test corpus for this project. If you can get `files.jsonl.zip` to load you can use it at the end.

In [2]:
from spacy_on_corpus import corpus
my_corpus = corpus()
corpus.build_corpus('creator.jsonl', my_corpus= my_corpus)


  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'1': {'doc': It's a shame that the weak writing undermines The Creator so much, as there are some intriguing concepts that could have been compelling if executed better. For the most part, it's a mishmash of other movies with not much to say on its own.,
  'metadata': {'id': '1',
   'author': 'Michelle Kisner',
   'fullText': "It's a shame that the weak writing undermines The Creator so much, as there are some intriguing concepts that could have been compelling if executed better. For the most part, it's a mishmash of other movies with not much to say on its own."},
  'sentiment-analysis': {'label': 'NEGATIVE', 'score': 0.9997203946113586}},
 '2': {'doc': Although 'New Asia' is America's enemy, we are encouraged to transfer our sympathies in that direction. Yet the abiding vision of Asian life is a mass of touristic clichés seen through western eyes.,
  'metadata': {'id': '2',
   'author': 'John McDonald',
   'fullText': "Although 'New Asia' is America's enemy, we are encouraged to tr

# Ways We Can Add More NLP

There are many python packages that do NLP. Today we will look at three ways to add more NLP to our `corpus` class and our web app:

1. Augment spaCy
2. Use huggingface transformers
3. Use something else (that probably uses huggingface transformers!)

# Alternative 1: Augment spaCy

spaCy has extra plugins (available from the [spaCy universe](https://spacy.io/universe/) ). These plugins allow you to extend spaCy. We will play with two.

## Key phrases

Let's get some keyphrases!

In [12]:
import pyate
import spacy

nlp = spacy.load('en_core_web_md')          
nlp.add_pipe('combo_basic')

doc = nlp('Maine is beautiful in the fall. The leaves turn orange and green and drop from the trees. The quiet roads summon all travelers.')
print(list(doc._.combo_basic.keys()))

['beautiful in the fall', 'quiet roads']


### Add key phrases to the things we can do in our `corpus` class

Modify `spacy_on_corpus.py` as follows:
1. import `pyate`
2. after you make the spacy engine, add this line: `nlp.add_pipe("combo_basic")`
3. implement instance method `get_keyphrase_counts` which behaves similarly to `get_token_counts` except the document attribute to retrieve is `_.combo_basic`
4. implement instance method `get_keyphrase_statistics`

Make sure to add docstrings.

Feel free to test in the code cell below.

In [13]:
print(my_corpus.get_token_counts())
my_corpus.get_keyphrase_counts()


[('It', 1), ('shame', 1), ('weak', 1), ('undermines', 1), ('there', 1), ('intriguing', 1), ('concepts', 1), ('could', 1), ('been', 1), ('compelling', 1), ('if', 1), ('executed', 1), ('better', 1), ('For', 1), ('most', 1), ('part', 1), ('mishmash', 1), ('other', 1), ('movies', 1), ('say', 1), ('own', 1), ('Although', 1), ('Asia', 1), ('America', 1), ('enemy', 1), ('encouraged', 1), ('transfer', 1), ('sympathies', 1), ('direction', 1), ('Yet', 1), ('abiding', 1), ('vision', 1), ('Asian', 1), ('life', 1), ('mass', 1), ('touristic', 1), ('clichés', 1), ('seen', 1), ('through', 1), ('western', 1), ('eyes', 1), ('astonishing', 1), ('visuals', 1), ('but', 1), ('where', 1), ('ends', 1), ('While', 1), ('performances', 1), ('strong', 1), ('thrilling', 1), ('elements', 1), ('swap', 1), ('actual', 1), ('excitement', 1), ('more', 1), ('traditional', 1), ('pays', 1), ('tribute', 1), ('influences', 1), ('little', 1), ('else', 1), ('incredibly', 1), ('immersive', 1), ('from', 1), ('visual', 1), ('pers

[('mishmash of other movies', 1),
 ('mishmash of other', 1),
 ('weak writing', 1),
 ('intriguing concepts', 1),
 ('most part', 1),
 ('other movies', 1),
 ('sympathies in that direction', 1),
 ('vision of Asian life', 1),
 ('vision of Asian', 1),
 ('mass of touristic', 1),
 ('Asian life', 1),
 ('western eyes', 1),
 ('elements of the film', 1),
 ('traditional science fiction film', 1),
 ('traditional science fiction', 1),
 ('science fiction film', 1),
 ('astonishing visuals', 1),
 ('actual excitement', 1),
 ('traditional science', 1),
 ('fiction film', 1),
 ('immersive from a visual', 1),
 ('world building perspective', 1),
 ('world building', 1),
 ('building perspective', 1),
 ('gorgeous feature in all the ways', 1),
 ('feature in all the ways', 1),
 ('charm of an old', 1),
 ('gorgeous feature', 1),
 ('fi epic', 1),
 ('fashioned blockbuster', 1),
 ('leap of faith', 1),
 ('serious thrills', 1),
 ('fi journey', 1),
 ('stimulating film', 1),
 ('profound questions', 1),
 ('rise of artificia

### Add keyphrases to our web app's server

In the notebook for project4a, define a function `get_corpus_keyphrases_statistics` that returns a list of sentiment types and their frequencies (a counter of positive sentiment, negative sentiment and neutral sentiment documents in the corpus).


Make sure to add docstrings.

Feel free to test in the code cell below.

### Add keyphrases to our web app's client

1. Add a button for key phrases to the `Analyze Corpus` form
2. Add a function in the code for that form that calls `get_keyphrase_statistics`
3. Add a plot of the sentiment counts to the web app
4. If the user chooses key phrases and either count or cloud, print an error message

### Questions:
1. *What is `_.combo_basic'? Is it an instance attribute, instance method, class attribute or class method of class `doc`?*
Instance attribute because _.combo_basic is an attribute relating to an instance of a doc class. We run the .keys() method on it.
2. *Why do we not implement `plot_keyphrase_counts` or `plot_keyphrase_cloud`?*
Because most of the keyphrase counts are unique it wouldn't give us much information. 
3. *Rebuild your corpus now that you have added functionality. Look at the keyphrases. Do you agree that these capture the essence of this corpus?*
Yes, they provide lots of the important core data about the corpus. If looking for quick synthesis keywords can give core meaning. 

# Alternative 2: Use huggingface transformers

[Huggingface](https://www.crunchbase.com/organization/hugging-face/) is a Brooklyn-based company that was founded just about the time the first transformer models for NLP became famous. Its business model is open source.

The huggingface staff:

* host transformer-related data, code, models, data sheets, model cards and applications
* help people more easily use transformers (for NLP, computer vision and other AI applications)
* help people more easily *fine tune* transformers (we will do that!)
* consult with companies on how to operationalize and scale their use of transformers

## `transformers`

Because of the huggingface `transformers` package, we can easily use transformers ourselves!


## Sentiment analysis

Let's make a transformers `pipeline` for sentiment analysis. Sentiment analysis is a NLP task that estimates the *polarity* (and sometimes the *strength*) of the sentiment communicated by a text.

In [14]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
print(classifier('Maine is beautiful in the fall. The leaves turn orange and green and drop from the trees. The quiet roads summon all travelers.'))

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9997126460075378}]


Now let's get key phrases from each document in our corpus.

In [15]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

# get a list of the document ids in the corpus
keys = list(my_corpus.keys())
# get the text from each entry in the corpus
text = my_corpus.get_documents()
# there are two ways to get the text from each entry in the corpus!

# print the texts
print(*text, sep = "\n")
# do sentiment analysis
results = []
for doc in text:
    results.extend(classifier(str(doc)))
# print the results
print(results)
# add the sentiment label and score into the metadata for each entry in the corpus
for i, result in enumerate(results):
    my_corpus[keys[i]]['metadata']['sentiment_label'] = result['label']
    my_corpus[keys[i]]['metadata']['sentiment_score'] = result['score']
print(my_corpus)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


It's a shame that the weak writing undermines The Creator so much, as there are some intriguing concepts that could have been compelling if executed better. For the most part, it's a mishmash of other movies with not much to say on its own.
Although 'New Asia' is America's enemy, we are encouraged to transfer our sympathies in that direction. Yet the abiding vision of Asian life is a mass of touristic clichés seen through western eyes.
The Creator has astonishing visuals, but that's where its charm ends. While the performances are strong, thrilling elements of the film swap actual excitement for a more traditional science fiction film that pays tribute to its influences and little else.
The Creator is incredibly immersive from a visual and world building perspective, however it leaves a lot to be desired with its writing.
The Creator is a gorgeous feature in all the ways that matter and it's certainly a sci-fi epic that shouldn't be missed in a year that's severely lacking the charm of

Well, we just used transformers, the most advanced NLP model type known today! 

A huggingface `pipeline` pulls together a tokenizer, one or more models, and some post-processing. It can operate over a single text or a list of texts.

[There are NLP pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) for:

* named entity recognition
* sentiment analysis
* summarization
* question answering
* text classification
* translation

There are also computer vision and speech pipelines.

You can change the sentiment model. Copy the code from above into a new code cell. Investigate the sentiment models available - try at least one more. 


In [16]:
from transformers import pipeline

classifier2 = pipeline("translation_en_to_fr")

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 100%|██████████| 1.21k/1.21k [00:00<00:00, 5.56MB/s]
model.safetensors: 100%|██████████| 892M/892M [00:02<00:00, 312MB/s] 
generation_config.json: 100%|██████████| 147/147 [00:00<00:00, 915kB/s]
spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 15.1MB/s]
tokenizer.json: 100%|██████████| 1.39M/1.39M [00:03<00:00, 361kB/s]
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [17]:
# get a list of the document ids in the corpus
keys = list(my_corpus.keys())
# get the text from each entry in the corpus
text = my_corpus.get_documents()
# there are two ways to get the text from each entry in the corpus!

# print the texts
print(*text, sep = "\n")
# do sentiment analysis
results = []
for doc in text:
    results.extend(classifier2(str(doc)))
# print the results
print(results)
# add the sentiment label and score into the metadata for each entry in the corpus
for i, result in enumerate(results):
    my_corpus[keys[i]]['metadata']['sentiment_label'] = result['translation_text']
print(my_corpus)

It's a shame that the weak writing undermines The Creator so much, as there are some intriguing concepts that could have been compelling if executed better. For the most part, it's a mishmash of other movies with not much to say on its own.
Although 'New Asia' is America's enemy, we are encouraged to transfer our sympathies in that direction. Yet the abiding vision of Asian life is a mass of touristic clichés seen through western eyes.
The Creator has astonishing visuals, but that's where its charm ends. While the performances are strong, thrilling elements of the film swap actual excitement for a more traditional science fiction film that pays tribute to its influences and little else.
The Creator is incredibly immersive from a visual and world building perspective, however it leaves a lot to be desired with its writing.
The Creator is a gorgeous feature in all the ways that matter and it's certainly a sci-fi epic that shouldn't be missed in a year that's severely lacking the charm of

### A little abstraction

Before we proceed, let's make a function to get the texts of all the documents out of the corpus.

In `spacy_on_corpus.py`:
* add an instance method to class `corpus` called `get_document_texts`. It should return a list of pairs (id, text).
* add an instance method to class `corpus` called `update_document_metadata`. It should take a document id and a dictionary of metadata key:value pairs. It should add each key:value pair to the document in the corpus at the id, and if there is no such document id in the corpus it should print an error message.



Now copy the code from above into a new code cell. Modify it to use `get_document_texts` and `update_document_metadata`.

In [18]:
texts = my_corpus.get_document_texts()
# there are two ways to get the text from each entry in the corpus
# print the texts
print(*texts, sep = "\n")
# do sentiment analysis
results = []
for doc in texts:
    results.extend(classifier2(str(doc[1])))
# print the results
print(results)
# add the sentiment label and score into the metadata for each entry in the corpus
for i, result in enumerate(results):
    my_corpus.update_document_metadata(texts[i][0], result)
print(my_corpus)

('1', It's a shame that the weak writing undermines The Creator so much, as there are some intriguing concepts that could have been compelling if executed better. For the most part, it's a mishmash of other movies with not much to say on its own.)
('2', Although 'New Asia' is America's enemy, we are encouraged to transfer our sympathies in that direction. Yet the abiding vision of Asian life is a mass of touristic clichés seen through western eyes.)
('3', The Creator has astonishing visuals, but that's where its charm ends. While the performances are strong, thrilling elements of the film swap actual excitement for a more traditional science fiction film that pays tribute to its influences and little else.)
('4', The Creator is incredibly immersive from a visual and world building perspective, however it leaves a lot to be desired with its writing.)
('5', The Creator is a gorgeous feature in all the ways that matter and it's certainly a sci-fi epic that shouldn't be missed in a year th

### Add sentiment analysis to our `corpus` class

In the `corpus` class, add some code to run sentiment analysis to the `add_document` method. 

In `get_basic_statistics`, print out the number of documents, number of sentences and number of positive, neutral and negative documents.

Use the code cell below to test.


In [19]:
from spacy_on_corpus import corpus
my_corpus = corpus()
my_corpus = corpus.build_corpus('creator.jsonl', my_corpus= my_corpus)
print(my_corpus)

{'1': {'doc': It's a shame that the weak writing undermines The Creator so much, as there are some intriguing concepts that could have been compelling if executed better. For the most part, it's a mishmash of other movies with not much to say on its own., 'metadata': {'id': '1', 'author': 'Michelle Kisner', 'fullText': "It's a shame that the weak writing undermines The Creator so much, as there are some intriguing concepts that could have been compelling if executed better. For the most part, it's a mishmash of other movies with not much to say on its own."}, 'sentiment-analysis': {'label': 'NEGATIVE', 'score': 0.9997203946113586}}, '2': {'doc': Although 'New Asia' is America's enemy, we are encouraged to transfer our sympathies in that direction. Yet the abiding vision of Asian life is a mass of touristic clichés seen through western eyes., 'metadata': {'id': '2', 'author': 'John McDonald', 'fullText': "Although 'New Asia' is America's enemy, we are encouraged to transfer our sympathi

In [3]:

print(my_corpus.get_basic_statistics())

Documents: 10
Sentences: 21
Tokens: 539
Unique tokens: 311
Entities: 34
Unique Entities: 28
Noun chunks: 144
Unique Noun chunks: 115
Positive Documents: 6
Neutral Docuements: 0
Negative Docuements: 4



### Add sentiment analysis to our web app's server

In the notebook for project4a:

1. define a function `get_corpus_sentiment` that returns a list of sentiment types and their frequencies (a counter of positive sentiment, negative sentiment and neutral sentiment documents in the corpus).
2. define a function `get_corpus_statistics` that calls `get_basic_statistics`.

Use the code cell below to test.


### Add sentiment analysis to our web app's client

In your web app, add the ability to plot corpus sentiment:
1. Add a button for sentiment to the `Analyze Corpus` form
2. Add a function in the code for that form that calls `get_corpus_sentiment`
3. Add a plot of the sentiment counts to the web app
4. On the `Build Corpus` form, print the basic statistics after each action is performed

### Questions

4. *Why did we add `update_document_metadata` to our corpus class?*
If we want to add new metadata to a document we can with this method
5. *Run the sentiment analysis on this corpus. For which documents do you agree with the assigned sentiment, and for which do you disagree?*


## Summarization

Let's try a transformers single-document summarization pipeline.

In the code cell below, summarize each document. Add each summary to the corresponding document's metadata in the corpus.

In [None]:
summarizer = pipeline("summarization")

# get the documents and their ids
texts = my_corpus.get_document_texts()
# summarize
results = []
for doc in texts:
    results.extend(summarizer(str(texts[1])))
# add the summaries to the documents' metadatas


NameError: name 'pipeline' is not defined

Maybe we think those summaries are too long or too short. Let's exert more control.

In [None]:
# get the documents and their ids

# summarize
results = summarizer(..., min_length=5, max_length=20)

# add the summaries to the documents' metadatas


Is that better?

What if we used a different model? In the code cell below, try a different summarizer model.

Is that better?

Notice that when you instantiate a pipeline, hugginface downloads a model. Any transformer model is pretty big. Some are a lot bigger than others. Downloading a model (and then loading it) takes time, which is why once we've made a pipeline it's good to keep it around if we are going to process a lot of documents.

### Add summarization to our `corpus` class

In the `corpus` class, add some code to run single-document summarization to the `add_document` method.


### Add summarization to our web app's server

In the notebook for project4a, define a function `get_document_summary` that gets the summary for a single document id. It should return the text summary.

Use the code cell below to test.


### Add summarization our web app's client

In your web app, add the ability to display a document summary:
1. Add a button for summary to the `Analyze Document` form
2. Add a function in the code for that form that calls `get_document_summary`
3. Render the document summary in the web app

### Questions

6. *Look at the generated summaries. Are they extractive or abstractive? In other words, is every word in a summary in the document it comes from?*
7. *Play around with summary lengths. Is there a good length - either an absolutely ideal length, or a fraction of the length of the input document? Why or why not?*

# Alternative 3: Using transformers in Other Packages

Transformers are used in many other applications and python packages. Here we will look at one, BERTopic, which does topic modeling.

In [None]:
# first, get the full texts from the corpus
# get a lot of copies of them since we have a small corpus and topic modeling needs more documents
full_texts = [my_corpus[x]['doc'].text for x in my_corpus]*50

print(len(full_texts))

### Topic modeling

We set up BERTTopic with spaCy using the best practices from https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html.

In [None]:
# Pre-calculate embeddings; consider any embedding model
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(full_texts, show_progress_bar=True)

# Prevent stochastic behavior
from umap import UMAP

# choose a number of neighbors that's reasonable for your data set
umap_model = UMAP(n_neighbors=5, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# Set minimum cluster size
from hdbscan import HDBSCAN

# choose a minimum cluster size that's reasonable for your data set
hdbscan_model = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# I find this dubious but okay; "Improve" default representation
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

# Use multiple representations
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, PartOfSpeech

# KeyBERT
keybert_model = KeyBERTInspired()

# Part-of-Speech
pos_model = PartOfSpeech("en_core_web_md")

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# All representation models
representation_model = {
    "KeyBERT": keybert_model,
    "MMR": mmr_model,
    "POS": pos_model
}

# Make topic model using all of this setup
from bertopic import BERTopic

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=10,
  verbose=True,
  nr_topics="auto"
)

That bit takes awhile. Now we have a topic model, but what does it look like?

In [None]:
# extract the topics
topics, probabilities = topic_model.fit_transform(full_texts, embeddings)

Now at this point you will iterate through one or both of the steps below til you are happy(ish).

Look at the topics.

In [None]:
# create a visualization of the topic model for this group
topic_model.visualize_topics()

Or maybe visualizing the documents will help more.

In [None]:
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(full_texts, reduced_embeddings=reduced_embeddings)

Maybe you look at the visualization and you know the number of topics you want to end up with.

In [None]:
topic_model.reduce_topics(full_texts, nr_topics=7)

Or maybe you want to merge some topics, like topics 1 and 2 and topics 5 and 6.

In [None]:
topics_to_merge = [[0, 1]]
topic_model.merge_topics(full_texts, topics_to_merge)

### Add topic modeling to our `corpus` class

In the `corpus` class, add an instance method `build_topic_model` that runs topic modeling. 

Use the code cell below to test.

### Add topic modeling to our web app's server

In the notebook for project4a, define a function `get_topic_model_topics_plot` that returns the topic model plot.

In the notebook for project4a, define a function `get_topic_model_documents_plot` that returns the topic model document plot.

Use the code cell below to test.



### Add topic modeling to our web app's client

In your web app, add the ability to display a document summary:
1. Add a new form for topic modeling, called `Analyze Topics`
2. Add it to the top right hand menu
3. Add a function in the code for that form that calls `get_topic_model_topics_plot` and `get_topic_model_documents_plot`
4. Render the two plots in the web app, side by side or one above the other

### Questions

8. *Can a document have two or more topics? Why or why not?*
9. *Which do you think is more easy to understand as a 'capture' of a corpus - key phrases or topics? Why?*
10. *Which flow control statement type do we use more often in the `corpus` class: for, while or if?*

# Huggingface vs spaCy

Huggingface and spaCy are different companies. Each company releases open source software.

The huggingface software is the python package `transformers`.

The spaCy software is the python package `spaCy`.

Both softwares use models. spaCy has a whole set of models (the ones ending in `-trf`) that use huggingface transformers!

spaCy can do some NLP tasks that huggingface can't do. 

The spaCy models are highly tuned and optimized for processing text using NLP. The huggingface models (e.g. for NER) are contributed by the community. 

The huggingface models focus more on NLP *applications* like summarization, sentiment analysis or translation.

If you have a choice, I would use the spaCy models for text preprocessing. 

If you want to use a NLP application, huggingface is great.