# Title: skweak: Weak Supervision Made Easy for NLP

#### Members' Names : Bhakti Bhatt & Yashi Garg

####  Emails: bhakti.bhatt@ryerson.ca & yashi.garg@ryerson.ca

# Introduction:

#### Problem Description:

In NLP, accurately labelled data is scarce when dealing with specialized domains and internal business initiatives. NLP projects undertaken by businesses often deal with the lack of labelled data ‚Äì especially when the business defines domain specific (e.g. internal) labels and cannot make use of pre-existing resources. In many cases, there is a need to rely on massive sets of hand-labelled domain-specific training data or large pre-trained language models.

#### Context of the Problem:

In the modern world, there is infinite availability of text document. Unfortunately, there is huge scarcity of meaningful labelled data. This issue magnifies in resource-poor languages and/or uncommon textual domains, also extended to projects without pre-existing datasets.

#### Limitation About other Approaches:

There are many ways to get more labeled training data however each have their own drawbacks.



*   **Traditional Supervision (Hand-Labeled Data by Subject Matter Experts (SMEs))**

     Hand-labeled training datasets are expensive and time-consuming to create, and are not able to swiftly accommodate to change if new labelling guidelines are introduced. For example, if a new domain-specific label was introduced, a full review of the training data set would be required by SMEs.


*   **Semi-supervised Learning (Use of structural assumptions on unlabeled data)**

     The semi-supervised learning approach takes a small labelled dataset and a large unlabelled dataset to extract structural assumptions. However, it needs some labeled data. 

*	**Transfer Learning (Use of pre-trained models)**

     Transfer Learning makes use of already existing pre-trained models and fine-tunes them on a different task. However, transfer learning only works if the initial and target problems are similar enough for the first round of training to be relevant. This is often not the case in cross sector NLP projects. 


#### Solution:

The Skweak framework relies on weak supervision eliminating labelling data by hand. Skweak relies on weak supervision to programmatically label data points through a collection of labelling functions. Another feature of skweak is the ability to create labelling functions that produce underspecified labels. 

# Background


| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Ratner et al. [1] | Snorkel implementation of weak supervision combines various supervision sources using a generative model. It allows the incorporatation of both  | Labelling Functions, Pre-trained Discriminative Models | One drawback is that Snorkel requires datapoints to be independent, making it difficult to apply to sequence labelling tasks. This is improved in the skweak implementation. Future outlook comprises of increasing unification with transfer learning and implementation of a multi-task weak supervision to accomodate noisy data. 
| Fries et al. [2] | The Swellshark impmelentation of weak supervision is optimised for biomedical Named Entity Recognition (NER). | Labelling Functions, Pre-trained Discriminative Models | Swellshark relies on a separate, ad-hoc mechanism to generate in-line labels.
| Fu et. al [3] | The FlyingSquid implementation of weak supervision uses an approach called triplet methods which is fast and applicable to structured prediction problems. | Labelling Functions | The aggregation model of FlyingSquid focuses on estimating the accuracies of each labelling function and is therefore difficult to apply to problems where labelling sources have varying precision/recall trade-offs.
| Safranchik et. al [4] | The Safranchik paper mentions a model based on linked hidden Markov models (HMM). This is closely related to skweak. | Tagging Rules, Linking Rules | Compared to skweak, the Safranchik paper does not provide the ability to include document level constraints.
| Lison et. al [5] | The skweak implementation of weak supervision aggregates the input data & models using an HMM model. It also allows the input of document level constraints to generalize the labels at the document level as opposed to the token level. | Labelling Functions, Pre-trained Discriminative Models, Document-level constraints | Similar to the mature implementation of Snorkel, skweak can look to implement semi-supervised learning approches into their current implementation, and further more implement multi-task weak supervision.

# Methodology

Skweak is a versatile, Python-based software toolkit enabling NLP developers to apply weak supervision to a wide range of tasks, and in particular sequence labelling and text classification. Instead of labelling data points manually, we define labelling functions to automatically annotate text documents from the target domain. The results of those labelling functions are then aggregated into one single annotation layer using a generative model.



![skweak_procedure.png](main/skweak_procedure.png)



As shown above, weak supervision with skweak is divided in several steps:

**Start: Preprocessing**
We must first prepare the (unlabelled) corpus onto which the labelling functions will be applied. Skweak is build on top of SpaCy, and operates with Spacy Doc objects, so you first need to convert your documents to Doc objects with spacy.

**Step 1: Labelling functions**
We then define a range of labelling functions that will take those documents and annotate spans with labels. Those labelling functions can take a variety of forms, from handcrafted heuristics to machine learning models.

**Heuristics**
The simplest type of labelling functions integrated in skweak are rule-based heuristics. For instance, one heuristic to detect entities of type COMPANY is to look for text spans ending with a legal company type (such as ‚ÄúInc.‚Äù).
The easiest way to define heuristics in skweak is through standard Python functions that take a SpaCy Doc object as input and returns labelled spans.

**Machine learning models**
Labelling functions may also take the form of machine learning models. Typically, those models will be trained on data from other, related domains, thereby leading to some form of transfer learning across domains. Skweak does not impose any constraint on type of model that can be employed.

**Gazetteers**
Gazetteers are modules searching for occurrences of a list of words or phrases in the document. For instance, a gazetteer may  be constructed using the geographical locations from Geonames or names of persons, organisations and locations from DBPedia.

**Document-level functions**
Unlike previous weak supervision frameworks, skweak also provides functionalities to create document-level labelling functions that rely on the global document context to derive new supervision signals. In particular, skweak includes a labelling function that takes advantage of label consistency within a document.

**Step 2: Aggregation model**
Once the labelling functions have been applied to your corpus, we aggregate their results in order to obtain a single, probabilistic annotation (instead of the multiple, possibly conflicting annotations from the labelling functions).
For sequence labelling, this model is expressed as a Hidden Markov Model where the states correspond to the ‚Äútrue‚Äù (unobserved) labels, and the observations are the predictions of each labelling function. For classification, this model reduces to Naive Bayes since there are no transitions. This generative model is estimated using the BaumWelch algorithm, which a variant
of EM that uses the forward-backward algorithm to compute the statistics for the expectation step. For efficient inference, skweak combines Python with C-compiled routines from the hmmlearn package3 for parameter estimation and decoding. This is done in skweak using a generative model that automatically estimates the relative accuracy and possible confuctions of each labelling function.

**Step 3: Final model**
Finally, based on those aggregated labels, we train our final model. Step 2 gives us a labelled corpus that (probabilistically) aggregates the outputs of all labelling functions, and you can use this labelled data to estimate any kind of machine learning model.

# Implementation

#### Install all required libraries


* skweak is the implementation of weak supervision.
* SpaCy is dicussed further on, it is one of the key libraries used in NLP to understand larger texts.
* Afinn is the simplest yet popular lexicons used for sentiment analysis developed by Finn √Örup Nielsen. 
* en_code_web_sm and en_core_web_md are english vocabulary ore-trained datasets.


In [None]:
#pip install skweak

In [None]:
#pip install -U spacy

In [None]:
#pip install afinn

In [None]:
#!python -m spacy download en_core_web_sm
#!python -m spacy download en_core_web_md

### Start: preparing the corpus
We have a small corpus of 200 news articles that we wish to annotate with two entity types: 
- companies
- other (non-commercial) organisations.

The first step is to extract the texts from the corpus:

In [None]:
import tarfile

# We retrieve the texts
texts = [] 
archive_file = tarfile.open("reuters_small.tar.gz")
for archive_member in archive_file.getnames():
    if archive_member.endswith(".txt"):
        text = archive_file.extractfile(archive_member).read().decode("utf8")
        texts.append(text)

We can now run Spacy on those texts to obtain `Doc` objects. spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. spaCy is designed to help build applications that process and ‚Äúunderstand‚Äù large volumes of text. It can be used to ** build information extraction** or **natural language understanding systems**, or to **pre-process text for deep learning**. 

The spaCy library is foundational to all the code for skweak.

In [None]:
import spacy

# We run spacy on the texts    
nlp = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
docs = list(nlp.pipe(texts))


<br>

## Step 1: Labelling functions

Labelling functions are at the core of `skweak`. They take a `Doc` as input and returns a list of spans with their associated labels. 

One simple type of labelling functions are heuristics. For instance, we can write that commercial companies may be recognized by their legal suffix (such as Corp.):

In [None]:
import skweak

def company_detector_fun(doc):
    for chunk in doc.noun_chunks:
        if chunk[-1].lower_.rstrip(".") in {'corp', 'inc', 'ltd', 'llc', 'sa', 'ag'}:
            yield chunk.start, chunk.end, "COMPANY"

# We create the labelling function by giving it a name, and a function to apply
company_detector = skweak.heuristics.FunctionAnnotator("company_detector", company_detector_fun)

# We run the function on the full corpus
docs = list(company_detector.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "company_detector")

<br>
For non-commercial organisations, we can also look for the occurrence of words that are quite typical of public organisations or NGOs: 

In [None]:
OTHER_ORG_CUE_WORDS = {"University", "Institute", "College", "Committee", "Party", "Agency",
                       "Union", "Association", "Organization", "Court", "Office", "National"}
def other_org_detector_fun(doc):
    for chunk in doc.noun_chunks:
        if any([tok.text in OTHER_ORG_CUE_WORDS for tok in chunk]):
            yield chunk.start, chunk.end, "OTHER_ORG"

# We create the labelling function
other_org_detector = skweak.heuristics.FunctionAnnotator("other_org_detector", other_org_detector_fun)

# We run the function on the full corpus
docs = list(other_org_detector.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "other_org_detector")

Below are some labelling functions we added ourselves. 

1. Month Detector
2. Unit Detector
3. Adjective Detector
4. Overall Document Sentiment using the Afinn module



In [None]:
# 1. Month detector
MONTHS = {"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November",
          "December"}

def month_detector(doc):
    for chunk in doc.noun_chunks:
        if any([tok.text in MONTHS for tok in chunk]):
            yield chunk.start, chunk.end, "MONTH"
            
# We create the labelling function
month_detector = skweak.heuristics.FunctionAnnotator("month_detector", month_detector)

# We run the function on the full corpus
docs = list(month_detector.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "month_detector")

In [None]:
## 2. Unit Detector
UNITS = {"million", "billion", "mln", "bln", "bn", "thousand", "m", "k", "b", "m.", "k.", "b.", "mln.", "bln.",
              "bn.", "tons", "tonnes", "barrels", "m", "km", "miles", "kph", "mph", "kg", "¬∞C", "dB", "ft", "gal", "gallons", "g",
              "kW", "s", "oz",
              "m2", "km2", "yards", "W", "kW", "kWh", "kWh/yr", "Gb", "MW", "kilometers", "meters", "liters", "litres", "g",
              "grams", "tons/yr",
              'pounds', 'cubits', 'degrees', 'ton', 'kilograms', 'inches', 'inch', 'megawatts', 'metres', 'feet', 'ounces',
              'watts', 'megabytes',
              'gigabytes', 'terabytes', 'hectares', 'centimeters', 'millimeters', "F", "Celsius"}

def unit_detector(doc):
    for chunk in doc.noun_chunks:
        if any([tok.text in UNITS for tok in chunk]):
            yield chunk.start, chunk.end, "UNIT"
            
# We create the labelling function
unit_detector = skweak.heuristics.FunctionAnnotator("unit_detector", unit_detector)

# We run the function on the full corpus
docs = list(unit_detector.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "unit_detector")

In [None]:
## 3. Adjective Detector
def adj_detector(doc):
    for chunk in doc.noun_chunks:
        if any([tok.pos_ == 'ADJ' for tok in chunk]):
            yield chunk.start, chunk.end, "ADJ"
            
# We create the labelling function
adj_detector = skweak.heuristics.FunctionAnnotator("adj_detector", adj_detector)

# We run the function on the full corpus
docs = list(adj_detector.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "adj_detector")


In [None]:
## 4. Negative Sentiment 
from afinn import Afinn

def negative_sentiment(x):
    afn = Afinn()
    for chunk in x.noun_chunks:
        if [afn.score(chunk.text)<=0]:
            yield chunk.start, chunk.end, "0"
    
#yield 0, len(x),('0' if afn.score(x.text)<=0 else '1')

negative_sentiment = skweak.heuristics.FunctionAnnotator("negative_sentiment",negative_sentiment )

# We run the function on the full corpus
docs = list(negative_sentiment.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "negative_sentiment")


<br>
In addition to heuristics, we can also exploit _gazetteers_ that search for the occurrences of entries (often extracted from a knowledge base): 

In [None]:

# We extract the entries (from Crunchbase)
tries = skweak.gazetteers.extract_json_data("crunchbase_companies.json.gz")
gazetteer = skweak.gazetteers.GazetteerAnnotator("gazetteer", tries)
print("done building the gazetteer")

# We run the function on the full corpus
docs = list(gazetteer.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "gazetteer")

Extracting data from ../data/crunchbase_companies.json.gz
Populating trie for class COMPANY (number: 539174)
done building the gazetteer


<br>
And finally, we can also take advantage of machine learning models trained from data of related domains. Here, we will use a spacy model to get the usual named entities:

In [None]:

# Run a NER model trained on OntoNotes 5.0
ner = skweak.spacy.ModelAnnotator("spacy", "en_core_web_sm")
docs = list(ner.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "spacy")

<br> 

## Step 2: aggregation

Once the labelling functions have been applied, we must then aggregate their results, to get a single annotation for each document. This is done in `skweak` by estimating a generative model. Aggregating the labels can be done in a few lines of code: 

In [None]:
#pip list

In [None]:
# We define the aggregation model
model = skweak.aggregation.HMM("hmm", ["COMPANY", "OTHER_ORG","UNIT","ADJ", "MONTH"])

# We indicate that "ORG" is an underspecified value, which may
# represent either COMPANY or OTHER_ORG
model.add_underspecified_label("ORG", ["COMPANY", "OTHER_ORG"])

# And run the estimation
docs = model.fit_and_aggregate(docs)

Starting iteration 1
Finished E-step with 195 documents
Starting iteration 2


         1      -70672.2650             +nan


Finished E-step with 195 documents
Starting iteration 3


         2      -64345.5949       +6326.6700


Finished E-step with 195 documents
Starting iteration 4


         3      -64269.1687         +76.4263


Finished E-step with 195 documents


         4      -64260.3083          +8.8604


In [None]:
# Note: if you are running Jupyter Notebook instead of Jupyter Lab, you need to 
# set add_tooltip=False, as Juypter Notebook does not support HTML tooltips
skweak.utils.display_entities(docs[28], "hmm", add_tooltip=True) 

<br>

## Step 3: Training the final model
    
Once we have finished labelling the corpus, we can then train any type of machine learning model on it!

In [None]:
for doc in docs:
    doc.ents = doc.spans["hmm"]
skweak.utils.docbin_writer(docs, "reuters_small.spacy")

Write to ../data/reuters_small.spacy...done


In [None]:
!spacy init config - --lang en --pipeline ner --optimize accuracy | \
spacy train - --paths.train /content/reuters_small.spacy  --paths.dev /content/reuters_small.spacy \
--initialize.vectors en_core_web_md --output content/reuters_small


[i] Saving to output directory: ..\data\reuters_small
[i] Using CPU
[1m


2022-04-19 09:13:47.283920: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-04-19 09:13:47.283956: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-04-19 09:13:47.284371: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-04-19 09:13:47.284399: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[2022-04-19 09:13:53,450] [INFO] Set up nlp object from config
[2022-04-19 09:13:53,450] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-04-19 09:13:53,466] [INFO] Created vocabulary
[2022-04-19 09:13:54,552] [INFO] Added vectors: en_core_web_md
[2022-04-19 09:13:54,709] [INFO] Finished initializing nlp object
[2022-04-19 09:14:04

[+] Initialized pipeline
[1m
[i] Pipeline: ['tok2vec', 'ner']
[i] Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     85.00    0.51    0.38    0.80    0.01
  1     200        240.32   4788.39   73.35   89.01   62.38    0.73
  2     400         42.94   1570.86   72.39   78.06   67.48    0.72
  3     600         45.79   1096.95   77.36   83.96   71.73    0.77
  4     800         54.65    881.41   90.47   94.76   86.56    0.90
  5    1000        155.03    821.89   89.35   93.70   85.39    0.89
  6    1200        145.34    606.25   93.93   92.93   94.95    0.94
  7    1400        221.35    596.89   94.13   92.07   96.28    0.94
  8    1600        230.47    422.48   95.45   98.04   92.99    0.95
  9    1800        314.55    428.59   94.26   95.23   93.30    0.94
 10    2000        286.46    372.67   94.38   94.66   94.10    0.94
 11    2200        323.

This is of course just a very short example. Please look at domain-specific Jupyter notebooks in `examples/ner` and `examples/sentiment` directories for more details.

# Conclusion and Future Direction


**Conclusion**

Skweak toolkit helps to obtain labelled data using weak supervision eliminating manual labelling. The toolkit provides
a Python API to apply labelling functions and aggregate their results in a few lines of code. The toolkit can be applied
to both sequence labelling and text classification and comes along a range of functionalities, such as the integration of underspecified labels.


**Future Direction**

The skweak implementation can be further improved by learning from teh more mature implementation of Snorkel. It can look to implement semi-supervised learning approches into their current implementation, and further more implement multi-task weak supervision.


**Lessons Learned**

Weak supervision is a state-of-the-art methodology used in Natural Language Processing of unlabelled data. Through a review of this paper, we now understand what weak supervision is and how it is implemented, how labelling functions can help extract labels from unlabelled data, how skweak libraries can be used for Named Entity Recognition. We were able to identify parallels and differences from other articles that use weak supervision and understand that there are more mature Weak Supervision toolkits available for use. 

# References:

[1]: Dasagrandhi, Charan Sai. ‚ÄúUnderstanding Named Entity Recognition Pre-Trained Models.‚Äù Blog, https://blog.vsoftconsulting.com/blog/understanding-named-entity-recognition-pre-trained-models#:~:text=Named%20Entity%20Recognition%20(NER)%20is,entity%20chunking%20and%20entity%20extraction. 

[2]: Fu, Daniel Y., et al. ‚ÄúFast and Three-Rious: Speeding up Weak Supervision with Triplet Methods.‚Äù ArXiv.org, 15 July 2020, https://arxiv.org/abs/2002.11955. 

[3]: ‚ÄúInformation Extraction from Text Python.‚Äù Analytics Vidhya, 23 Dec. 2020, https://www.analyticsvidhya.com/blog/2020/06/nlp-project-information-extraction/. 

[4]: Ratner, Alexander, et al. ‚ÄúSnorkel: Rapid Training Data Creation with Weak Supervision.‚Äù ArXiv.org, 28 Nov. 2017, https://arxiv.org/abs/1711.10160. 

[5]: P. Lison Skweak: Weak Supervision Made Easy for NLP - Arxiv.org. https://arxiv.org/pdf/2104.09683v1. 

[6]: ‚ÄúSpacy 101: Everything You Need to Know ¬∑ Spacy Usage Documentation.‚Äù SpaCy 101: Everything You Need to Know, https://spacy.io/usage/spacy-101. 

[7]: Tran, Khuyen. ‚ÄúSnorkel‚Ää-‚ÄäProgrammatically Build Training Data in Python.‚Äù Medium, Towards Data Science, 30 Jan. 2022, https://towardsdatascience.com/snorkel-programmatically-build-training-data-in-python-712fc39649fe. 

[8]: Weakly Supervised Sequence Tagging from Noisy Rules. https://cs.brown.edu/people/sbach/files/safranchik-aaai20.pdf. 

[9]: ‚Äúüê≠ Weakly Supervised NER with Skweak.‚Äù üê≠ Weakly Supervised NER with Skweak - Rubrix 0.13 Documentation, https://rubrix.readthedocs.io/en/stable/tutorials/skweak.html. 