# Title: skweak: Weak Supervision Made Easy for NLP

#### Members' Names : Bhakti Bhatt & Yashi Garg

####  Emails: bhakti.bhatt@ryerson.ca & yashi.garg@ryerson.ca

# Introduction:

#### Problem Description:

In NLP, accurately labeled data is scarce when dealing with specialized domains and internal business initiatives. NLP projects undertaken by businesses often deal with the lack of labeled data – especially when the business defines domain specific (e.g. internal) labels and cannot make use of pre-existing resources. In many cases, there is a need to rely on massive sets of hand-labeled domain-specific training data or large pre-trained language models.

#### Context of the Problem:

In the modern world, there is infinite availability of text document. Unfortunately, there is huge scarcity of meaningful labeled data. This issue magnifies in resource-poor languages and/or uncommon textual domains, also extended to projects without pre-existing datasets.

This problem is important as Machine Learning Models require labeled data for supervised learning, and the scarcity of acurately labeled data makes it difficult to train these models.

#### Limitation About other Approaches:

There are many ways to get more labeled training data however each have their own drawbacks.



*   **Traditional Supervision (Hand-Labeled Data by Subject Matter Experts (SMEs))**

     Hand-labeled training datasets are expensive and time-consuming to create, and are not able to swiftly accommodate to change if new labelling guidelines are introduced. For example, if a new domain-specific label was introduced, a full review of the training data set would be required by SMEs.


*   **Semi-supervised Learning (Use of structural assumptions on unlabeled data)**

     The semi-supervised learning approach takes a small labelled dataset and a large unlabelled dataset to extract structural assumptions. However, it needs some labeled data. 

*	**Transfer Learning (Use of pre-trained models)**

     Transfer Learning makes use of already existing pre-trained models and fine-tunes them on a different task. However, transfer learning only works if the initial and target problems are similar enough for the first round of training to be relevant. This is often not the case in cross sector NLP projects. 


#### Solution:

The solution to these problems is provided by the weak supervision paradigm. Weak Supervision saves the time that is required to label data by hand. It takes, as input, noisy/conflicting/less accurate data, then aggregates the data to provide a single output of labels without conflicts. This aggeregated data can then be used to train an NLP model for sentiment analysis and named entity recognition. The skweak framework relies on weak supervision to **programmatically label data points** through a collection of labelling functions which are discussed further in this report. 

# Background


| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Ratner et al. [1] | Snorkel implementation of weak supervision combines various supervision sources using a generative model. | Labelling Functions, Pre-trained Discriminative Models | One drawback is that Snorkel requires datapoints to be independent, making it difficult to apply to sequence labelling tasks. This is improved in the skweak implementation. Future outlook comprises of increasing unification with transfer learning and implementation of a multi-task weak supervision to accomodate noisy data. 
| Fries et al. [2] | The Swellshark impmelentation of weak supervision is optimised for biomedical Named Entity Recognition (NER). | Labelling Functions, Pre-trained Discriminative Models | Swellshark relies on a separate, ad-hoc mechanism to generate in-line labels.
| Fu et. al [3] | The FlyingSquid implementation of weak supervision uses an approach called triplet methods which is fast and applicable to structured prediction problems. | Labelling Functions | The aggregation model of FlyingSquid focuses on estimating the accuracies of each labelling function and is therefore difficult to apply to problems where labelling sources have varying precision/recall trade-offs.
| Safranchik et. al [4] | The Safranchik paper mentions a model based on linked hidden Markov models (HMM). This is closely related to skweak. | Tagging Rules, Linking Rules | Compared to skweak, the Safranchik paper does not provide the ability to include document level constraints.
| Lison et. al [5] | The skweak implementation of weak supervision aggregates the input data & models using an HMM model. It also allows the input of document level constraints to generalize the labels at the document level as opposed to the token level. | Labelling Functions, Pre-trained Discriminative Models, Document-level constraints | Similar to the mature implementation of Snorkel, skweak can look to implement semi-supervised learning approches into their current implementation, and further more implement multi-task weak supervision.

# Methodology

Skweak allows users to define labelling functions that automatically annotate text documents from the target domain. The results of those labelling functions are then aggregated into one single annotation layer using a generative model.


<img src='/content/skweak_procedure.PNG'>


As shown above, there are thre primary steps required to be followed to use the sweak library.

**Start: Preprocessing**

Prepare the (unlabelled) corpus onto which the labelling functions will be applied. As skweak is build on top of SpaCy, it operates with Spacy Doc objects, so to preprocess the data successfully, the input data needs to be converted to Doc objects with spaCy.

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python and is designed to help build NLP systems.The spaCy library is foundational to all the code for skweak.

**Step 1: Labelling functions**

The first step requires one to define labelling functions. One can think of labelling functions as "enablers" that help ML models be more self-sufficient for learning.

The goal of labelling functions is to take spaCy doc objects as input and annotate spans with labels.Note that a Token refers to a single word, punctuation symbol, whitespace, etc. from a document, while a Span is an ordered sequence of Tokens. 

There are different types of labelling functions that skweak supports.

1. **Heuristics**

     Heuristics are known to be rule-based labelling functions making them the simplest form of labelling functions. An example would be when a heuristic is created to detect entities of type COMPANY, then it would simply search for text spans ending with a legal company type (such as “Inc.”).
     
     To build a heuristic, define a Python function that take a SpaCy Doc object as input and returns the labelled spans. An example is shown below:

       def money_detector(doc):
          """Searches for occurrences of MONEY entities in text"""
          for tok in doc[1:]:
            # check for whether the token in the document is a digit
            # and if the previous token is is a currency symbol
             if (tok.text[0].isdigit() and tok.nbor(-1).is_currency):
                  # return the sequence of tokens (span) & their label
                  yield tok.i-1, tok.i+1, "MONEY"

2. **Machine learning models**

     Skweak also supports the input of machine learning models. Typically, those models will be trained on data from related domains, thereby leading to some form of transfer learning across domains. 

     Let's take a look at a quick example on how a Machine Learning Model can be taken as input in skweak. Below we define two pre-trained models.
     * en_core_web_md is the standard Spacy model for English, trained on Ontonotes v5 which has over 18 classes including, but not limited to, PERSON, ORG, MONEY, DATE, GPE (geopolitical locations), LOC ("natural locations" such as continents, seas, etc.), CARDINAL, ORDINAL, QUALTITY, etc. The full list of labels can be found here: https://github.com/explosion/spacy-models/blob/master/meta/en_core_web_md-3.2.0.json.

     * conll2003 is another language model pre-trained for NER. It also has a multiple labels including PER(SON), ORG, MISC, LOC It has https://www.clips.uantwerpen.be/conll2003/ner/

  We can apply annotations from a Spacy model using the ModelAnnotator class. As input, the Model Annotator takes the name and path of a pre-trained model. The model is then run on spaCy doc objects and it returns the annotated documents. 

       # apply annotations from the core_web_md pre-trained model
       core_web_annotator = skweak.spacy.ModelAnnotator("core_web_md", "en_core_web_md")
       doc = core_web_annotator(doc)

       # the annotations are stored in the spans of the annotated docs
       [(ent, ent.label_) for ent in doc.spans["core_web_md"]]
       
       # apply annotations from the conll2003 pre-trained model
       annotator = skweak.spacy.ModelAnnotator("conll2003", "/content/conll2003/")
       doc = annotator(doc)
       
       # the annotations are stored in the spans of the annotated docs
       [(ent, ent.label_) for ent in doc.spans["conll2003"]]
     
      Note there are differences between the entity labels of these models: while core_web_md has greater than 18 classes, conll2003 only contains PER(SON), ORG, LOC and MISC. Furthermore, the labels also do not match each other perfectly: while Ontonotes distinguishes between geopolitical locations (GPE) and "natural" locations (such as continents, seas etc., labelled as LOC), conll2003 groups all geographical entities as LOC. 
      
      There is no need to worry if the input data is conflicting, in fact any weak supervision model is expected to take noisy data as input with the final goal to de-noise the data. Similary, the aim of skweak is to generate a large amount of training data in a short period to time by taking noisy/conflicting data and resolving the conflicts in the aggregation layer, which is discussed further on. 

3. **Gazetteers**
     
     Gazetteers are functions that search for entities in a document, by cross referencing a list of words or phrases from a different source. For example, Geonames is a geographical database that contains over eleven million placenames that are available for download online. A gazetteer labelling function can be created to search within a doc object for any placenames and label them as "LOC". In general, a gazetteer is helpful when searching for names of persons, organisations and locations by cross-referencing external sources. 

     There are many such external sources that can be used to build gazetteers, and some examples aare implemented in the src/NER_step_by_step.ipynb file: PRODUCT NAMES, GEONAMES, CRUNCHBASE (includes a list of company names and business persons), WIKIDATA (includes annotations for PER, LOC, GPE, ORG, PRODUCT).

     
4. **Document-level functions**
     
     A functionality unique to skweak, as opposed to other Weak Supervision libraries, is that it provides the ability to create document-level labelling functions that rely on the global document context to produce a label. In particular the document level functions allow:

    * If a person's name has been mentioned in full (at least two consecutive tokens, most often first name followed by last name), then mark future occurrences of the last token (last name) as a PER as well. 
    * If an organisation has been mentioned together with a legal type, mark all other occurrences (possibly without the legal type at the end) also as a COMPANY.

All the labelling functions have to be applied to your corpus prior to moving onto the next step. By doing so, you will have generated (token/span, label) which will later be aggregated.

 Once the labelling functions have been applied to your corpus, we aggregate their results in order to obtain a single, probabilistic annotation (instead of the multiple, possibly conflicting annotations from the labelling functions).

**Step 2: Aggregation model**

A generative model (HMM) is used for the purpose of aggregating labelled tokens/spans.Depending on the task at hand, the Aggregators defined in skweak will use either the HMM model or the Naive Bayes model. As we do not have access to labelled data for the target domain, the model parameters are estimated in a fully unsupervised manner.

The aggregation model has the following components:

* List of j labeling functions {λ1, ...λj} 
* List of labels that can be produced by labeling functions {l1, ... lL} 
* List of observations Yij - annotations from each labeling function
* List of tokens/spans i  ∈ {1, ... n}
* List of S states for the tokens/spans (i.e. the true (hidden) labels for each token) {l1, ...lS}

and the respective probability distributions: 

* initial probabilities of the label on a token/span
* transition probability matrix of moving from one token/span to another
* emission/output probability matrix (one for each labelling function)

To initialise the HMM model, at minimum, it requires the above parameters to be estimated. The initial transition probabilities are estimated using a Majority Voter (MV) mechanism that predicts the most likely labels based on the “votes” for each label, where each labelling function corresponds to a voter. Similarly, the emission probabilities are also estaimated using the majority voter technique. An emission probability matrix is created for each labeling function.The transition matrix is essentially the adjacency matrix we learned in class, of size |S| X |S| (e.g. |tokens| X |tokens|) and would be flagged with 1 when a valid transition from token/span A to B exists.

Once the parameters of the model are identified, we can fit the aggregation model to the estimated parameters and get the desired output: **aggregated labels**. The logical next step would be to optimize the parameters. For this, we run through several forward-backward passes of the Baum-Welch, a variant of the Expectation Maximization (EM) algorithm to to get what we call the best-model. The Baum–Welch algorithm is a special case of the EM algorithm and it is used to find the unknown parameters of a hidden Markov model (HMM).

**Step 3: Train Final model**

Finally, based on those aggregated labels, we train our final model. Step 2 gives us a labelled corpus that (probabilistically) aggregates the outputs of all labelling functions, and you can use this labelled data to estimate any kind of machine learning model. You can either re-train your data on the best-model that was output through the EM iterations, or train your own NER model.

# Implementation

#### Install all required libraries


* skweak is the implementation of weak supervision.
* SpaCy is one of the key libraries used in NLP to understand larger texts.
* en_code_web_sm and en_core_web_md are english vocabulary ore-trained datasets.


In [47]:
pip install skweak



In [48]:
pip install -U spacy



In [49]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 22.8 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[K     |████████████████████████████████| 45.7 MB 1.8 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


### Start: Preprocessing Data

We have a small corpus of 200 news articles that we wish to annotate with two entity types: 
- companies
- other (non-commercial) organisations
- adjectives
- units
- Month

The first step is to extract the texts from the corpus:

In [50]:
import tarfile

# We retrieve the texts
texts = [] 
archive_file = tarfile.open("reuters_small.tar.gz")
for archive_member in archive_file.getnames():
    if archive_member.endswith(".txt"):
        text = archive_file.extractfile(archive_member).read().decode("utf8")
        texts.append(text)

We can now run Spacy on those texts to obtain `Doc` objects. spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. spaCy is designed to help build applications that process and “understand” large volumes of text. It can be used to ** build information extraction** or **natural language understanding systems**, or to **pre-process text for deep learning**. 

The spaCy library is foundational to all the code for skweak.

In [51]:
import spacy

# We run spacy on the texts    
nlp = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
docs = list(nlp.pipe(texts))


<br>

## Step 1: Labelling functions

Labelling functions are at the core of `skweak`. They take a `Doc` as input and returns a list of spans with their associated labels. 

One simple type of labelling functions are heuristics. For instance, we can write that commercial companies may be recognized by their legal suffix (such as Corp.):

In [52]:
import skweak

def company_detector_fun(doc):
    for chunk in doc.noun_chunks:
        if chunk[-1].lower_.rstrip(".") in {'corp', 'inc', 'ltd', 'llc', 'sa', 'ag'}:
            yield chunk.start, chunk.end, "COMPANY"

# We create the labelling function by giving it a name, and a function to apply
company_detector = skweak.heuristics.FunctionAnnotator("company_detector", company_detector_fun)

# We run the function on the full corpus
docs = list(company_detector.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "company_detector")

<br>
For non-commercial organisations, we can also look for the occurrence of words that are quite typical of public organisations or NGOs: 

In [53]:
OTHER_ORG_CUE_WORDS = {"University", "Institute", "College", "Committee", "Party", "Agency",
                       "Union", "Association", "Organization", "Court", "Office", "National"}
def other_org_detector_fun(doc):
    for chunk in doc.noun_chunks:
        if any([tok.text in OTHER_ORG_CUE_WORDS for tok in chunk]):
            yield chunk.start, chunk.end, "OTHER_ORG"

# We create the labelling function
other_org_detector = skweak.heuristics.FunctionAnnotator("other_org_detector", other_org_detector_fun)

# We run the function on the full corpus
docs = list(other_org_detector.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "other_org_detector")

Below are some labelling functions we added ourselves. 

1. Month Detector
2. Unit Detector
3. Adjective Detector


In [54]:
# 1. Month detector
MONTHS = {"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November",
          "December"}

def month_detector(doc):
    for chunk in doc.noun_chunks:
        if any([tok.text in MONTHS for tok in chunk]):
            yield chunk.start, chunk.end, "MONTH"
            
# We create the labelling function
month_detector = skweak.heuristics.FunctionAnnotator("month_detector", month_detector)

# We run the function on the full corpus
docs = list(month_detector.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "month_detector")

In [55]:
## 2. Unit Detector
UNITS = {"million", "billion", "mln", "bln", "bn", "thousand", "m", "k", "b", "m.", "k.", "b.", "mln.", "bln.",
              "bn.", "tons", "tonnes", "barrels", "m", "km", "miles", "kph", "mph", "kg", "°C", "dB", "ft", "gal", "gallons", "g",
              "kW", "s", "oz", "m2", "km2", "yards", "W", "kW", "kWh", "kWh/yr", "Gb", "MW", "kilometers", "meters", "liters", "litres", "g",
              "grams", "tons/yr", 'pounds', 'cubits', 'degrees', 'ton', 'kilograms', 'inches', 'inch', 'megawatts', 'metres', 'feet', 'ounces',
              'watts', 'megabytes', 'gigabytes', 'terabytes', 'hectares', 'centimeters', 'millimeters', "F", "Celsius"}

def unit_detector(doc):
    for i, tok in enumerate(nlp(doc.text)):
        if tok.text in UNITS:
            yield i, i+1, "UNIT"
            
# We create the labelling function
unit_detector = skweak.heuristics.FunctionAnnotator("unit_detector", unit_detector)

# We run the function on the full corpus
docs = list(unit_detector.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "unit_detector")

In [56]:
## 3. Adjective Detector
def adj_detector(doc):
    for i, tok in enumerate(nlp(doc.text)):
        if tok.pos_ == 'ADJ':
            yield i, i+1, "ADJ"
            
# We create the labelling function
adj_detector = skweak.heuristics.FunctionAnnotator("adj_detector", adj_detector)

# We run the function on the full corpus
docs = list(adj_detector.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "adj_detector")


<br>
In addition to heuristics, we can also use  _gazetteers_ that search for the occurrences of entries (often extracted from an external source): 

In [57]:
# We extract the entries (from Crunchbase)
tries = skweak.gazetteers.extract_json_data("crunchbase_companies.json.gz")
gazetteer = skweak.gazetteers.GazetteerAnnotator("gazetteer", tries)
print("done building the gazetteer")

# We run the function on the full corpus
docs = list(gazetteer.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "gazetteer")

Extracting data from crunchbase_companies.json.gz
Populating trie for class COMPANY (number: 539174)
done building the gazetteer


<br>
And finally, we can also take advantage of machine learning models trained from data of related domains. Here, we will use a spacy model to get the usual named entities:

In [58]:

# Run a NER model trained on OntoNotes 5.0
ner = skweak.spacy.ModelAnnotator("spacy", "en_core_web_sm")
docs = list(ner.pipe(docs))

# Show an example
skweak.utils.display_entities(docs[28], "spacy")

<br> 

## Step 2: aggregation

Once the labelling functions have been applied, we must then aggregate their results, to get a single annotation for each document. This is done in `skweak` by estimating a generative model. Aggregating the labels can be done in a few lines of code. However, it is important to understand that we are estimating 3 particular parameters for the HMM model.

1. initial probabilities of the label on a token/span
2. transition probability matrix of moving from one token/span to another
3. emission/output probability matrix (one for each labelling function)

In [59]:
# We define the aggregation model
model = skweak.aggregation.HMM("hmm", ["COMPANY", "OTHER_ORG","UNIT","ADJ", "MONTH"])

# We indicate that "ORG" is an underspecified value, which may
# represent either COMPANY or OTHER_ORG
model.add_underspecified_label("ORG", ["COMPANY", "OTHER_ORG"])

print("NOTE: The number output below at each EM iteration, provide the likelihood of the model \n \
 with respect to the training data and the evolution of the likelihood with respect to the previous iteration.\n \
 We see that the likelihood is very low in many cases, this is primarily due to the selection of conflicting\n \
 labelling functions. This can be fine-tuned by enhancing the labelling functions and tweaking the accuracy of the input\n \
 parameters in the model (adding weight factors where one model can overpower others in case of conflicts ) ")

# And run the estimation
docs = model.fit_and_aggregate(docs)

NOTE: The number output below at each EM iteration, provide the likelihood of the model 
  with respect to the training data and the evolution of the likelihood with respect to the previous iteration.
  We see that the likelihood is very low in many cases, this is primarily due to the selection of conflicting  labelling functions. This can be fine-tuned by enhancing the labelling functions and tweaking the accuracy of the input  parameters in the model. 
Starting iteration 1
Finished E-step with 195 documents
Starting iteration 2


         1      -67934.8666             +nan


Finished E-step with 195 documents
Starting iteration 3


         2      -63804.6987       +4130.1679


Finished E-step with 195 documents
Starting iteration 4


         3      -63715.9821         +88.7166


Finished E-step with 195 documents


         4      -63701.5607         +14.4214


### Visualize the Aggregated Labels

In [60]:
skweak.utils.display_entities(docs[28], "hmm", add_tooltip=True) 

<br>

## Step 3: Training the final model
    
Once we have finished labelling the corpus, we can then train any type of machine learning model on it!

In [61]:
for doc in docs:
    doc.ents = doc.spans["hmm"]
skweak.utils.docbin_writer(docs, "reuters_small.spacy")

Write to reuters_small.spacy...done


#### Getting the best-model
The code below allows one to optimize the pre-trained hmm model which was passed estimated parameters. We can further fine-tune the model and get the optimal accuracy of the model through the EM iterations. 

In [62]:
!spacy init config - --lang en --pipeline ner --optimize accuracy | \
spacy train - --paths.train /content/reuters_small.spacy  --paths.dev /content/reuters_small.spacy \
--initialize.vectors en_core_web_md --output content/reuters_small


[38;5;4mℹ Saving to output directory: content/reuters_small[0m
[38;5;4mℹ Using CPU[0m
[1m
[2022-04-23 08:38:49,511] [INFO] Set up nlp object from config
[2022-04-23 08:38:49,523] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-04-23 08:38:49,528] [INFO] Created vocabulary
[2022-04-23 08:38:51,048] [INFO] Added vectors: en_core_web_md
[2022-04-23 08:38:51,204] [INFO] Finished initializing nlp object
[2022-04-23 08:39:01,362] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    101.68    0.00    0.00    0.00    0.00
  1     200        473.32   9988.82   88.80   89.15   88.45    0.89
  2     400        114.51   3801.07   91.06   95.00   87.43    0.91
  3     600         89.99   2535.98   94.65   92

This above implementation was  a very short example. Please look at domain-specific Jupyter notebooks in `src/NER_step_by_step.ipynb` for more implementations.

## Experimental Results

As previously mentioned, sweak can be used for both text classification and NER. We will briefly discuss the NER task and the results obtained in the original paper for their NER implementation.

* Dataset used: MUC-6 corpus with 318 Wall Street Journal Articles annotated with 7 entity types: LOCATION, ORGANIATION, PERSON, MONEY, DATE, TIME, PERCENT

* 52 labelling functions created:
     * For detecting dates, time, money amounts, cardinal/ordinal values, %, dictionary of common first names, detecting company names, NNP part-of-speech tags
     * Neural models trained on the Broad Twitter Corpus, and ConLL2003
     * Document-level labelling functions specified for majority labels for given entity 

**Results** 

![ner_scores.PNG] (\fig\ner_scores.PNG)

They evaulated the model using Micro-averaged F1 scores. F1 scores can help identify the whether there is a good balance between precision and recall. The Micro-averaged F1 score is  used for multi-label binary problems and it is simily the harmonic mean of the micro-average precision & recall. The F1 scores were evaluated at both the token and entity level. When using the predicted named-entities for downstream tasks, it is more useful to evaluate with metrics at a full named-entity level.For example "New York" evaluated at the token level will be split into two, however at the entity level shuold be counted as a whole.

As a baseline, the results obtained by aggregating all labelling functions using
a majority voter (e.g. a token/entity taking the label that is assigned most frequently by each labelling function).Then the hmm model was evaluated by taking different subsets of labelling functions. A noteworthy observation is that heuristics outperform pre-trained NER models and gazetteers when properly identified. They are simple yet powerful labelling functions. Also, it seems that doc-level constraints did not have a major impact on the F1 score at either the entity/token level. The final line indicates the results using a neural NER model trained on the HMM-aggregated labels (with all labelling functions). This model is composed of four convolutional layers with residual connections. A noteworthy observation is how the models are consistenly worse off when evaluated at the entity level as opposed to the token level. 


# Conclusion and Future Direction


**Conclusion**

Skweak toolkit helps to obtain labelled data using weak supervision eliminating manual labelling. By simply definingly  labelling functions and aggregate their results in a few lines of code we are able to create aggregated labels. skweak can be used for both sequence labelling and text classification.


**Future Direction**

The skweak implementation can be further improved by learning from teh more mature implementation of Snorkel. It can look to implement semi-supervised learning approches into their current implementation, and further more implement multi-task weak supervision.


**Lessons Learned**

Weak supervision is a state-of-the-art methodology used in Natural Language Processing of unlabelled data. Through a review of this paper, we now understand what weak supervision is and how it is implemented, how labelling functions can help extract labels from unlabelled data, how skweak libraries can be used for Named Entity Recognition. We were able to identify parallels and differences from other articles that use weak supervision and understand that there are more mature Weak Supervision toolkits available for use.

# References:

[1]: Ratner, Alexander, et al. “Snorkel: Rapid Training Data Creation with Weak Supervision.” ArXiv.org, 28 Nov. 2017, https://arxiv.org/abs/1711.10160. 

[2]: Fries,J., Sen Wu, Alex Ratner, and Christopher Re. 2017. Swellshark: A generative model for biomedical named entity recognition without labeled data.

[3]: Fu, Daniel Y., et al. “Fast and Three-Rious: Speeding up Weak Supervision with Triplet Methods.” ArXiv.org, 15 July 2020, https://arxiv.org/abs/2002.11955. 

[4]: Esteban Safranchik, Shiying Luo, and Stephen H. Bach.
2020b. Weakly supervised sequence tagging from
noisy rules. In AAAI Conference on Artificial Intelligence (AAAI). https://cs.brown.edu/people/sbach/files/safranchik-aaai20.pdf. 
 
[5]: P. Lison Skweak: Weak Supervision Made Easy for NLP - Arxiv.org. https://arxiv.org/pdf/2104.09683v1. 

[6]: “Spacy 101: Everything You Need to Know · Spacy Usage Documentation.” SpaCy 101: Everything You Need to Know, https://spacy.io/usage/spacy-101. 

[7]: Tran, Khuyen. “Snorkel - Programmatically Build Training Data in Python.” Medium, Towards Data Science, 30 Jan. 2022, https://towardsdatascience.com/snorkel-programmatically-build-training-data-in-python-712fc39649fe. 

[8]: Dasagrandhi, Charan Sai. “Understanding Named Entity Recognition Pre-Trained Models.” Blog, https://blog.vsoftconsulting.com/blog/understanding-named-entity-recognition-pre-trained-models#:~:text=Named%20Entity%20Recognition%20(NER)%20is,entity%20chunking%20and%20entity%20extraction. 

[9]:Pierre Lison, Jeremy Barnes, Aliaksandr Hubin, and Samia Touileb. 2020. Named entity recognition without labelled data: A weak supervision approach.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1518–1533, https://aclanthology.org/2020.acl-main.139.pdf. Association for Computational Linguistics.
