# NLP Pipelines

This package is designed to generalize the fit/predict to entire end to end pipelines for text. It can be used as individual methods or as a pipeline object to manage execution of training and/or testing. See README.md for an overview of which methods are in here.

## This Demo Set

This set of demos walks through:

* **Basic Usage Demos** Using abstract/toy datasets to show how to use this library.
  * **Usage of the library as individual methods**: Using each of the methods manually to get results.
  * **Usage of the pipeline constructor**: A reimplementaiton of the examples above using the pipeline object.
  * **Saving and loading trained pipelines**: Quick guidance on how to separate training and running/prediction.
* **Practical Demos**: With real data and justifying decisions on pipeline design.
  * **Practical Demo: Clustering**: A real end-to-end example on clustering, where "clustering" is picking some set of groups without explicit mapping to classes.
  * **Practical Demo: Classification**: A real end-to-end example on classification, where "classificaiton" is picking which of a set of classes best apply to a document.
  * **Practical Demo: Extractive Labeling**: A real end-to-end example on extractive labeling, where "extractive labeling" is picking which words or phrases in a document "seem important", like automatic keyword extraction.
  * **Practical Demo: Extractive Labeling**: A real end-to-end example on predictive labeling, where a set of possible labels is chosen a prioro, and between none and all of the possible labels may apply to a document.
  * **(WIP) Practical Demo: Mixing Methods**:  A real end-to-end example on using different kinds of methods together to get simpler results.

NOTE that this is not intended to be an exhaustive demo of all of the different methods: `test_demos.ipynb` may be a better choice for that.

## Setup Dependncies
For portability, this is included to install dependencies for this library and notebook itself.

In [2]:
# install dependencies
%pip install -r requirements.txt 

Note: you may need to restart the kernel to use updated packages.


# Basic Usage Demos

## Datasets

We should make different datasets for different pipelines. For this basic usage section, the datasets are fake and chosen unrealistically to demonstrate the shape of the inputs and outputs of the steps and pipelines. We'll need a dataset with no truths (true labels are called "truths" in this library) for clustering and extractive labeling (n.b. this is not strictly necessary, clustering and extractive labeling methods will simply ignore truths, so including them here is likely to be confusing), one with classes for classification, and one with lists of labels as truths for labeling.

In [13]:
from nlp_pipelines.dataset import Dataset

# methods which do not need truths of any form get the unlabeled_dataset.
texts_1 = [
    "I love hiking in the mountains.",
    "The sun is shining bright today.",
    "Reading books is my favorite hobby.",
    "I enjoy outdoor activities like camping.",
    "It's a beautiful day for a walk in the park.",
    "I prefer staying indoors and watching movies.",
    "Cooking new recipes is always fun."
]

unlabeled_dataset = Dataset(texts_1)

# methods which need truth clases (exactly one per document) get the classed_dataset.
texts_2 = [
    "I love this movie",
    "This is terrible",
    "Fantastic work",
    "Awful experience",
    "My favorite film ever",
    "A waste of my time",
    "Not enough people have seen this masterpiece"
    ]
classes = [
    "positive",
    "negative",
    "positive",
    "negative",
    "positive",
    "negative",
    "positive"
    ]
classed_dataset = Dataset(texts_2, classes)

# methods which need truth label sets (0 to n per document) get the labeled_dataset.
texts_3 = [
    "I love baking with things from my garden.",
    "Here's a recipe for lasagna.",
    "I enjoy hiking in the mountains and camping.",
    "I have recenly learned how to cook good food when hiking.",
    "I spent the weekend building a treehouse.",
    "Cooking dinner with fresh ingredients is a joy.",
    "I often take long walks in the park during summer to get inspired on what I should plant."
]

labels = [
    ["cooking", "gardening"],
    ["cooking"],
    ["outdoors"],
    ["outdoors", "cooking"],
    ["outdoors"],
    ["cooking"],
    ["outdoors", "gardening"]
]

labeled_dataset = Dataset(texts_3, classes)

#also, for convenience, let's add an explicit list of possible_labels
possible_labels = ["gardening", "cooking", "outdoors"]



## Usage of the library as individual methods

Now, let's try to get some results for our toy data.



### Clustering
As a reminder, we want to apply clusters on the unlabeled dataset.
For now, we'll judge the results on what *feels* right for such a small dataset with no truth for comparison; proper evaluation will be explored in the next section on real data.

In [14]:

# let's make a copy of our data as not to interfere with any other runs
import copy

cluster_dataset = copy.deepcopy(unlabeled_dataset)

#Our goal is to see which clusters the data is in, rather than to use the clusters to classify new data, so we can use the same data to fit and predict without splitting, just using cluster_dataset

# we need a vectorizer to put this into a form which we can do math on. So we can use one thing for all, we'll use the sentence embedding model ('all-MiniLM-L6-v2)

# import
from nlp_pipelines.vectorizer import SentenceEmbedding

# initialize
embedding_model = SentenceEmbedding(model_name='all-MiniLM-L6-v2') # "all-MiniLM-L6-v2" is the default for this method, but let's make it explicit
# this is a pretrained embedding model which uses context to take a chunk of text and return a vector which is meant to represent the text in some high dimensional space. See (https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) for details of this particular model; the website should also have other options for sentence embeddings, which may be more well suited to particular tasks, possibly with computational tradeoffs.

# while sentence embeddings typically do well with the full sentence uncleaned, we'll also remove uninformative words and stem, to make the sentences simpler.

# import
from nlp_pipelines.preprocess import StopwordRemove, Stem

# initalize
stopword_remover = StopwordRemove() # remove uninformative words
stemmer = Stem() # find the roots of words

# let's use k-means with k=2 for the clustering

#import
from nlp_pipelines.clusterer import Kmeans

# initalize
kmeans_model = Kmeans(num_clusters=2, random_state=101) # picks 2 points in embedding space as centroids; documents are assigned to closest centroid

# ok, now we need to run these on the dataset in order. Preprocess, then vectorize, then cluster.

# each of the methods needs to be fit and transoformed/predicted on the data.
print("initial data:", cluster_dataset)
stopword_remover.fit(cluster_dataset)
cluster_dataset = stopword_remover.transform(cluster_dataset) # keep working on this object "cluster_dataset" step by step
print("after stopword remove:", cluster_dataset)
stemmer.fit(cluster_dataset)
cluster_dataset = stemmer.transform(cluster_dataset)
print("after stemming:", cluster_dataset)
embedding_model.fit(cluster_dataset)
cluster_dataset = embedding_model.transform(cluster_dataset)
print("after vectorization:", cluster_dataset)
kmeans_model.fit(cluster_dataset)
cluster_dataset = kmeans_model.predict(cluster_dataset)
print("after k-means:", cluster_dataset)

# finally, let's just look at the results and original texts side by side
print("original text:", cluster_dataset.original_texts)
print("cluster results:", cluster_dataset.results)


initial data: <Dataset with 7 texts
Texts: ['I love hiking in the mountains.', 'The sun is shining bright today.']... +5 more>
after stopword remove: <Dataset with 7 texts
Texts: ['love hiking mountains', 'sun shining bright today']... +5 more>
after stemming: <Dataset with 7 texts
Texts: ['love hike mountain', 'sun shine bright today']... +5 more>
after vectorization: <Dataset with 7 texts, vectors: 384-dim
Texts: ['love hike mountain', 'sun shine bright today']... +5 more>
after k-means: <Dataset with 7 texts, vectors: 384-dim, results: 7 items
Texts: ['love hike mountain', 'sun shine bright today']... +5 more\Results: [1 1]... +5 more>
original text: ['I love hiking in the mountains.', 'The sun is shining bright today.', 'Reading books is my favorite hobby.', 'I enjoy outdoor activities like camping.', "It's a beautiful day for a walk in the park.", 'I prefer staying indoors and watching movies.', 'Cooking new recipes is always fun.']
cluster results: [1 1 1 0 1 0 0]


  return forward_call(*args, **kwargs)


### Extractive Labeling (Keyword Extraction)

Labeling is getting 0+ "labels" per document. the simpler version of this is extractive labeling, more commonly referred to as keyword extraction. The problem is usually framed as having a large text and wanting to get especially important/informative keywords. These are not picked from a user-supplied set, and are extracted from the text. Since the text is extracted, and there are multiple labels per document, I've called this extractive labeling for internal naming consistency.

Anyway, we can do the same preprocessing as we did, but we don't need to vectorize, since this method works on the text itself, rather than in an embedding space. We /could/ reuse the same dataset which already has these transformations and overwrite the results, but for now, let's do it from scratch for clarity.

In [None]:
# preprocessing methods were already imported above, let's re-initialize them for clarity (agin we don't /need/ to do this, we could simply fit to overwrite any information the methods got during preprocessing [in this case there's none]).

stopword_remover = StopwordRemove() # remove uninformative words
stemmer = Stem() # find the roots of words

# TfidfTopN is a method which uses Term Frequency Inverse Document (corpus) frequency to find informative words, then just picks the top N on that score.
# more or less, think of this mathod as returning words which are rare across the corpus but generally but show up in this document a lot.

from nlp_pipelines.labeler import TfidfTopN

# initalize
tfidf_labeler = TfidfTopN(top_k=2, ngram_range=(1,3)) # 2 keywords per thing, up to 3gram keywords

# a fresh copy of the source dataset again
extractive_dataset = copy.deepcopy(unlabeled_dataset)

# fit and transform/predict each method

print("initial data:", extractive_dataset)
stopword_remover.fit(extractive_dataset)
extractive_dataset = stopword_remover.transform(extractive_dataset) # keep working on this object "cluster_dataset" step by step
print("after stopword remove:", extractive_dataset)
stemmer.fit(extractive_dataset)
extractive_dataset = stemmer.transform(extractive_dataset)
print("after stemming:", extractive_dataset)
tfidf_labeler.fit(extractive_dataset)
extractive_dataset = tfidf_labeler.predict(extractive_dataset)
print("after labeling:", extractive_dataset)

# finally, let's just look at the results and original texts side by side
print("original text:", extractive_dataset.original_texts)
print("labeling results:", extractive_dataset.results)

initial data: <Dataset with 7 texts
Texts: ['I love hiking in the mountains.', 'The sun is shining bright today.']... +5 more>
after stopword remove: <Dataset with 7 texts
Texts: ['love hiking mountains', 'sun shining bright today']... +5 more>
after stemming: <Dataset with 7 texts
Texts: ['love hike mountain', 'sun shine bright today']... +5 more>
after labeling: <Dataset with 7 texts, results: 7 items
Texts: ['love hike mountain', 'sun shine bright today']... +5 more\Results: [['love', 'hike'], ['today', 'sun']]... +5 more>
original text: ['I love hiking in the mountains.', 'The sun is shining bright today.', 'Reading books is my favorite hobby.', 'I enjoy outdoor activities like camping.', "It's a beautiful day for a walk in the park.", 'I prefer staying indoors and watching movies.', 'Cooking new recipes is always fun.']
labeling results: [['love', 'hike'], ['today', 'sun'], ['book', 'read'], ['activ', 'like'], ['beauti', 'day'], ['watch', 'indoor'], ['recip', 'cook']]


## Usage of the pipeline constructor

## Saving and loading trained pipeline

# Practical Demos

In [4]:
# Dataset, what it is and why it's used.

## Clustering

## Classification

## Extractive Labeling

## Predictive Labeling

## Mixing Methods
WIP