# NLP Pipelines Demos

This notebook shows a few demos to help understand the nlp pipelines package.

In [1]:
# install dependencies
%pip install -r requirements.txt 

Note: you may need to restart the kernel to use updated packages.


## Example: Simple Data, Clustered

For perhaps the simplest example, let's take a small toy dataset/corpus and make clusters. First we'll use methods individually, then we'll use the pipeline object to help simplify.

### Using Specific Methods

You can use any of the methods directly. For example, we can clean text or use a bag of words vectorization.

In [2]:

from nlp_pipelines.vectorizer import Bow
from nlp_pipelines.dataset import Dataset
from nlp_pipelines import preprocess

# First, a simple dataset for demonstration
texts = [
    "The new stethoscope model by Littmann is available now.",
    "Philips unveils an innovative heart monitor with improved accuracy.",
    "Medtronic announces a breakthrough in robotic surgery technology.",
    "GE Healthcare's ultrasound device provides high-definition imaging.",
    "Stryker introduces a new orthopedic surgical tool.",
    "Johnson & Johnson releases a new line of surgical instruments.",
    "Siemens Healthineers develops a state-of-the-art MRI scanner.",
    "Boston Scientific launches a catheter designed for heart surgery."
]
dataset = Dataset(texts)

print(dataset)


  from .autonotebook import tqdm as notebook_tqdm


<Dataset with 8 texts
Texts: ['The new stethoscope model by Littmann is available now.', 'Philips unveils an innovative heart monitor with improved accuracy.']... +6 more>


In [3]:
# let's remove stopwords (uninformative words)
stopword_remover = preprocess.StopwordRemove()

dataset = stopword_remover.transform(dataset)
print(dataset.texts)

['new stethoscope model Littmann available', 'Philips unveils innovative heart monitor improved accuracy', 'Medtronic announces breakthrough robotic surgery technology', 'GE Healthcare ultrasound device provides high definition imaging', 'Stryker introduces new orthopedic surgical tool', 'Johnson Johnson releases new line surgical instruments', 'Siemens Healthineers develops state art MRI scanner', 'Boston Scientific launches catheter designed heart surgery']


In [4]:
# let's lemmatize to see what that does too

lemmatizer = preprocess.Lemmatize()

dataset = lemmatizer.transform(dataset)
print(dataset.texts)

['new stethoscope model Littmann available', 'philip unveil innovative heart monitor improved accuracy', 'medtronic announce breakthrough robotic surgery technology', 'GE Healthcare ultrasound device provide high definition imaging', 'stryker introduce new orthopedic surgical tool', 'Johnson Johnson release new line surgical instrument', 'Siemens Healthineers develop state art MRI scanner', 'Boston Scientific launch catheter design heart surgery']


In [5]:
# ok, maybe this is reasonable to vectorize?

vectorizer = Bow()

vectorizer.fit(dataset)
dataset = vectorizer.transform(dataset) # for now, the same dataset since it's all we have

print(dataset) # show us the state of the dataset
print(dataset.vectors) # show us just the vectors


<Dataset with 8 texts, vectors: 47-dim
Texts: ['new stethoscope model Littmann available', 'philip unveil innovative heart monitor improved accuracy']... +6 more>
[[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 1]
 [0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 1 0 1 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
  0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
  0 0 0 0 1 0 1 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 1 0 0 0 0 0 1 0 0 0 1 0 0
  0 0 0 0 0 0 1 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
  0 1 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
  1 0 0 0 0 1 0 0 0 0 0]]


In [6]:
# to make this end to end, let's try to cluster to see what that gets us
from nlp_pipelines.clusterer import Kmeans

model = Kmeans(num_clusters=2, random_state=101)

model.fit(dataset)
dataset = model.predict(dataset)

print(dataset) # what's in the dataset now
print(dataset.results) # what are the results


<Dataset with 8 texts, vectors: 47-dim, results: 8 items
Texts: ['new stethoscope model Littmann available', 'philip unveil innovative heart monitor improved accuracy']... +6 more\Results: [1 1]... +6 more>
[1 1 1 1 0 0 1 1]


### Using a pipeline

Instead of doing these one by one, a helper class "Pipeline" lets us define these as a pipeline and run them all together.

In [7]:
from nlp_pipelines.pipeline import Pipeline

pipeline = Pipeline([
    {"name": "preproc1", "method": "preprocess.StopwordRemove"},
    {"name": "preproc2", "method": "preprocess.Lemmatize"},
    {"name": "vectorize", "method": "vectorizer.Bow"},
    {"name": "cluster", "method": "clusterer.Kmeans", "params":{"num_clusters":2, "random_state": 101}}
])
pipeline.set_data(train_data=dataset, run_data=dataset) # for now, train and run on the same data
pipeline.run()

print("Results:", pipeline.run_data.results)




Results: [1 1 1 1 0 0 1 1]


Same results, since it's the same pipeline.

Also, the intermediate results are still part of the pipeline's dataset.

In [8]:
# the dataset keeps the last of the other things it's seen

# original text
print("Original texts:", pipeline.run_data.original_texts)
# preprocessed text
print("Preprocessed texts:", pipeline.run_data.texts)
# vectors
print("Vectors:", pipeline.run_data.vectors)

Original texts: ['The new stethoscope model by Littmann is available now.', 'Philips unveils an innovative heart monitor with improved accuracy.', 'Medtronic announces a breakthrough in robotic surgery technology.', "GE Healthcare's ultrasound device provides high-definition imaging.", 'Stryker introduces a new orthopedic surgical tool.', 'Johnson & Johnson releases a new line of surgical instruments.', 'Siemens Healthineers develops a state-of-the-art MRI scanner.', 'Boston Scientific launches a catheter designed for heart surgery.']
Preprocessed texts: ['new stethoscope model Littmann available', 'philip unveil innovative heart monitor improved accuracy', 'medtronic announce breakthrough robotic surgery technology', 'GE Healthcare ultrasound device provide high definition imaging', 'stryker introduce new orthopedic surgical tool', 'Johnson Johnson release new line surgical instrument', 'Siemens Healthineers develop state art MRI scanner', 'Boston scientific launch catheter design hea

## Example: Simple Data, Classified

Now, let's pick a dataset with truths and use that to classify the documents

In [9]:
texts = ["I love this movie", "This is terrible", "Fantastic work", "Awful experience", "It was okay"]
truths = ["positive", "negative", "positive", "negative", "neutral"]
dataset = Dataset(texts, truths)

print(dataset)

train, test = dataset.split(count=3)
print(train, test)


<Dataset with 5 texts
Texts: ['I love this movie', 'This is terrible']... +3 more
Truths: ['positive', 'negative']... +3 more>
<Dataset with 3 texts
Texts: ['I love this movie', 'This is terrible']... +1 more
Truths: ['positive', 'negative']... +1 more> <Dataset with 2 texts
Texts: ['Awful experience', 'It was okay']
Truths: ['negative', 'neutral']>


In [10]:
# the words are quite different, so low co-occurence is going to break tfidf/bow; let's try to embed with a sentence embedding to get context!

pipeline = Pipeline([
    {"name": "vectorize", "method": "vectorizer.SentenceEmbedding"},
    {"name": "classify", "method": "classifier.Xgboost"}
])


pipeline.set_data(train_data=train, run_data=test) # now we have different train and test data!
pipeline.run()

print("Results:", pipeline.run_data.results)
print("Truths:", pipeline.run_data.truths) # TODO evaluation code was not finished for all result types as of the writing of this.

# I ran it without a seed, and got 1/2 right, so maybe three sentences isn't enough to train a tree model ;)

  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)


Results: ['positive' 'positive']
Truths: ['negative', 'neutral']


## Example: Simple Data, Labeled

Finally, we have labeling (formerly "keyword extraction"); retuning 0 to n labels.
However, we have two different kinds of labelers: extractive ones pick important words from the document. Predictive ones take the list of labels and try to predict which apply to text.

In any case, let's start with the dataset.

In [11]:
# common dataset
texts = [
    "Patient shows symptoms of fever and cough, possible pneumonia diagnosis.",
    "Headache and nausea reported, likely migraine.",
    "Frequent urination and fatigue, potential diabetes condition.",
    "Coughing and shortness of breath, indicative of respiratory infection.",
    "Reports of dizziness, nausea, and blurred vision, possible stroke."
]

truths = [
    ["pneumonia", "respiratory infection"],
    ["migraine"],
    ["diabetes"],
    ["respiratory infection"],
    ["stroke"]
]

possible_labels = ["pneumonia", "migraine", "diabetes", "respiratory infection", "stroke"]

# Create the Dataset object
dataset = Dataset(texts, truths)

print(dataset)


<Dataset with 5 texts
Texts: ['Patient shows symptoms of fever and cough, possible pneumonia diagnosis.', 'Headache and nausea reported, likely migraine.']... +3 more
Truths: [['pneumonia', 'respiratory infection'], ['migraine']]... +3 more>


### Extractive Labeling
What are the top 2 words according to an extractive labeler? Let's clean a little bit then try it!

In [12]:
pipeline = Pipeline([
    {"name": "preprocess", "method": "preprocess.Lemmatize"},
    {"name": "extract", "method": "labeler.Yake", "params":{"top_k":2}}
])


pipeline.set_data(train_data=dataset, run_data=dataset)
pipeline.run()

print("Results:", pipeline.run_data.results) # not sure how to best evaluate extractive keywords in our context.

Results: [['patient show symptom', 'patient show'], ['Headache and nausea', 'nausea report'], ['potential diabetes condition', 'frequent urination'], ['cough and shortness', 'shortness of breath'], ['report of dizziness', 'nausea']]


## Predictive labeling
Let's use a method to predict which keywords from our list seem to best apply. We'll need to embed more things.

In [13]:
pipeline = Pipeline([
    {"name": "vectorize", "method": "vectorizer.SentenceEmbedding"},
    {"name": "predict", "method": "labeler.ThresholdSim"}
])

# we overwrite the previous instance of the dataset, so let's make a clean copy
dataset = Dataset(texts, truths)
train, test = dataset.split(count=3)

pipeline.set_data(train_data=train, run_data=test, possible_labels=possible_labels)
pipeline.run()

print("Results:", pipeline.run_data.results) # TODO! it only SORTS the keywords on distance right now! Do the threshold.
print("Truths:", pipeline.run_data.truths)

Results: [['migraine', 'stroke', 'pneumonia', 'diabetes', 'respiratory infection'], ['diabetes', 'respiratory infection', 'migraine', 'stroke', 'pneumonia']]
Truths: [['migraine'], ['diabetes']]
