# Understanding Outliers in Text Data with Transformers, Cleanlab, and Topic Modeling

<figure>
<img 
src="https://cdn.pixabay.com/photo/2016/02/16/21/07/books-1204029_960_720.jpg" width="900"
alt="books">
    <figcaption>Image taken from 
    <a href="https://pixabay.com/photos/books-bookstore-book-reading-1204029/">Pixabay</a>.
    </figcaption>
</figure>


Many text corpora contain heterogeneous documents, some of which may be anomalous and worth understanding more. In particular for deployed ML systems, we may want to automatically flag test documents that do not stem from the same distribution as their training data and understand emerging themes within these new documents that were absent from the training data.  This post demonstrates how to find anomalous texts in large NLP corpora using open-source Python libraries like Hugging Face, CleanLab, and PyTorch, as well as how to discover new topics within these texts using c-TF-IDF in order to better understand these anomalies.


We will use the [**MultiNLI** dataset on the Hugging Face Hub](https://huggingface.co/datasets/multi_nli), a natural language inference dataset commonly used to train language understanding models.

- The dataset contains multiple pairs of sentences (premise, hypothesis) that have been labelled whether the premise entails the hypothesis (`"entailment"`) or not (`"contradiction"`). A neutral label is also included (`"neutral"`).
- The corpus is split into a single training set and two validation sets.
    - The training set is sourced from 5 different genres: `[fiction, government, slate, telephone, travel]`.
    - The *matched validation set* is sourced from genres that *match* those in the the training set
    - The other validation set, also referred to as the *mismatched validation set*, is sourced from *other* genres not present in the training data: `[nineeleven, facetoface, letters, oup, verbatim]`. 
- More information about the corpus can be found [here](https://cims.nyu.edu/~sbowman/multinli/).



The steps in this post can be applied with your own word/sentence embedding models and any dataset containing multiple sources of text.




#### Too Long; Didn't Run (the code)
Here's our general workflow for detecting outliers from multiple text sources and finding new topics within them:



- Load and preprocess text datasets from the Hugging Face Hub to create PyTorch datasets.
- Apply pretrained sentence embedding model to create vector embeddings from the text. 
 - Here we utilize a bi-encoder based on a siamese neural network from the [SentenceTransformers](https://huggingface.co/sentence-transformers) library.
- Use the [cleanlab](https://github.com/cleanlab/cleanlab) library to find outlier texts in the training data.
- Find outlier examples in the validation data that don't come from the data distribution in the training set.
 - This would be analogous to looking for anomalies in new data sources/feeds.
- Select a threshold for deciding whether to consider an example an outlier or not.
- Cluster the selected outlier examples to find anomalous genres/sources of text.
- Identify topics within the anomalous genres/sources.



Our main goal is to find out-of-distribution examples in a dataset, paying more attention to new genres/domains/sources.  In the case for the MultiNLI dataset, only 1 out of the following 4 examples are considered anomalous with these methods. (Can you guess which?)

| **Premise**                                                     | **Hypothesis**                                                     |  **Genre**  |
|-----------------------------------------------------------------|--------------------------------------------------------------------|:-----------:|
| said San'doro.                                                  | San'doro spoke.                                                    |   fiction   |
| Answer? said Julius.                                            | Julius needed an answer right then.                                |   fiction   |
| The space age began with the launch of Sputnik in October 1957. | In October 1957 the space age started after the launch of Sputnik. | **letters** |
| Then he turned to Tommy.                                        | He turned to Tommy next.                                           |   fiction   |

It will turn out that the most likely outliers identified by our method come from the genres in the mismatched validation set, as is to be expected.

Many of these outlier examples form clusters based on their respective genres, which can be used to find out-of-distribution topics in the data.

![Outlier UMAP](images/outlier_umap.png)



# Let's get coding!

The remainder of this article will demonstrate how we implement our strategy, with fully runnable code! Here's a link to a Colab notebook in which you can run this code: TODO:link


## Install dependencies

You can install all the required packages by running:

```ipython
!pip install cleanlab datasets hdbscan matplotlib nltk sklearn torch tqdm transformers umap-learn
```

Next we'll import the necessary packages, set logging level to 'ERROR' and set some RNG seeds for reproducibility.

In [None]:
import cleanlab
import datasets
import hdbscan
import nltk
import matplotlib.pyplot as plt
import numpy as np
import re
import torch

from cleanlab.outlier import OutOfDistribution
from datasets import load_dataset, concatenate_datasets
from IPython.display import display
from sklearn.metrics import precision_recall_curve
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel
from umap import UMAP


try:
    nltk.corpus.stopwords.words
except LookupError:
    nltk.download('stopwords')

datasets.logging.set_verbosity_error()

SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.cuda.manual_seed_all(SEED)


## Preprocess datasets

The MultiNLI dataset can be fetched from the Hugging Face Hub via its `datasets` api. The only preprocessing we perform is removing unused columns/features from the datasets. Note that for this post we're *not* looking at the entailment labels (`label`) in the dataset. Rather we are simply trying to automatically identify out of distribution examples based only on their text.


For evaluating our outlier detection algorithm, we consider all examples from the mismatched validation set to be out-of-distribution examples. 
We'll still use the matched validation set to find naturally occurring outlier examples. Our algorithms also do not require the genre information, this is only used for evaluation purposes.


In [None]:
def preprocess_datasets(
    *datasets,
    sample_sizes = [45000, 9000, 9000],
    columns_to_remove = ['premise_binary_parse', 'premise_parse', 'hypothesis_binary_parse', 'hypothesis_parse', 'promptID', 'pairID', 'label'],
):
    # Remove -1 labels (no gold label)
    f = lambda ex: ex["label"] != -1
    datasets = [dataset.filter(f) for dataset in datasets]

    # Sample a subset of the data
    assert len(sample_sizes) == len(datasets), "Number of datasets and sample sizes must match"
    datasets = [
        dataset.shuffle(seed=SEED).select([idx for idx in range(sample_size)])
        for dataset, sample_size in zip(datasets, sample_sizes)
    ]
    
    # Remove columns
    datasets = [data.remove_columns(columns_to_remove) for data in datasets]

    return datasets

train_data = load_dataset("multi_nli", split="train")
val_matched_data = load_dataset("multi_nli", split="validation_matched")
val_mismatched_data = load_dataset("multi_nli", split="validation_mismatched")

train_data, val_matched_data, val_mismatched_data = preprocess_datasets(
    train_data, val_matched_data, val_mismatched_data
)

To get some idea of the data format, we'll take a look at a few examples from each dataset.


In [None]:
print("Training data")
print(f"Genres: {np.unique(train_data['genre'])}")
display(train_data.to_pandas().head())

print("Validation matched data")
print(f"Genres: {np.unique(val_matched_data['genre'])}")
display(val_matched_data.to_pandas().head())

print("Validation mismatched data")
print(f"Genres: {np.unique(val_mismatched_data['genre'])}")
display(val_mismatched_data.to_pandas().head())


# Transform NLI data into vector embeddings

We'll use pretrained SentenceTransformer models to embed the sentence pairs in the MultiNLI dataset.

One way to train sentence encoders from NLI data (including MultiNLI) is to add a 3-way softmax classifier on top of a Siamese BERT-Network like the one shown below.

<figure>
<img 
src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SBERT_SoftmaxLoss.png" width="400"
alt="Softmax loss with siamese network">
    <figcaption>Siamese network with softmax classifier. Image taken from 
    <a href="https://www.sbert.net/examples/training/nli/README.html">SBERT docs</a>.
    </figcaption>
</figure>

We will use outputs of the $(u, v, \vert u - v \vert )$-layer from such a network as a single vector embedding for each  sentence pair.  This is preferable to concatenating the sentence pairs into single strings as it would increase the risk of truncating the model inputs and losing information (particularly from the hypothesis).


In [None]:
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

def embed_sentence_pairs(dataloader, tokenizer, model, disable_tqdm=False):
    premise_embeddings  = []
    hypothesis_embeddings = []
    feature_embeddings = []

    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    model.to(device)
    model.eval()

    loop = tqdm(dataloader, desc=f"Embedding sentences...", disable=disable_tqdm)
    for data in loop:

        premise, hypothesis = data['premise'], data['hypothesis']
        encoded_premise, encoded_hypothesis = (
            tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
            for sentences in (premise, hypothesis)
        )

        # Compute token embeddings
        with torch.no_grad():
            encoded_premise = encoded_premise.to(device)
            encoded_hypothesis = encoded_hypothesis.to(device)
            model_premise_output = model(**encoded_premise)
            model_hypothesis_output = model(**encoded_hypothesis)

        # Perform pooling
        pooled_premise = mean_pooling(model_premise_output, encoded_premise['attention_mask']).cpu().numpy()
        pooled_hypothesis = mean_pooling(model_hypothesis_output, encoded_hypothesis['attention_mask']).cpu().numpy()
    
        premise_embeddings.extend(pooled_premise)
        hypothesis_embeddings.extend(pooled_hypothesis)

        
    # Concatenate premise and hypothesis embeddings, as well as their absolute difference
    feature_embeddings = np.concatenate(
        [
            np.array(premise_embeddings),
            np.array(hypothesis_embeddings),
            np.abs(np.array(premise_embeddings) - np.array(hypothesis_embeddings))
        ],
        axis=1
    )
    # feature_embeddings = normalize(feature_embeddings, norm='l2', axis=1)
    return feature_embeddings



For the next step, you have to choose a pretrained tokenizer+model from the Hugging Face Hub that will provide the token embeddings to the pooling layer of the network.

This is done by providing the name of the model on the Hub.

In [None]:
# Load model from Hugging Face Hub

# Pretrained SentenceTransformers handle this task better than regular Transformers
model_name = 'sentence-transformers/all-MiniLM-L6-v2'

# Uncomment the following line to try a regular Transformers model trained on MultiNLI
# model_name = 'sileod/roberta-base-mnli'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

batch_size = 128

# Place Hugging Face datasets in a PyTorch DataLoader
trainloader = DataLoader(train_data, batch_size=batch_size, shuffle=False)
valmatchedloader = DataLoader(val_matched_data, batch_size=batch_size, shuffle=False)
valmismatchedloader = DataLoader(val_mismatched_data, batch_size=batch_size, shuffle=False)

# Get embeddings
train_embeddings = embed_sentence_pairs(trainloader, tokenizer, model, disable_tqdm=True)
val_matched_embeddings = embed_sentence_pairs(valmatchedloader, tokenizer, model, disable_tqdm=True)
val_mismatched_embeddings = embed_sentence_pairs(valmismatchedloader, tokenizer, model, disable_tqdm=True)

## Find outliers in the datasets with cleanlab

We can find outliers in the training data with cleanlab's `OutOfDistribution` class. This fits a nearest neighbor estimator to the training data (in feature space) and returns an outlier score for each example based on its average distance from its *K* nearest neighbors.


In [None]:
# Get outlier scores for each of the training data feature embeddings
ood = OutOfDistribution()
train_outlier_scores = ood.fit_score(features=train_embeddings)

We can look at the top outliers in the training data.

In [None]:
# View images with top 15 outlier scores (outliers have low similarity scores)
top_train_outlier_idxs = (train_outlier_scores).argsort()[:15]
top_train_outlier_subset = train_data.select(top_train_outlier_idxs)
top_train_outlier_subset.to_pandas().head()


Next, we use the fitted nearest neighbor estimator to get outlier scores for the validation data, both the matched and mismatched validation sets.


In [None]:
# Get outlier scores for each of the feature embeddings in the *combined* validation set
test_feature_embeddings = np.concatenate([val_matched_embeddings, val_mismatched_embeddings], axis=0)
test_outlier_scores = ood.score(features=test_feature_embeddings)

First, we look at the top outliers in the validation data.

In [None]:
# Visualize 15 most severe outliers in test data
test_data = concatenate_datasets([val_matched_data, val_mismatched_data])

top_outlier_idxs = (test_outlier_scores).argsort()[:20]
top_outlier_subset = test_data.select(top_outlier_idxs)
top_outlier_subset.to_pandas()

Although the combined validation set is balanced with respect to matched and mismatched genres, most of the examples with high outlier scores are from the mismatched validation set (`[nineeleven, facetoface, letters, oup, verbatim]`).


Compare this with examples at the other end of the spectrum that are considered unlikely to be outliers.


In [None]:
bottom_outlier_idxs = (-test_outlier_scores).argsort()[:20]
bottom_outlier_subset = test_data.select(bottom_outlier_idxs)
bottom_outlier_subset.to_pandas()

These examples are only from 4 of the 5 genres in the matched validation set (`[fiction, government, telephone, travel]`). 
The only exception is the `slate` genre, but the first example appears much further down this list.

## Evaluate outlier scores

Realistically, if we already knew that the mismatched dataset contained different genres from those in the training set, we could do the outlier detection on each genre separately.

  - I.e. detect outlier sentence pairs from `nineeleven`, then outliers from `facetoface`, etc.

To keep things brief for now, let's consider outlier examples from the combined validation set.

We can set a threshold to decide what examples in the combined validation set are outliers.
We'll be conservative and use the 2.5-th percentile of the outlier score distribution in the training data as the threshold.
This threshold is used to select examples from the combined validation set as outliers.

In [None]:
# Take the 2.5th percentile of the outlier scores in the training data as the threshold
threshold = np.percentile(test_outlier_scores, 2.5)

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
plt_range = [min(train_outlier_scores.min(),test_outlier_scores.min()), \
             max(train_outlier_scores.max(),test_outlier_scores.max())]

axes[0].hist(train_outlier_scores, range=plt_range, bins=50)
axes[0].set(title='train_outlier_scores distribution', ylabel='Frequency')
axes[0].axvline(x=threshold, color='red', linewidth=2)
axes[1].hist(test_outlier_scores, range=plt_range, bins=50)
axes[1].set(title='test_outlier_scores distribution', ylabel='Frequency')
axes[1].axvline(x=threshold, color='red', linewidth=2)

This will result in a few false positives, as can be seen below.

In [None]:
# Get embeddings of test examples whose outlier scores are below the threshold

sorted_ids = test_outlier_scores.argsort()
outlier_scores = test_outlier_scores[sorted_ids]
outlier_ids = sorted_ids[outlier_scores < threshold]

selected_outlier_subset = test_data.select(outlier_ids)
selected_outlier_subset.to_pandas().tail(15)


## Cluster outliers

Let's assume that we don't know the content of the genres from the mismatched dataset.
We can try clustering the outliers from the validation set to see if we can get a better idea about the mismatched genres.

With this assumption, it would make sense to use a density based clustering algorithm like HDBSCAN which can handle noise in the selected outlier examples. Unfortunately, it doesn't perform well on high dimensional data. We'll use UMAP to reduce the dimensionality of the data. For visualization purposes, we'll reduce the dimensionality to 2 dimensions, but you may benefit from a slightly higher dimensionality if you expect some overlapping clusters.


In [None]:
# Get embeddings of selected outliers
selected_outlier_subset_embeddings = test_feature_embeddings[outlier_ids]

# Reduce dimensionality with UMAP
umap_fit = UMAP(n_components=2, n_neighbors=8, random_state=SEED)
selected_outlier_subset_embeddings_umap = umap_fit.fit_transform(selected_outlier_subset_embeddings)

# Set plot labels
mismatched_labels = {"nineeleven": 0, "facetoface": 1, "letters": 2, "oup": 3, "verbatim": 4}
matched_labels = {"fiction": 5, "government": 6, "slate": 7, "telephone": 8, "travel": 9}
labels_dict = {**mismatched_labels, **matched_labels}
genre_labels = np.array([labels_dict.get(x, 0) for x in selected_outlier_subset["genre"]])

# Plot reduced embeddings
plt.figure(figsize=(10, 10))
x_plot, y_plot = selected_outlier_subset_embeddings_umap[:, 0], selected_outlier_subset_embeddings_umap[:, 1]


for i, genre in enumerate(labels_dict.keys()):
    x, y = x_plot[genre_labels == i], y_plot[genre_labels == i]
    if genre in mismatched_labels:
        # Mismatched genres are filled circles
        plt.scatter(x, y, label=genre)
    else:
        # Matched genres are transparent triangles
        plt.scatter(x, y, label=genre, alpha=0.5, marker="^")
plt.legend()


At a quick glance, we see that the mismatched genres tend to cluster together. Only `facetoface` overlaps with `verbatim` and the majority of the matched genres. Our best bet would be to look for small local clusters to see how a single genre contains multiple topics. We'll have to set a relatively small minimum cluster size and allow more localized clusters.
This is done by lowering the `min_cluster_size` and `min_samples` parameters in the HDBSCAN algorithm.

In [None]:
clusterer = hdbscan.HDBSCAN(min_cluster_size=6, min_samples=4)
clusterer.fit(selected_outlier_subset_embeddings_umap)
cluster_labels = clusterer.labels_

clusterer.condensed_tree_.plot(select_clusters=True)

# plot each set of points in a different color
plt.figure(figsize=(10, 10))
for i in np.unique(cluster_labels):
    if i != -1:
        x, y = x_plot[cluster_labels == i], y_plot[cluster_labels == i]
        plt.scatter(x, y, label=f"cluster {i}")

# Plot outliers in gray
x, y = x_plot[cluster_labels == -1], y_plot[cluster_labels == -1]
plt.scatter(x, y, label="outliers", color="gray", alpha=0.15)
plt.legend()

The clusters on the edges are relatively pure based on visual inspection, i.e. the majority of the points in each cluster are from the same genre.

The main exceptions are the:

- Violet cluster consisting of 3 genres.
- Yellow-green cluster in the center with multiple overlapping genres.
  - This suggests that `verbatim` is an "in-distribution" topic. This is not useful for testing NLI models.
  - This can be removed in some cases.

Most of the "pure" `verbatim` clusters might be too small to be insightful, but the larger `nineeleven` and `oup` clusters are promising.



# Finding Topics with c-TF-IDF

A useful way of extracting topics from clusters of dense sentence/document embeddings is with [c-TF-IDF](https://maartengr.github.io/BERTopic/api/ctfidf.html). An existing algorithm, [BERTopic](https://maartengr.github.io/BERTopic/index.html), leverages Transformers (specifically the BERT model) to do exactly this in an easy manner. BERTopic essentially does the following:


- Reduce the dimensionality of the embeddings with UMAP
- Cluster the reduced embeddings with HDBSCAN.
- Compute the TF-IDF scores of the words in each sentence/document cluster with c-TF-IDF.
  - TF-IDF scores represent the importance of the words in the cluster.
  - Extracting the main topic from a cluster can be done by collecting the words with the highest scores. 

We've already performed the dimensionality reduction and clustering with the aforementioned methods, so it's redundant to do it again with BERTopic.  A [nice article by James Briggs](https://www.pinecone.io/learn/bertopic/) goes through the same steps in detail and provides a clear implementation of c-TF-IDF for pre-computed embeddings and clusters.  We reuse parts of that implementation below. To keep things simple, we use unigrams to extract topics.



In [None]:
###### Create documents from sentence pairs

# Get combined text from the selected outliers
# Joining the premise and hypothesis together
def join_sentence_pair(example):
    docs = []
    for premise, hypothesis in zip(example["premise"], example["hypothesis"]):
        docs.append(premise + " " + hypothesis)
    example["docs"] = docs
    return example

selected_outlier_subset = selected_outlier_subset.map(join_sentence_pair, batched=True)

###### Build vocabularies for classes

classes = {}
for label in set(clusterer.labels_):
    classes[label] = {
        'vocab': set(),
        'tokens': [],
        'tfidf_array': None
    }
selected_outlier_subset = selected_outlier_subset.add_column('class', clusterer.labels_)


# Lowercase and remove punctuation
alpha = re.compile(r'[^a-zA-Z ]+')
selected_outlier_subset = selected_outlier_subset.map(lambda x: {
    'tokens': alpha.sub('', x['docs']).lower()
})

# Tokenize
selected_outlier_subset = selected_outlier_subset.map(lambda x: {
    'tokens': nltk.tokenize.wordpunct_tokenize(x['tokens'])
})

# Collect tokens from all examples for their respective classes
for example in selected_outlier_subset:
    classes[example['class']]['tokens'].extend(example['tokens'])

# Remove stopwords
for c in classes.keys():
    stopwords = set(nltk.corpus.stopwords.words('english'))
    classes[c]['tokens'] = [
        word for word in classes[c]['tokens'] if word not in stopwords
    ]

# Build class vocabulary
vocab = set()
for c in classes.keys():
    vocab = vocab.union(set(classes[c]['tokens']))
    classes[c]['vocab'] = set(classes[c]['tokens'])


###### c-TF-IDF scores

tf = np.zeros((len(classes.keys()), len(vocab)))

for c, _class in enumerate(classes.keys()):
    for t, term in enumerate(tqdm(vocab, disable=True)):
        tf[c, t] = classes[_class]['tokens'].count(term)

idf = np.zeros((1, len(vocab)))

# Calculate average number of words per class
A = tf.sum() / tf.shape[0]

for t, term in enumerate(tqdm(vocab, disable=True)):
    # Frequency of term t across all classes
    f_t = tf[:,t].sum()
    # Calculate IDF
    idf_score = np.log(1 + (A / f_t))
    idf[0, t] = idf_score

tf_idf = tf*idf


Looking at the words with the top c-TF-IDF scores in each cluster should give us some idea of this cluster's main topic. 

In [None]:
n = 7

top_idx = np.argpartition(tf_idf, -n)[:, -n:]
vlist = list(vocab)
for c, _class in enumerate(classes.keys()):
    topn_idx = top_idx[c, :]
    topn_terms = [vlist[idx] for idx in topn_idx]
    if _class != -1:
        print(f"Topic class {_class}: {topn_terms}")
    else:
        print(f"Outliers: {topn_terms}")

Let's visualize the clustered embeddings with their associated topics (left). For comparison, we'll visualize the same embeddings with their original genres as labels (right). 

In [None]:
# Plot two figures
plt.subplots(nrows=1, ncols=2, figsize=(20, 10))


# LEFT PLOT
# Plot scatter plot of umap embeddings with clusterer labels as colors

x_plot, y_plot = selected_outlier_subset_embeddings_umap[:, 0], selected_outlier_subset_embeddings_umap[:, 1]
plt.subplot(1, 2, 1)
for i, topic in enumerate(np.unique(cluster_labels)):
    if topic != -1:
        if i > 10:
            marker = "x"
        else:
            marker = "o"
        x, y = x_plot[cluster_labels == topic], y_plot[cluster_labels == topic]
        label = "_".join([vlist[idx] for idx in top_idx[topic, :]])
        # Truncate label to fit in legend
        label = label[:10]
        plt.scatter(x, y, label=f"{topic}: {label}", marker=marker)

# Plot outliers in gray with lower alpha
plt.scatter(x_plot[cluster_labels == -1], y_plot[cluster_labels == -1], label="outliers", color="gray", alpha=0.25)
plt.title("Clustered by HDBSCAN")
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.legend()


# RIGHT PLOT
# Plot scatter plot of umap embeddings with genre labels as colors

plt.subplot(1, 2, 2)
genre_labels = np.array([labels_dict.get(x, 0) for x in selected_outlier_subset["genre"]])
for i, genre in enumerate(labels_dict.keys()):
    x, y = x_plot[genre_labels == i], y_plot[genre_labels == i]
    if genre in mismatched_genres:
        plt.scatter(x, y, label=genre)
    else:
        plt.scatter(x, y, label=genre, alpha=0.5, marker="^")
plt.title("Labelled by genre")
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.legend()

In the `nineeleven` genre, several topics stick out, some on US airline flights and others on middle eastern leaders. One topic is discovered in the `oup` genre, which appears to  be about textiles.  The remaining genres overlap too much to obtain meaningful topics. One way to handle the overlapping clusters is to redo the previous clustering exclusively on those points, e.g. by removing `nineeleven` from the analysis and recursively repeat this process as needed.


## Conclusion

This analysis demonstrated how to identify and understand outliers in text data. The required methods are all available and easy to use in open-source Python libraries, and you should be able to apply the same code demonstrated here to your own text datasets. I hope identifying and understanding outliers helps you ensure better quality data and ML performance in your own applications. You might choose to either omit such examples from your dataset, or to expand your data collection to obtain better coverage of such cases (if they seem relevant).
