Real-world projects stand and fall with the quality of the data that is being used. Quality over quantity is the name of the game! For numerical or categorical data, there are many good techniques to determine and inspect the quality of data. But how can we do this for text data, where there are no set benchmarks of what makes a good and high quality text? Turns out there are some useful libraries and techniques that we can use to find out if our text data is any good! 

First off, we are going to use the "textstat" library, a fantastic tool that offers many functionalities to get a better text understanding. 

Also, all these snippets are taken from our open-source NLP content library. You can find all of this and more at: https://bricks.kern.ai/home :-) 

## Measuring data quality for text data

Kiefer (2019) suggests these quality indicators for text data:
- percentage of abbraviations
- percentage of spelling mistakes 
- lexical diversity 
- percentage of uppercase words 
- percentage of ungrammatical sentences
- average sentence length 

Let's have a look at how we can measure or improve these features of our text data. 

Source: https://btw.informatik.uni-rostock.de/download/workshopband/C2-5.pdf

<hr>

### Lexical diversity

In [29]:
# lexical diversity 
def lexical_diversity(text):
    word_count = len(text.split())
    vocab_size = len(set(text.split()))
    return round(vocab_size / word_count, 3) # this is the diversity score

texts = [
# NYT article:
"At Microsoft, Satya Nadella, the tech giant’s chief executive, said that Mr. Altman would be chief executive of the new research lab, “setting a new pace for innovation,” in an apparent contrast at the OpenAI board’s desire for caution in developing A.I. technology. Mr. Nadella noted in a post to X, formerly known as Twitter, that Mr. Altman’s new group will operate as an independent entity within Microsoft.",
# Poem for children:
"The wheels on the bus go round and round, Round and round, Round and round. The wheels on the bus go round and round, All through the town. The wipers on the bus go Swish, swish, swish; Swish, swish, swish; Swish, swish, swish. The wipers on the bus go Swish, swish, swish, All through the town. The horn on the bus goes Beep, beep, beep; Beep, beep, beep; Beep, beep, beep. The horn on the bus goes Beep, beep, beep, All through the town."
]

for t in texts:
    print(lexical_diversity(t))

0.794
0.298


I would argue that a higher lexical diversity is pretty useful to get an understanding of how complicated a text is. A text with a super hight lexical diversity might be too confusing for an LLM (or non-expert human readers) and might require you to provide an glossary or to simplify the language that is used.

<hr>

### Sentence complexity

The sentence complexity is another great indicator, as sentences that are too complex are often not good when doing RAG. 

**The formula for this is as follows: 206.835 – (1.015 x Average Sentence Length) – (84.6 x Average Syllables Per Word)** <br>
You can also find a weighted version of this code here: https://bricks.kern.ai/classifiers/1049 (applied some additional aggregation logic)

In [1]:
# sentence complexity with textstat
import textstat

def sentence_complexity(text:str)->str:    
    return lookup_label(textstat.flesch_reading_ease(text))

def lookup_label(score:int) -> str:
    if score < 30:
        return "very difficult"
    if score < 50:
        return "difficult"
    if score < 60:
        return "fairly difficult"
    if score < 70:
        return "standard"
    if score < 80:
        return "fairly easy"
    if score < 90:
        return "easy"        
    return "very easy"


# ↑ necessary bricks function 
# -----------------------------------------------------------------------------------------
# ↓ example implementation 

def example_integration():
    texts = ["Doctors from Stockholm University invent cure for rare disease.", "Mary had a little lamb."]
    textstat.set_lang("en") # you can use en, de, es, fr, it, nl, ru
    for text in texts:
        print(f"\"{text}\" is {sentence_complexity(text)}")

example_integration()

"Doctors from Stockholm University invent cure for rare disease." is fairly difficult
"Mary had a little lamb." is very easy


You can see that sentences don't simply need to be long to be complex to be detected as difficult. <br>
Besides that, **textstat** also offers some useful statistical features, such as: 

In [7]:
text = "this is a super short text, which contains the word Epigallocatechin-gallate"

print(f"Reding time of the text: {textstat.reading_time(text, ms_per_char=14.69)} seconds (this is an average, some read even faster, some read slower) \n")
print(f"Amount of words with three or more syllables: {textstat.polysyllabcount(text)}")

Reding time of the text: 0.97 seconds (this is an average, some read even faster, some read slower) 

Amount of word with three or more syllables: 1


<hr>

For RAG, it also makes sense to get some knowledge about the length of you texts to find out how suitable they are. As we seen above, the length is also a great indicator for the quality of the text data. <br><br>
Doing this can have two purposes: 
- Finding out the complexity of the data -> Additional indicators for if the text needs additional treatment
- Keeping in mind the limited token and context window of an LLM

This also plays a role for the famous Lost in the Middle problem: https://arxiv.org/pdf/2307.03172.pdf

![lost in the middle by Liu et al.](lost-in-the-middle.png "LLMs often ignore context in the middle")

In [8]:
# classifiy a text based on their word count
def word_count_classifier(text: str) -> str:
    words = text.split()
    length = len(words)
    if length < 5:
          return "short"
    elif length < 20:
          return "medium"
    else:
          return "long"

# ↑ necessary bricks function 
# -----------------------------------------------------------------------------------------
# ↓ example implementation 

def example_integration():
    texts = ["This is short.", "This is a text with medium length.", "This is a longer text with many more words. There is even a second sentence with extra words. Splendid, what a joyful day!"]
    for text in texts:
        print(f"\"{text}\" is -> {word_count_classifier(text)}")

example_integration()

"This is short." is -> short
"This is a text with medium length." is -> medium
"This is a longer text with many more words. There is even a second sentence with extra words. Splendid, what a joyful day!" is -> long


Or we can use the **tiktoken** library from OpenAI, which uses the tokenizer of the GPT models to count the amount of tokens (subwords). <br>
(There are of course other tokenizers we can use for open-source models, too)

In [9]:
# classify a text based on their token contents
import tiktoken

def tiktoken_token_counter(text: str, encoding_name: str = "cl100k_base") -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    tokens = encoding.encode(text)
    return len(tokens)

# ↑ necessary bricks function 
# -----------------------------------------------------------------------------------------
# ↓ example implementation 

def example_integration():
    texts = ["This is a short text with few tokens.", "This is a second short text"]
    for text in texts:
        print(f"\"{text}\" -> {tiktoken_token_counter(text)}")

example_integration()

"This is a short text with few tokens." -> 9
"This is a second short text" -> 6


<hr>

### Special characters

We've learned that text that for example contain special characters or extracted tables pose a risk of confusing an LLM and lead to unstable outputs. Let see how we could tackle this! 

Especially when you extract data from PDFs, the extracted contents can often contain some weird special characters. While this might not be too much of a problem for human readers (depending on how many special characters and fragments are in the text), an LLM can easily get distracted by some weird characters. Using the snippet below, we can easily detect some weird stuff in our data and make sure to quickly find data that we would need to manually check! 

In [8]:
import unicodedata
from typing import List, Tuple

DEFAULT_ALLOWED_RANGE = set(range(32, 127)).union( # Basic Latin
    set(range(160, 255)), # Latin-1 Supplement
    set(range(256, 384)),  # Latin Extended-A
    set(range(384, 592)),  # Latin Extended-B
    set(range(8192, 8303)),  # General Punctuation
    set(range(8352, 8399)),  # Currency Symbols
    set([ord("\t"), ord("\n"), ord("\r")])# common stop chars
)


def special_character_classifier(text: str, allowed_range: List[int] = None) -> str: 
    if allowed_range is None:
        allowed_range= DEFAULT_ALLOWED_RANGE
    
    for char in text:
        if ord(char) not in allowed_range and unicodedata.category(char) != "Zs":
            return True
    return False


# ↑ necessary bricks function 
# -----------------------------------------------------------------------------------------
# ↓ example implementation 

def example_integration():
    texts = ["This contains a special char 你好.", "Such a clean text, wow!", "This is a greek letter: α", "Super funny 😀", "Rainbows are very nice."]
    for text in texts:
        print(f"\"{text}\" -> {special_character_classifier(text)}")

example_integration()


"This contains a special char 你好." -> True
"Such a clean text, wow!" -> False
"This is a greek letter: α" -> True
"Super funny 😀" -> True
"Rainbows are very nice." -> False


<hr>

### Splitting techniques

Splitting is another super important topic for RAG. If the text pieces (chunks) are too big, we might have issues finding the right texts or the context window of an LLM is filled up too fast. If you are having issues with the **retrieval** part of the RAG process, implementing chunking can provide you some quick and easy gains. <br><br>
Here are some basic chunking techniques:

In [3]:
# simple newline splitter
from typing import List 

def newline_splitter(text: str) -> List[str]:
    splits = [t.strip() for t in text.split("\n")]
    return [val for val in splits if len(val) > 0]

# ↑ necessary bricks function 
# -----------------------------------------------------------------------------------------
# ↓ example implementation

def example_integration():
    texts = ["""
    This is a sentence.
    This too, but in another line
    """, "This is a sentence\nwith a newline literal!"]
    for text in texts:
        print(f"{repr(text)} ---> {newline_splitter(text)}\n")

example_integration()

'\n    This is a sentences.\n    This too, but in another line\n    ' ---> ['This is a sentences.', 'This too, but in another line']

'This is a sentence\nwith a newline literal!' ---> ['This is a sentence', 'with a newline literal!']



SpaCy is a fantastic library for processing texts. We can also use it to detect sentences and use these to chunk the text data. SpaCy is also available in many different languages: https://spacy.io/

In [2]:
# spacy sentence splitter 
import spacy 

text = """
The quick brown fox jumps over the lazy dog. 

This is a well-known pangram, a sentence that uses every letter of the alphabet at least once.
"""

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

def spacy_splitter(spacy_doc):
    return [str(sent) for sent in spacy_doc.sents]

splits = spacy_splitter(doc)
print(splits)
print(len(splits))

  from .autonotebook import tqdm as notebook_tqdm


['\nThe quick brown fox jumps over the lazy dog.', '\n\n', 'This is a well-known pangram, a sentence that uses every letter of the alphabet at least once.\n']
3


Libraries like LangChain also apply some additional logic, for example by recursively chunking to build better text chunks. This can also be combined with the previous chunking approaches.

In [5]:
# recursive splitter 
class RecursiveCharacterTextSplitter:
    def __init__(self, separators=None, keep_separator=True, is_separator_regex=False, chunksize=50):
        self.separators = separators if separators else [' ']
        self.keep_separator = keep_separator
        self.is_separator_regex = is_separator_regex
        self.chunksize = chunksize

    def split_text(self, text):
        chunks = []
        for i in range(0, len(text), self.chunksize):
            chunk = text[i:i+self.chunksize]
            for sep in self.separators:
                if sep in chunk:
                    parts = chunk.rsplit(sep, 1)  # Split on the last occurrence of the separator
                    chunks.extend([parts[0] + sep] if self.keep_separator else [parts[0]])
                    remaining_text = parts[1] + text[i+self.chunksize:]
                    return chunks + self.split_text(remaining_text)  # Recursively split the remaining text
            else:
                chunks.append(chunk)
        return chunks
    
# ↑ necessary function 
# -----------------------------------------------------------------------------------------
# ↓ example implementation 

splitter = RecursiveCharacterTextSplitter(separators=["\n", " "], chunksize=100)
text = """
The quick brown fox jumps over the lazy dog. This is a well-known pangram, a sentence that uses every letter of the alphabet at least once. Pangrams have been used for centuries and are a fascinating aspect of the English language.
However, not all languages have pangrams. For example, in Chinese, which uses a logographic writing system, it's impossible to create a pangram because there are thousands of characters, and a sentence containing all of them would be impractically long.

In contrast, languages with alphabets, like English, French, and German, can have pangrams. Some other examples of English pangrams include "Pack my box with five dozen liquor jugs" and "How vexingly quick daft zebras jump!"
Pangrams are useful for testing keyboards, fonts, and other typography-related tools. They can show how each character in a font looks and whether any characters are missing or incorrectly rendered.
So, the next time you see a sentence like "The quick brown fox jumps over the lazy dog," remember that it's not just a quirky sentence. It's a tool that's been used for centuries to help us understand and improve our written language.
"""
splits = splitter.split_text(text)

print(splits)
print(len(splits))

['\n', 'The quick brown fox jumps over the lazy dog. This is a well-known pangram, a sentence that uses ', 'every letter of the alphabet at least once. Pangrams have been used for centuries and are a ', 'fascinating aspect of the English language.\n', 'However, not all languages have pangrams. For example, in Chinese, which uses a logographic writing ', "system, it's impossible to create a pangram because there are thousands of characters, and a ", 'sentence containing all of them would be impractically long.\n\n', 'In contrast, languages with alphabets, like English, French, and German, can have pangrams. Some ', 'other examples of English pangrams include "Pack my box with five dozen liquor jugs" and "How ', 'vexingly quick daft zebras jump!"\n', 'Pangrams are useful for testing keyboards, fonts, and other typography-related tools. They can show ', 'how each character in a font looks and whether any characters are missing or incorrectly rendered.\n', 'So, the next time you see a sent

<hr>

### Table clean up

Tables from PDFs can be a mess. Down below I extracted a table from one invoice I had laying around. You can see that the extracted table is super messy. We can use GPT to fix this! For this part, you need an OpenAI API key. See their quickstart for how to get started: https://platform.openai.com/docs/quickstart?context=python

In [6]:
# cleaning up tables using GPT
import openai
import os

openai.api_type = "openai"
openai.api_key = os.getenv("OPENAI_API_KEY")

raw_markdown_table = """
Datum Beschreibung Preis Leistungs-
datum

MwSt Betrag
MwSt
Betrag
netto
Betrag
brutto
13.11.23 Fahrkarte Sparpreis, Nürnberg
Hbf → Berlin Hbf (tief), 2.
Klasse, 1 Person (15-26 Jahre)

66,65 € 14.11.23^17 % (D) 4,36 € 62,29 € 66,65 €
13.11.23 Reservierung Nürnberg Hbf
→ Berlin Hbf (tief), 1 Person
(27-64 Jahre)

0,00 € 14.11.
13.11.23 Reservierungsentgelt 2. Klasse 4,90 € 14.11.23^17 % (D) 0,32 € 4,58 € 4,90 €

Summe (netto) 7 % (D) 66,87 €
zzgl. 7 % MwSt (D) 4,68 €

Summe (brutto) 71,55 €
"""

completion = openai.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": f"""
        I extracted this table from a PDF, but it's pretty messy. Could you use this to create me a clean markdown table?
        =====
        Here's the raw extracted table: {raw_markdown_table}
        =====
        Start here: 
        """},
    temperature=0.0
  ]
)

print(completion.choices[0].message.content)

| Datum     | Beschreibung                          | Preis       | Leistungsdatum | MwSt Betrag | MwSt  Betrag | netto     | brutto    |
|-----------|---------------------------------------|-------------|----------------|-------------|--------------|-----------|-----------|
| 13.11.23  | Fahrkarte Sparpreis, Nürnberg Hbf →... | 66,65 €     | 14.11.23^17 %  | (D) 4,36 €  | 62,29 €      | 66,65 €  |           |
| 13.11.23  | Reservierung Nürnberg Hbf → Berlin...  | 0,00 €      | 14.11.        |             |              |           |           |
| 13.11.23  | Reservierungsentgelt 2. Klasse        | 4,90 €      | 14.11.23^17 %  | (D) 0,32 €  | 4,58 €       | 4,90 €    |           |
| Summe     | (netto)                               | 7 % (D)     | 66,87 €        |             |              |           |           |
| zzgl. 7 % | MwSt (D)                              | 4,68 €      |                |             |              |           |           |
| Summe     | (brutto)            

<hr>

## Handling multilingualism

If you are working for a bigger organization, chances are that the data you work with is in multiple languages. Luckily there are some easy approaches that can make our life a bit easier when dealing with these use cases. 

<hr>

### Language detection

Before we process any data, it is generally a good idea to find out what languages you are dealing with. For this, the **langdetect** library is fantastic, as it is super lightweight and easy to use!

In [6]:
# detect the language of your data
from langdetect import detect

def language_detection(text:str)->str:    
    if not text or not text.strip():
        return "unknown"
    return detect(text)

# ↑ necessary bricks function 
# -----------------------------------------------------------------------------------------
# ↓ example implementation 

def example_integration():
    texts = ["This is an english sentence.", "Dies ist ein Text in Deutsch."]
    for text in texts:
        print(f"\"{text}\" is written in {language_detection(text)}")

example_integration()

"This is an english sentence." is written in en
"Dies ist ein Text in Deutsch." is written in de


We often noticed that, even when given the right instructions, the answer of an LLM is often in the language that the system message is in. Using **langdetect**, we can quite easily classify an input text and then inject a predefined system messages, even if the data for our context is in a different language. 

![simple architecture](multilingualism.png "Simple multilingual setup")

<hr>

### Multilingual embeddings & search 

Actually processing the text data is the next logical step. While the **text-embedding-ada-002** embedding model from OpenAI is really good a languages like English, German or Mandarin, the performance is lacking for minority languages.<br> For smaller languages, we can use a model like **intfloat/multilingual-e5-small** to embed out texts. The model is lightweight enough to run locally, even on normal hardware / CPU only.<br><br>
Link on HuggingFace: https://huggingface.co/intfloat/multilingual-e5-small

In [8]:
# load some example sentences in different languages
with open("language-examples.txt", "r") as f:
    examples = f.readlines()
print(examples)

['Green tea is rich in antioxidants and can boost brain function.\n', 'Grüner Tee ist reich an Antioxidantien und kann die Gehirnfunktion verbessern.\n', 'El té verde es rico en antioxidantes y puede mejorar la función cerebral.\n', 'Le thé vert est riche en antioxydants et peut stimuler la fonction cérébrale.\n', '绿茶富含抗氧化剂，可以提高大脑功能。\n', '緑茶は抗酸化物質が豊富で、脳の機能を向上させることができます。\n', 'Il tè verde è ricco di antiossidanti e può migliorare la funzione cerebrale.']


In [11]:
# downloading and using a multilingual E5 model for embedding purposes and
# implement super simple similarity search

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/multilingual-e5-small")

def cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

embeddings = [model.encode(sent) for sent in examples]
base_vector = base_vector = model.encode("Voglio saperne di più sul tè verde") # I want to learn more about green tea in italian

# Calculate cosine similarity for each vector in the list
similarities = [cosine_similarity(np.array(vector), np.array(base_vector)) for vector in embeddings]

# Find the index of the most similar vector
most_similar_index = np.argmax(similarities)

print(f"The most similar sentence i -> {examples[most_similar_index]}") # should return our Italian example sentence

The most similar sentence i -> Green tea is rich in antioxidants and can boost brain function.



For more really good embedding models, you can visit the text embedding leaderboard: https://huggingface.co/spaces/mteb/leaderboard

![leaderboard](leaderboard.png "HuggingFace embedding model leaderboard")

## Handling private or sensitive data

<hr>

Text data is often filled with personal information. Finding and removing it is not an easy task! Especially in the context of RAG, we don't want to give any private information to third parties. Luckily there are ways to use smaller, lightweight models locally to do some of the work for us!

### Names 

Besides splitting our texts, we can also use SpaCy to find the names of people in our texts. SpaCy offers models in many different languages and can be used for many extraction tasks like this!  

In [12]:
# detect and replace names of people using spacy
import spacy
from typing import List, Tuple

def person_extraction(text: str, extraction_keyword: str) -> List[Tuple[str, int]]:
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)

    name_positions = []
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            name_positions.append((extraction_keyword, ent.start, ent.end))
    return name_positions

# ↑ necessary bricks function 
# -----------------------------------------------------------------------------------------
# ↓ example implementation

def example_integration():
    texts = ["My name is James Bond.", "Harry met Jane on a sunny afternoon.", "Say my name."]
    extraction_keyword = "name"
    for text in texts:
        found = person_extraction(text, extraction_keyword)
        if found:
            print(f"text: \"{text}\" has {extraction_keyword} -> \"{found}\"")
        else:
            print(f"text: \"{text}\" doesn't have {extraction_keyword}")

example_integration()


text: "My name is James Bond." has name -> "[('name', 3, 5)]"
text: "Harry met Jane on a sunny afternoon." has name -> "[('name', 0, 1), ('name', 2, 3)]"
text: "Say my name." doesn't have name


<hr>

### General Named Entity Extraction with BERT

For this part you will need a Hugging Face account and API key: https://huggingface.co/

In [1]:
# detect and replace names of people using open-source transformer models
import requests
import spacy
import os

def bert_ner_extraction(text, api_key):
      headers = {"Authorization": f"Bearer {api_key}"}
      data = {"inputs": text}
      try: 
            response = requests.post("https://api-inference.huggingface.co/models/dslim/bert-base-NER", headers=headers, json=data)
            response_json = response.json()
            ner_positions = []

            nlp = spacy.load("en_core_web_sm")
            doc = nlp(text)

            for item in response_json:
                  start = item["start"]
                  end = item["end"]
                  span = doc.char_span(start, end, alignment_mode="expand")
                  ner_positions.append((item["entity_group"], span.start, span.end))
            return ner_positions
      except Exception as e: 
            return f"That didn't work. Did you provide a valid API key? Go error: {e} and message: {response_json}"

# ↑ necessary bricks function 
# -----------------------------------------------------------------------------------------
# ↓ example implementation 

def example_integration():
      hf_api_key = os.getenv("HUGGINGFACE_API_KEY")
      texts = ["Apple announces new iPhone.", "Angela Merkel was the chancellor of Germany."]
      for text in texts:
            output = bert_ner_extraction(text, api_key=hf_api_key)
            print(output)

example_integration()


  from .autonotebook import tqdm as notebook_tqdm


[('ORG', 0, 1), ('MISC', 3, 4)]
[('PER', 0, 2), ('LOC', 6, 7)]
