# NLP

**Text cleaning & normalization**

re (built-in regex), string, unicodedata

ftfy (fixes broken unicode)

unidecode (accent → ASCII)

emoji / emot (detect/remove/describe emojis)

clean-text, neattext (common cleaning ops)

**Tokenization & segmentation (rule-based)**

sacremoses (Moses tokenizer, rule-based)

nltk tokenizers (Punkt/simpler word tokenizers; no training needed)

spaCy tokenizer without loading a model (pure rule rules for many langs)

**Stemming & lemmatization (non-ML)**

nltk.stem (Porter, Snowball, Lancaster)

snowballstemmer (standalone)

simplemma (dictionary-based lemmatizer)

**Keyword/phrase matching & string search**

flashtext (very fast exact keyword extraction)

pyahocorasick (multi-pattern search; Aho-Corasick automaton)

rapidfuzz / python-Levenshtein (fuzzy matching, edit distances)

regex (advanced PCRE-like features)

**Rule-based pattern extraction**

spaCy’s Matcher/PhraseMatcher (can use without ML models)

dateparser (recognize/normalize dates)

wordfreq (word frequencies, useful for heuristics)

textstat (readability metrics)

**Spell-check & correction (non-ML)**

pyenchant (dictionary-based spellcheck)

symspellpy (very fast SymSpell corrections)

**Sentiment & simple scoring (lexicon-based)**

vaderSentiment (rule/lexicon-based sentiment, great for social text)

textblob (can use the default Pattern analyzer = lexicon-based)

**Transliteration & Indic text utilities (mostly rule-based)**

indic-transliteration (script ↔ script mapping)

indic-nlp-library (normalizers/tokenizers for Indic)

pykakasi (JP kana/kanji → romaji)

In [1]:
# --- Setup: install third‑party libraries (run as needed) ---
# You can comment out any lines you don't need.

# Core text cleaning
!pip install ftfy unidecode emoji clean-text neattext regex

# Tokenization & segmentation
!pip install sacremoses nltk spacy

# Morphology / lemmatization
!pip install snowballstemmer simplemma

# Keyword search / fuzzy
!pip install flashtext pyahocorasick rapidfuzz

# Rule-based extractors / utils
!pip install dateparser wordfreq textstat

# Spell-check
!pip install pyenchant symspellpy

# Sentiment (lexicon-based) & simple NLP
!pip install vaderSentiment textblob

# Indic tools & transliteration
!pip install indic-transliteration indic-nlp-library pykakasi

# One‑time NLTK downloads (for tokenizers / VADER lexicon / etc.)
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')  # for newer nltk versions
nltk.download('vader_lexicon')

# (Optional) spaCy: download a small model if you later want ML features
!python -m spacy download en_core_web_sm



Collecting ftfy
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting unidecode
  Downloading Unidecode-1.4.0-py3-none-any.whl.metadata (13 kB)
Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting clean-text
  Downloading clean_text-0.6.0-py3-none-any.whl.metadata (6.6 kB)
Collecting neattext
  Downloading neattext-0.1.3-py3-none-any.whl.metadata (12 kB)
Collecting emoji
  Downloading emoji-1.7.0.tar.gz (175 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.4/175.4 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m741.6 kB/s[0m eta [36m0:00:00[0m
[?25hDownloading Unidecode-1.4.0-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.8/235.8 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Setup cell ready. Uncomment the lines you need and run.


## re (built‑in regular expressions)

Classic regex tasks: tokenization, extraction, substitution, named groups, and lookaheads.

### Word tokenization (simple regex)

In [2]:
import re
text = "AI (A.I.) is AMAZING!!! Visit us at ai@example.com"
tokens = re.findall(r"[A-Za-z]+(?:'[A-Za-z]+)?", text)
print(tokens)

['AI', 'A', 'I', 'is', 'AMAZING', 'Visit', 'us', 'at', 'ai', 'example', 'com']


### Extract emails

In [3]:
import re
text = "Contact: alice@test.org, bob.smith@company.co.in"
emails = re.findall(r"[\w\.-]+@[\w\.-]+", text)
print(emails)

['alice@test.org', 'bob.smith@company.co.in']


### Substitute multiple spaces / punctuation

In [4]:
import re
text = "Hello,,,   world!!"
clean = re.sub(r"[!,]+", "!", text)
clean = re.sub(r"\s+", " ", clean).strip()
print(clean)

Hello! world!


### Named groups

In [5]:
import re
log = "USER=jane EVENT=LOGIN TS=2025-09-15"
m = re.search(r"USER=(?P<user>\w+)\s+EVENT=(?P<evt>\w+)", log)
print(m.groupdict())

{'user': 'jane', 'evt': 'LOGIN'}


### Positive lookahead (extract words before '!')

In [6]:
import re
text = "Great! Awesome! Meh."
hits = re.findall(r"\b\w+(?=!)", text)  # words followed by '!'
print(hits)

['Great', 'Awesome']


## string (built‑in)

Utilities for punctuation, digits, ascii letters, templates, and simple transformations.

### Remove punctuation using string.punctuation

In [7]:
import string
text = "Hello, world! #NLP"
no_punct = "".join(ch for ch in text if ch not in string.punctuation)
print(no_punct)

Hello world NLP


### Check character classes (ascii letters / digits)

In [8]:
import string
s = "A1b2C3"
print(all(ch in string.ascii_letters+string.digits for ch in s))

True


### Using string.Template for safe substitution

In [9]:
from string import Template
t = Template("Hello, $name! Today is $day.")
print(t.substitute(name="Giri", day="Monday"))

Hello, Giri! Today is Monday.


### Capitalize words with string.capwords

In [10]:
import string
text = "hello from hyderabad"
print(string.capwords(text))

Hello From Hyderabad


### Custom translation table (remove digits)

In [11]:
import string
text = "a1 b2 c3"
tbl = str.maketrans("", "", string.digits)
print(text.translate(tbl))

a b c


## unicodedata (built‑in)

Normalize, inspect categories, names, and numeric values of Unicode characters.

### NFC vs NFD normalization

In [12]:
import unicodedata
s = "café"
nfd = unicodedata.normalize("NFD", s)
nfc = unicodedata.normalize("NFC", nfd)
print(s, nfd, nfc, sep=" | ")

café | café | café


### Character categories

In [13]:
import unicodedata
chars = ['A', 'é', '—', '🙂']
print([ (c, unicodedata.category(c)) for c in chars ])

[('A', 'Lu'), ('é', 'Ll'), ('—', 'Pd'), ('🙂', 'So')]


### Character names (if available)

In [14]:
import unicodedata
chars = ['A', 'é', '—', '🙂']
print([ (c, unicodedata.name(c, 'UNKNOWN')) for c in chars ])

[('A', 'LATIN CAPITAL LETTER A'), ('é', 'LATIN SMALL LETTER E WITH ACUTE'), ('—', 'EM DASH'), ('🙂', 'SLIGHTLY SMILING FACE')]


### Strip combining marks (decompose then drop)

In [15]:
import unicodedata
s = "café"
nfd = unicodedata.normalize("NFD", s)
stripped = "".join(ch for ch in nfd if unicodedata.category(ch) != 'Mn')
print(stripped)

cafe


### Numeric values in unicode (e.g., Roman numerals)

In [16]:
import unicodedata
chars = ['Ⅳ', 'Ⅴ', '５']  # roman 4, 5, fullwidth 5
print([ unicodedata.numeric(c, None) for c in chars ])

[4.0, 5.0, 5.0]


## ftfy

Fixes broken Unicode text (mojibake, curly quotes, etc.).

### Fix mojibake / smart quotes

In [17]:
!pip install ftfy



In [18]:
from ftfy import fix_text
text = "This â€” is \"broken\" text…"
print(fix_text(text))
print("Install ftfy to run this example.")

This — is "broken" text…
Install ftfy to run this example.


### Normalize weird spacing

In [19]:
from ftfy import fix_text
print(fix_text("Hello\u00A0world"))  # NBSP -> space


Hello world


### Repair ligatures & punctuation

In [20]:
from ftfy import fix_text
print(fix_text("offiﬁce — fiﬂes"))


offifice — fifles


### Decode mixed encodings

In [21]:
from ftfy import fix_text
print(fix_text('cafÃ© -> café'))


café -> café


### Canonicalize quotes/dashes

In [22]:
from ftfy import fix_text
print(fix_text("He said, ‘hello’ -- ok?"))


He said, 'hello' -- ok?


## unidecode

ASCII transliteration for non‑ASCII characters.

### Accented Latin → ASCII

In [23]:
from unidecode import unidecode
print(unidecode("café naïve jalapeño"))


cafe naive jalapeno


### Greek → ASCII

In [24]:
from unidecode import unidecode
print(unidecode("Αθήνα (Athens)"))

Athena (Athens)


### Cyrillic → ASCII

In [25]:
from unidecode import unidecode
print(unidecode("Привет мир"))


Privet mir


### Han/Kanji → phonetic ASCII

In [26]:
from unidecode import unidecode
print(unidecode("東京"))


Dong Jing 


### Emoji fallback

In [27]:
from unidecode import unidecode
print(unidecode("I ❤️ NLP"))


I  NLP


## emoji

Detect, demojize (emoji → :shortcode:), emojize (:shortcode: → emoji).

### Find emojis in text

In [28]:
import emoji
text = "Great job! 👍🔥"
print([ch for ch in text if ch in emoji.EMOJI_DATA])


['👍', '🔥']


### Demojize to :shortcodes:

In [29]:
import emoji
print(emoji.demojize("I ❤️ NLP"))


I :red_heart: NLP


### Emojize from :shortcodes:

In [30]:
import emoji
print(emoji.emojize("Python is :fire:", language='alias'))


Python is 🔥


### Replace emojis with tokens

In [31]:
import emoji, re
text = "So happy 😊"
print(re.sub(r"[\U0001F600-\U0001F64F]", "<EMOJI>", text))


So happy <EMOJI>


### Count emojis

In [32]:
import emoji
text = "wow 😮😮😮"
print(sum(ch in emoji.EMOJI_DATA for ch in text))


3


## clean-text

One‑liner cleaning utilities via `cleantext.clean`.

### Basic cleaning

In [33]:
from cleantext import clean
s = "  Hello!!! Visit https://x.y  \n \t #NLP "
print(clean(s, lower=True, no_urls=True, no_punct=True, no_line_breaks=True))


hello visit httpsxy nlp


### Strip emojis & accents

In [34]:
from cleantext import clean
print(clean("Café 😊", no_emoji=True, fix_unicode=True))


cafe


### Keep case but remove digits

In [35]:
from cleantext import clean
print(clean("A1 B2 C3", lower=False, no_numbers=True))


A1 B2 C3


### Replace currency symbols

In [36]:
from cleantext import clean
#print(clean("Price: ₹100 or $5", replace_with_currency="CUR"))
print(clean("Price: ₹100 or $5"))


price: rs100 or $5


### Custom replacements

In [38]:
from cleantext import clean


## neattext

Convenient text cleaning helpers with chainable API.

### Remove special chars and digits

In [40]:
import neattext.functions as nfx
s = "a1! b2? c3."
print(nfx.remove_special_characters(nfx.remove_numbers(s)))


a b c


### Extract emails & phones

In [41]:
import neattext.functions as nfx
s = "Email me at a@b.com or +91-90000-12345"
print(nfx.extract_emails(s), nfx.extract_phone_numbers(s))


['a@b.com'] ['+91-90000-12345']


### Normalize whitespace

In [43]:
import neattext as nt
doc = nt.TextFrame(text="NLP is fun.")
print(doc.word_tokens)


<bound method TextFrame.word_tokens of TextFrame(text="NLP is fun.")>


## sacremoses (Moses tokenizer)

Rule‑based tokenization/detokenization (Moses).

### Basic tokenize

In [44]:
from sacremoses import MosesTokenizer
mt = MosesTokenizer(lang='en')
print(mt.tokenize("Hello, world!", return_str=False))


['Hello', ',', 'world', '!']


### Return as string

In [45]:
from sacremoses import MosesTokenizer
mt = MosesTokenizer(lang='en')
print(mt.tokenize("Don't split contractions.", return_str=True))


Don &apos;t split contractions .


### Detokenize

In [46]:
from sacremoses import MosesTokenizer, MosesDetokenizer
mt = MosesTokenizer(lang='en')
md = MosesDetokenizer(lang='en')
toks = mt.tokenize("Hello, world!")
print(md.detokenize(toks))


Hello, world!


### Escape XML

In [47]:
from sacremoses import MosesTokenizer
mt = MosesTokenizer()
print(mt.escape_xml("5 < 6 & 7 > 3"))


5 &lt; 6 &amp; 7 &gt; 3


### Non‑English example

In [48]:
from sacremoses import MosesTokenizer
mt = MosesTokenizer(lang='de')
print(mt.tokenize("Das ist großartig!"))


['Das', 'ist', 'großartig', '!']


## NLTK tokenizers

Sentence and word tokenization using NLTK (no ML training). Requires punkt models.

### Sentence tokenize

In [49]:
import nltk
from nltk.tokenize import sent_tokenize
text = "Hello world. How are you? I'm fine."
print(sent_tokenize(text))


['Hello world.', 'How are you?', "I'm fine."]


### Word tokenize

In [50]:
from nltk.tokenize import word_tokenize
print(word_tokenize("Don't split! Please."))


['Do', "n't", 'split', '!', 'Please', '.']


### TreebankWordTokenizer

In [51]:
from nltk.tokenize import TreebankWordTokenizer
print(TreebankWordTokenizer().tokenize("It''s 3.14 (approx.)"))


['It', "''", 's', '3.14', '(', 'approx', '.', ')']


### RegexpTokenizer

In [52]:
from nltk.tokenize import RegexpTokenizer
tok = RegexpTokenizer(r"[A-Za-z]+")
print(tok.tokenize("Hello, world! 123"))


['Hello', 'world']


### Punkt (language-specific)

In [53]:
import nltk
from nltk.tokenize.punkt import PunktSentenceTokenizer
text = "Dr. Smith went to the U.K. It was fun."
print(PunktSentenceTokenizer().tokenize(text))


['Dr.', 'Smith went to the U.K.', 'It was fun.']


## spaCy tokenizer (no model)

Use spaCy’s rule-based tokenizer by creating a blank Language object.

### Basic English tokenizer

In [54]:
import spacy
from spacy.lang.en import English
nlp = English()
doc = nlp("Hello, world! It's 2025.")
print([t.text for t in doc])


['Hello', ',', 'world', '!', 'It', "'s", '2025', '.']


### Customize token rules (add special case)

In [55]:
from spacy.lang.en import English
from spacy.symbols import ORTH
nlp = English()
special_case = [{ORTH: "NLP"}]
nlp.tokenizer.add_special_case("NLP", special_case)
print([t.text for t in nlp("I love NLP and AI.")])


['I', 'love', 'NLP', 'and', 'AI', '.']


### Split on URLs as single tokens

In [56]:
from spacy.lang.en import English
nlp = English()
doc = nlp("Read https://example.com now.")
print([t.text for t in doc])
print("Install spacy to run.")

['Read', 'https://example.com', 'now', '.']
Install spacy to run.


### Whitespace tokenizer

In [57]:
import spacy
from spacy.lang.en import English
nlp = English()
nlp.tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab, rules={})  # simplistic
print([t.text for t in nlp("Hello   world")])


['Hello', '  ', 'world']


### Measure token attributes (is_punct, is_space)

In [58]:
from spacy.lang.en import English
nlp = English()
doc = nlp("Hi!  ")
print([(t.text, t.is_punct, t.is_space) for t in doc])


[('Hi', False, False), ('!', True, False), (' ', False, True)]


## NLTK stemming

Porter, Snowball, and Lancaster stemmers.

### PorterStemmer

In [59]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
print([ps.stem(w) for w in ["caresses", "ponies", "caress", "cats"]])


['caress', 'poni', 'caress', 'cat']


### SnowballStemmer (English)

In [60]:
from nltk.stem.snowball import SnowballStemmer
ss = SnowballStemmer("english")
print([ss.stem(w) for w in ["running", "easily", "fairly"]])


['run', 'easili', 'fair']


### LancasterStemmer

In [61]:
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()
print([ls.stem(w) for w in ["maximum", "presumably"]])


['maxim', 'presum']


### Compare stemmers

In [62]:
from nltk.stem import PorterStemmer, LancasterStemmer
ps, ls = PorterStemmer(), LancasterStemmer()
word = "universities"
print(ps.stem(word), ls.stem(word))


univers univers


### Stem a sentence

In [63]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
sent = "Cats are running easily."
print(" ".join(ps.stem(w) for w in sent.lower().split()))


cat are run easily.


## snowballstemmer (standalone)

Pure Snowball stemmer package with many languages.

### English stemmer

In [64]:
import snowballstemmer
stemmer = snowballstemmer.stemmer('english')
print(stemmer.stemWords(["running", "happiness", "cats"]))


['run', 'happi', 'cat']


### Spanish stemmer

In [65]:
import snowballstemmer
stemmer = snowballstemmer.stemmer('spanish')
print(stemmer.stemWords(["corriendo", "felicidad"]))


['corr', 'felic']


### German stemmer

In [66]:
import snowballstemmer
stemmer = snowballstemmer.stemmer('german')
print(stemmer.stemWords(["großartig", "Kinder"]))


['grossart', 'Kind']


### French stemmer

In [67]:
import snowballstemmer
stemmer = snowballstemmer.stemmer('french')
print(stemmer.stemWords(["manges", "mangé"]))


['mang', 'mang']


### List supported languages

In [68]:
import snowballstemmer
print(snowballstemmer.algorithms())


['arabic', 'armenian', 'basque', 'catalan', 'danish', 'dutch', 'dutch_porter', 'english', 'esperanto', 'estonian', 'finnish', 'french', 'german', 'greek', 'hindi', 'hungarian', 'indonesian', 'irish', 'italian', 'lithuanian', 'nepali', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'serbian', 'spanish', 'swedish', 'tamil', 'turkish', 'yiddish']


## simplemma

Dictionary-based lemmatizer for many languages.

### English lemmatization

In [69]:
from simplemma import lemmatize

print([lemmatize(w, lang="en") for w in ["running", "mice", "better"]])
# e.g. ['run', 'mouse', 'better']  (irregular adjectives may stay as-is)


['run', 'mouse', 'good']


### French lemmatization

In [70]:
from simplemma import lemmatize
from simplemma import lemmatize

words = ["manges", "allées"]
print([lemmatize(w, lang="fr") for w in words])


['manger', 'allée']


### German lemmatization

In [71]:
from simplemma import lemmatize

words = ["Kinder", "großen"]
print([lemmatize(w, lang="de") for w in words])


['Kind', 'groß']


### Mixed-language handling

In [72]:
import simplemma
from simplemma import lemmatize


words = ["running", "manges"]
print([lemmatize(w, "en") for w in words])  # e.g. ['run', 'manges']
print([lemmatize(w, "fr") for w in words])  # e.g. ['running', 'manger']


['run', 'manges']
['running', 'manger']


### Sentence lemmatization

In [73]:
from simplemma import lemmatize

sent = "Cats were running faster"
print(" ".join(lemmatize(w, "en") for w in sent.split()))


cat be run fast


## flashtext

Very fast exact keyword extraction and replacement.

### Extract keywords

In [74]:
from flashtext import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keywords_from_list(["nlp", "genai", "python"])
text = "I love Python and NLP!"
print(kp.extract_keywords(text))


['python', 'nlp']


### Add mapping (synonym → canonical)

In [75]:
from flashtext import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword("ai", "artificial_intelligence")
print(kp.extract_keywords("AI rocks"))


['artificial_intelligence']


### Replace keywords inline

In [76]:
from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keyword("IBM", "International Business Machines")
print(kp.replace_keywords("IBM Cloud"))


International Business Machines Cloud


### Large dictionaries (performance hint)

In [77]:
from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict({"lang": ["Python", "Java", "C++"]})
print(kp.extract_keywords("I code in Python and C++"))


['lang', 'lang']


### Keyword spans (offsets)

## pyahocorasick

Aho–Corasick automaton for multi-pattern substring search.

### Build automaton and search

In [78]:
import ahocorasick
A = ahocorasick.Automaton()
for i, w in enumerate(["nlp", "python", "ai"]):
    A.add_word(w, (i, w))
A.make_automaton()
for end_idx, (i, w) in A.iter("I love python and NLP"):
    print(end_idx, w)


12 python


### Find first match only

In [79]:
import ahocorasick
A = ahocorasick.Automaton()
for w in ["foo", "bar"]:
    A.add_word(w, w)
A.make_automaton()
print(next(A.iter("barbecue")))  # first match


(2, 'bar')


### Store custom payloads

In [80]:
import ahocorasick
A = ahocorasick.Automaton()
A.add_word("nlp", {"tag":"tech"})
A.make_automaton()
print(list(A.iter("nlp rules")))
print("Install pyahocorasick to run.")

[(2, {'tag': 'tech'})]
Install pyahocorasick to run.


### Case-insensitive build

In [81]:
import ahocorasick
A = ahocorasick.Automaton()
for w in ["NLP","Python"]:
    A.add_word(w.lower(), w)
A.make_automaton()
text = "I like PYTHON"
print([v for _, v in A.iter(text.lower())])


['Python']


## rapidfuzz

Fast fuzzy matching (ratios, extract, distances).

### Similarity ratio

In [82]:
from rapidfuzz import fuzz
print(fuzz.ratio("house", "houses"))


90.9090909090909


### Partial ratio

In [83]:
from rapidfuzz import fuzz
print(fuzz.partial_ratio("New York City", "York"))


100.0


### Token sort ratio

In [84]:
from rapidfuzz import fuzz
print(fuzz.token_sort_ratio("NLP and AI", "AI & NLP"))


66.66666666666667


### Top matches from choices

In [85]:
from rapidfuzz import process
choices = ["python", "java", "javascript", "rust"]
print(process.extract("py", choices, limit=2))


[('python', 90.0, 0), ('javascript', 45.0, 2)]


### Levenshtein distance

In [86]:
from rapidfuzz.distance import Levenshtein
print(Levenshtein.distance("kitten", "sitting"))


3


## regex (enhanced regex module)

Advanced features: overlapped matches, fuzzy matching, named sets, etc.

### Overlapped matches

In [87]:
import regex as re
print(re.findall(r"(?r)aba", "ababa", overlapped=True))


['aba', 'aba']


### Fuzzy matching (up to 1 error)

In [88]:
import regex as re
m = re.search(r"(?:color){e<=1}", "colour")
print(bool(m))


True


### Named sets / character properties

In [89]:
import regex as re
print(bool(re.match(r"\p{L}+", "Café")))


True


## spaCy Matcher & PhraseMatcher (rule‑based)

Build token patterns and phrase matchers without ML models.

### Matcher: adjective + noun

In [90]:
import spacy
from spacy.lang.en import English
from spacy.matcher import Matcher
nlp = English()
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}]  # POS needs model; use TEXT for no-model demo
#No model available → use TEXT rules instead:
pattern = [{"LOWER": {"IN": ["great","awesome","good"]}}, {"LOWER": "work"}]
matcher.add("ADJ_NOUN", [pattern])
doc = nlp("great work by team; awesome work indeed")
print([doc[s:e].text for _, s, e in matcher(doc)])


['great work', 'awesome work']


### PhraseMatcher: canned phrases

In [None]:
import spacy
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
nlp = English()
ph = PhraseMatcher(nlp.vocab)
phrases = [nlp.make_doc(p) for p in ["natural language processing","machine translation"]]
ph.add("TECH", phrases)
doc = nlp("I study natural language processing.")
print([doc[s:e].text for _, s, e in ph(doc)])


### Get spans with offsets

In [None]:
import spacy
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
nlp = English()
ph = PhraseMatcher(nlp.vocab)
ph.add("BRANDS", [nlp.make_doc("IBM"), nlp.make_doc("OpenAI")])
doc = nlp("IBM partners with OpenAI.")
print([(doc[s:e].text, s, e) for _, s, e in ph(doc)])


### Chaining multiple patterns

In [None]:
import spacy
from spacy.lang.en import English
from spacy.matcher import Matcher
nlp = English()
m = Matcher(nlp.vocab)
m.add("EMAIL", [[{"TEXT": {"REGEX": ".+@.+"}}]])
m.add("HASHTAG", [[{"TEXT": {"REGEX": r"#\w+"}}]])
doc = nlp("Mail a@b.com #hashtag")
print([doc[s:e].text for _, s, e in m(doc)])


### On-the-fly token attributes (LIKE_NUM/IS_PUNCT)

In [None]:
import spacy
from spacy.lang.en import English
nlp = English()
doc = nlp("I have 2 apples, you have 3.")
print([(t.text, t.like_num, t.is_punct) for t in doc])


## dateparser

Parse natural language dates and extract all date mentions.

### Parse simple dates

In [None]:
import dateparser
print(dateparser.parse("15 Sep 2025"))


### Relative dates

In [None]:
import dateparser
print(dateparser.parse("next Friday at 5pm"))


### Locale-specific formats

In [None]:
import dateparser
print(dateparser.parse("15/09/2025", settings={"DATE_ORDER":"DMY"}))


### Search and extract all dates

In [None]:
from dateparser.search import search_dates
print(search_dates("Pay on 15 Sep or 20/09/2025"))


### Time zone handling

In [None]:
import dateparser
print(dateparser.parse("2025-09-15 10:00 IST"))


## wordfreq

Word frequency estimates and helpful lists.

### Zipf frequency

In [None]:
from wordfreq import zipf_frequency
print(zipf_frequency("python", "en"))


### Word frequency

In [None]:
from wordfreq import word_frequency
print(word_frequency("hyderabad", "en"))


### Top N words by language

In [None]:
from wordfreq import top_n_list
print(top_n_list("en", n=10)[:5])


### Filter rare tokens (threshold)

In [None]:
from wordfreq import zipf_frequency
words = "this is a rarewordxyz maybe".split()
print([w for w in words if zipf_frequency(w,"en") > 2.0])


### Compare frequencies across words

In [None]:
from wordfreq import zipf_frequency
items = ["python","java","elixir"]
print(sorted(items, key=lambda w: zipf_frequency(w,"en"), reverse=True))


## textstat

Readability scores and text statistics.

### Flesch Reading Ease

In [None]:
import textstat
print(textstat.flesch_reading_ease("This is a simple sentence."))


### Grade level

In [None]:
import textstat
print(textstat.text_standard("Complex, polysyllabic content may score differently."))


### Syllable count & char count

In [None]:
import textstat
s = "Readability matters."
print(textstat.syllable_count(s), textstat.char_count(s))


### Sentence count

In [None]:
import textstat
print(textstat.sentence_count("One. Two! Three?"))


## pyenchant

Dictionary-based spell checking and suggestions.

### Check spelling

In [None]:
import enchant
d = enchant.Dict("en_US")
print(d.check("color"), d.check("colur"))


## symspellpy

Very fast spell correction using SymSpell algorithm.

### Create dictionary and add words

In [None]:
from symspellpy import SymSpell, Verbosity
sym = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
for w,f in [("color",10),("colour",5),("colors",3)]:
    sym.create_dictionary_entry(w, f)
print(sym.lookup("colur", Verbosity.CLOSEST))


### Compound word correction

## vaderSentiment (lexicon-based sentiment)

Social-text‑oriented sentiment analyzer; no ML required.

### Basic polarity scores

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
print(sid.polarity_scores("VADER is extremely useful!"))


### Negation handling

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
print(sid.polarity_scores("I don't like this."))


### Emoji & slang handling

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
print(sid.polarity_scores("This is lit 🔥"))


### Contrastive conjunctions

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
print(sid.polarity_scores("The plot was boring, but the acting was great!"))


### Batch evaluate lines

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
lines = ["good", "bad", "meh"]
print([sid.polarity_scores(s)["compound"] for s in lines])


## textblob (Pattern-based)

Convenient API for sentiment (lexicon-based), noun phrases, simple POS, etc.

### Sentiment

In [None]:
from textblob import TextBlob
print(TextBlob("I absolutely love this!").sentiment)


### Tokens & noun phrases

In [None]:
from textblob import TextBlob
print(TextBlob("I havv goood speling").correct())


## indic-transliteration

Script ↔ script transliteration (e.g., Devanagari ⇄ IAST/ISO).

### Devanagari → IAST

In [None]:
from indic_transliteration.sanscript import transliterate, DEVANAGARI, IAST
print(transliterate("भारत", DEVANAGARI, IAST))


### IAST → Devanagari

In [None]:
from indic_transliteration.sanscript import transliterate, DEVANAGARI, IAST
print(transliterate("bhārata", IAST, DEVANAGARI))


### Devanagari → HK

In [None]:
from indic_transliteration.sanscript import transliterate, DEVANAGARI, HK
print(transliterate("संस्कृतम्", DEVANAGARI, HK))


### Custom mapping example

In [None]:
from indic_transliteration.sanscript import transliterate, DEVANAGARI, ITRANS
print(transliterate("नमस्ते", DEVANAGARI, ITRANS))


### Multiple words/batch

In [None]:
from indic_transliteration.sanscript import transliterate, DEVANAGARI, IAST
items = ["भारत", "सम्मान"]
print([transliterate(x, DEVANAGARI, IAST) for x in items])


## indic-nlp-library

Normalization, tokenization, and utilities for Indic languages. Requires resource path setup.

### Set INDIC_NLP_RESOURCES and import

In [None]:
import os
os.environ['INDIC_NLP_RESOURCES'] = "/path/to/indic_nlp_resources"  # set your path
from indicnlp import common
print("Resources path:", os.environ['INDIC_NLP_RESOURCES'])


### Normalize Hindi text

In [None]:
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
factory = IndicNormalizerFactory()
normalizer = factory.get_normalizer("hi")
print(normalizer.normalize("क़िलाफ़"))


### Sentence tokenization (Hindi)

## pykakasi

Japanese transliteration (Kanji/Kana → Hepburn Romaji, etc.).

### Kanji → Romaji

In [None]:
import pykakasi
kakasi = pykakasi.kakasi()
kakasi.setMode("H","a"); kakasi.setMode("K","a"); kakasi.setMode("J","a")
conv = kakasi.getConverter()
print(conv.do("日本語の勉強"))


### Kana → Romaji

In [None]:
import pykakasi
kakasi = pykakasi.kakasi()
conv = kakasi.getConverter()
print(conv.do("にほんご"))
