## Noise removal 

In [1]:
# sample code to remove noisy words from a text 

noise_list = ["is", "a", "this", "..."]

def _remove_noise(input_text):
    words = input_text.split()
    noise_free_words = [word for word in words if word not in noise_list]
    noise_free_text = " ".join(noise_free_words)
    return noise_free_text

In [2]:
print(_remove_noise("this is a sample text"))

sample text


### using regular expression

In [4]:
# sample code to remove a regex pattern 
import re 

def _remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text)

    for i in urls:
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

In [8]:
regex_pattern = r"#[\w]*" 
input_text = "remove this #hashtag from Oluwafemi"

_remove_regex(input_text, regex_pattern)

'remove this  from Oluwafemi'

## Lexicon Normalization

Another type of textual noise is about the multiple representation exhinited by single word.

For example - "play","player", "played", "plays" and "playing" are the different variations of the word - play".Though they mean different but contextually all are similar.

This step converts all the disparities of a word into their normalized form (also known as lemma).

Normalization in text feature engineering converts the high dimensional features (N different features) to the low dimensional space (1 feature), which is an ideal ask for any ML model.

The most common lexicon norml=alization practices are:

* Stemming: Stemming is a rudimentary rule-based process of stripping thee suffixes ("ing", "ly", "es", "s" etc) from a word.

* Lemmatization: Lemmatization, is an organized and step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).


In [27]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\o.adejumobi\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\o.adejumobi\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\o.adejumobi\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\o.adejumobi\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\o.adejumobi\AppData\Roaming\nltk_data...
[nltk_data]    |   

True

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

lem = WordNetLemmatizer()

word = "multiplying"
lem.lemmatize(word, 'v')

'multiply'

In [14]:
stem = PorterStemmer()

stem.stem(word)

'multipli'

## Object Standardization 

Text data often contains words or phrases which are not present in any standard lexical dictionaries. The pieces are not recognized by search engines and models.

Examples are: acronymns, hashtags wih attached  words, and colloquial slangs.  

With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.

In [16]:
lookup_dict = lookup_dict = {'rt':'Retweet', 
                             'dm':'direct message', 
                             "awsm" : "awesome", 
                             "luv" :"love",}
def _lookup_words(input_text):
    words = input_text.split()
    new_words = []
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word)
    new_text = " ".join(new_words)
    return new_text


In [18]:
_lookup_words("RT this is a retweeted tweet by Oluwafemi")

'Retweet this is a retweeted tweet by Oluwafemi'

## Feature engineering on text data

Text features can be constructed using assorted techniques -
* Syntactical Parsing
* Entities
* N-grams
* word-based features
* statistical features
* word embeddings

### Syntactical Parsing

Synatactical parsing involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words.

**Dependency Grammar** and **Part of Speech tags** are the important attributes of text syntactics.

In [20]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\o.adejumobi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [24]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\o.adejumobi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

In [25]:
## part of speech tags

from nltk import word_tokenize, pos_tag
from nltk import word_tokenize, pos_tag
text = "I am learning Natural Language Processing on Analytics Vidhya"
tokens = word_tokenize(text)
print(pos_tag(tokens))

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'), ('Vidhya', 'NNP')]


In [22]:
tokens

['I',
 'am',
 'learning',
 'Natural',
 'Language',
 'Processing',
 'on',
 'Analytics',
 'Vidhya']

In [26]:
sentence = "book my flight, I will read this book"

tokens = word_tokenize(sentence)
tokens

['book', 'my', 'flight', ',', 'I', 'will', 'read', 'this', 'book']

## Topic Modelling

### Latent Dirichlet Allocation (LDA)

In [29]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp312-cp312-win_amd64.whl.metadata (8.2 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Using cached numpy-1.26.4-cp312-cp312-win_amd64.whl.metadata (61 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.1.0-py3-none-any.whl.metadata (24 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Downloading wrapt-1.17.2-cp312-cp312-win_amd64.whl.metadata (6.5 kB)
Downloading gensim-4.3.3-cp312-cp312-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   -- ------------------------------------- 1.3/24.0 MB 8.4 MB/s eta 0:00:03
   --------- ------------------------------ 5.5/24.0 MB 16.0 MB/s eta 0:00:02
   ---------------- ----------------------- 10.0/24.0 MB 17.8 MB/s eta 0:00:01
   ------------------------ --------------- 14.7/24.0 MB 19.2 MB/s eta 0:00:01
   ----


[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [31]:
import gensim
from gensim import corpora

In [32]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc_complete = [doc1, doc2, doc3]
doc_clean = [doc.split() for doc in doc_complete]

# Creating the term dictionary of our corpus, where every unique term is assigned an index.  
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. 
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

# Results 
print(ldamodel.print_topics())



[(0, '0.053*"driving" + 0.053*"sister" + 0.053*"my" + 0.053*"My" + 0.053*"to" + 0.053*"time" + 0.053*"father" + 0.053*"around" + 0.053*"lot" + 0.053*"practice."'), (1, '0.063*"to" + 0.036*"not" + 0.036*"sugar," + 0.036*"likes" + 0.036*"Sugar" + 0.036*"consume." + 0.036*"have" + 0.036*"but" + 0.036*"bad" + 0.036*"father."'), (2, '0.029*"driving" + 0.029*"My" + 0.029*"sister" + 0.029*"my" + 0.029*"to" + 0.029*"and" + 0.029*"stress" + 0.029*"Doctors" + 0.029*"suggest" + 0.029*"may"')]


## N-Grams as Features

A combination of N words together are called N-Grams. N grams (N > 1) are generally more informative as compared to words (Unigrams) as features. 

In [33]:
def generate_ngrams(text, n):
    words = text.split()
    output = []
    for i in range(len(words)-n+1):
        output.append(words[i:i+n])
    return output

In [34]:

generate_ngrams('this is a sample text', 2)
# [['this', 'is'], ['is', 'a'], ['a', 'sample'], , ['sample', 'text']] 

[['this', 'is'], ['is', 'a'], ['a', 'sample'], ['sample', 'text']]

# Statistical Features

## Term Frequency – Inverse Document Frequency (TF – IDF)

In [36]:
!pip install scikit-learn

Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Using cached scikit_learn-1.6.1-cp312-cp312-win_amd64.whl (11.1 MB)
Using cached threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-1.6.1 threadpoolctl-3.5.0



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)
print(X)

  (0, 7)	0.5844829010200651
  (0, 2)	0.5844829010200651
  (0, 4)	0.444514311537431
  (0, 1)	0.34520501686496574
  (1, 1)	0.3853716274664007
  (1, 0)	0.652490884512534
  (1, 3)	0.652490884512534
  (2, 4)	0.444514311537431
  (2, 1)	0.34520501686496574
  (2, 6)	0.5844829010200651
  (2, 5)	0.5844829010200651


The model creates a vocabulary dictionary and assigns an index to each word. Each row in the output contains a tuple (i,j) and a tf-idf value of word at index j in document i.

# Word Embedding (text vectors)

Word2Vec and GloVe are the two popular models to create word embedding of a text. 



**Word2Vec** model is composed of preprocessing module, a shallow neural network model called Continuous Bag of Words and another shallow neural network model called skip-gram.

 It first constructs a vocabulary from the training corpus and then learns word embedding representations.

In [None]:
from gensim.models import Word2Vec

sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],
             ['machine', 'learning'], ['deep', 'learning']]

# Train the Word2Vec model
model = Word2Vec(sentences, min_count=1)

# Calculate similarity between two words
similarity_score = model.wv.similarity('data', 'science')
print(f"Similarity between 'data' and 'science': {similarity_score}")

# Access the vector for a specific word
word_vector = model.wv['learning']
print(f"Vector for 'learning': {word_vector}")

Similarity between 'data' and 'science': -0.02367166057229042
Vector for 'learning': [-5.3622725e-04  2.3643136e-04  5.1033497e-03  9.0092728e-03
 -9.3029495e-03 -7.1168090e-03  6.4588725e-03  8.9729885e-03
 -5.0154282e-03 -3.7633716e-03  7.3805046e-03 -1.5334714e-03
 -4.5366134e-03  6.5540518e-03 -4.8601604e-03 -1.8160177e-03
  2.8765798e-03  9.9187379e-04 -8.2852151e-03 -9.4488179e-03
  7.3117660e-03  5.0702621e-03  6.7576934e-03  7.6286553e-04
  6.3508903e-03 -3.4053659e-03 -9.4640139e-04  5.7685734e-03
 -7.5216377e-03 -3.9361035e-03 -7.5115822e-03 -9.3004224e-04
  9.5381187e-03 -7.3191668e-03 -2.3337686e-03 -1.9377411e-03
  8.0774371e-03 -5.9308959e-03  4.5162440e-05 -4.7537340e-03
 -9.6035507e-03  5.0072931e-03 -8.7595852e-03 -4.3918253e-03
 -3.5099984e-05 -2.9618145e-04 -7.6612402e-03  9.6147433e-03
  4.9820580e-03  9.2331432e-03 -8.1579173e-03  4.4957981e-03
 -4.1370760e-03  8.2453608e-04  8.4986202e-03 -4.4621765e-03
  4.5175003e-03 -6.7869602e-03 -3.5484887e-03  9.3985079e-03


## Text classification

In [42]:
!pip install textblob

Collecting textblob
  Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Downloading textblob-0.19.0-py3-none-any.whl (624 kB)
   ---------------------------------------- 0.0/624.3 kB ? eta -:--:--
   --------------------------------- ------ 524.3/624.3 kB 3.3 MB/s eta 0:00:01
   ---------------------------------------- 624.3/624.3 kB 3.9 MB/s eta 0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.19.0



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob

training_corpus = [('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')
                   ]

test_corpus = [("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), 
                ('I do not enjoy my job', 'Class_B')
                ]



In [44]:
model = NBC(training_corpus)

In [45]:
print(model.classify("Their codes are amazing."))

Class_A


In [46]:
print(model.classify("I don't like their computer."))

Class_B


In [47]:
print(model.accuracy(test_corpus))

0.8333333333333334


In [71]:
## Scikit.Learn also provides a pipeline framework for text
# classification

from sklearn.feature_extraction.text import  TfidfTransformer
from sklearn.metrics import classification_report
from sklearn import svm

# data prep
train_data = []
train_label = []
for item in training_corpus:
    train_data.append(item[0])
    train_label.append(item[1])

test_data = []
test_labels = []
for item in test_corpus:
    test_data.append(item[0])
    test_labels.append(item[1])

# create feature vectors
vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)
train_vectors = vectorizer.fit_transform(train_data)

test_vectors = vectorizer.transform(test_data)


In [78]:
train_vectors.data

array([1.        , 1.        , 1.        , 1.        , 1.        ,
       0.77635539, 0.63029542, 0.77635539, 0.63029542, 1.        ])

In [72]:
model = svm.SVC(kernel='linear')
model.fit(train_vectors, train_label)

In [73]:
prediction = model.predict(test_vectors)

In [74]:
prediction

array(['Class_A', 'Class_A', 'Class_B', 'Class_B', 'Class_A', 'Class_A'],
      dtype='<U7')

In [75]:
test_labels

['Class_B', 'Class_A', 'Class_A', 'Class_B', 'Class_A', 'Class_B']

In [76]:

print(classification_report(test_labels, prediction))

              precision    recall  f1-score   support

     Class_A       0.50      0.67      0.57         3
     Class_B       0.50      0.33      0.40         3

    accuracy                           0.50         6
   macro avg       0.50      0.50      0.49         6
weighted avg       0.50      0.50      0.49         6



In [80]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.8.4-cp312-cp312-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.12-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp312-cp312-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.4-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp312-cp312-win_amd64


[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [82]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     - -------------------------------------- 0.5/12.8 MB 4.2 MB/s eta 0:00:03
     ---------- ----------------------------- 3.4/12.8 MB 11.8 MB/s eta 0:00:01
     -------------------- ------------------- 6.6/12.8 MB 13.0 MB/s eta 0:00:01
     ------------------------------ --------- 9.7/12.8 MB 13.1 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 13.8 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [83]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "I don't like NLP!"
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)

['I', 'do', "n't", 'like', 'NLP', '!']


## Text Matching / Similarity


### Levenshtein Distance

In [84]:

def levenshtein(s1,s2): 
    if len(s1) > len(s2):
        s1,s2 = s2,s1 
    distances = range(len(s1) + 1) 
    for index2,char2 in enumerate(s2):
        newDistances = [index2+1]
        for index1,char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1]) 
            else:
                 newDistances.append(1 + min((distances[index1], distances[index1+1], newDistances[-1]))) 
        distances = newDistances 
    return distances[-1]

print(levenshtein("analyze","analyse"))



1


### Phonetic Matching

Python’s module Fuzzy is used to compute soundex strings for different words

In [89]:
!pip install jellyfish

Collecting jellyfish
  Downloading jellyfish-1.1.3-cp312-cp312-win_amd64.whl.metadata (2.6 kB)
Downloading jellyfish-1.1.3-cp312-cp312-win_amd64.whl (212 kB)
Installing collected packages: jellyfish
Successfully installed jellyfish-1.1.3



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [93]:
from jellyfish import soundex


In [94]:
print(soundex('ankit'))

A523


In [95]:
print(soundex('kent'))

K530


### Cosine Similarity

In [96]:
import math
from collections import Counter

def get_cosine(vec1, vec2):
    common = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x]* vec2[x] for x in common])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()]) 

    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator)/denominator
    
def text_to_vector(text):
    words = text.split()
    return Counter(words)





In [100]:
text1 = 'This is an article on analytics vidhya' 
text2 = 'article on analytics vidhya is about natural language processing'

vector1 = text_to_vector(text1) 
vector2 = text_to_vector(text2)
cosine = get_cosine(vector1, vector2)
cosine

0.629940788348712