## Working with text data - Text Preprocessing

### **Text preprocessing steps:**

Text preprocessing steps include some essential tasks to clean and remove the noise from the available data.

1. **Removing special characters and punctuations.**
2. **Converting to lowercase**
3. **Tokenization(Sentence tokenization and word tokenization)**
4. **Removing stop words.**
5. **Stemming or Lemmatization.**
6. **HTML Parsing and Cleanup.**
7. **Spell Correction**

In [1]:
raw_text = '''
We're LaRNING1 nATURAL-LANguage-Processing!üòÇüôáüçÜ
In this\ example wE are goINg to LeaRN various text9 preProcessing Steps.
I'm GOiNG TO-Be Mr. Rich. 
'''
print(raw_text)


We're LaRNING1 nATURAL-LANguage-Processing!üòÇüôáüçÜ
In this\ example wE are goINg to LeaRN various text9 preProcessing Steps.
I'm GOiNG TO-Be Mr. Rich. 



  raw_text = '''


In [2]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [4]:
text = ''.join([char for char in raw_text if char not in string.punctuation and not char.isdigit()])
print(text)


Were LaRNING nATURALLANguageProcessingüòÇüôáüçÜ
In this example wE are goINg to LeaRN various text preProcessing Steps
Im GOiNG TOBe Mr Rich 



### A more powerful weapon to remove special characters and punctuations.

In [6]:
import re
regex = '[^a-zA-Z.!]'
text = re.sub(regex, ' ', raw_text)
print(text)

 We re LaRNING  nATURAL LANguage Processing!    In this  example wE are goINg to LeaRN various text  preProcessing Steps. I m GOiNG TO Be Mr. Rich.  


### Converting to lowercase

In [8]:
text = text.lower()
text

' we re larning  natural language processing!    in this  example we are going to learn various text  preprocessing steps. i m going to be mr. rich.  '

### # Tokenization (Sentence Tokenization and Word Tokenization)

In [10]:
words = text.split(' ')
print(words)

['', 'we', 're', 'larning', '', 'natural', 'language', 'processing!', '', '', '', 'in', 'this', '', 'example', 'we', 'are', 'going', 'to', 'learn', 'various', 'text', '', 'preprocessing', 'steps.', 'i', 'm', 'going', 'to', 'be', 'mr.', 'rich.', '', '']


In [11]:
sentences = text.split('.')
print(sentences)

[' we re larning  natural language processing!    in this  example we are going to learn various text  preprocessing steps', ' i m going to be mr', ' rich', '  ']


In [12]:
import nltk

In [13]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DeLL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [14]:
from nltk.tokenize import sent_tokenize
my_sent = sent_tokenize(text)
print(my_sent)

[' we re larning  natural language processing!', 'in this  example we are going to learn various text  preprocessing steps.', 'i m going to be mr. rich.']


In [15]:
from nltk.tokenize import word_tokenize
words = word_tokenize(text)
print(words)

['we', 're', 'larning', 'natural', 'language', 'processing', '!', 'in', 'this', 'example', 'we', 'are', 'going', 'to', 'learn', 'various', 'text', 'preprocessing', 'steps', '.', 'i', 'm', 'going', 'to', 'be', 'mr.', 'rich', '.']


In [16]:
for sentence in sent_tokenize(text):
    print(word_tokenize(sentence))

['we', 're', 'larning', 'natural', 'language', 'processing', '!']
['in', 'this', 'example', 'we', 'are', 'going', 'to', 'learn', 'various', 'text', 'preprocessing', 'steps', '.']
['i', 'm', 'going', 'to', 'be', 'mr.', 'rich', '.']


In [17]:
from nltk.corpus import stopwords

In [19]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Removing stop words

In [21]:
words = [word for word in words if word not in stopwords.words('english')]
print(words)

['larning', 'natural', 'language', 'processing', '!', 'example', 'going', 'learn', 'various', 'text', 'preprocessing', 'steps', '.', 'going', 'mr.', 'rich', '.']


### Stemming - Removing the suffix and reducing the word with the intention of taking the root word

In [22]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
clean_tokens_stem = [stemmer.stem(word) for word in words]
print(clean_tokens_stem)

['larn', 'natur', 'languag', 'process', '!', 'exampl', 'go', 'learn', 'variou', 'text', 'preprocess', 'step', '.', 'go', 'mr.', 'rich', '.']


### Lemmatization

In [23]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
clean_text_lemmatization = [lemmatizer.lemmatize(word) for word in words]
print(clean_text_lemmatization)

['larning', 'natural', 'language', 'processing', '!', 'example', 'going', 'learn', 'various', 'text', 'preprocessing', 'step', '.', 'going', 'mr.', 'rich', '.']


## Putting all the steps together

In [24]:
import pandas as pd
import numpy as np

text_data = ['We are learning machine learning $', 'Processing Natural - Language Data.', '10 machine - learning algorithms.', 'We are mimicking natural intelligence.']

In [25]:
df = pd.DataFrame({'text' : text_data})
df

Unnamed: 0,text
0,We are learning machine learning $
1,Processing Natural - Language Data.
2,10 machine - learning algorithms.
3,We are mimicking natural intelligence.


In [29]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


def clean(doc):
    regex = '[^a-zA-Z!.]'
    doc = re.sub(regex, ' ', doc)

    doc = doc.lower()

    tokens = nltk.word_tokenize(doc)

    stop_words = set(stopwords.words('english'))
    filtered_token = [word for word in tokens if not word in stop_words]

    lemmatizer = WordNetLemmatizer()
    lemmatizer_tokens = [lemmatizer.lemmatize(word) for word in filtered_token]

    return ' '.join(lemmatizer_tokens)

In [30]:
df['clean_text'] = df['text'].apply(lambda x : clean(x))
df

Unnamed: 0,text,clean_text
0,We are learning machine learning $,learning machine learning
1,Processing Natural - Language Data.,processing natural language data .
2,10 machine - learning algorithms.,machine learning algorithm .
3,We are mimicking natural intelligence.,mimicking natural intelligence .


## Vectorization

### Bag of Words

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
bow_vect = CountVectorizer()
text_dtm = bow_vect.fit_transform(df['clean_text'])
print()
print(f'shape of text_dtm (# of docs, # of unique vocabulary): {text_dtm.shape}')
print(f'Vocab : {bow_vect.get_feature_names_out()}')


shape of text_dtm (# of docs, # of unique vocabulary): (4, 9)
Vocab : ['algorithm' 'data' 'intelligence' 'language' 'learning' 'machine'
 'mimicking' 'natural' 'processing']


In [33]:
pd.DataFrame(text_dtm.toarray(), columns = bow_vect.get_feature_names_out())

Unnamed: 0,algorithm,data,intelligence,language,learning,machine,mimicking,natural,processing
0,0,0,0,0,2,1,0,0,0
1,0,1,0,1,0,0,0,1,1
2,1,0,0,0,1,1,0,0,0
3,0,0,1,0,0,0,1,1,0


In [4]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
text = ['Convert a collection of text documents to a matrix of token counts.','This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.','If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.','For an efficiency comparison of the different feature extractors.']

In [10]:
cv = CountVectorizer()
text = cv.fit_transform(text)
print(text)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 61 stored elements and shape (4, 52)>
  Coords	Values
  (0, 8)	1
  (0, 6)	1
  (0, 30)	2
  (0, 40)	1
  (0, 15)	1
  (0, 45)	1
  (0, 27)	1
  (0, 46)	1
  (0, 9)	1
  (1, 30)	1
  (1, 9)	1
  (1, 44)	1
  (1, 25)	1
  (1, 32)	1
  (1, 39)	2
  (1, 34)	1
  (1, 42)	1
  (1, 48)	1
  (1, 35)	1
  (1, 10)	1
  (2, 30)	2
  (2, 45)	1
  (2, 42)	3
  (2, 24)	1
  (2, 51)	2
  :	:
  (2, 38)	1
  (2, 26)	1
  (2, 20)	1
  (2, 36)	1
  (2, 43)	1
  (2, 29)	1
  (2, 21)	1
  (2, 50)	1
  (2, 4)	1
  (2, 18)	1
  (2, 49)	1
  (2, 37)	1
  (2, 23)	1
  (2, 5)	1
  (2, 2)	1
  (2, 11)	1
  (3, 30)	1
  (3, 42)	1
  (3, 0)	1
  (3, 20)	1
  (3, 22)	1
  (3, 17)	1
  (3, 7)	1
  (3, 13)	1
  (3, 19)	1
