**Text Preprocessing & Analysis**

In [1]:
text = "Machine learning is transforming the world of data science. It helps in making accurate predictions and decisions."


**Tasks:**

1. Tokenization:
- Break the text into individual words.

2. Stop Word Removal:
- Remove common stop words.

3. Stemming:
- Apply PorterStemmer to the filtered tokens.

4. Frequency Distribution:
- Count how many times each word appears after stemming.

1. **Tokenization**

In [2]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

In [3]:
 nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [4]:
word_tokens = word_tokenize(text)
print('The word tokens are:', word_tokens)

The word tokens are: ['Machine', 'learning', 'is', 'transforming', 'the', 'world', 'of', 'data', 'science', '.', 'It', 'helps', 'in', 'making', 'accurate', 'predictions', 'and', 'decisions', '.']


In [5]:
sentence_tokens = sent_tokenize(text)
print('The sentence tokens are:', sentence_tokens)

The sentence tokens are: ['Machine learning is transforming the world of data science.', 'It helps in making accurate predictions and decisions.']


2. **Stopword Removal**

In [6]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
stop_words = set(stopwords.words('english'))

In [8]:
filtered_words = [word for word in word_tokens if word.lower() not in stop_words]
print('The filtered words are:', filtered_words)

The filtered words are: ['Machine', 'learning', 'transforming', 'world', 'data', 'science', '.', 'helps', 'making', 'accurate', 'predictions', 'decisions', '.']


3. **Stemming**

In [9]:
from nltk.stem import PorterStemmer

In [10]:
# instantiate the stemmer object
stemmer = PorterStemmer()

In [11]:
stemmed_tokens = [stemmer.stem(word) for word in filtered_words]
print('The stemmed tokens are:', stemmed_tokens)

The stemmed tokens are: ['machin', 'learn', 'transform', 'world', 'data', 'scienc', '.', 'help', 'make', 'accur', 'predict', 'decis', '.']


The words have been stemmed but the base text really do not make sense.

Let's try nltk's lemmatization and see if we will have more meaningful base words.

In [13]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [14]:
# instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

In [15]:
 lem_words = [lemmatizer.lemmatize(word) for word in filtered_words]
 print('The lemmatized words are:', lem_words)

The lemmatized words are: ['Machine', 'learning', 'transforming', 'world', 'data', 'science', '.', 'help', 'making', 'accurate', 'prediction', 'decision', '.']


We can see that nltk's lemmatizer has helped improve the base words but we can do better.

This is where Spacy's lemmatization comes in to give the base words more meaning.

We will explore this in another lesson.

4. **Frequency Distribution**

In [16]:
# count how many times each word appears after stemming
from collections import Counter

In [17]:
word_freq = Counter(lem_words)
print('The word frequency is:', word_freq)

The word frequency is: Counter({'.': 2, 'Machine': 1, 'learning': 1, 'transforming': 1, 'world': 1, 'data': 1, 'science': 1, 'help': 1, 'making': 1, 'accurate': 1, 'prediction': 1, 'decision': 1})
