#   Assignment No. 07 - Text Analysis

#   1. Extract Sample document and apply following document preprocessing methods:
#   Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
#   2. Create representation of document by calculating Term Frequency and Inverse Document Frequency.

# 1. Document Preprocessing:

Tokenization:
Tokenization involves breaking down the text into individual words or tokens. This can be achieved using libraries like NLTK or spaCy in Python.

In [19]:
import nltk
from nltk.tokenize import word_tokenize

# Sample document
document = "Text analytics is the process of analyzing unstructured text data for useful insights."

# Tokenization
tokens = word_tokenize(document)
print(tokens)


['Text', 'analytics', 'is', 'the', 'process', 'of', 'analyzing', 'unstructured', 'text', 'data', 'for', 'useful', 'insights', '.']


POS Tagging:
POS tagging assigns a part of speech tag to each token in the text.

In [20]:
# POS Tagging
import nltk
nltk.download('averaged_perceptron_tagger')

pos_tags = nltk.pos_tag(tokens)
print(pos_tags)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\afeef\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


[('Text', 'NN'), ('analytics', 'NNS'), ('is', 'VBZ'), ('the', 'DT'), ('process', 'NN'), ('of', 'IN'), ('analyzing', 'VBG'), ('unstructured', 'JJ'), ('text', 'NN'), ('data', 'NNS'), ('for', 'IN'), ('useful', 'JJ'), ('insights', 'NNS'), ('.', '.')]


Stop Words Removal:
Stop words are common words that are often removed because they do not carry much significance

In [21]:
from nltk.corpus import stopwords

# Stop words removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)


['Text', 'analytics', 'process', 'analyzing', 'unstructured', 'text', 'data', 'useful', 'insights', '.']


Stemming and Lemmatization:
Stemming reduces words to their root form, while lemmatization reduces them to their base or dictionary form.

In [22]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Stemming
porter = PorterStemmer()
stemmed_tokens = [porter.stem(word) for word in filtered_tokens]
print(stemmed_tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized_tokens)


['text', 'analyt', 'process', 'analyz', 'unstructur', 'text', 'data', 'use', 'insight', '.']
['Text', 'analytics', 'process', 'analyzing', 'unstructured', 'text', 'data', 'useful', 'insight', '.']


# 2. Document Representation:

Term Frequency (TF):
Term Frequency measures the frequency of a term in a document.

In [23]:
from collections import Counter

# Term Frequency
tf = Counter(filtered_tokens)
print(tf)


Counter({'Text': 1, 'analytics': 1, 'process': 1, 'analyzing': 1, 'unstructured': 1, 'text': 1, 'data': 1, 'useful': 1, 'insights': 1, '.': 1})


Inverse Document Frequency (IDF):
Inverse Document Frequency measures the importance of a term in the entire corpus.

python


In [24]:
import math

def calculate_idf(corpus, term):
    doc_containing_term = sum(1 for doc in corpus if term in doc)
    return math.log(len(corpus) / (1 + doc_containing_term))

# Example corpus
corpus = ["Text analytics is the process of analyzing unstructured text data for useful insights.",
          "Text analytics involves natural language processing techniques to derive meaningful information from text documents."]

# Calculate IDF for each term
idf = {}
for term in set(filtered_tokens):
    idf[term] = calculate_idf(corpus, term)

print(idf)


{'insights': 0.0, 'unstructured': 0.0, 'analyzing': 0.0, 'useful': 0.0, 'analytics': -0.40546510810816444, 'data': 0.0, 'process': -0.40546510810816444, '.': -0.40546510810816444, 'text': -0.40546510810816444, 'Text': -0.40546510810816444}
