## Text Analytics
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document
Frequency.

## A. Document Preprocessing

In [1]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [2]:
from nltk.tokenize import word_tokenize

In [3]:
document = "Hi! My name is Vaishnav, I am studying Data Science and Analytics."

In [4]:
tokens = word_tokenize(document)
tokens

['Hi',
 '!',
 'My',
 'name',
 'is',
 'Vaishnav',
 ',',
 'I',
 'am',
 'studying',
 'Data',
 'Science',
 'and',
 'Analytics',
 '.']

# B. POS Tagging

### POS tagging assigns a part-of-speech tag to each word in the document.

In [6]:
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [7]:
from nltk import pos_tag

In [8]:
pos_tags = pos_tag(tokens)
pos_tags

[('Hi', 'NN'),
 ('!', '.'),
 ('My', 'PRP$'),
 ('name', 'NN'),
 ('is', 'VBZ'),
 ('Vaishnav', 'NNP'),
 (',', ','),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('studying', 'VBG'),
 ('Data', 'NNP'),
 ('Science', 'NNP'),
 ('and', 'CC'),
 ('Analytics', 'NNPS'),
 ('.', '.')]

# C. Stop Words Removal

### Stop words are commonly used words (e.g., "the," "is," "and") that often don't carry much meaning and can be removed to focus on more important words.

In [9]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [10]:
from nltk.corpus import stopwords

In [11]:
stop_words = set(stopwords.words("english"))

In [12]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [13]:
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

filtered_tokens

['Hi',
 '!',
 'name',
 'Vaishnav',
 ',',
 'studying',
 'Data',
 'Science',
 'Analytics',
 '.']

# D. Stemming and Lemmatization

### Stemming and lemmatization aim to reduce words to their base or root forms. Stemming chops off the ends of words, while lemmatization uses language knowledge to get to the base form.

In [21]:
import nltk

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...


True

In [22]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [23]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [24]:
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

In [25]:
stemmed_tokens

['hi', '!', 'name', 'vaishnav', ',', 'studi', 'data', 'scienc', 'analyt', '.']

In [26]:
lemmatized_tokens

['Hi',
 '!',
 'name',
 'Vaishnav',
 ',',
 'studying',
 'Data',
 'Science',
 'Analytics',
 '.']

# 2. Representation using TF-IDF

### To create a representation of the document using Term Frequency-Inverse Document Frequency (TF-IDF), you'll need a corpus of documents. Assuming you have a collection of documents, you can use the scikit-learn library in Python to calculate TF-IDF.

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

In [29]:
vectorizer = TfidfVectorizer()

In [30]:
tfidf_matrix = vectorizer.fit_transform(corpus)

In [31]:
tfidf_matrix

<4x9 sparse matrix of type '<class 'numpy.float64'>'
	with 21 stored elements in Compressed Sparse Row format>

In [32]:
feature_names = vectorizer.get_feature_names_out()

feature_names

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [33]:
for doc_index, doc in enumerate(corpus):
    print(f"\nDocument {doc_index + 1}: ")
    for term_index, term in enumerate(feature_names):
        tfidf_value = tfidf_matrix[doc_index, term_index]
        if tfidf_value > 0:
            print(f"{term}: {tfidf_value:.2f}")


Document 1: 
document: 0.47
first: 0.58
is: 0.38
the: 0.38
this: 0.38

Document 2: 
document: 0.69
is: 0.28
second: 0.54
the: 0.28
this: 0.28

Document 3: 
and: 0.51
is: 0.27
one: 0.51
the: 0.27
third: 0.51
this: 0.27

Document 4: 
document: 0.47
first: 0.58
is: 0.38
the: 0.38
this: 0.38
