## Assignment 6 (Text Analytics)

### 1. Document Preprocessing
Given documents:
- **Document A**: "Jupiter is the largest Planet"
- **Document B**: "Mars is the fourth planet from the Sun"

Steps:
- **Tokenization**: Splitting sentences into words.
- **POS Tagging**: Assigning parts of speech to each word.
- **Stop Words Removal**: Removing common words like "is", "the", etc.
- **Stemming and Lemmatization**: Reducing words to their base or root forms.

### 2. Term Frequency and Inverse Document Frequency (TF-IDF)
Calculate the TF-IDF manually, then use Python to verify.

### Code:


In [9]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from collections import Counter
import numpy as np

# Sample documents
doc_a = "Jupiter is the largest Planet"
doc_b = "Mars is the fourth planet from the Sun"

# Tokenization
tokens_a = word_tokenize(doc_a.lower())
tokens_b = word_tokenize(doc_b.lower())

# Removing stop words
stop_words = set(stopwords.words('english'))
filtered_a = [word for word in tokens_a if word not in stop_words]
filtered_b = [word for word in tokens_b if word not in stop_words]

# Stemming and Lemmatization
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stemmed_a = [ps.stem(word) for word in filtered_a]
stemmed_b = [ps.stem(word) for word in filtered_b]
lemmatized_a = [lemmatizer.lemmatize(word) for word in filtered_a]
lemmatized_b = [lemmatizer.lemmatize(word) for word in filtered_b]

print("Stemmed A:", stemmed_a)
print("Stemmed B:", stemmed_b)
print("Lemmatized A:", lemmatized_a)
print("Lemmatized B:", lemmatized_b)

# Calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([doc_a, doc_b])
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())
print("Feature Names:", vectorizer.get_feature_names_out())


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\0x4C3DD/nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.13_3.13.240.0_x64__qbz5n2kfra8p0\\nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.13_3.13.240.0_x64__qbz5n2kfra8p0\\share\\nltk_data'
    - 'C:\\Program Files\\WindowsApps\\PythonSoftwareFoundation.Python.3.13_3.13.240.0_x64__qbz5n2kfra8p0\\lib\\nltk_data'
    - 'C:\\Users\\0x4C3DD\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [8]:
import re
from collections import Counter
import math

def simple_tokenize(text):
    """Simple tokenization by splitting on spaces and removing punctuation"""
    # Convert to lowercase and remove punctuation
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text.split()

def remove_stopwords(tokens):
    """Remove common English stop words"""
    stop_words = {'is', 'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'with', 'by'}
    return [token for token in tokens if token not in stop_words]

def simple_stem(word):
    """Very basic stemming rules"""
    if word.endswith('ing'):
        return word[:-3]
    if word.endswith('s'):
        return word[:-1]
    if word.endswith('ed'):
        return word[:-2]
    return word

def calculate_tf(tokens):
    """Calculate term frequency"""
    return Counter(tokens)

def calculate_idf(documents_tokens):
    """Calculate inverse document frequency"""
    num_documents = len(documents_tokens)
    word_doc_count = Counter()
    
    for doc_tokens in documents_tokens:
        # Count each word only once per document
        unique_words = set(doc_tokens)
        for word in unique_words:
            word_doc_count[word] += 1
    
    # Calculate IDF for each word
    idf = {}
    for word, doc_count in word_doc_count.items():
        idf[word] = math.log(num_documents / doc_count)
    
    return idf

def calculate_tf_idf(documents):
    """Calculate TF-IDF for the documents"""
    # Preprocess all documents
    processed_docs = []
    for doc in documents:
        tokens = simple_tokenize(doc)
        tokens = remove_stopwords(tokens)
        tokens = [simple_stem(token) for token in tokens]
        processed_docs.append(tokens)
    
    # Calculate TF for each document
    tf_scores = [calculate_tf(doc_tokens) for doc_tokens in processed_docs]
    
    # Calculate IDF across all documents
    idf_scores = calculate_idf(processed_docs)
    
    # Calculate TF-IDF
    tf_idf_results = []
    for tf_doc in tf_scores:
        tf_idf_doc = {}
        for term, tf in tf_doc.items():
            tf_idf_doc[term] = tf * idf_scores.get(term, 0)
        tf_idf_results.append(tf_idf_doc)
    
    return tf_idf_results

# Example documents
doc_A = "Jupiter is the largest Planet"
doc_B = "Mars is the fourth planet from the Sun"
documents = [doc_A, doc_B]

# Process and analyze documents
print("Document Processing Results:")
for idx, doc in enumerate(documents):
    print(f"\nDocument {idx + 1}: '{doc}'")
    # Show tokenization
    tokens = simple_tokenize(doc)
    print(f"Tokens: {tokens}")
    
    # Show after stop words removal
    clean_tokens = remove_stopwords(tokens)
    print(f"After stop words removal: {clean_tokens}")
    
    # Show after stemming
    stemmed = [simple_stem(token) for token in clean_tokens]
    print(f"After stemming: {stemmed}")

# Calculate and show TF-IDF
print("\nTF-IDF Results:")
tf_idf_results = calculate_tf_idf(documents)
for idx, doc_scores in enumerate(tf_idf_results):
    print(f"\nDocument {idx + 1} TF-IDF scores:")
    sorted_terms = sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)
    for term, score in sorted_terms:
        print(f"{term}: {score:.4f}")

Document Processing Results:

Document 1: 'Jupiter is the largest Planet'
Tokens: ['jupiter', 'is', 'the', 'largest', 'planet']
After stop words removal: ['jupiter', 'largest', 'planet']
After stemming: ['jupiter', 'largest', 'planet']

Document 2: 'Mars is the fourth planet from the Sun'
Tokens: ['mars', 'is', 'the', 'fourth', 'planet', 'from', 'the', 'sun']
After stop words removal: ['mars', 'fourth', 'planet', 'from', 'sun']
After stemming: ['mar', 'fourth', 'planet', 'from', 'sun']

TF-IDF Results:

Document 1 TF-IDF scores:
jupiter: 0.6931
largest: 0.6931
planet: 0.0000

Document 2 TF-IDF scores:
mar: 0.6931
fourth: 0.6931
from: 0.6931
sun: 0.6931
planet: 0.0000


### Explanation:
1. Tokenization, removing stop words, and applying stemming/lemmatization.
2. Calculating TF-IDF using `TfidfVectorizer`.