<a href="https://colab.research.google.com/github/bhargavaCharyRudravelli/Bhargava_INFO5731_Fall2024/blob/main/Rudravelli_Bhargava_03_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [21]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Task: Email Spam Detection
Description: Spam detection is the process of categorizing emails according to their content spam. This work assists in removing unsolicited emails and shielding users from possible dangers.

Features for building the Machine Learning Model

1. Bags of Words(BoW):
 -->Provides a set of word frequencies to represent text.
 -->It is useful to capture records terms like "free," "win," "prize," and others that are frequently included in spam emails, as well as their frequency.

2. Inverse Document Frequency-Term Frequency (TF-IDF):
 -->It determines a word's significance by comparing its frequency within a document to that of the full corpus.
 -->It is useful in finding phrases that are more important in spam emails than in the total dataset, emphasizing distinct terms that might be signs of spam.

3. Email Metadata:

 -->It features like the email address of the sender, the subject line, and the existence of specific keywords in the subject are included in this description.
 -->The email metadata contains certain email addresses or subject line patterns will be linked to spam. Spam may be indicated, for instance, by emails with subjects like "Congratulations" or "Urgent" or from unknown domains.

4. N-grams:
  -->Word sequences consisting of N letters, such as bigrams and trigrams.
  -->It captures word combinations and context that individual words could miss. For instance, the bigrams "limited time" and "click here" are often reliable signs of spam.

5. HTML Features:
 -->Presence of HTML tags and attributes in the email body.
 -->Spam emails often contain HTML content with links, images, and formatting to make them look
    legitimate. The presence of certain HTML tags (e.g., <a>, <img>) can be indicative of spam.

6. Special Characters and Punctuation:
 -->Numbers of punctuation marks and special characters (like $, %, and @).
 -->Special letters and a lot of punctuation are frequently used in spam emails to draw attention or hide content. Multiple dollar signs ("$$$") or exclamation points ("!!!"), for instance, could be indicators of spam.

7. Word Embeddings:
 -->Word representations in dense vectors (e.g., Word2Vec, GloVe).
 -->This is helpful to captures the semantic connections between words, enabling the model to comprehend word similarity and context. For instance, the embeddings for "offer" and "deal" would be identical, suggesting spam.

8. Text Length:
 -->The email's word count or character count.
 -->This is helpful to spam emails typically have one of two lengths:
    very long with comprehensive offers, or very short with a call to action.
    Text length can aid in bringing other aspects into line.

Conclusion:
We can create a powerful machine learning model for spam detection that incorporates a variety of email content and metadata by integrating these elements.
Email filtering systems work better when emails are effectively classified as spam or not thanks to this multifaceted approach.



'''

'\nPlease write you answer here:\n\nTask: Email Spam Detection\nDescription: Spam detection is the process of categorizing emails according to their content spam. This work assists in removing unsolicited emails and shielding users from possible dangers.\n\nFeatures for building the Machine Learning Model\n\n1. Bags of Words(BoW):\n -->Provides a set of word frequencies to represent text.\n -->It is useful to capture records terms like "free," "win," "prize," and others that are frequently included in spam emails, as well as their frequency.\n\n2. Inverse Document Frequency-Term Frequency (TF-IDF):\n -->It determines a word\'s significance by comparing its frequency within a document to that of the full corpus.\n -->It is useful in finding phrases that are more important in spam emails than in the total dataset, emphasizing distinct terms that might be signs of spam.\n\n3. Email Metadata:\n\n -->It features like the email address of the sender, the subject line, and the existence of sp

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [22]:
# You code here (Please add comments in the code):

import re
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from collections import Counter
from bs4 import BeautifulSoup
import nltk

# Download the stopwords data
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample text data
emails = [
    "Congratulations! You've won a free prize. Click here to claim your reward.",
    "Dear user, your account has been compromised. Please update your password immediately.",
    "Meeting tomorrow at 10 AM. Your appointment with UNT Legal Advisor is scheduled.",
    "Limited time offer! Get 50% off on all books in UNT library. Visit our website now.",
    "Hi Bhargava, just wanted to check in and see how you're doing. Let's catch up soon."
]

# Initialize vectorizers
count_vector = CountVectorizer()
tfidf_vector = TfidfVectorizer()
stop_words = set(stopwords.words('english'))

# Bag of Words (BoW)
bow_features = count_vector.fit_transform(emails).toarray()

# TF-IDF
tfidf_features = tfidf_vector.fit_transform(emails).toarray()

# Email Metadata (for simplicity, using subject as part of the email text)
def extract_metadata(email):
    return {
        'length': len(email),
        'num_special_chars': sum(1 for char in email if char in "!@#$%^&*()"),
        'num_html_tags': len(BeautifulSoup(email, "html.parser").find_all())
    }

metadata_features = [extract_metadata(email) for email in emails]

# N-grams
def extract_ngrams(email, n=2):
    # Tokenize the email into words
    words = word_tokenize(email)

    # Create a list to store the n-grams
    ngrams = []

    # Loop through the list of words to create n-grams
    num_words = len(words)
    for start_index in range(num_words - n + 1):
        # Extract n words starting from the current index
        ngram_words = words[start_index:start_index + n]

        # Join the n-gram words with a space to form a single string
        ngram_string = " ".join(ngram_words)

        # Add the n-gram string to the list of n-grams
        ngrams.append(ngram_string)

    return ngrams

bigrams = [extract_ngrams(email, 2) for email in emails]

# Part of Speech (POS) Tags
def extract_pos_tags(email):
    tokens = word_tokenize(email)
    pos_tags = pos_tag(tokens)
    pos_counts = Counter(tag for word, tag in pos_tags)
    return pos_counts

pos_tag_features = [extract_pos_tags(email) for email in emails]

# Create DataFrames for the features
bow_df = pd.DataFrame(bow_features, columns=count_vector.get_feature_names_out())
tfidf_df = pd.DataFrame(tfidf_features, columns=[f"tfidf_{word}" for word in tfidf_vector.get_feature_names_out()])
metadata_df = pd.DataFrame(metadata_features)
pos_tag_df = pd.DataFrame(pos_tag_features).fillna(0)

# Combine all features into a single DataFrame
features_df = pd.concat([bow_df, tfidf_df, metadata_df, pos_tag_df], axis=1)

print(features_df)


   10  50  account  advisor  all  am  and  appointment  at  been  ...  PRP$  \
0   0   0        0        0    0   0    0            0   0     0  ...   1.0   
1   0   0        1        0    0   0    0            0   0     1  ...   2.0   
2   1   0        0        1    0   1    0            1   1     0  ...   1.0   
3   0   1        0        0    1   0    0            0   0     0  ...   1.0   
4   0   0        0        0    0   0    1            0   0     0  ...   0.0   

     ,  VBZ  VBG   IN   CD   CC  WRB  POS   RP  
0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
1  1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
2  0.0  1.0  1.0  2.0  1.0  0.0  0.0  0.0  0.0  
3  0.0  0.0  0.0  3.0  1.0  0.0  0.0  0.0  0.0  
4  1.0  0.0  1.0  1.0  0.0  1.0  1.0  1.0  1.0  

[5 rows x 151 columns]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [23]:
# You code here (Please add comments in the code):

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.datasets import fetch_20newsgroups

# Load dataset
categories_ds = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories_ds)

# Convert text data to TF-IDF features
tf_vector = TfidfVectorizer(stop_words='english', max_features=1000)
X = tf_vector.fit_transform(newsgroups_train.data)
y = newsgroups_train.target

# Apply Chi-Square feature selection
chi2_scores, p_values = chi2(X, y)

# Create a DataFrame with feature names and their Chi-Square scores
feature_names = tf_vector.get_feature_names_out()
chi2_df = pd.DataFrame({'feature': feature_names, 'chi2_score': chi2_scores})

# Rank features based on Chi-Square scores in descending order
chi2_df = chi2_df.sort_values(by='chi2_score', ascending=False)

# Select the top N features
top_features = chi2_df.head(20)

# Display the top features
print(top_features)


        feature  chi2_score
395    graphics  107.404265
475       keith  106.852696
385         god  102.261303
655        pitt   93.206041
373         geb   77.141680
122       banks   74.653441
390      gordon   74.486385
582         msg   74.118948
465       jesus   66.868713
193      church   64.188361
159     caltech   61.956683
514     livesey   61.442825
192  christians   59.816445
189      christ   59.122114
108     atheism   54.920004
577    morality   54.437770
354       files   54.228775
35           3d   51.563985
757     rutgers   49.022242
437       image   48.457932


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [24]:
# You code here (Please add comments in the code):
import torch
import numpy as np
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT model and tokenizer
bt_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings
def get_bert(texts):
    inputs = bt_tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings

# Sample text data
emails = [
    "Congratulations! You've won a free prize. Click here to claim your reward.",
    "Dear user, your account has been compromised. Please update your password immediately.",
    "Meeting tomorrow at 10 AM. Your appointment with UNT Legal Advisor is scheduled.",
    "Limited time offer! Get 50% off on all books in UNT library. Visit our website now.",
    "Hi Bhargava, just wanted to check in and see how you're doing. Let's catch up soon."
]

# Query
query = "How to claim my prize?"

# Get BERT embeddings for emails and query
email_embeddings = get_bert(emails)
query_embedding = get_bert([query])

# Calculate cosine similarity
similarity_scores = cosine_similarity(query_embedding, email_embeddings).flatten()

# Rank emails based on similarity scores
indices = np.argsort(similarity_scores)[::-1]
emails = [emails[i] for i in indices]
scores = [similarity_scores[i] for i in indices]

# Print ranked emails and their similarity scores
for email, score in zip(emails, scores):
    print(f"Score: {score:.4f}, Email: {email}")



Score: 0.6043, Email: Congratulations! You've won a free prize. Click here to claim your reward.
Score: 0.5335, Email: Meeting tomorrow at 10 AM. Your appointment with UNT Legal Advisor is scheduled.
Score: 0.5165, Email: Limited time offer! Get 50% off on all books in UNT library. Visit our website now.
Score: 0.4297, Email: Hi Bhargava, just wanted to check in and see how you're doing. Let's catch up soon.
Score: 0.4205, Email: Dear user, your account has been compromised. Please update your password immediately.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [25]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

This exercise taken much time to analyse and to understand the concepts. I gone through many websites and pyhton pages to understand these concepts.
I understood the concepts of Machine learning the way it will handle the datasets that are Bag of Words, Term Frequency-Inverse Document Frequenct,
N-grams and Parts of Speech.
While writing and executing the code I have encountered many issues in the compile time and found compile errors.
This is the most time taken to find out and resolve the issues.
I have learned the techniques such as preprocessing, feature_extraction and many other which are the common concepts of Machine Learning.

The study of NLP can greatly benefit from this practice. A crucial first in many natural language processing (NLP) applications, text categorization, machine translation, and information retrieval, is feature extraction from text input.
I can efficiently preprocess and analyze textual material by comprehending and putting these strategies into practice.

This exercise highlights the significance in the context of NLP such as Data Preprocessing and Feature Engineering
This work has given me a strong basis in text feature extraction, which is essential for many jobs in natural language processing.

'''

'\nPlease write you answer here:\n\nThis exercise taken much time to analyse and to understand the concepts. I gone through many websites and pyhton pages to understand these concepts.\nI understood the concepts of Machine learning the way it will handle the datasets that are Bag of Words, Term Frequency-Inverse Document Frequenct,\nN-grams and Parts of Speech. \nWhile writing and executing the code I have encountered many issues in the compile time and found compile errors.\nThis is the most time taken to find out and resolve the issues. \nI have learned the techniques such as preprocessing, feature_extraction and many other which are the common concepts of Machine Learning.\n\nThe study of NLP can greatly benefit from this practice. A crucial first in many natural language processing (NLP) applications, text categorization, machine translation, and information retrieval, is feature extraction from text input. \nI can efficiently preprocess and analyze textual material by comprehendin