<a href="https://colab.research.google.com/github/Vinuthna06reddy/VinuthnaReddy_INFO5731_FALL2024/blob/main/INFO5731_Exercise_3_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
'''
Customer Intent Classification:
It is a task in Natural Language Processing(NLP) where the goal is to purpose of intent behind a customer's input. Typically in the form of text such as chat messages, reviews, or emails.

Useful features for building a Machine Learning Model:
As the dataset is text-based, the features extracted from the text will help in identifying the customer’s intent.

1.Bag of Words (BoW) / Term Frequency-Inverse Document Frequency (TF-IDF)
    BoW captures word frequencies in a document, while TF-IDF adjusts those frequencies based on how common the words are across documents.
  Ex: Common words like “buy,” “help,” or “issue” can directly hint at specific intents. For instance, words like "buy" or "order" may indicate a "Purchase Intent," while words like "issue" or "problem" may indicate "Support Request."

2.N-grams (Bigrams, Trigrams)
    N-grams are sequences of words. Bigrams (two-word sequences) and trigrams (three-word sequences) are useful for capturing context in phrases.
  Ex: Phrases like "how to buy" or "need support" provide more context than individual words. These word combinations are often more predictive of intent than single words.

3.Sentiment Analysis
    Sentiment analysis detects the emotional tone of the text (positive, negative, or neutral).
  Ex:Customer inquiries with negative sentiment might indicate frustration or a need for support, whereas positive sentiment might indicate feedback or praise. Sentiment can help distinguish between "Request for Support" and "Feedback."

4.Word Embeddings
    Word embeddings capture the semantic relationships between words by representing them as dense vectors in a high-dimensional space.
  They help capture the meaning of words in context. Words like “purchase” and “order” may be used interchangeably in different contexts, and embeddings would capture that similarity.

5.Length of the Text
    The number of words or characters in the input text.
  Shorter texts may indicate quick inquiries or commands, while longer texts may indicate detailed feedback or complex support requests.

6.Part of Speech (POS) Tags
    This feature represents the grammatical structure of the sentence by tagging each word with its part of speech (e.g., noun, verb, adjective).
  Ex: Different types of intent might be associated with different grammatical structures. For example, "Request" intents may frequently include modal verbs ("can," "would"), while "Feedback" might include more adjectives ("great," "bad").

'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from nltk.sentiment import SentimentIntensityAnalyzer
import numpy as np

nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

nlp = spacy.load("en_core_web_sm")

sample_texts = [
    "I want to buy a new laptop.",
    "I need help with my account.",
    "The product is great, I love it!",
    "I'm having issues with the billing.",
    "Can you tell me how to return the item?"
]

bow_vectorizer = CountVectorizer()
bow_features = bow_vectorizer.fit_transform(sample_texts)

print("Bag of Words (BoW) features:\n", bow_features.toarray())
print("BoW feature names:", bow_vectorizer.get_feature_names_out())

tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(sample_texts)

print("\nTF-IDF features:\n", tfidf_features.toarray())
print("TF-IDF feature names:", tfidf_vectorizer.get_feature_names_out())

ngram_vectorizer = CountVectorizer(ngram_range=(2, 3))
ngram_features = ngram_vectorizer.fit_transform(sample_texts)

print("\nN-gram features (Bigrams/Trigrams):\n", ngram_features.toarray())
print("N-gram feature names:", ngram_vectorizer.get_feature_names_out())

sentiment_analyzer = SentimentIntensityAnalyzer()
sentiment_scores = [sentiment_analyzer.polarity_scores(text) for text in sample_texts]

print("\nSentiment Scores:")
for text, score in zip(sample_texts, sentiment_scores):
    print(f"{text} --> {score}")

print("\nWord Embeddings (SpaCy):")
for text in sample_texts:
    doc = nlp(text)
    print(f"Text: {text}")
    print(f"Embedding shape: {doc.vector.shape}")
    print(f"First 5 elements of embedding: {doc.vector[:5]}\n")

text_lengths = [len(nltk.word_tokenize(text)) for text in sample_texts]

print("\nLength of Text (in words):")
for text, length in zip(sample_texts, text_lengths):
    print(f"{text} --> {length} words")

print("\nPart of Speech (POS) Tags:")
for text in sample_texts:
    doc = nlp(text)
    print(f"\nText: {text}")
    for token in doc:
        print(f"{token.text} ({token.pos_})", end=' ')
    print()



[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Bag of Words (BoW) features:
 [[0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0]
 [1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0]
 [0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0]
 [0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 1 1 1 0 0 1]]
BoW feature names: ['account' 'billing' 'buy' 'can' 'great' 'having' 'help' 'how' 'is'
 'issues' 'it' 'item' 'laptop' 'love' 'me' 'my' 'need' 'new' 'product'
 'return' 'tell' 'the' 'to' 'want' 'with' 'you']

TF-IDF features:
 [[0.         0.         0.46369322 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.46369322 0.         0.         0.         0.         0.46369322
  0.         0.         0.         0.         0.37410477 0.46369322
  0.         0.        ]
 [0.46369322 0.         0.         0.         0.         0.
  0.46369322 0.         0.         0.         0.         0.
  0.         0.         0.         0.46369322 0.46369322 0.
  0.   

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder

texts = [
    "I want to buy a new laptop.",
    "I need help with my account."
]
labels = ["Purchase Intent", "Support Request"]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(labels)

chi2_values, p_values = chi2(X, y)

feature_names = vectorizer.get_feature_names_out()
chi2_df = pd.DataFrame({'Feature': feature_names, 'Chi2': chi2_values})
chi2_df = chi2_df.sort_values(by='Chi2', ascending=False).reset_index(drop=True)

print(chi2_df)


   Feature      Chi2
0  account  0.447214
1      buy  0.447214
2     help  0.447214
3   laptop  0.447214
4       my  0.447214
5     need  0.447214
6      new  0.447214
7       to  0.447214
8     want  0.447214
9     with  0.447214


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
!pip install transformers torch scikit-learn
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

sample_texts = [
    "I want to buy a new laptop.",
    "I need help with my account.",
    "The product is great, I love it!",
    "I'm having issues with the billing.",
    "Can you tell me how to return the item?"
]

def get_bert_embedding(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings

text_embeddings = []
for text in sample_texts:
    text_embedding = get_bert_embedding(text, model, tokenizer)
    text_embeddings.append(text_embedding)

text_embeddings = torch.cat(text_embeddings)

query = "Can you please help me return this?."

query_embedding = get_bert_embedding(query, model, tokenizer)

cosine_similarities = cosine_similarity(query_embedding, text_embeddings)[0]

ranked_indices = np.argsort(-cosine_similarities)

print(f"Query: {query}\n")
print("Ranked Texts based on Similarity:")
for idx in ranked_indices:
    print(f"Text: {sample_texts[idx]} -- Similarity Score: {cosine_similarities[idx]:.4f}")










Query: Can you please help me return this?.

Ranked Texts based on Similarity:
Text: Can you tell me how to return the item? -- Similarity Score: 0.8308
Text: I need help with my account. -- Similarity Score: 0.6154
Text: I want to buy a new laptop. -- Similarity Score: 0.5814
Text: I'm having issues with the billing. -- Similarity Score: 0.5672
Text: The product is great, I love it! -- Similarity Score: 0.5554


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Learning experience:
Working on extracting features from text data has been a valuable experience, particularly in understanding how raw textual information can be transformed into meaningful numerical representations for various NLP tasks.
The most beneficial concepts which I felt interesting are: Bag of words(BoW) and TF-IDF, N-grams, sentiment analysis, BERT embeddings.

Challenges encountered:
With the same text dataset, implementing feature extraction using multiple techniques (BoW, TF-IDF, sentiment analysis, embeddings, etc.) needed careful execution and planning.  It needed time and careful consideration to make sure that the various approaches could be seamlessly included into the pipeline.

Relevance to the field of study:
This practice is extremely useful in Natural Language Processing (NLP), especially for problems involving text similarity, sentiment analysis, and intent categorization. In almost all NLP tasks, the basic technique of transforming unprocessed text into numerical attributes is essential.

'''