# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Let's look at an interesting task for classifying text: figuring out how people feel about social media comments. The objective is to figure out whether a certain word shows a positive, negative, or neutral attitude. In order to find out how the public feels about their goods, services, or events, this activity is useful for companies and groups.

#What follows are five types of features that could be useful for creating a machine learning model for mood analysis:

#Features of Word Frequency:

#By looking at how often words appear in a message, you can get a sense of how people feel about it.
#Feature Example: Word Count for Positive and Negative Remarks.
#"Love," "great," and "amazing" are examples of positive sentiments, whereas "hate," "disappointing," and "terrible" are examples of negative sentiments.
#Letters called "n-grams"

#Taking into account groups of words that are next to each other (n-grams) can help you understand the situation.
#For example, bi-grams or tri-grams of words that are often put together.
#Definition: Phrases like "not good" or "highly recommend" can add depth to meaning that a single word might miss.
#Analysis of Emoticons and Emojis:

#Description: Emoticons and emojis can show how you feel.
#Type and number of emoticons or emojis used in the message is an example of a feature.
#For example, 😈 could mean a good feeling, while 😡 could mean a negative feeling.
#Lexicons for Sentiment:

#Lists of words that have already been described and their sentiment scores.
#Example Feature: Sum of sentiment ratings for words in the comment calculated using a sentiment lexicon.
#To explain: By giving words scores, you can record how strongly someone feels about something, and the comment's general sentiment can be seen as a whole.
#Tags for Part-of-Speech (POS):

#Figuring out what kind of grammar each word in a comment belongs to.
#Number of adjectives or adverbs used in the message is an example of a feature.
#Meaning: Adjectives and adverbs often show how someone feels. One way to differentiate between "amazing product" and "slow service" is by looking at the words actually used.
#By putting these features together, a machine learning model can read different parts of text, which helps it understand and sort the feelings people have when they leave notes on social media sites. Accurately classifying mood requires preprocessing the text data, feature engineering, and picking the right algorithm.

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [12]:
!pip install nltk emoji textblob

import re
from nltk.tokenize import word_tokenize
from nltk import FreqDist
from nltk.util import ngrams
import emoji
from textblob import TextBlob
from nltk import pos_tag
import nltk

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample text data
sample_comments = [
    "I love this product! It's amazing.",
    "The service was terrible. I won't recommend it.",
    "Not a bad experience, but could be better.",
    "😊 Great event! Enjoyed every moment.",
    "Disappointed with the quality. 😡",
]

def analyze_word_frequencies(input_text):
    words = word_tokenize(input_text.lower())
    positive_words = ["love", "great", "awesome"]
    negative_words = ["hate", "disappointing", "terrible"]

    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)

    return positive_count, negative_count

def generate_n_grams(input_text, n=2):
    words = word_tokenize(input_text.lower())
    n_grams_result = list(ngrams(words, n))
    return n_grams_result

def identify_emojis_modified(input_text):
    text_with_emojis = emoji.demojize(input_text)
    emojis = re.findall(r':[a-z_]+:', text_with_emojis)
    return emojis

def analyze_sentiment(input_text):
    blob = TextBlob(input_text)
    sentiment_score = blob.sentiment.polarity
    return sentiment_score

def count_pos_tags_modified(input_text, pos_tags=["JJ", "RB"]):
    words = word_tokenize(input_text)
    pos_tags_count = sum(1 for _, tag in pos_tag(words) if tag in pos_tags)
    return pos_tags_count

# Feature extraction for each comment
for sample_comment in sample_comments:
    print(f"\nComment: {sample_comment}")

    positive_count, negative_count = analyze_word_frequencies(sample_comment)
    print(f"Word Frequency Features - Positive Count: {positive_count}, Negative Count: {negative_count}")

    n_grams_result = generate_n_grams(sample_comment, n=2)
    print(f"N-grams Features: {n_grams_result}")

    emojis_result = identify_emojis_modified(sample_comment)
    print(f"Emoji Features: {emojis_result}")

    sentiment_score_result = analyze_sentiment(sample_comment)
    print(f"Sentiment Lexicon Features: {sentiment_score_result}")

    pos_tags_count_result = count_pos_tags_modified(sample_comment, pos_tags=["JJ", "RB"])
    print(f"POS Tag Features - Adjectives and Adverbs Count: {pos_tags_count_result}")


Comment: I love this product! It's amazing.
Word Frequency Features - Positive Count: 1, Negative Count: 0
N-grams Features: [('i', 'love'), ('love', 'this'), ('this', 'product'), ('product', '!'), ('!', 'it'), ('it', "'s"), ("'s", 'amazing'), ('amazing', '.')]
Emoji Features: []
Sentiment Lexicon Features: 0.6125
POS Tag Features - Adjectives and Adverbs Count: 1

Comment: The service was terrible. I won't recommend it.
Word Frequency Features - Positive Count: 0, Negative Count: 1
N-grams Features: [('the', 'service'), ('service', 'was'), ('was', 'terrible'), ('terrible', '.'), ('.', 'i'), ('i', 'wo'), ('wo', "n't"), ("n't", 'recommend'), ('recommend', 'it'), ('it', '.')]
Emoji Features: []
Sentiment Lexicon Features: -1.0
POS Tag Features - Adjectives and Adverbs Count: 2

Comment: Not a bad experience, but could be better.
Word Frequency Features - Positive Count: 0, Negative Count: 0
N-grams Features: [('not', 'a'), ('a', 'bad'), ('bad', 'experience'), ('experience', ','), (',', 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [13]:
from sklearn.feature_selection import SelectKBest, chi2
import pandas as pd

# Sample features and labels
input_features = [
    [1, 0, [('i', 'love'), ('love', 'this')], 2, 0.6125, 3],
    [0, 1, [('service', 'terrible'), ('recommend', 'it.')], 1, -0.5, 2],
    [0, 0, [('bad', 'experience'), ('could', 'better.')], 0, 0.25, 1],
    [1, 0, [('great', 'event'), ('enjoyed', 'moment.')], 1, 0.75, 2],
    [0, 1, [('disappointed', 'quality.'), ('angry', 'face.')], 1, -0.8, 1],
]

labels = ['Positive', 'Negative', 'Neutral', 'Positive', 'Negative']

# Create a DataFrame for the features
original_df = pd.DataFrame(input_features, columns=['Positive_Count', 'Negative_Count', 'N-grams', 'Emoji_Count', 'Sentiment_Lexicon', 'Pos_Tags_Count'])

# Extract labels
original_y = [1 if label == 'Positive' else 0 for label in labels]

# Preprocess 'N-grams' feature
original_df['N-grams'] = original_df['N-grams'].apply(lambda x: len(x))  # For simplicity, using the count of n-grams as a feature

# Make all features non-negative
normalized_df = original_df - original_df.min().min()

# Select the top k features
k_value = 3  # You can adjust this value based on your requirements
feature_selector = SelectKBest(chi2, k=k_value)
selected_features_array = feature_selector.fit_transform(normalized_df, original_y)

# Get the indices of the selected features
selected_feature_indices = feature_selector.get_support(indices=True)

# Display the selected features and their scores
selected_feature_names = normalized_df.columns[selected_feature_indices]
feature_scores_result = feature_selector.scores_[selected_feature_indices]
sorted_selected_features = sorted(zip(selected_feature_names, feature_scores_result), key=lambda x: x[1], reverse=True)

print(f"Selected Features (Top {k_value}):")
for feature_name, score_value in sorted_selected_features:
    print(f"{feature_name}: {score_value}")

Selected Features (Top 3):
Sentiment_Lexicon: 1.4796195652173916
Positive_Count: 0.9999999999999992
Pos_Tags_Count: 0.628205128205128


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [14]:
!pip install transformers scikit-learn

from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import torch
import pandas as pd

# Sample text data
text_data_samples = [
    "I love this product! It's amazing.",
    "The service was terrible. I won't recommend it.",
    "Not a bad experience, but could be better.",
    "😊 Great event! Enjoyed every moment.",
    "Disappointed with the quality. 😡",
]

# Query
search_query = "Looking for a great product with excellent service."

# Load pre-trained BERT model and tokenizer
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize and encode the query
query_tokens_encoded = tokenizer_bert(search_query, return_tensors='pt', padding=True, truncation=True)
query_output_embeddings = bert_model(**query_tokens_encoded)

# Extract the embeddings for the query
query_embedding_result = query_output_embeddings.last_hidden_state.mean(dim=1).detach().numpy()

# Tokenize and encode each document in the dataset
document_embeddings_result = []
for document_text in text_data_samples:
    document_tokens_encoded = tokenizer_bert(document_text, return_tensors='pt', padding=True, truncation=True)
    document_output_embeddings = bert_model(**document_tokens_encoded)
    document_embedding_result = document_output_embeddings.last_hidden_state.mean(dim=1).detach().numpy()
    document_embeddings_result.append(document_embedding_result)

# Calculate cosine similarity between the query and each document
similarity_scores_result = [cosine_similarity(query_embedding_result, doc_embedding)[0][0] for doc_embedding in document_embeddings_result]

# Create a DataFrame to display results
result_df_updated = pd.DataFrame({'Document': text_data_samples, 'Similarity': similarity_scores_result})

# Sort documents based on similarity in descending order
result_df_updated = result_df_updated.sort_values(by='Similarity', ascending=False)

# Display the ranked documents
print("Ranked Documents based on Similarity:")
print(result_df_updated)




Ranked Documents based on Similarity:
                                          Document  Similarity
0               I love this product! It's amazing.    0.674666
4                 Disappointed with the quality. 😡    0.614161
3             😊 Great event! Enjoyed every moment.    0.595608
2       Not a bad experience, but could be better.    0.579785
1  The service was terrible. I won't recommend it.    0.574029


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Working on extracting features from text data is a very important way to learn how to describe textual data for machine learning tasks. Tokenization, feature extraction, and the use of external libraries such as NLTK and scikit-learn were helpful in becoming familiar with the process of preprocessing and transforming text input into a format that machine learning models could handle. It was clear how important it was to choose the right features and know how they affected the model's performance.

#Problems: One problem was making sure that different libraries could work together and meeting the specific needs of each library. For example, working with the NLTK punkt resource and changing the character feature extraction because the library had changed showed how important it is to know a lot about the library. It was also hard to pull features from text data that had emojis, special characters, and different languages.

#Important for your field of study:
#As a component of Natural Language Processing (NLP), the exercise has a direct connection to activities that involve the analysis of text data and machine learning. Essential to NLP, feature extraction decides how written data is stored for models. Word frequencies, n-grams, sentiment scores, and part-of-speech tags are used in NLP for sentiment analysis, text classification, and information retrieval. The fact that BERT is used to rank text similarity shows how important advanced NLP methods are in real life settings. Overall, the activity teaches us useful things about preprocessing and feature building for NLP tasks.



'''