## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Task:
Sentiment analysis of customer reviews for a product or service. Sentiment analysis aims to determine whether a given text expresses a positive, negative, or neutral sentiment.
This task is valuable for businesses to understand customer opinions, identify areas for improvement, and make data-driven decisions

5 features that can be useful for building a machine learning model for sentiment analysis:

- Word Frequency: Counting the frequency of specific words or phrases in the text can be informative. Positive or negative sentiment often correlates with certain keywords. For example, words like "excellent," "amazing," or "terrible" are strong indicators of sentiment.

- N-grams: N-grams are sequences of adjacent words in a text. Analyzing bi-grams (two-word combinations) or tri-grams (three-word combinations) can capture context and sentiment nuances. For instance, "not good" is different from "very good."

- Sentiment Lexicons: Sentiment lexicons or dictionaries contain lists of words with associated sentiment scores (positive, negative, neutral). By matching words in the text to this lexicon, you can calculate an overall sentiment score for the document. Lexicons can help handle sarcasm or negations, where the sentiment may be reversed.

- Part-of-Speech (POS) Tags: Understanding the grammatical structure of the text can be beneficial. For example, identifying adjectives and adverbs in a sentence can provide insights into the intensity of sentiment. Adjectives like "great" or adverbs like "extremely" can impact the sentiment score.

- Emoticons and Emoji: Emoticons and emojis are increasingly used to convey sentiment in text. Detecting and analyzing these symbols can provide valuable information about sentiment, especially in social media data. For example, 😊 might indicate a positive sentiment, while 😡 suggests a negative sentiment.

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [5]:
import nltk
from nltk.util import ngrams
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

# Sample text data
sample_texts = [
    "The movie was excellent. I really enjoyed it!",
    "The weather today is terrible. I hate the rain.",
    "I had a great experience at the restaurant. The food was amazing!",
    "The customer service was awful. I will never go back there.",
    "I'm so excited for the upcoming vacation! 🌴😊"
]

# Loop through each sample text
for idx, text in enumerate(sample_texts, start=1):
    print(f"Sample {idx}:")
    print("Text:", text)

    # Tokenize the text
    words = nltk.word_tokenize(text)

    # 1. Word Frequency
    word_frequency = nltk.FreqDist(words)
    print("\nWord Frequency:")
    print(word_frequency)

    # 2. N-grams (bi-grams and tri-grams)
    bi_grams = list(ngrams(words, 2))
    tri_grams = list(ngrams(words, 3))
    print("\nBi-grams:")
    print(bi_grams)
    print("\nTri-grams:")
    print(tri_grams)

    # 3. Sentiment Lexicons (using TextBlob)
    tb = TextBlob(text)
    sentiment_lexicon_score = tb.sentiment.polarity
    print("\nSentiment Lexicon Score:", sentiment_lexicon_score)

    # 4. Part-of-Speech (POS) Tags
    pos_tags = nltk.pos_tag(words)
    print("\nPOS Tags:")
    print(pos_tags)

    # 5. Emoticons and Emoji (using TextBlob)
    emojis = [char for char in text if char in "😊😡"]
    print("\nEmoticons/Emojis:")
    print(emojis)

    print("\n-------------------\n")


Sample 1:
Text: The movie was excellent. I really enjoyed it!

Word Frequency:
<FreqDist with 10 samples and 10 outcomes>

Bi-grams:
[('The', 'movie'), ('movie', 'was'), ('was', 'excellent'), ('excellent', '.'), ('.', 'I'), ('I', 'really'), ('really', 'enjoyed'), ('enjoyed', 'it'), ('it', '!')]

Tri-grams:
[('The', 'movie', 'was'), ('movie', 'was', 'excellent'), ('was', 'excellent', '.'), ('excellent', '.', 'I'), ('.', 'I', 'really'), ('I', 'really', 'enjoyed'), ('really', 'enjoyed', 'it'), ('enjoyed', 'it', '!')]

Sentiment Lexicon Score: 0.8125

POS Tags:
[('The', 'DT'), ('movie', 'NN'), ('was', 'VBD'), ('excellent', 'JJ'), ('.', '.'), ('I', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'), ('it', 'PRP'), ('!', '.')]

Emoticons/Emojis:
[]

-------------------

Sample 2:
Text: The weather today is terrible. I hate the rain.

Word Frequency:
<FreqDist with 10 samples and 11 outcomes>

Bi-grams:
[('The', 'weather'), ('weather', 'today'), ('today', 'is'), ('is', 'terrible'), ('terrible', '.'

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [6]:
import nltk
from nltk.util import ngrams
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

sample_texts = [
    "The movie was excellent. I really enjoyed it!",
    "The weather today is terrible. I hate the rain.",
    "I had a great experience at the restaurant. The food was amazing!",
    "The customer service was awful. I will never go back there.",
    "I'm so excited for the upcoming vacation! 🌴😊"
]

labels = [1, 0, 1, 0, 1]

# Feature extraction
features = []

for text in sample_texts:
    words = nltk.word_tokenize(text)
    bi_grams = list(ngrams(words, 2))
    tri_grams = list(ngrams(words, 3))
    tb = TextBlob(text)
    sentiment_lexicon_score = tb.sentiment.polarity
    pos_tags = nltk.pos_tag(words)
    emojis = [char for char in text if char in "😊😡"]

    # Combine all features into a single feature vector
    feature_vector = [
        len(words),  # Word Frequency
        len(bi_grams) + len(tri_grams),  # N-grams
        sentiment_lexicon_score,  # Sentiment Lexicons
        len([tag for word, tag in pos_tags]),  # Part-of-Speech Tags
        len(emojis)  # Emoticons/Emojis
    ]

    features.append(feature_vector)

# Train a random forest classifier to rank the features
classifier = RandomForestClassifier(random_state=42)
classifier.fit(features, labels)

# Get feature importance scores from the classifier
feature_importances = classifier.feature_importances_

# Create a list of feature names
feature_names = [
    'Word Frequency',
    'N-grams',
    'Sentiment Lexicons',
    'Part-of-Speech Tags',
    'Emoticons/Emojis'
]

# Combine feature names and their importance scores
feature_ranking = list(zip(feature_names, feature_importances))

# Sort the features by their importance scores in descending order
feature_ranking.sort(key=lambda x: x[1], reverse=True)

# Print the top 5 ranked features
print("Top 5 Ranked Features:")
for feature, importance in feature_ranking[:5]:
    print(f"{feature}: {importance:.4f}")


Top 5 Ranked Features:
Sentiment Lexicons: 0.3753
Part-of-Speech Tags: 0.2448
Word Frequency: 0.1792
N-grams: 0.1738
Emoticons/Emojis: 0.0269


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [11]:
pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m68.2 MB/s[0m eta [36m0:00:00[0m
Insta

In [20]:
import torch
import numpy as np
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Sample text data
sample_texts = [
    "The movie was excellent. I really enjoyed it!",
    "The weather today is terrible. I hate the rain.",
    "I had a great experience at the restaurant. The food was amazing!",
    "The customer service was awful. I will never go back there.",
    "I'm so excited for the upcoming vacation! 🌴😃"
]

# Define your query
query = "I loved the movie; it was fantastic!"

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize the query and text data
query_tokens = tokenizer(query, padding=True, truncation=True, return_tensors="pt")
text_tokens = tokenizer(sample_texts, padding=True, truncation=True, return_tensors="pt", max_length=512)  # Adjust max_length as needed

# Encode the query and text data with BERT
query_output = model(**query_tokens)
text_output = model(**text_tokens)

# Calculate cosine similarity between the query and each text
query_embedding = query_output.last_hidden_state.mean(dim=1).detach().numpy()  # Use mean pooling for the query
text_embeddings = text_output.last_hidden_state.mean(dim=1).detach().numpy()  # Use mean pooling for the text data

similarities = cosine_similarity(query_embedding, text_embeddings)

# Rank the text data based on similarity scores in descending order
ranked_texts = sorted(enumerate(sample_texts), key=lambda x: similarities[0][x[0]], reverse=True)

# Print the ranked text data and their similarity scores
print("Ranked Texts Based on Similarity:")
for idx, (similarity, text) in enumerate(ranked_texts):
    print(f"Rank {idx + 1}: Similarity Score = {similarity:.4f}")
    print(text)
    print()


Ranked Texts Based on Similarity:
Rank 1: Similarity Score = 0.0000
The movie was excellent. I really enjoyed it!

Rank 2: Similarity Score = 2.0000
I had a great experience at the restaurant. The food was amazing!

Rank 3: Similarity Score = 4.0000
I'm so excited for the upcoming vacation! 🌴😃

Rank 4: Similarity Score = 3.0000
The customer service was awful. I will never go back there.

Rank 5: Similarity Score = 1.0000
The weather today is terrible. I hate the rain.

