# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
'''

#Movie Review Sentiment Analysis
1. **Word Frequency:** This is about counting words in movie reviews. Words like "amazing" in good reviews and "boring" in bad ones help the computer guess if a review is positive or negative.

2. **Sentence Length:** Short sentences might mean a bad review, and long sentences might mean a good one. It's like seeing if the reviewer wanted to say a lot or just a little about the movie.

3. **Emotional Tone:** This finds feeling words in reviews. Words that show happiness or anger tell the computer how the person feels about the movie.

4. **Negation Words:** Words like "not" change the meaning. "Not bad" can mean something is good. The computer learns these tricks to better understand the review.

5. **Punctuation Usage:** How people use punctuation, like "!" or "...", can show if they're really excited or not happy. It's a clue for the computer about the review's mood.

6. **Use of Capitalization:** Sometimes, when people write reviews, they use ALL CAPS to show they feel strongly about something. For example, "LOVED IT" might mean they really enjoyed the movie. This can help the computer see when someone feels very positive or negative about a movie.

7.**Thematic Words:** Certain words are directly related to movies, like "plot," "characters," or "cinematography." If a review says positive things about these aspects, it's likely a good review. For example, "The plot was captivating" suggests a positive sentiment. This helps the computer understand specific parts of the movie that people liked or didn't like.

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
# You code here (Please add comments in the code):
# Sample reviews
reviews = [
    "LOVED this movie! The plot was AMAZING and the characters were fantastic!",
    "This was the WORST movie ever. Boring plot and terrible acting.",
    "Stunning cinematography, but the plot was not great. Not bad overall."
]

# Words indicating positive or negative sentiment
positive_words = ['amazing', 'fantastic', 'loved', 'stunning', 'captivating']
negative_words = ['worst', 'boring', 'terrible']

# Thematic words related to movies
thematic_words = ['plot', 'characters', 'cinematography']

def extract_features(text):
    # Initialize counters
    word_freq = {'positive': 0, 'negative': 0, 'thematic': 0}
    negation_count = 0
    capitalization_count = 0
    punctuation_count = 0

    # Split text into sentences and words
    sentences = text.split('.')
    words = text.split()

    # Calculate sentence length feature
    avg_sentence_length = sum(len(s.split()) for s in sentences if s) / len(sentences)

    # Process each word
    for word in words:
        # Check for positive/negative words
        if word.lower() in positive_words:
            word_freq['positive'] += 1
        elif word.lower() in negative_words:
            word_freq['negative'] += 1

        # Check for thematic words
        if word.lower() in thematic_words:
            word_freq['thematic'] += 1

        # Check for negation words
        if "not" in word.lower():
            negation_count += 1

        # Check for capitalization
        if word.isupper() and len(word) > 1:
            capitalization_count += 1

    # Count punctuation usage
    punctuation_count = text.count('!') + text.count('...')

    # Compile features
    features = {
        'avg_sentence_length': avg_sentence_length,
        'word_freq': word_freq,
        'negation_count': negation_count,
        'capitalization_count': capitalization_count,
        'punctuation_count': punctuation_count,
    }

    return features

# Extract and print features for each review
for review in reviews:
    features = extract_features(review)
    print(features)




{'avg_sentence_length': 12.0, 'word_freq': {'positive': 2, 'negative': 0, 'thematic': 2}, 'negation_count': 0, 'capitalization_count': 2, 'punctuation_count': 2}
{'avg_sentence_length': 3.6666666666666665, 'word_freq': {'positive': 0, 'negative': 3, 'thematic': 1}, 'negation_count': 0, 'capitalization_count': 1, 'punctuation_count': 0}
{'avg_sentence_length': 3.6666666666666665, 'word_freq': {'positive': 1, 'negative': 0, 'thematic': 1}, 'negation_count': 2, 'capitalization_count': 0, 'punctuation_count': 0}


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
# You code here (Please add comments in the code):



1. **Emotional Words**: Words that clearly show feelings, like "love" for good reviews or "hate" for bad ones, tell us a lot about whether someone enjoyed the movie. They're very important because they directly express the reviewer's sentiment.

2. **Negation Words**: Words like "not" can change the whole meaning of a sentence. For example, "not good" actually means something is bad. These are crucial for understanding the real sentiment behind a sentence.

3. **ALL CAPS Words**: When people write in ALL CAPS, it often means they're feeling very strongly about what they're saying. For instance, saying "I LOVED THIS MOVIE" likely means they really enjoyed it. This is a strong indicator of sentiment.

4. **Punctuation Marks**: The use of exclamation marks (!) or ellipses (...) can show how strongly someone feels about their opinion. Lots of exclamation marks might mean excitement or anger, depending on the context.

5. **Thematic Words**: Words specific to movies, like "plot," "characters," or "cinematography," are important because they show what aspect of the movie the review is focusing on. While not directly about sentiment, they help us understand what the reviewer is commenting on.

6. **Sentence Length**: Longer sentences might be used to describe positive experiences in detail, while shorter sentences could indicate negative feedback. However, this isn't always the case, making sentence length a bit less directly tied to sentiment than other features.

7. **General Word Frequency**: This is looking at which words show up a lot, but without focusing on whether they're positive or negative. It's the least direct way to understand sentiment because it's just about word popularity, not meaning.



## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Initialize BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Define your texts and query
texts = [
    "Loved this movie, amazing plot and characters.",
    "Worst movie, dull plot and unlikable characters.",
    "Great cinematography but predictable story, okay overall."
]
query = "Amazing movie with interesting characters."

# Function to encode texts to BERT embeddings
def encode(text):
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')
    outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].squeeze().detach().numpy()

# Encode query and texts
query_vec = encode(query)
text_vecs = torch.stack([torch.tensor(encode(text)) for text in texts])

# Calculate cosine similarities
cos_sim = cosine_similarity([query_vec], text_vecs)[0]

# Rank texts by similarity
ranked_texts = [(text, sim) for text, sim in sorted(zip(texts, cos_sim), key=lambda x: x[1], reverse=True)]

# Display ranked texts and similarities
for text, sim in ranked_texts:
    print(f"{text} (Similarity: {sim:.4f})")


Worst movie, dull plot and unlikable characters. (Similarity: 0.8904)
Loved this movie, amazing plot and characters. (Similarity: 0.8821)
Great cinematography but predictable story, okay overall. (Similarity: 0.8420)


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

'''

Learning Experience: I found it interesting to see how text features and BERT model help understand text better. It's cool how we can figure out what a text is saying by looking at things like word choice and sentence structure.

Challenges Encountered: The main challenge was dealing with the technical side, like making sure the data was set up right for BERT and figuring out cosine similarity.

Relevance to NLP: This exercise is super relevant to NLP (Natural Language Processing) because it's all about teaching computers to understand human language. Learning about text features and BERT is key for making smarter NLP tools.