# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

"""
In the realm of combatting misinformation, discerning between fake and genuine news has become a crucial text classification task. To effectively tackle this challenge, various types of features can be leveraged for constructing a robust machine learning model:

Text Content Features:
- TF-IDF (Term Frequency-Inverse Document Frequency)
- Word Embeddings

Lexical Features:
- Punctuation Usage
- Capitalization

Syntactic Features:
- Part-of-Speech (POS) Tags
- Sentence Length

Semantic Features:
- Named Entity Recognition (NER)
- Topic Modeling

Sentiment Analysis Features:
- Sentiment Polarity
- Emotional Tone

Source and Context Features:
- Source Reliability
- Publication Date

By integrating these diverse features into a machine learning framework, we can develop a potent classifier capable of effectively distinguishing between fake and genuine news articles. It's imperative to underscore the necessity of a robust dataset comprising labeled samples of both fabricated and authentic news articles for thorough training and testing of such a model.
"""

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [2]:
# You code here (Please add comments in the code):
import pandas as pd
import numpy as np
import spacy
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

def extract_tfidf_features(data):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_features = tfidf_vectorizer.fit_transform(data)
    tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
    return tfidf_df

def extract_lexical_features(data):
    lexical_features = []
    for text in data:
        punctuation_count = text.count('!')
        capitalization_ratio = sum(1 for char in text if char.isupper()) / len(text)
        lexical_features.append([punctuation_count, capitalization_ratio])
    lexical_df = pd.DataFrame(lexical_features, columns=["Punctuation Count", "Capitalization Ratio"])
    return lexical_df

def extract_pos_features(data):
    nlp = spacy.load("en_core_web_sm")
    pos_tags = []
    for text in data:
        doc = nlp(text)
        pos_counts = nltk.FreqDist(token.pos_ for token in doc)
        pos_tags.append(pos_counts)
    pos_df = pd.DataFrame(pos_tags)
    return pos_df

def extract_sentiment_features(data):
    sia = SentimentIntensityAnalyzer()
    sentiment_polarity = []
    for text in data:
        sentiment_scores = sia.polarity_scores(text)
        sentiment_polarity.append(sentiment_scores)
    sentiment_df = pd.DataFrame(sentiment_polarity)
    return sentiment_df

# Different sample data
different_sample_data = [
    "New Study Claims Coffee Causes Memory Loss",
    "Local School Wins National Science Competition",
    "Mayor Announces Plan to Reduce Traffic Congestion",
    "Rumors of Celebrity Engagement Spark Social Media Frenzy",
]

# Extracting features
tfidf_features = extract_tfidf_features(different_sample_data)
lexical_features = extract_lexical_features(different_sample_data)
pos_features = extract_pos_features(different_sample_data)
sentiment_features = extract_sentiment_features(different_sample_data)

# Print extracted features
print("TF-IDF Features:\n", tfidf_features)
print("\nLexical Features:\n", lexical_features)
print("\nPOS Tag Features:\n", pos_features)
print("\nSentiment Polarity Features:\n", sentiment_features)


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


TF-IDF Features:
    announces    causes  celebrity    claims    coffee  competition  \
0   0.000000  0.377964   0.000000  0.377964  0.377964     0.000000   
1   0.000000  0.000000   0.000000  0.000000  0.000000     0.408248   
2   0.377964  0.000000   0.000000  0.000000  0.000000     0.000000   
3   0.000000  0.000000   0.353553  0.000000  0.000000     0.000000   

   congestion  engagement    frenzy     local  ...    reduce    rumors  \
0    0.000000    0.000000  0.000000  0.000000  ...  0.000000  0.000000   
1    0.000000    0.000000  0.000000  0.408248  ...  0.000000  0.000000   
2    0.377964    0.000000  0.000000  0.000000  ...  0.377964  0.000000   
3    0.000000    0.353553  0.353553  0.000000  ...  0.000000  0.353553   

     school   science    social     spark     study        to   traffic  \
0  0.000000  0.000000  0.000000  0.000000  0.377964  0.000000  0.000000   
1  0.408248  0.408248  0.000000  0.000000  0.000000  0.000000  0.000000   
2  0.000000  0.000000  0.000000  0.

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [4]:
# You code here (Please add comments in the code):
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

def train_random_forest_classifier(labels):
    # Convert labels to numerical format
    label_dict = {label: idx for idx, label in enumerate(labels)}
    num_labels = [label_dict[label] for label in labels]

    # Creating a new TF-IDF vectorizer
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_features = tfidf_vectorizer.fit_transform(labels)

    # Training Random Forest classifier
    rand_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
    rand_forest_clf.fit(tfidf_features, num_labels)

    # Extracting feature importances
    feature_importances = rand_forest_clf.feature_importances_

    # Creating a DataFrame to store feature importances
    feature_importance_df = pd.DataFrame({'Feature': tfidf_vectorizer.get_feature_names_out(), 'Importance': feature_importances})

    # Sorting features by importance
    sorted_features = feature_importance_df.sort_values(by='Importance', ascending=False)

    return sorted_features

def print_top_features(feature_df, top_n=10):
    # Printing top N features
    for idx, row in feature_df.head(top_n).iterrows():
        print(f"Feature: {row['Feature']}, Importance: {row['Importance']:.4f}")

# Sample labels
sample_labels = ['Immigration', 'Economy', 'GDP', 'Unemployment', 'Government']

# Training Random Forest classifier and getting top features
sorted_features = train_random_forest_classifier(sample_labels)

# Printing top features
print_top_features(sorted_features)

Feature: economy, Importance: 0.2390
Feature: immigration, Importance: 0.2317
Feature: gdp, Importance: 0.1867
Feature: government, Importance: 0.1840
Feature: unemployment, Importance: 0.1585


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [5]:
pip install transformers torch numpy



In [6]:
# You code here (Please add comments in the code):
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def get_bert_embeddings(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    return embeddings

def rank_documents_by_similarity(query_embedding, sample_embeddings, sample_data):
    similarities = cosine_similarity([query_embedding], sample_embeddings)[0]
    ranking = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)
    ranked_results = [(sample_data[index], similarity) for index, similarity in ranking]
    return ranked_results

# Sample text data
sample_data = [
    "New Study Claims Coffee Causes Memory Loss",
    "Local School Wins National Science Competition",
    "Mayor Announces Plan to Reduce Traffic Congestion",
    "Rumors of Celebrity Engagement Spark Social Media Frenzy"]

# Query
query = "Reduce Traffic Congestion"

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode the query and sample data into BERT embeddings
query_embedding = get_bert_embeddings(query)
sample_embeddings = [get_bert_embeddings(text) for text in sample_data]

# Rank documents based on similarity to query
ranked_results = rank_documents_by_similarity(query_embedding, sample_embeddings, sample_data)

# Print ranked results
print("Ranked Documents based on Similarity to Query:")
for rank, (text, similarity) in enumerate(ranked_results):
    print(f"Rank {rank + 1}: Similarity = {similarity:.4f}")
    print(f"Text: {text}\n")





The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked Documents based on Similarity to Query:
Rank 1: Similarity = 0.8043
Text: Mayor Announces Plan to Reduce Traffic Congestion

Rank 2: Similarity = 0.6189
Text: New Study Claims Coffee Causes Memory Loss

Rank 3: Similarity = 0.5869
Text: Local School Wins National Science Competition

Rank 4: Similarity = 0.5691
Text: Rumors of Celebrity Engagement Spark Social Media Frenzy



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
It was a great learning experience to work on extracting features from text data. I also understood
the importance of text pre processing before performing feature extraction.
The challenge which I encountered was to make sure that the compactability between the
libraries and models that are used for this exercises. This exercise is highly related to the
field of NLP, it helps in performing different NLP based projects like text classification,
topic modelling, summmarization etc
'''