<a href="https://colab.research.google.com/github/apurwa2024/Apurwa_INFO5731_FaLL2024/blob/main/Bhattarai_Apurwa_Exercise_3_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
An interesting text classification could be offerup including different product categories. For example, description, title and price and later categories into furniture, bikes etc.

1. Textual feature: this feature is used to extract from text data to help model learns patterns and understand relationship in texts. examples include: bag of words, N-grams, TF-IDF etc
2. Price feature: It is a numerical represenation of the cost value which is useful to predict house price based on size, categorizing items into different range like low, medium and high.
3. Seller feature: the rating of seller could affect the items that is being sold. If a seller has a high rating, chances are he is selling high value goods.
4. Specific term: is the seller include a specific term its easier to categorize. For example, buyers type sofa or couch than it is most likely associated with furniture.
5. Interaction feature: combining text feature with price could increase accuracy. For example, using word "primitive" with high price the category could most likely be furniture.




'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import normalize
from textblob import TextBlob
import numpy as np
import re
from gensim.models import Word2Vec

# Sample HTML content
html_content = """
<html>
  <body>
    <div class="listing">
      <h2 class="title">Second-Hand Couch</h2>
      <p class="price">$30</p>
      <p class="description">used couch but looks good.</p>
    </div>
    <div class="listing">
      <h2 class="title">Skateboard</h2>
      <p class="price">$45</p>
      <p class="description">Skateboard for sale.</p>
    </div>
    <div class="listing">
      <h2 class="title">Study Desk</h2>
      <p class="price">$125</p>
      <p class="description">A study desk, good for students.</p>
    </div>
  </body>
</html>
"""

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Extract product listings
listings = soup.find_all('div', class_='listing')

# Prepared a list to hold the data
data = []
for listing in listings:
    title = listing.find('h2', class_='title').text
    price = listing.find('p', class_='price').text
    description = listing.find('p', class_='description').text
    data.append({'title': title, 'description': description})

# Create a DataFrame
df = pd.DataFrame(data)

# Feature 1: Bag of Words
count_vectorizer = CountVectorizer()
bow_matrix = count_vectorizer.fit_transform(df['title'] + ' ' + df['description'])

# Feature 2: TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['title'] + ' ' + df['description'])

# Feature 3: Morphological Analysis
def stem_words(text):
    words = text.split()
    return ' '.join([re.sub(r'(ing|ed|s)$', '', word) for word in words])

df['stemmed'] = df['title'] + ' ' + df['description']
df['stemmed'] = df['stemmed'].apply(stem_words)

# Feature 4: Word Embedding
sentences = [row.split() for row in df['stemmed']]
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Create an average vector for each listing based on the words in the title and description
def get_average_vector(text):
    words = text.split()
    vector = np.zeros(100)
    count = 0
    for word in words:
        if word in word2vec_model.wv:
            vector += word2vec_model.wv[word]
            count += 1
    return vector / count if count > 0 else vector

df['embedding'] = df['stemmed'].apply(get_average_vector)

# Feature 5: Semantic Analysis
df['sentiment'] = df['description'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Combined features into a final DataFrame
feature_df = pd.concat([
    df[['title', 'description', 'sentiment']],
    pd.DataFrame(bow_matrix.toarray(), columns=count_vectorizer.get_feature_names_out()),
    pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out()),
    pd.DataFrame(df['embedding'].tolist(), columns=[f'embed_{i}' for i in range(100)])
], axis=1)

# Display the feature DataFrame
print(feature_df)


               title                       description  sentiment  but  couch  \
0  Second-Hand Couch        used couch but looks good.        0.7    1      2   
1         Skateboard              Skateboard for sale.        0.0    0      0   
2         Study Desk  A study desk, good for students.        0.7    0      0   

   desk  for  good  hand  looks  ...  embed_90  embed_91  embed_92  embed_93  \
0     0    0     1     1      1  ...  0.000949  0.001448  0.000337 -0.000137   
1     0    1     0     0      0  ...  0.002830 -0.000366  0.001651 -0.002444   
2     2    1     1     0      0  ...  0.000874  0.002089  0.001089 -0.000180   

   embed_94  embed_95  embed_96  embed_97  embed_98  embed_99  
0  0.003118  0.001339  0.001929 -0.003381  0.000586 -0.001534  
1 -0.000364  0.002606  0.002808 -0.004326 -0.003576  0.001823  
2  0.003795 -0.000367  0.000048  0.002937  0.001586  0.004486  

[3 rows x 129 columns]


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import make_pipeline

# Step 1: Load the Data
df = pd.DataFrame({
    'description': ['Sofa in good condition', 'Mountain bike for sale', 'Dining table set'],
    'title': ['Sofa', 'Bike', 'Table'],
    'price': [200, 150, 300],

})

# Step 2: Preprocess the Text
df['text'] = df['title'] + ' ' + df['description']

# Step 3: Vectorize the Text Data
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X = tfidf_vectorizer.fit_transform(df['text'])

# Step 5: Applying the Chi-Squared Test
chi2_scores, p_values = chi2(X, y)

# Step 6: Creating a DataFrame for Features and Scores
feature_names = tfidf_vectorizer.get_feature_names_out()
chi2_results = pd.DataFrame({'feature': feature_names, 'chi2_score': chi2_scores})

# Ranking the features based on Chi-Squared scores in descending order
chi2_results = chi2_results.sort_values(by='chi2_score', ascending=False)

# Display the ranked features
print(chi2_results)


     feature  chi2_score
0       bike    1.632993
4   mountain    0.816497
5       sale    0.816497
7       sofa    0.408248
8      table    0.408248
1  condition    0.204124
2     dining    0.204124
3       good    0.204124
6        set    0.204124


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
# You code here (Please add comments in the code):
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Loaded the pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample text data
documents = [
    "Second-Hand Couch used couch but looks good",
    "Skateboard Skateboard for sale",
    "Study Desk A study desk, good for students"
]

# example query
query = "Second-Hand Couch looks good"

# Function to get BERT embeddings
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :]
    return cls_embedding.squeeze().numpy()

# BERT embeddings for documents and the query
document_embeddings = np.array([get_bert_embedding(doc) for doc in documents])
query_embedding = get_bert_embedding(query)

# cosine similarity between query and documents
cosine_similarities = cosine_similarity([query_embedding], document_embeddings).flatten()

# Ranked documents in descending order based on similarity
ranked_indices = np.argsort(cosine_similarities)[::-1]

# Output ranked documents
print("Ranked documents based on similarity to the query:")
for idx in ranked_indices:
    print(f"Document: {documents[idx]} | Similarity: {cosine_similarities[idx]:.4f}")






Ranked documents based on similarity to the query:
Document: Second-Hand Couch used couch but looks good | Similarity: 0.8916
Document: Study Desk A study desk, good for students | Similarity: 0.7630
Document: Skateboard Skateboard for sale | Similarity: 0.6850


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

This assignment was challenging for me but there were a lot of learning opportunities. Tokenization, TF-IDF and bag of words were something that I found beneficial.

The challenges I encounter was data preprocessing where cleaning data was challenging.

This exercise is useful for text classification, feature extraction and information retrieval.



'''