<a href="https://colab.research.google.com/github/Veerenderkumar/veerender_INFO5731_Fall2024/blob/main/kumar_veerender_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

Detecting fake news is an important task in today's information-driven world. The goal is to classify news articles as either fake or real. Fake news often contains misleading or sensational content, making it essential to identify such articles in the fight against misinformation.

Features for Fake News Detection:
TF-IDF (Term Frequency-Inverse Document Frequency)
Why I need it: Fake news often uses attention-grabbing words. TF-IDF helps identify words that are overused in fake articles but less common across all articles.

N-grams (Word Pairs or Triplets)
Why I need it: Fake news tends to use word combinations like "shocking discovery." N-grams help detect patterns in phrasing typical of misleading articles.

Sentiment Analysis
Why I need it: The goal of fake news is often to evoke strong emotions, whether positive or negative. Sentiment analysis can reveal exaggerated emotional tones.

Readability Scores
Why I need it: Real news is often more formal and complex, while fake news is typically written in simpler language to appeal to a wider audience. Readability measures help capture this difference.

Part-of-Speech (POS) Tags
Why I need it: Fake news may use more adjectives and adverbs to sound more dramatic. POS tags help detect grammatical patterns that distinguish real news from fake.



## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [9]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from textblob import TextBlob
import nltk
from nltk import pos_tag, word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

data = {
    'text': [
        "Shocking discovery! Scientists found alien life on Mars.",
        "Government passes new healthcare bill.",
        "Celebrity caught in outrageous scandal!",
        "Experts suggest global warming has irreversible effects.",
        "New study shows coffee cures all diseases.",
        "Stock market hits all-time high.",
        "Breaking: Scientists clone a dinosaur.",
        "Sports team wins championship after dramatic final."
    ],
    'label': [1, 0, 1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)

df['text'] = df['text'].str.lower().apply(nltk.word_tokenize)

stop_words = set(stopwords.words('english'))
df['text'] = df['text'].apply(lambda words: [w for w in words if w.isalpha() and w not in stop_words])

df['text'] = df['text'].apply(lambda words: ' '.join(words))

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])
tfidf_features = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print("\nTF-IDF Features:")
print(tfidf_features.head())

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_matrix = bigram_vectorizer.fit_transform(df['text'])
bigram_features = pd.DataFrame(bigram_matrix.toarray(), columns=bigram_vectorizer.get_feature_names_out())
print("\nBigram Features:")
print(bigram_features.head())

df['sentiment'] = df['text'].apply(lambda text: TextBlob(text).sentiment.polarity)
print("\nSentiment Scores:")
print(df[['text', 'sentiment']].head())

def pos_counts(text):
    tags = pos_tag(word_tokenize(text))
    pos_freq = nltk.FreqDist(tag for (word, tag) in tags)
    return pos_freq

df['pos_tags'] = df['text'].apply(pos_counts)
print("\nPOS Tag Counts:")
print(df[['text', 'pos_tags']].head())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.



TF-IDF Features:
      alien      bill  breaking  caught  celebrity  championship  clone  \
0  0.386265  0.000000       0.0     0.0        0.0           0.0    0.0   
1  0.000000  0.461149       0.0     0.0        0.0           0.0    0.0   
2  0.000000  0.000000       0.0     0.5        0.5           0.0    0.0   
3  0.000000  0.000000       0.0     0.0        0.0           0.0    0.0   
4  0.000000  0.000000       0.0     0.0        0.0           0.0    0.0   

     coffee     cures  dinosaur  ...  scientists  shocking     shows  sports  \
0  0.000000  0.000000       0.0  ...     0.32372  0.386265  0.000000     0.0   
1  0.000000  0.000000       0.0  ...     0.00000  0.000000  0.000000     0.0   
2  0.000000  0.000000       0.0  ...     0.00000  0.000000  0.000000     0.0   
3  0.000000  0.000000       0.0  ...     0.00000  0.000000  0.000000     0.0   
4  0.418767  0.418767       0.0  ...     0.00000  0.000000  0.418767     0.0   

   stock     study   suggest  team   warming  wins

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from textblob import TextBlob
from sklearn.feature_selection import chi2
from sklearn.preprocessing import LabelEncoder
import nltk
from nltk import pos_tag, word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

data = {
    'text': [
        "Shocking discovery! Scientists found alien life on Mars.",
        "Government passes new healthcare bill.",
        "Celebrity caught in outrageous scandal!",
        "Experts suggest global warming has irreversible effects.",
        "New study shows coffee cures all diseases.",
        "Stock market hits all-time high.",
        "Breaking: Scientists clone a dinosaur.",
        "Sports team wins championship after dramatic final."
    ],
    'label': [1, 0, 1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)
df['text'] = df['text'].str.lower().apply(nltk.word_tokenize)

stop_words = set(stopwords.words('english'))
df['text'] = df['text'].apply(lambda words: [w for w in words if w.isalpha() and w not in stop_words])
df['text'] = df['text'].apply(lambda words: ' '.join(words))

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])
tfidf_features = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

label_encoder = LabelEncoder()
df['label_encoded'] = label_encoder.fit_transform(df['label'])

chi2_scores, p_values = chi2(tfidf_matrix, df['label_encoded'])
chi2_df = pd.DataFrame({'Feature': tfidf_vectorizer.get_feature_names_out(), 'Chi2_Score': chi2_scores})
chi2_df = chi2_df.sort_values(by='Chi2_Score', ascending=False)

print("\nTop features ranked by Chi-Square scores:")
print(chi2_df.head(10))

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_matrix = bigram_vectorizer.fit_transform(df['text'])
bigram_features = pd.DataFrame(bigram_matrix.toarray(), columns=bigram_vectorizer.get_feature_names_out())

bigram_chi2_scores, bigram_p_values = chi2(bigram_matrix, df['label_encoded'])
bigram_chi2_df = pd.DataFrame({'Bigram': bigram_vectorizer.get_feature_names_out(), 'Chi2_Score': bigram_chi2_scores})
bigram_chi2_df = bigram_chi2_df.sort_values(by='Chi2_Score', ascending=False)

print("\nTop bigram features ranked by Chi-Square scores:")
print(bigram_chi2_df.head(10))

df['sentiment'] = df['text'].apply(lambda text: TextBlob(text).sentiment.polarity)

print("\nFinal dataset with selected features:")
print(df[['text', 'label_encoded', 'sentiment']])



Top features ranked by Chi-Square scores:
       Feature  Chi2_Score
30  scientists    0.759277
9     dinosaur    0.519708
2     breaking    0.519708
6        clone    0.519708
20        high    0.500000
29     scandal    0.500000
21        hits    0.500000
34       stock    0.500000
27  outrageous    0.500000
24      market    0.500000

Top bigram features ranked by Chi-Square scores:
                 Bigram  Chi2_Score
0            alien life         1.0
25   shocking discovery         1.0
19       new healthcare         1.0
20            new study         1.0
21   outrageous scandal         1.0
22           passes new         1.0
23     scientists clone         1.0
24     scientists found         1.0
26         shows coffee         1.0
1   breaking scientists         1.0

Final dataset with selected features:
                                                text  label_encoded  sentiment
0  shocking discovery scientists found alien life...              1  -0.625000
1              go

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [14]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('bert-base-nli-mean-tokens')

data = {
    'text': [
        "Shocking discovery! Scientists found alien life on Mars.",
        "Government passes new healthcare bill.",
        "Celebrity caught in outrageous scandal!",
        "Experts suggest global warming has irreversible effects.",
        "New study shows coffee cures all diseases.",
        "Stock market hits all-time high.",
        "Breaking: Scientists clone a dinosaur.",
        "Sports team wins championship after dramatic final."
    ]
}

df = pd.DataFrame(data)

query = "New scientific breakthrough in space exploration."

query_embedding = model.encode(query, convert_to_tensor=True)
text_embeddings = model.encode(df['text'].tolist(), convert_to_tensor=True)

cosine_scores = util.pytorch_cos_sim(query_embedding, text_embeddings)[0]
sorted_indexes = np.argsort(-cosine_scores)

ranked_df = df.iloc[sorted_indexes]
ranked_df['similarity_score'] = cosine_scores[sorted_indexes].numpy()

print("Ranked Documents based on Similarity to Query:\n")
print(ranked_df[['text', 'similarity_score']])




Ranked Documents based on Similarity to Query:

                                                text  similarity_score
0  Shocking discovery! Scientists found alien lif...          0.708499
6             Breaking: Scientists clone a dinosaur.          0.512551
1             Government passes new healthcare bill.          0.433043
3  Experts suggest global warming has irreversibl...          0.426087
5                   Stock market hits all-time high.          0.325972
7  Sports team wins championship after dramatic f...          0.299732
2            Celebrity caught in outrageous scandal!          0.287964
4         New study shows coffee cures all diseases.          0.268384


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ranked_df['similarity_score'] = cosine_scores[sorted_indexes].numpy()


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:





'''