# CMPSC 445 - M9 Assignment

### Dataset
Loads the StackOverflow dataset, and prints the schema of the dataset which translates into the features of our training and testing datasets as the dimensions of the pandas dataframe.

In [135]:
import pandas as pd
import textwrap

# read json data into a dataframe
df_idf = pd.read_json("data/tf-idf-data/data/stackoverflow-data-idf.json", lines=True)

# print schema
print(f"Schema:\n\n{df_idf.dtypes}")
print(f"\nNumber of questions, columns = {df_idf.shape}")

Schema:

id                            int64
title                        object
body                         object
answer_count                  int64
comment_count                 int64
creation_date                object
last_activity_date           object
last_editor_display_name     object
owner_display_name           object
owner_user_id               float64
post_type_id                  int64
score                         int64
tags                         object
view_count                    int64
accepted_answer_id          float64
favorite_count              float64
last_edit_date               object
last_editor_user_id         float64
community_owned_date         object
dtype: object

Number of questions, columns = (20000, 19)


### Data Processing
Processes a set of document titles and bodies by cleaning the text (lowercasing, removing HTML tags, and stripping special characters), combines them into a single column, and then demonstrates the cleaned output for the second document.

In [136]:
import re

def pre_process(text):
    
    # lowercase
    text=text.lower()
    
    # tag removal
    text=re.sub("<!--?.*?-->","",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text
 
df_idf['text'] = df_idf['title'] + df_idf['body']
df_idf['text'] = df_idf['text'].apply(lambda x:pre_process(x))
 

# show the second 'text' just for fun
txt = df_idf['text'][2]
txt = textwrap.fill(txt, width=100)
print(txt)

gradle command line p i m trying to run a shell script with gradle i currently have something like
this p pre code def test project tasks create test exec commandline bash c bash c my file dir script
sh code pre p the problem is that i cannot run this script because i have spaces in my dir name i
have tried everything e g p pre code commandline bash c bash c my file dir script sh tokenize
commandline bash c bash c my file dir script sh commandline bash c new stringbuilder append bash
append c my file dir script sh commandline bash c bash c my file dir script sh file dir file c my
file dir script sh commandline bash c bash dir getabsolutepath code pre p im using windows bit and
if i use a path without spaces the script runs perfectly therefore the only issue as i can see is
how gradle handles spaces p


### Creating Vocabulary and Word Counts for IDF
Loads stop words, processes a set of documents to create a word count matrix while filtering out common terms, and sets up the data for further analysis, such as TF-IDF computation.

In [137]:
from sklearn.feature_extraction.text import CountVectorizer

def get_stop_words(stop_file_path):
    """load stop words """

    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        # return frozenset(stop_set) is replaced by
        return list(stop_set)

# load a set of stop words
stopwords= get_stop_words("data/tf-idf-data/resources/stopwords.txt")

# get the text column 
docs = df_idf['text'].tolist()

# create a vocabulary of words, 
# ignore words that appear in 85% of documents, 
# eliminate stop words
cv=CountVectorizer(max_df=0.85, stop_words=stopwords, max_features=10000)
word_count_vector=cv.fit_transform(docs)

list(cv.vocabulary_.keys())[:10]



['serializing',
 'private',
 'struct',
 'public',
 'class',
 'contains',
 'properties',
 'string',
 'serialize',
 'attempt']

### Transformer to Compute Inverse Document Frequency (IDF)

- Prepares the transformer to convert raw word counts into a TF-IDF format, which is useful for NLP, as it highlights the importance of terms relative to the overall document set.

In [138]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf_transformer.fit(word_count_vector)

### Computing TF-IDF and Extracting Keywords

- Reads the test dataset, combines title and body text, preprocesses it, and converts it into a list of documents, which is useful for text classification or feature extraction.

In [139]:
# read test docs into a dataframe and concatenate title and body
df_test = pd.read_json("data/tf-idf-data/data/stackoverflow-test.json",lines=True)
df_test['text'] = df_test['title'] + df_test['body']
df_test['text'] = df_test['text'].apply(lambda x:pre_process(x))

# get test docs into a list
docs_test = df_test['text'].tolist()

In [140]:
def sort_coo(coo_matrix):
    """sorts the values in the vector while preserving the column index"""
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
 
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]
 
    score_vals = []
    feature_vals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
 
    # create tuples of feature,score
    # results = zip(feature_vals,score_vals)
    results = {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results


# mapping of indices
feature_names = cv.get_feature_names_out()

# get the document that we want to extract keywords from
doc = docs_test[0]
doc = textwrap.fill(doc, width=100)

# generate tf-idf for given document
tf_idf_vector = tfidf_transformer.transform(cv.transform([doc]))

# sort the tf-idf vectors by descending order of scores
sorted_items = sort_coo(tf_idf_vector.tocoo())

# extract only the top n; n here is 10
keywords = extract_topn_from_vector(feature_names,sorted_items, 10)

# find maximum key length
max_key_length = max(len(k) for k in keywords)

# now print the results
print("\n=====Doc=====")
print(doc)
print("\n===Keywords===")
for k in keywords:
    print(f"{k:<{max_key_length}} {keywords[k]}")


=====Doc=====
integrate war plugin for m eclipse into eclipse project p i set up a small web project with jsf and
maven now i want to deploy on a tomcat server is there a possibility to automate that like a button
in eclipse that automatically deploys the project to tomcat p p i read about a the a href http maven
apache org plugins maven war plugin rel nofollow noreferrer maven war plugin a but i couldn t find a
tutorial how to integrate that into my process eclipse m eclipse p p can you link me to help or try
to explain it thanks p

===Keywords===
eclipse     0.49
maven       0.451
war         0.393
plugin      0.265
integrate   0.233
tomcat      0.223
project     0.197
automate    0.13
jsf         0.125
possibility 0.121


In [141]:
def get_top_n_keywords_from_Nth_doc(N, n):
    """extracts and prints the top n keywords from the nth document based on TF-IDF scores
       parameter: 
       N (int): The index of the document from which to extract keywords (1-based index).
       n (int): The number of top keywords from the Nth document.
    """
    # adjusting indices for correct document
    N = N - 1
    # nth document
    N_doc = docs_test[N]

    # generate tf-idf for given document
    N_tf_idf_vector = tfidf_transformer.transform(cv.transform([N_doc]))

    # sort the tf-idf vectors by descending order of scores
    N_sorted_items = sort_coo(N_tf_idf_vector.tocoo())

    # extract only the top n
    N_keywords = extract_topn_from_vector(feature_names, N_sorted_items, n)

    # find maximum key length
    N_max_key_length = max(len(k) for k in N_keywords)

    # print results
    print(f"{"Keyword":<{N_max_key_length}} | TF-IDF")
    print("-"*20)
    for k in N_keywords:
        print(f"{k:<{N_max_key_length}} | {N_keywords[k]}")
    

### Questions

1. How many samples in the training set?

In [142]:
# df_test is the dataframe from stackoverflow-data-idf.json
print(f"Number of samples in the training set: {len(df_idf)} samples.")

Number of samples in the training set: 20000 samples.


2. How many samples in the testing set?

In [143]:
# df_test is the dataframe from stackoverflow-test.json
print(f"Number of samples in the testing set: {len(df_test)} samples.")

Number of samples in the testing set: 500 samples.


3. What is this dataset?<br>
It is a dataset consisting of questions and responses content with metadata from the technical help wesbsite Stack Overflow. <br><br>Specifically, it consists of the following data:
    - Questions and Answers: It contains user-generated questions and answers about programming and software development.
    - Metadata: The dataset includes metadata such as:
        - User IDs  
        - Timestamps of when questions were asked or answered
        - Tags related to the questions (e.g., programming languages, frameworks)
        - Scores/votes on the questions/answers.
        - Text Data: The primary focus would probably be the text content of questions and answers, which can be analyzed for various tasks like topic modeling, classification, or sentiment analysis.

In [144]:
df_idf.head()

Unnamed: 0,id,title,body,answer_count,comment_count,creation_date,last_activity_date,last_editor_display_name,owner_display_name,owner_user_id,post_type_id,score,tags,view_count,accepted_answer_id,favorite_count,last_edit_date,last_editor_user_id,community_owned_date,text
0,4821394,Serializing a private struct - Can it be done?,<p>I have a public class that contains a priva...,1,0,2011-01-27 20:19:13.563 UTC,2011-01-27 20:21:37.59 UTC,,,163534.0,1,0,c#|serialization|xml-serialization,296,,,,,,serializing a private struct can it be done p ...
1,3367882,How do I prevent floated-right content from ov...,<p>I have the following HTML:</p>\n\n<pre><cod...,2,2,2010-07-30 00:01:50.9 UTC,2012-05-10 14:16:05.143 UTC,,,1190.0,1,2,css|overflow|css-float|crop,4121,3367943.0,0.0,2012-05-10 14:16:05.143 UTC,44390.0,,how do i prevent floated right content from ov...
2,31682135,Gradle command line,<p>I'm trying to run a shell script with gradl...,0,2,2015-07-28 16:30:18.28 UTC,2015-07-28 16:32:15.117 UTC,,,1299158.0,1,1,bash|shell|android-studio|gradle,259,,,,,,gradle command line p i m trying to run a shel...
3,20218536,Loop variable as parameter in asynchronous fun...,<p>I have an object with the following form.</...,1,1,2013-11-26 13:34:49.957 UTC,2013-11-26 15:07:50.8 UTC,,,642751.0,1,0,javascript|asynchronous|foreach|async.js,120,,1.0,2013-11-26 15:02:47.993 UTC,1333873.0,,loop variable as parameter in asynchronous fun...
4,19941459,Canot get the href value,<p>Hi I need to valid the href is empty or not...,5,1,2013-11-12 22:41:36.11 UTC,2013-11-12 23:48:34.67 UTC,,,819774.0,1,0,javascript,97,19941620.0,,2013-11-12 22:43:42.97 UTC,21886.0,,canot get the href value p hi i need to valid ...


4. What are the top 10 keywords and their TF-IDF scores from the test set sample 1 (the document about eclipse project)?

In [145]:
eclipse_project_doc = 1
get_top_n_keywords_from_Nth_doc(eclipse_project_doc, 10)

Keyword     | TF-IDF
--------------------
eclipse     | 0.49
maven       | 0.451
war         | 0.393
plugin      | 0.265
integrate   | 0.233
tomcat      | 0.223
project     | 0.197
automate    | 0.13
jsf         | 0.125
possibility | 0.121


5. How about sample 2 (the one about phantomjs)?

In [146]:
phantomjs_doc = 2
get_top_n_keywords_from_Nth_doc(phantomjs_doc, 10)

Keyword  | TF-IDF
--------------------
evaluate | 0.474
content  | 0.403
console  | 0.281
log      | 0.265
function | 0.215
promise  | 0.2
return   | 0.195
wait     | 0.169
let      | 0.163
resolve  | 0.156


6. How about sample 3 (the one about dynamic operations)?

In [147]:
dynamic_operations_doc = 3
get_top_n_keywords_from_Nth_doc(dynamic_operations_doc, 10)

Keyword     | TF-IDF
--------------------
appdomain   | 0.41
dynamic     | 0.384
performed   | 0.332
operations  | 0.297
targeting   | 0.199
trust       | 0.182
net         | 0.182
project     | 0.179
stating     | 0.178
expressions | 0.167
