# HW4: Natural Language Processing

 <div class="alert alert-block alert-warning">Each assignment needs to be completed independently. Never ever copy others' work (even with minor modification, e.g. changing variable names). Anti-Plagiarism software will be used to check all submissions. </div>

## Problem Description

In this assignment, we'll use what we learned in preprocessing module to compare ChatGPT-generated text with human-generated answers. A dataset with 200 questions and answers has been provided for you to use. The dataset can be found at https://huggingface.co/datasets/Hello-SimpleAI/HC3.


Please follow the instruction below to do the assessment step by step and answer all analysis questions.


In [1]:
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
import string

import pandas as pd
import spacy
import nltk

import numpy as np
from sklearn.preprocessing import normalize

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
data = pd.read_csv("qa.csv")
data.head()

Unnamed: 0,question,chatgpt_answer,human_answer
0,What happens if a parking ticket is lost / des...,If a parking ticket is lost or destroyed befor...,In my city you also get something by mail to t...
1,"why the waves do n't interfere ? first , I 'm ...",Interference is the phenomenon that occurs whe...,They do actually . That 's why a microwave ove...
2,Is it possible to influence a company's action...,"Yes, it is possible to influence a company's a...",Yes and no. This really should be taught at ju...
3,Why do taxpayers front the bill for sports sta...,Sports stadiums are usually built with public ...,That 's the bargaining chip that team owners u...
4,Why do clothing stores generally have a ton of...,There are a few reasons why clothing stores ma...,Your observation is almost certainly a matter ...


## Q1. Tokenize function

Define a function `tokenize(docs, lemmatized = True, remove_stopword = True, remove_punct = True)`  as follows:
   - Take three parameters: 
       - `docs`: a list of documents (e.g. questions)
       - `lemmatized`: an optional boolean parameter to indicate if tokens are lemmatized. The default value is True (i.e. tokens are lemmatized).
       - `remove_stopword`: an optional bookean parameter to remove stop words. The default value is True (i.e. remove stop words). 
   - Split each input document into unigrams and also clean up tokens as follows:
       - if `lemmatized` is turned on, lemmatize all unigrams.
       - if `remove_stopword` is set to True, remove all stop words.
       - if `remove_punct` is set to True, remove all punctuation tokens.
       - remove all empty tokens and lowercase all the tokens.
   - Return the list of tokens obtained for each document after all the processing. 
   
(Hint: you can use spacy package for this task. For reference, check https://spacy.io/api/token#attributes)

In [3]:
def tokenize_a_doc(doc, nlp, lemmatized=True, remove_stopword=True, remove_punct=True): 
    clean_tokens = []
    # load current doc into spacy nlp model
    chunks = doc.split("\\n")
    for chunk in chunks:
        doc = nlp(chunk)
    
        # clean either lemmatized unigrams or unmodified doc tokens
        if lemmatized:
            clean_tokens += [token.lemma_.lower() for token in doc            # using spacy nlp params, skip token if:
                            if (not remove_stopword or not token.is_stop)     # it is a stopword and remove_stopwords = True
                            and (not remove_punct or not token.is_punct)      # it is punctuation and remove_punct = True
                            and not token.lemma_.isspace()]                   # it is whitespace
        else:
            clean_tokens += [token.text.lower() for token in doc 
                            if (not remove_stopword or not token.is_stop) 
                            and (not remove_punct or not token.is_punct) 
                            and not token.text.isspace()]
        
    return clean_tokens

def tokenize(docs, lemmatized=True, remove_stopword=True, remove_punct=True):
    # load in spacy NLP model and disable unused pipelines to reduce processing time/memory space
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    nlp.add_pipe("sentencizer")
    # tokenize each doc in the corpus using specified params for lemmatization and removal conditions
    tokens = [tokenize_a_doc(doc, nlp, lemmatized, remove_stopword, remove_punct) for doc in docs]
    
    return tokens

Test your function with different parameter configuration and observe the differences in the resulting tokens.

In [4]:
# For simplicity, We will test on document

print(data["question"].iloc[0] + "\n")

print(f"1.lemmatized=True, remove_stopword=False, remove_punct = True:\n \
{tokenize(data['question'].iloc[0:1], lemmatized=True, remove_stopword=False, remove_punct = True)}\n")

print(f"2.lemmatized=True, remove_stopword=True, remove_punct = True:\n \
{tokenize(data['question'].iloc[0:1], lemmatized=True, remove_stopword=True, remove_punct = True)}\n")

print(f"3.lemmatized=False, remove_stopword=False, remove_punct = True:\n \
{tokenize(data['question'].iloc[0:1], lemmatized=False, remove_stopword=False, remove_punct = True)}\n")

print(f"4.lemmatized=False, remove_stopword=False, remove_punct = False:\n \
{tokenize(data['question'].iloc[0:1], lemmatized=False, remove_stopword=False, remove_punct = False)}\n")

What happens if a parking ticket is lost / destroyed before the owner is aware of the ticket , and it goes unpaid ? I 've always been curious . Please explain like I'm five.

1.lemmatized=True, remove_stopword=False, remove_punct = True:
 [['what', 'happen', 'if', 'a', 'parking', 'ticket', 'be', 'lose', 'destroy', 'before', 'the', 'owner', 'be', 'aware', 'of', 'the', 'ticket', 'and', 'it', 'go', 'unpaid', 'i', 've', 'always', 'be', 'curious', 'please', 'explain', 'like', 'i', 'be', 'five']]

2.lemmatized=True, remove_stopword=True, remove_punct = True:
 [['happen', 'parking', 'ticket', 'lose', 'destroy', 'owner', 'aware', 'ticket', 'go', 'unpaid', 've', 'curious', 'explain', 'like']]

3.lemmatized=False, remove_stopword=False, remove_punct = True:
 [['what', 'happens', 'if', 'a', 'parking', 'ticket', 'is', 'lost', 'destroyed', 'before', 'the', 'owner', 'is', 'aware', 'of', 'the', 'ticket', 'and', 'it', 'goes', 'unpaid', 'i', 've', 'always', 'been', 'curious', 'please', 'explain', 'like

## Q2. Sentiment Analysis


Let's check if there is any difference in sentiment between ChatGPT-generated and human-generated answers.


Define a function `compute_sentiment(generated, reference, pos, neg )` as follows:
- take four parameters:
    - `gen_tokens` is the tokenized ChatGPT-generated answers by the `tokenize` function in Q1.
    - `ref_tokens` is the tokenized human-generated answers by the `tokenize` function in Q1.
    - `pos` (`neg`) is the lists of positive (negative) words, which can be find in Canvas preprocessing module.
- for each ChatGPT-generated or human-generated answer, compute the sentiment as `(#pos - #neg )/(#pos + #neg)`, where `#pos`(`#neg`) is the number of positive (negative) words found in each answer. If an answer contains none of the positive or negative words, set the sentiment to 0.
- return the sentiment of ChatGPT-generated and human-generated answers as two columns of DataFrame.


Analysis: 
- Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how sentiment results change.
- Do you think, in general, which tokenization configuration should be used? Why does this combination make the most senese?
- Do you think, overall, ChatGPT-generated answers are more posive or negative than human-generated ones? Use data to support your conclusion.


In [5]:
def sent(target, pos, neg):
    p = sum(1 for word in target if word in pos)
    n = sum(1 for word in target if word in neg)
    if p + n != 0:
        sentiment = (p - n) / (p + n)
    else:
        sentiment = 0
    return sentiment

def compute_sentiment(gen_tokens, ref_tokens, pos, neg):
    
    tokens = lambda token_list: [sent(sublist, pos, neg) for sublist in token_list]
    result = pd.DataFrame({'gen_sentiment': tokens(gen_tokens), 'ref_sentiment': tokens(ref_tokens)})
    return result

In [6]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=False, remove_stopword=False, remove_punct = False)
ref_tokens = tokenize(data["human_answer"], lemmatized=False, remove_stopword=False, remove_punct = False)

In [7]:
pos = pd.read_csv("positive-words.txt", header = None)
pos.head()

neg = pd.read_csv("negative-words.txt", header = None)
neg.head()

Unnamed: 0,0
0,a+
1,abound
2,abounds
3,abundance
4,abundant


Unnamed: 0,0
0,2-faced
1,2-faces
2,abnormal
3,abolish
4,abominable


In [8]:
result = compute_sentiment(gen_tokens, 
                           ref_tokens, 
                           pos[0].values,
                           neg[0].values)
result.head()

Unnamed: 0,gen_sentiment,ref_sentiment
0,0.0,-0.5
1,-0.777778,0.076923
2,0.666667,0.2
3,1.0,0.2
4,0.6,-0.333333


In [9]:
from scipy.stats import wilcoxon

(result['gen_sentiment'] - result['ref_sentiment']).mean()

res = wilcoxon(result['gen_sentiment'] - result['ref_sentiment'], alternative='greater')
res.statistic, res.pvalue

0.1462586239453829

(10279.5, 0.0011456573663914912)

In [9]:
from scipy.stats import wilcoxon

(result['gen_sentiment'] - result['ref_sentiment']).mean()

res = wilcoxon(result['gen_sentiment'] - result['ref_sentiment'], alternative='greater')
res.statistic, res.pvalue

0.14665715815970656

(10403.0, 0.0010660004805700114)

## Q3: Performance Evaluation


Next, we evaluate how accurate the ChatGPT-generated answers are, compared to the human-generated answers. One widely used method is to calculate the `precision` and `recall` of n-grams. For simplicity, we only calculate bigrams here. You can try unigram, trigram, or n-grams in the same way.


Define a funtion `bigram_precision_recall(gen_tokens, ref_tokens)` as follows:
- take two parameters:
    - `gen_tokens` is the tokenized ChatGPT-generated answers by the `tokenize` function in Q1.
    - `ref_tokens` is the tokenized human answers by the `tokenize` function in Q1.
- generate bigrams from each tokenized document in `gen_tokens` and `ref_tokens`
- for each pair of ChatGPT-generated and human answers, find the overlapping bigrams between them
- compute `precision` as the number of overlapping bigrams divided by the total number of bigrams from the ChatGPT-generated answer. In other words, the bigram is considered as a predicted value. The `precision` measures the percentage of correctly generated bigrams out of all generated bigrams.
- compute `recall` as the number of overlapping bigrams divided by the total number of bigrams from the human answer. In other words, the `recall` measures the percentage of bigrams from the human answer can be successfully retrieved.
- return the precision and recall for each pair of answers.


Analysis: 
- Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how precison and recall change.
- Do you think, in general, which tokenization configuration should be used? Why does this combination make the most senese?
- Do you think, overall, ChatGPT is able to mimic human in answering these questions?



In [10]:
def bigram_precision_recall(gen_tokens, ref_tokens):
    result = pd.DataFrame(columns = ['overlapping','precision','recall'])
    
    gen_bigrams = [list(nltk.bigrams(tokens)) for tokens in gen_tokens]
    ref_bigrams = [list(nltk.bigrams(tokens)) for tokens in ref_tokens]

    bigrams = list(zip(gen_bigrams, ref_bigrams))

    overlapping = []
    precision = []
    recall = []
    for gen, ref in bigrams:
        overlap = [tup1 for tup1 in gen for tup2 in ref if tup1 == tup2]
        overlapping.append(list(set(overlap)))

        precision.append(len(overlap)/len(gen))
        recall.append(len(overlap)/len(ref))

    result['overlapping'] = overlapping
    result['precision'] = precision
    result['recall'] = recall
    
    return result

In [11]:
result = bigram_precision_recall(gen_tokens, 
                                 ref_tokens)
result.head()

Unnamed: 0,overlapping,precision,recall
0,"[(to, pay), (it, goes)]",0.016807,0.042553
1,"[(radio, stations), (can, cancel), (out, of), ...",0.065421,0.03139
2,"[(shareholders, to), (to, influence), (to, vot...",0.230216,0.096677
3,"[(be, the), (to, be), (the, local)]",0.017647,0.051724
4,"[(result, ,), (., if), (a, result), (as, a), (...",0.028571,0.072165


In [12]:
result[["precision", "recall"]].mean(axis = 0)

precision    0.110050
recall       0.165531
dtype: float64

In [11]:
result = bigram_precision_recall(gen_tokens, 
                                 ref_tokens)
result.head()

Unnamed: 0,overlapping,precision,recall
0,"[(it, goes), (to, pay)]",0.016807,0.042553
1,"[(can, cancel), (out, of), (radio, stations), ...",0.033333,0.015695
2,"[(to, influence), (a, company), (to, vote), (t...",0.143885,0.060423
3,"[(to, be), (the, local), (be, the)]",0.017647,0.051724
4,"[(., as), (as, a), (a, result), (result, ,), (...",0.028571,0.072165


In [12]:
result[["precision", "recall"]].mean(axis = 0)

precision    0.074530
recall       0.132274
dtype: float64

## Q4 Compute TF-IDF

Define a function `compute_tf_idf(tokenized_docs)` as follows: 
- Take paramter `tokenized_docs`, i.e., a list of tokenized documents by `tokenize` function in Q1
- Calculate tf_idf weights as shown in lecture notes (Hint: feel free to reuse the code segment in Lecture Notes (II))
- Return the smoothed normalized `tf_idf` array, where each row stands for a document and each column denotes a word. 

In [60]:
docs_tokens = {idx:nltk.FreqDist(tokens) for idx,tokens in enumerate(corpus)}
dtm=pd.DataFrame.from_dict(docs_tokens, orient="index").fillna(0).sort_index(axis = 0)
tf=dtm.values
tf

array([[1., 1., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 1.]])

In [50]:
def compute_tfidf(docs):
    # step 2. process all documents to get list of token list
    docs_tokens = {idx:nltk.FreqDist(tokens) for idx,tokens in enumerate(docs)}

    # step 3. get document-term matrix
    dtm=pd.DataFrame.from_dict(docs_tokens, orient="index").fillna(0).sort_index(axis = 0)
      
    # step 4. get normalized term frequency (tf) matrix        
    tf=dtm.values
    doc_len=tf.sum(axis=1, keepdims=True)
    tf=np.divide(tf, doc_len)
    
    # step 5. get idf
    df=np.where(tf>0,1,0)
    #idf=np.log(np.divide(len(docs), \
    #    np.sum(df, axis=0)))+1

    smoothed_idf=np.log(np.divide(len(docs)+1, np.sum(df, axis=0)+1))+1    
    smoothed_tf_idf=normalize(tf*smoothed_idf)
    
    return smoothed_tf_idf

Try different tokenization options to see how these options affect TFIDF matrix:

In [51]:
# Test tfidf generation using questions

question_tokens = tokenize(data["question"], lemmatized=True, remove_stopword=False, remove_punct = True)
dtm = compute_tfidf(question_tokens)
print(f"1.lemmatized=True, remove_stopword=False, remove_punct = True\n \
Shape: {dtm.shape}\n")

question_tokens = tokenize(data["question"], lemmatized=True, remove_stopword=True, remove_punct = True)
dtm = compute_tfidf(question_tokens)
print(f"2.lemmatized=True, remove_stopword=True, remove_punct = True:\n \
Shape: {dtm.shape}\n")

question_tokens = tokenize(data["question"], lemmatized=False, remove_stopword=False, remove_punct = True)
dtm = compute_tfidf(question_tokens)
print(f"3.lemmatized=False, remove_stopword=False, remove_punct = True:\n \
Shape: {dtm.shape}\n")

question_tokens = tokenize(data["question"], lemmatized=False, remove_stopword=False, remove_punct = False)
dtm = compute_tfidf(question_tokens)
print(f"4.lemmatized=False, remove_stopword=False, remove_punct = False:\n \
Shape: {dtm.shape}\n")

1.lemmatized=True, remove_stopword=False, remove_punct = True
 Shape: (200, 1438)

2.lemmatized=True, remove_stopword=True, remove_punct = True:
 Shape: (200, 1271)

3.lemmatized=False, remove_stopword=False, remove_punct = True:
 Shape: (200, 1643)

4.lemmatized=False, remove_stopword=False, remove_punct = False:
 Shape: (200, 1665)



In [14]:
# Test tfidf generation using questions

question_tokens = tokenize(data["question"], lemmatized=True, remove_stopword=False, remove_punct = True)
dtm = compute_tfidf(question_tokens)
print(f"1.lemmatized=True, remove_stopword=False, remove_punct = True:\n \
Shape: {dtm.shape}\n")

question_tokens = tokenize(data["question"], lemmatized=True, remove_stopword=True, remove_punct = True)
dtm = compute_tfidf(question_tokens)
print(f"2.lemmatized=True, remove_stopword=True, remove_punct = True:\n \
Shape: {dtm.shape}\n")

question_tokens = tokenize(data["question"], lemmatized=False, remove_stopword=False, remove_punct = True)
dtm = compute_tfidf(question_tokens)
print(f"3.lemmatized=False, remove_stopword=False, remove_punct = True:\n \
Shape: {dtm.shape}\n")

question_tokens = tokenize(data["question"], lemmatized=False, remove_stopword=False, remove_punct = False)
dtm = compute_tfidf(question_tokens)
print(f"4.lemmatized=False, remove_stopword=False, remove_punct = False:\n \
Shape: {dtm.shape}\n")


1.lemmatized=True, remove_stopword=False, remove_punct = True
 Shape: (200, 1435)

2.lemmatized=True, remove_stopword=True, remove_punct = True:
 Shape: (200, 1269)

3.lemmatized=False, remove_stopword=False, remove_punct = True:
 Shape: (200, 1643)

4.lemmatized=False, remove_stopword=False, remove_punct = False:
 Shape: (200, 1665)



## Q5. Assess similarity. 


Define a function `assess_similarity(question_tokens, gen_tokens, ref_tokens)`  as follows: 
- Take three inputs:
   - `question_tokens`: tokenized questions by `tokenize` function in Q1
   - `gen_tokens`: tokenized ChatGPT-generated answers by `tokenize` function in Q1
   - `ref_tokens`: tokenized human answers by `tokenize` function in Q1
- Concatenate these three token lists into a single list to form a corpus
- Calculate the smoothed normalized tf_idf matrix for the concatenated list using the `compute_tfidf` function defined in Q3.
- Split the tf_idf matrix into sub-matrices corresponding to `question_tokens`, `gen_tokens`, and `ref_tokens` respectively
- For each question, find its similarities to the paired ChatGPT-generated answer and human answer.
- For each pair of ChatGPT-generated answer and human answer, find their similarity
- Print out the following:
    - the question which has the largest similarity to the ChatGPT-generated answer.
    - the question which has the largest similarity to the human answer.
    - the pair of ChatGPT-generated and human answers which have the largest similarity.
- Return a DataFrame with the three columns for the similarities among questions and answers.



Analysis: 
- Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how similarities change.
- Based on similarity, do you think ChatGPT-generate answers are more relevant to questions than human answers?

In [25]:
# package to calculate distance
from sklearn.metrics import pairwise_distances

In [69]:
similarity=1-pairwise_distances(tfidf_question, tfidf_gen, metric = 'cosine')
similarity

# find top doc similar to the first one
# Note the diagonal value is 1, which is the largest

np.argsort(similarity)[:,::-1][0,0:2]


array([[0.57054916, 0.04873125, 0.05435993, ..., 0.05525877, 0.04119335,
        0.06083295],
       [0.06837583, 0.53589851, 0.07290558, ..., 0.09236547, 0.05481382,
        0.10305691],
       [0.1187486 , 0.01299878, 0.46451491, ..., 0.08568355, 0.0288882 ,
        0.04995915],
       ...,
       [0.04069226, 0.02958631, 0.03604291, ..., 0.33750164, 0.03119483,
        0.08137021],
       [0.0439228 , 0.05876601, 0.06815987, ..., 0.06023432, 0.50735979,
        0.06542159],
       [0.10568441, 0.07956869, 0.08638208, ..., 0.08546416, 0.1024627 ,
        0.47311868]])

array([  0, 177], dtype=int64)

In [48]:
# For better visualization, let's make the tf-idf array a dataframe
pd.options.display.float_format = '{:,.2f}'.format # set format for float

pd.DataFrame(tf_idf)
# the dtm dataframe we created in Step 3 has each word as a column

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7341,7342,7343,7344,7345,7346,7347,7348,7349,7350
0,0.10,0.19,0.09,0.06,0.26,0.49,0.24,0.22,0.27,0.15,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1,0.08,0.00,0.00,0.00,0.00,0.00,0.13,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
2,0.00,0.00,0.00,0.12,0.00,0.00,0.13,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
3,0.00,0.00,0.00,0.07,0.00,0.00,0.15,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
4,0.00,0.00,0.00,0.10,0.00,0.00,0.05,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,0.09,0.00,0.04,0.09,0.00,0.00,0.08,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
596,0.00,0.00,0.00,0.07,0.00,0.00,0.03,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
597,0.00,0.00,0.00,0.09,0.00,0.00,0.02,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
598,0.00,0.00,0.00,0.08,0.00,0.00,0.00,0.00,0.00,0.07,...,0.13,0.13,0.26,0.13,0.13,0.13,0.13,0.13,0.00,0.00


In [67]:
len(similarity)

200

In [61]:
# Flatten the nested list
# Concatenate the three lists into a single list
corpus = []
corpus.extend(question_tokens)
corpus.extend(gen_tokens)
corpus.extend(ref_tokens)
tf_idf = compute_tfidf(corpus)

n_question_tokens = len(question_tokens)
n_gen_tokens = len(gen_tokens)
n_ref_tokens = len(ref_tokens)

tfidf_question = tf_idf[:n_question_tokens]
tfidf_gen = tf_idf[n_question_tokens:n_question_tokens + n_gen_tokens]
tfidf_ref = tf_idf[-n_ref_tokens:]



In [33]:
from sklearn.metrics.pairwise import cosine_similarity

def assess_similarity(question_tokens, gen_tokens, ref_tokens):

    # Concatenate the three lists into a single list
    corpus = []
    corpus.extend(question_tokens)
    corpus.extend(gen_tokens)
    corpus.extend(ref_tokens)
    
    tf_idf = compute_tfidf(corpus)
    # Split the tf-idf matrix
    n_question_tokens = len(question_tokens)
    n_gen_tokens = len(gen_tokens)
    n_ref_tokens = len(ref_tokens)
    
    tfidf_question_tokens = tf_idf[:n_question_tokens]
    tfidf_gen_tokens = tf_idf[n_question_tokens:n_question_tokens + n_gen_tokens]
    tfidf_ref_tokens = tf_idf[-n_ref_tokens:]

    # Calculate the similarities
    sim_question_gen = cosine_similarity(tfidf_question_tokens, tfidf_gen_tokens)
    sim_question_ref = cosine_similarity(tfidf_question_tokens, tfidf_ref_tokens)
    sim_gen_ref = cosine_similarity(tfidf_gen_tokens, tfidf_ref_tokens)

    # Find the maximum similarities
    max_sim_question_gen = np.max(sim_question_gen, axis=1)
#     max_sim_question_gen_idx = np.argmax(sim_question_gen, axis=1)
    max_sim_question_ref = np.max(sim_question_ref, axis=1)
#     max_sim_question_ref_idx = np.argmax(sim_question_ref, axis=1)
    max_sim_gen_ref = np.max(sim_gen_ref)
#     max_sim_gen_ref_idx = np.unravel_index(np.argmax(sim_gen_ref, axis=None), sim_gen_ref.shape)
    
    # Return a DataFrame
    result = pd.DataFrame({
        'question_gen_sim': max_sim_question_gen,
        'question_ref_sim': max_sim_question_ref,
        'gen_ref_sim': max_sim_gen_ref
    })
    return result
    
        
    

In [38]:
result = assess_similarity(question_tokens, gen_tokens, ref_tokens)
result.describe()

Unnamed: 0,sim_question_gen,sim_question_ref,sim_gen_ref
count,200.0,200.0,200.0
mean,0.306688,0.210218,0.6801671
std,0.116436,0.103345,1.780814e-15
min,0.082671,0.082353,0.6801671
25%,0.215031,0.138806,0.6801671
50%,0.293052,0.173668,0.6801671
75%,0.387031,0.246901,0.6801671
max,0.742086,0.675172,0.6801671


In [37]:
data.iloc[30]

question          Where to find historical quotes for the Dow Jo...
chatgpt_answer    You can find historical quotes for the Dow Jon...
human_answer      A number of places.  First, fast and cheap, yo...
Name: 30, dtype: object

In [16]:
result = assess_similarity(question_tokens, gen_tokens, ref_tokens)
result.head()

Question with the largest similarity to the ChatGPT-generated answer:
 Question: Where to find historical quotes for the Dow Jones Global Total Stock Market Index? 
 ChatGPT: You can find historical quotes for the Dow Jones Global Total Stock Market Index at several financial websites, such as Yahoo Finance, Google Finance, and Bloomberg. These websites allow you to view the historical performance of the index and see how it has changed over time.To find historical quotes for the Dow Jones Global Total Stock Market Index on Yahoo Finance, go to the Yahoo Finance website and enter "Dow Jones Global Total Stock Market Index" in the search bar. From the search results, click on the link for the Dow Jones Global Total Stock Market Index. On the resulting page, you will be able to view the current value of the index as well as historical data dating back to the index's inception.To find historical quotes for the Dow Jones Global Total Stock Market Index on Google Finance, go to the Google F

Unnamed: 0,question_ref_sim,question_gen_sim,gen_ref_sim
0,0.125593,0.570514,0.17182
1,0.136946,0.535158,0.26538
2,0.198697,0.464419,0.451627
3,0.14098,0.418178,0.349791
4,0.177977,0.285982,0.142059


In [17]:
result.describe()

Unnamed: 0,question_ref_sim,question_gen_sim,gen_ref_sim
count,200.0,200.0,200.0
mean,0.202392,0.346593,0.293754
std,0.128555,0.116747,0.143728
min,0.0,0.020415,0.014863
25%,0.105901,0.272874,0.187289
50%,0.183565,0.347686,0.28931
75%,0.266182,0.429793,0.385834
max,0.871841,0.782898,0.682076


## Q5 (Bonus): Further Analysis (Open question)


- Can you find at least three significant differences between ChatGPT-generated and human answeres? Use data to support your answer.
- Based on these differences, are you able to design a classifier to identify ChatGPT generated answers? Implement your ideas using traditional machine learning models, such as SVM, decision trees.
