# HW4: Natural Language Processing

 <div class="alert alert-block alert-warning">Each assignment needs to be completed independently. Never ever copy others' work (even with minor modification, e.g. changing variable names). Anti-Plagiarism software will be used to check all submissions. </div>

## Problem Description

In this assignment, we'll use what we learned in preprocessing module to compare ChatGPT-generated text with human-generated answers. A dataset with 200 questions and answers has been provided for you to use. The dataset can be found at https://huggingface.co/datasets/Hello-SimpleAI/HC3.


Please follow the instruction below to do the assessment step by step and answer all analysis questions.


In [1]:
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
import string
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import spacy
import nltk

import numpy as np
from sklearn.preprocessing import normalize

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
data = pd.read_csv("qa.csv")
data.head()

Unnamed: 0,question,chatgpt_answer,human_answer
0,What happens if a parking ticket is lost / des...,If a parking ticket is lost or destroyed befor...,In my city you also get something by mail to t...
1,"why the waves do n't interfere ? first , I 'm ...",Interference is the phenomenon that occurs whe...,They do actually . That 's why a microwave ove...
2,Is it possible to influence a company's action...,"Yes, it is possible to influence a company's a...",Yes and no. This really should be taught at ju...
3,Why do taxpayers front the bill for sports sta...,Sports stadiums are usually built with public ...,That 's the bargaining chip that team owners u...
4,Why do clothing stores generally have a ton of...,There are a few reasons why clothing stores ma...,Your observation is almost certainly a matter ...


## Q1. Tokenize function

Define a function `tokenize(docs, lemmatized = True, remove_stopword = True, remove_punct = True)`  as follows:
   - Take three parameters: 
       - `docs`: a list of documents (e.g. questions)
       - `lemmatized`: an optional boolean parameter to indicate if tokens are lemmatized. The default value is True (i.e. tokens are lemmatized).
       - `remove_stopword`: an optional bookean parameter to remove stop words. The default value is True (i.e. remove stop words). 
   - Split each input document into unigrams and also clean up tokens as follows:
       - if `lemmatized` is turned on, lemmatize all unigrams.
       - if `remove_stopword` is set to True, remove all stop words.
       - if `remove_punct` is set to True, remove all punctuation tokens.
       - remove all empty tokens and lowercase all the tokens.
   - Return the list of tokens obtained for each document after all the processing. 
   
(Hint: you can use spacy package for this task. For reference, check https://spacy.io/api/token#attributes)

In [22]:
def tokenize_a_doc(doc, nlp, lemmatized=True, remove_stopword=True, remove_punct=True): 
    clean_tokens = []
    # load current doc into spacy nlp model and split sentences by newline chars
    sentences = doc.split("\\n")
    for sentence in sentences:
        doc = nlp(sentence)
    
        # clean either lemmatized unigrams or unmodified doc tokens
        if lemmatized:
            clean_tokens += [token.lemma_.lower() for token in doc            # using spacy nlp params, skip token if:
                            if (not remove_stopword or not token.is_stop)     # it is a stopword and remove_stopwords = True
                            and (not remove_punct or not token.is_punct)      # it is punctuation and remove_punct = True
                            and not token.lemma_.isspace()]                   # it is whitespace
        else:
            clean_tokens += [token.text.lower() for token in doc 
                            if (not remove_stopword or not token.is_stop) 
                            and (not remove_punct or not token.is_punct) 
                            and not token.text.isspace()]
        
    return clean_tokens

def tokenize(docs, lemmatized=True, remove_stopword=True, remove_punct=True):
    # load in spacy NLP model and disable unused pipelines to reduce processing time/memory space
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    nlp.add_pipe("sentencizer")
    # tokenize each doc in the corpus using specified params for lemmatization and removal conditions
    tokens = [tokenize_a_doc(doc, nlp, lemmatized, remove_stopword, remove_punct) for doc in docs]
    
    return tokens

Test your function with different parameter configuration and observe the differences in the resulting tokens.

In [23]:
# For simplicity, We will test on document

print(data["question"].iloc[0] + "\n")

print(f"1.lemmatized=True, remove_stopword=False, remove_punct = True:\n \
{tokenize(data['question'].iloc[0:1], lemmatized=True, remove_stopword=False, remove_punct = True)}\n")

print(f"2.lemmatized=True, remove_stopword=True, remove_punct = True:\n \
{tokenize(data['question'].iloc[0:1], lemmatized=True, remove_stopword=True, remove_punct = True)}\n")

print(f"3.lemmatized=False, remove_stopword=False, remove_punct = True:\n \
{tokenize(data['question'].iloc[0:1], lemmatized=False, remove_stopword=False, remove_punct = True)}\n")

print(f"4.lemmatized=False, remove_stopword=False, remove_punct = False:\n \
{tokenize(data['question'].iloc[0:1], lemmatized=False, remove_stopword=False, remove_punct = False)}\n")


What happens if a parking ticket is lost / destroyed before the owner is aware of the ticket , and it goes unpaid ? I 've always been curious . Please explain like I'm five.

1.lemmatized=True, remove_stopword=False, remove_punct = True:
 [['what', 'happen', 'if', 'a', 'parking', 'ticket', 'be', 'lose', 'destroy', 'before', 'the', 'owner', 'be', 'aware', 'of', 'the', 'ticket', 'and', 'it', 'go', 'unpaid', 'i', 've', 'always', 'be', 'curious', 'please', 'explain', 'like', 'i', 'be', 'five']]

2.lemmatized=True, remove_stopword=True, remove_punct = True:
 [['happen', 'parking', 'ticket', 'lose', 'destroy', 'owner', 'aware', 'ticket', 'go', 'unpaid', 've', 'curious', 'explain', 'like']]

3.lemmatized=False, remove_stopword=False, remove_punct = True:
 [['what', 'happens', 'if', 'a', 'parking', 'ticket', 'is', 'lost', 'destroyed', 'before', 'the', 'owner', 'is', 'aware', 'of', 'the', 'ticket', 'and', 'it', 'goes', 'unpaid', 'i', 've', 'always', 'been', 'curious', 'please', 'explain', 'like

## Q2. Sentiment Analysis


Let's check if there is any difference in sentiment between ChatGPT-generated and human-generated answers.


Define a function `compute_sentiment(generated, reference, pos, neg )` as follows:
- take four parameters:
    - `gen_tokens` is the tokenized ChatGPT-generated answers by the `tokenize` function in Q1.
    - `ref_tokens` is the tokenized human-generated answers by the `tokenize` function in Q1.
    - `pos` (`neg`) is the lists of positive (negative) words, which can be find in Canvas preprocessing module.
- for each ChatGPT-generated or human-generated answer, compute the sentiment as `(#pos - #neg )/(#pos + #neg)`, where `#pos`(`#neg`) is the number of positive (negative) words found in each answer. If an answer contains none of the positive or negative words, set the sentiment to 0.
- return the sentiment of ChatGPT-generated and human-generated answers as two columns of DataFrame.


Analysis: 
- Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how sentiment results change.
- Do you think, in general, which tokenization configuration should be used? Why does this combination make the most senese?
- Do you think, overall, ChatGPT-generated answers are more posive or negative than human-generated ones? Use data to support your conclusion.


In [4]:
def sent(target, pos, neg):
    p = sum(1 for word in target if word in pos)
    n = sum(1 for word in target if word in neg)
    if p + n != 0:
        sentiment = (p - n) / (p + n)
    else:
        sentiment = 0
    return sentiment

def compute_sentiment(gen_tokens, ref_tokens, pos, neg):
    
    tokens = lambda token_list: [sent(sublist, pos, neg) for sublist in token_list]
    result = pd.DataFrame({'gen_sentiment': tokens(gen_tokens), 'ref_sentiment': tokens(ref_tokens)})
    return result

In [5]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=False, remove_stopword=False, remove_punct = False)
ref_tokens = tokenize(data["human_answer"], lemmatized=False, remove_stopword=False, remove_punct = False)

In [6]:
pos = pd.read_csv("positive-words.txt", header = None)
pos.head()

neg = pd.read_csv("negative-words.txt", header = None)
neg.head()

Unnamed: 0,0
0,a+
1,abound
2,abounds
3,abundance
4,abundant


Unnamed: 0,0
0,2-faced
1,2-faces
2,abnormal
3,abolish
4,abominable


In [7]:
result = compute_sentiment(gen_tokens, 
                           ref_tokens, 
                           pos[0].values,
                           neg[0].values)
result.head()

Unnamed: 0,gen_sentiment,ref_sentiment
0,0.0,-0.5
1,-0.777778,0.076923
2,0.666667,0.2
3,1.0,0.2
4,0.6,-0.333333


In [8]:
from scipy.stats import wilcoxon

(result['gen_sentiment'] - result['ref_sentiment']).mean()

res = wilcoxon(result['gen_sentiment'] - result['ref_sentiment'], alternative='greater')
res.statistic, res.pvalue

0.1462586239453829

(10279.5, 0.0011456573663914912)

### Analysis

#### Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how sentiment results change.

In [24]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=True, remove_stopword=False, remove_punct = True)
ref_tokens = tokenize(data["human_answer"], lemmatized=True, remove_stopword=False, remove_punct = True)
result = compute_sentiment(gen_tokens, 
                           ref_tokens, 
                           pos[0].values,
                           neg[0].values)
result.head()

(result['gen_sentiment'] - result['ref_sentiment']).mean()

res = wilcoxon(result['gen_sentiment'] - result['ref_sentiment'], alternative='greater')
res.statistic, res.pvalue

Unnamed: 0,gen_sentiment,ref_sentiment
0,-0.230769,-0.5
1,-0.777778,-0.066667
2,0.666667,0.2
3,1.0,0.2
4,0.6,-0.333333


0.13499355278093766

(10576.5, 0.002139873867261238)

In [25]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=True, remove_stopword=True, remove_punct = True)
ref_tokens = tokenize(data["human_answer"], lemmatized=True, remove_stopword=True, remove_punct = True)
result = compute_sentiment(gen_tokens, 
                           ref_tokens, 
                           pos[0].values,
                           neg[0].values)
result.head()

(result['gen_sentiment'] - result['ref_sentiment']).mean()

res = wilcoxon(result['gen_sentiment'] - result['ref_sentiment'], alternative='greater')
res.statistic, res.pvalue

Unnamed: 0,gen_sentiment,ref_sentiment
0,-0.230769,-0.5
1,-0.777778,-0.142857
2,0.666667,0.111111
3,1.0,0.2
4,0.6,-0.333333


0.14764114639510492

(10680.0, 0.0008091725092345695)

In [26]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=False, remove_stopword=False, remove_punct = True)
ref_tokens = tokenize(data["human_answer"], lemmatized=False, remove_stopword=False, remove_punct = True)
result = compute_sentiment(gen_tokens, 
                           ref_tokens, 
                           pos[0].values,
                           neg[0].values)
result.head()

(result['gen_sentiment'] - result['ref_sentiment']).mean()

res = wilcoxon(result['gen_sentiment'] - result['ref_sentiment'], alternative='greater')
res.statistic, res.pvalue

Unnamed: 0,gen_sentiment,ref_sentiment
0,0.0,-0.5
1,-0.777778,0.076923
2,0.666667,0.2
3,1.0,0.2
4,0.6,-0.333333


0.1462586239453829

(10279.5, 0.0011456573663914912)

#### In general, which tokenization configuration should be used? Why does this combination make the most sense?
The configuration that had the best results was tokenize(data, lemmatized=True, remove_stopword=True, remove_punct=True). This combination makes the most sense because lemmatization will help normalize the data by reducing the unique words in the dataset while still preserving semantic meaning of a given document in the corpus. Removing stop words and punctuation will also improve tokenization results since these types of characters don't add to the semantic meaning of the text. Stopwords and punctuation also tend to have a high frequency within a text, so by removing this noise it is much easier to extract the desired text characteristics.

#### Do you think, overall, ChatGPT-generated answers are more posive or negative than human-generated ones? Use data to support your conclusion.

As seen in the statistics from result.describe(), ChatGPT does have a lower mean value for sentiment, with the average human answer (0.233) around 3x more positive than ChatGPT (0.086). The 50th percentile of ChatGPT answers have a sentiment score of 0 or lower, while the human generated answers 50th percentile is 0.26. While they have similar average scores for the 25th percentile of responses, the human responses are clearly much more positive on average, a trend that continues for the 75th percentile with human responses scoring around 30% higher sentiment.

In [27]:
result.describe()

Unnamed: 0,gen_sentiment,ref_sentiment
count,200.0,200.0
mean,0.232919,0.08666
std,0.593347,0.563039
min,-1.0,-1.0
25%,-0.255682,-0.253289
50%,0.261364,0.0
75%,0.753676,0.446429
max,1.0,1.0


## Q3: Performance Evaluation


Next, we evaluate how accurate the ChatGPT-generated answers are, compared to the human-generated answers. One widely used method is to calculate the `precision` and `recall` of n-grams. For simplicity, we only calculate bigrams here. You can try unigram, trigram, or n-grams in the same way.


Define a funtion `bigram_precision_recall(gen_tokens, ref_tokens)` as follows:
- take two parameters:
    - `gen_tokens` is the tokenized ChatGPT-generated answers by the `tokenize` function in Q1.
    - `ref_tokens` is the tokenized human answers by the `tokenize` function in Q1.
- generate bigrams from each tokenized document in `gen_tokens` and `ref_tokens`
- for each pair of ChatGPT-generated and human answers, find the overlapping bigrams between them
- compute `precision` as the number of overlapping bigrams divided by the total number of bigrams from the ChatGPT-generated answer. In other words, the bigram is considered as a predicted value. The `precision` measures the percentage of correctly generated bigrams out of all generated bigrams.
- compute `recall` as the number of overlapping bigrams divided by the total number of bigrams from the human answer. In other words, the `recall` measures the percentage of bigrams from the human answer can be successfully retrieved.
- return the precision and recall for each pair of answers.


Analysis: 
- Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how precison and recall change.
- Do you think, in general, which tokenization configuration should be used? Why does this combination make the most senese?
- Do you think, overall, ChatGPT is able to mimic human in answering these questions?



In [32]:
def bigram_precision_recall(gen_tokens, ref_tokens):
    result = pd.DataFrame(columns = ['overlapping','precision','recall'])
    
    gen_bigrams = [list(nltk.bigrams(tokens)) for tokens in gen_tokens]
    ref_bigrams = [list(nltk.bigrams(tokens)) for tokens in ref_tokens]

    bigrams = list(zip(gen_bigrams, ref_bigrams))

    overlapping = []
    precision = []
    recall = []
    for gen, ref in bigrams:
        overlap = [tup1 for tup1 in gen for tup2 in ref if tup1 == tup2]
        overlapping.append(list(set(overlap)))
        
        if gen:
            precision.append(len(overlap)/len(gen))
        else:
            precision.append(0)
        
        if ref:
            recall.append(len(overlap)/len(ref))
        else:
            recall.append(0)

    result['overlapping'] = overlapping
    result['precision'] = precision
    result['recall'] = recall
    
    return result

In [10]:
result = bigram_precision_recall(gen_tokens, 
                                 ref_tokens)
result.head()

Unnamed: 0,overlapping,precision,recall
0,"[(it, goes), (to, pay)]",0.016807,0.042553
1,"[(can, be), (can, cancel), (out, .), (out, of)...",0.065421,0.03139
2,"[(influence, the), (other, shareholders), (to,...",0.230216,0.096677
3,"[(the, local), (be, the), (to, be)]",0.017647,0.051724
4,"[(a, result), (as, a), (result, ,), (it, 's), ...",0.028571,0.072165


In [11]:
result[["precision", "recall"]].mean(axis = 0)

precision    0.110050
recall       0.165531
dtype: float64

### Analysis
#### Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how precison and recall change.

In [34]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=False, remove_stopword=False, remove_punct = True)
ref_tokens = tokenize(data["human_answer"], lemmatized=False, remove_stopword=False, remove_punct = True)
result = bigram_precision_recall(gen_tokens, ref_tokens)
result[["precision", "recall"]].mean(axis = 0)

precision    0.096335
recall       0.153194
dtype: float64

In [33]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=True, remove_stopword=True, remove_punct = True)
ref_tokens = tokenize(data["human_answer"], lemmatized=True, remove_stopword=True, remove_punct = True)
result = bigram_precision_recall(gen_tokens, ref_tokens)
result[["precision", "recall"]].mean(axis = 0)

precision    0.064452
recall       0.101432
dtype: float64

In [30]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=True, remove_stopword=False, remove_punct = True)
ref_tokens = tokenize(data["human_answer"], lemmatized=True, remove_stopword=False, remove_punct = True)
result = bigram_precision_recall(gen_tokens, ref_tokens)
result[["precision", "recall"]].mean(axis = 0)

precision    0.119037
recall       0.186091
dtype: float64

#### Do you think, in general, which tokenization configuration should be used? Why does this combination make the most sense?
The best configuration in this case was tokenize(data, lemmatized=True, remove_stopword=False, remove_punct=True). This combination makes sense because stopwords can provide context to certain words and their meaning or connections to other unique tokens, as well as identify phrases or make it easier to pair up tokens by pairing unique words and stopwords in a bigram.

#### Do you think, overall, ChatGPT is able to mimic human in answering these questions?
I think ChatGPT is not mimicing humans very well, as it still has low precision and recall scores overall, regardless of the tokenization configuration. ChatGPT tends to repeat a lot of the words from the question, or rephrase the same information with some new text. The human answers have more variability in the text and more of a unique voice than the ones generated by ChatGPT.

## Q4 Compute TF-IDF

Define a function `compute_tf_idf(tokenized_docs)` as follows: 
- Take paramter `tokenized_docs`, i.e., a list of tokenized documents by `tokenize` function in Q1
- Calculate tf_idf weights as shown in lecture notes (Hint: feel free to reuse the code segment in Lecture Notes (II))
- Return the smoothed normalized `tf_idf` array, where each row stands for a document and each column denotes a word. 

In [20]:
def compute_tfidf(docs):
    # process all documents to get token frequency for each doc
    docs_tokens = {idx:nltk.FreqDist(tokens) for idx,tokens in enumerate(docs)}

    #get document-term matrix
    dtm=pd.DataFrame.from_dict(docs_tokens, orient="index").fillna(0).sort_index(axis = 0)
      
    # get term frequency matrix        
    tf=dtm.values
    doc_len=tf.sum(axis=1, keepdims=True)
    tf=np.divide(tf, doc_len)
    
    # get idf
    df=np.where(tf>0,1,0)
    
    # get smoothed and normalized tfidf
    smoothed_idf=np.log(np.divide(len(docs)+1, np.sum(df, axis=0)+1))+1    
    smoothed_tf_idf=normalize(tf*smoothed_idf)
    
    return smoothed_tf_idf

Try different tokenization options to see how these options affect TFIDF matrix:

In [21]:
# Test tfidf generation using questions

question_tokens = tokenize(data["question"], lemmatized=True, remove_stopword=False, remove_punct = True)
dtm = compute_tfidf(question_tokens)
print(f"1.lemmatized=True, remove_stopword=False, remove_punct = True\n \
Shape: {dtm.shape}\n")

question_tokens = tokenize(data["question"], lemmatized=True, remove_stopword=True, remove_punct = True)
dtm = compute_tfidf(question_tokens)
print(f"2.lemmatized=True, remove_stopword=True, remove_punct = True:\n \
Shape: {dtm.shape}\n")

question_tokens = tokenize(data["question"], lemmatized=False, remove_stopword=False, remove_punct = True)
dtm = compute_tfidf(question_tokens)
print(f"3.lemmatized=False, remove_stopword=False, remove_punct = True:\n \
Shape: {dtm.shape}\n")

question_tokens = tokenize(data["question"], lemmatized=False, remove_stopword=False, remove_punct = False)
dtm = compute_tfidf(question_tokens)
print(f"4.lemmatized=False, remove_stopword=False, remove_punct = False:\n \
Shape: {dtm.shape}\n")

1.lemmatized=True, remove_stopword=False, remove_punct = True
 Shape: (200, 1438)

2.lemmatized=True, remove_stopword=True, remove_punct = True:
 Shape: (200, 1271)

3.lemmatized=False, remove_stopword=False, remove_punct = True:
 Shape: (200, 1643)

4.lemmatized=False, remove_stopword=False, remove_punct = False:
 Shape: (200, 1665)



## Q5. Assess similarity. 


Define a function `assess_similarity(question_tokens, gen_tokens, ref_tokens)`  as follows: 
- Take three inputs:
   - `question_tokens`: tokenized questions by `tokenize` function in Q1
   - `gen_tokens`: tokenized ChatGPT-generated answers by `tokenize` function in Q1
   - `ref_tokens`: tokenized human answers by `tokenize` function in Q1
- Concatenate these three token lists into a single list to form a corpus
- Calculate the smoothed normalized tf_idf matrix for the concatenated list using the `compute_tfidf` function defined in Q3.
- Split the tf_idf matrix into sub-matrices corresponding to `question_tokens`, `gen_tokens`, and `ref_tokens` respectively
- For each question, find its similarities to the paired ChatGPT-generated answer and human answer.
- For each pair of ChatGPT-generated answer and human answer, find their similarity
- Print out the following:
    - the question which has the largest similarity to the ChatGPT-generated answer.
    - the question which has the largest similarity to the human answer.
    - the pair of ChatGPT-generated and human answers which have the largest similarity.
- Return a DataFrame with the three columns for the similarities among questions and answers.



Analysis: 
- Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how similarities change.
- Based on similarity, do you think ChatGPT-generate answers are more relevant to questions than human answers?

In [14]:

def assess_similarity(question_tokens, gen_tokens, ref_tokens):
    n_tokens = len(data)
    # Concatenate the three lists into a single list
    corpus = []
    corpus.extend(question_tokens)
    corpus.extend(gen_tokens)
    corpus.extend(ref_tokens)
    
    tf_idf = compute_tfidf(corpus)
    
    # Split the tf-idf matrix
    tfidf_question = tf_idf[:n_tokens]
    tfidf_gen = tf_idf[n_tokens:n_tokens + n_tokens]
    tfidf_ref = tf_idf[-n_tokens:]

    # Calculate similarity
    question_gen_sim = cosine_similarity(tfidf_question, tfidf_gen)
    question_ref_sim = cosine_similarity(tfidf_question, tfidf_ref)
    gen_ref_sim = cosine_similarity(tfidf_ref, tfidf_gen)

    # Find the maximum similarities and their indices
    max_sim_question_gen = np.max(question_gen_sim, axis=1)
    question_gen_idx = np.argmax(max_sim_question_gen)
    
    max_sim_question_ref = np.max(question_ref_sim, axis=1)
    question_ref_idx = np.argmax(max_sim_question_ref)
    
    max_sim_gen_ref = np.max(gen_ref_sim, axis = 1)
    gen_ref_idx = np.argmax(max_sim_gen_ref)
    
    result = pd.DataFrame({
        'question_ref_sim': max_sim_question_ref,
        'question_gen_sim': max_sim_question_gen,
        'gen_ref_sim': max_sim_gen_ref
    })
    
    print(f"\nQuestion with the largest similarity to the ChatGPT-generated answer:\
    \nQuestion: {data['question'][question_gen_idx]}\
    \nChatGPT: {data['chatgpt_answer'][question_gen_idx]}\
    \nHuman: {data['human_answer'][question_gen_idx]}")
    print(f"\n{result.iloc[question_gen_idx]}")
    
    print(f"\nQuestion with the largest similarity to the human answer:\
        \nQuestion: {data['question'][question_ref_idx]}\
        \nChatGPT: {data['chatgpt_answer'][question_ref_idx]}\
        \nHuman: {data['human_answer'][question_ref_idx]}")
    print(f"\n{result.iloc[question_ref_idx]}")
    
    print(f"\nQuestion with the largest similarity between ChatGPT-generated and human answers:\
        \nQuestion: {data['question'][gen_ref_idx]}\
        \nChatGPT: {data['chatgpt_answer'][gen_ref_idx]}\
        \nHuman: {data['human_answer'][gen_ref_idx]}")
    print(f"\n{result.iloc[gen_ref_idx]}")
    
    return result    

In [15]:
result = assess_similarity(question_tokens, gen_tokens, ref_tokens)
result.head()


Question with the largest similarity to the ChatGPT-generated answer:    
Question: Where to find historical quotes for the Dow Jones Global Total Stock Market Index?    
ChatGPT: You can find historical quotes for the Dow Jones Global Total Stock Market Index at several financial websites, such as Yahoo Finance, Google Finance, and Bloomberg. These websites allow you to view the historical performance of the index and see how it has changed over time.To find historical quotes for the Dow Jones Global Total Stock Market Index on Yahoo Finance, go to the Yahoo Finance website and enter "Dow Jones Global Total Stock Market Index" in the search bar. From the search results, click on the link for the Dow Jones Global Total Stock Market Index. On the resulting page, you will be able to view the current value of the index as well as historical data dating back to the index's inception.To find historical quotes for the Dow Jones Global Total Stock Market Index on Google Finance, go to the Go

Unnamed: 0,question_ref_sim,question_gen_sim,gen_ref_sim
0,0.139381,0.570549,0.19047
1,0.204887,0.535899,0.287058
2,0.215753,0.464515,0.451962
3,0.158283,0.418185,0.350015
4,0.178233,0.286313,0.202566


In [16]:
result.describe()

Unnamed: 0,question_ref_sim,question_gen_sim,gen_ref_sim
count,200.0,200.0,200.0
mean,0.233769,0.35287,0.320368
std,0.107624,0.110428,0.123144
min,0.084628,0.090956,0.094002
25%,0.156987,0.276771,0.231383
50%,0.20921,0.353386,0.312536
75%,0.277172,0.431382,0.397675
max,0.871841,0.782894,0.682076


### Analysis
#### Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how similarities change.

In [35]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=False, remove_stopword=False, remove_punct = True)
ref_tokens = tokenize(data["human_answer"], lemmatized=False, remove_stopword=False, remove_punct = True)
result = assess_similarity(question_tokens, gen_tokens, ref_tokens)
result.describe()


Question with the largest similarity to the ChatGPT-generated answer:    
Question: Where to find historical quotes for the Dow Jones Global Total Stock Market Index?    
ChatGPT: You can find historical quotes for the Dow Jones Global Total Stock Market Index at several financial websites, such as Yahoo Finance, Google Finance, and Bloomberg. These websites allow you to view the historical performance of the index and see how it has changed over time.To find historical quotes for the Dow Jones Global Total Stock Market Index on Yahoo Finance, go to the Yahoo Finance website and enter "Dow Jones Global Total Stock Market Index" in the search bar. From the search results, click on the link for the Dow Jones Global Total Stock Market Index. On the resulting page, you will be able to view the current value of the index as well as historical data dating back to the index's inception.To find historical quotes for the Dow Jones Global Total Stock Market Index on Google Finance, go to the Go

Unnamed: 0,question_ref_sim,question_gen_sim,gen_ref_sim
count,200.0,200.0,200.0
mean,0.215811,0.332533,0.305191
std,0.108893,0.11427,0.128017
min,0.077459,0.095774,0.087769
25%,0.133289,0.253999,0.206993
50%,0.193265,0.319195,0.296274
75%,0.264397,0.408473,0.382049
max,0.874726,0.789226,0.69602


In [36]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=True, remove_stopword=True, remove_punct = True)
ref_tokens = tokenize(data["human_answer"], lemmatized=True, remove_stopword=True, remove_punct = True)
result = assess_similarity(question_tokens, gen_tokens, ref_tokens)
result.describe()


Question with the largest similarity to the ChatGPT-generated answer:    
Question: Where to find historical quotes for the Dow Jones Global Total Stock Market Index?    
ChatGPT: You can find historical quotes for the Dow Jones Global Total Stock Market Index at several financial websites, such as Yahoo Finance, Google Finance, and Bloomberg. These websites allow you to view the historical performance of the index and see how it has changed over time.To find historical quotes for the Dow Jones Global Total Stock Market Index on Yahoo Finance, go to the Yahoo Finance website and enter "Dow Jones Global Total Stock Market Index" in the search bar. From the search results, click on the link for the Dow Jones Global Total Stock Market Index. On the resulting page, you will be able to view the current value of the index as well as historical data dating back to the index's inception.To find historical quotes for the Dow Jones Global Total Stock Market Index on Google Finance, go to the Go

Unnamed: 0,question_ref_sim,question_gen_sim,gen_ref_sim
count,200.0,200.0,200.0
mean,0.156714,0.228628,0.30775
std,0.107821,0.12101,0.159078
min,0.027274,0.030997,0.0
25%,0.084312,0.141589,0.178442
50%,0.123175,0.210139,0.289394
75%,0.206487,0.311293,0.415362
max,0.823524,0.699193,0.725314


In [37]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=True, remove_stopword=False, remove_punct = True)
ref_tokens = tokenize(data["human_answer"], lemmatized=True, remove_stopword=False, remove_punct = True)
result = assess_similarity(question_tokens, gen_tokens, ref_tokens)
result.describe()


Question with the largest similarity to the ChatGPT-generated answer:    
Question: Where to find historical quotes for the Dow Jones Global Total Stock Market Index?    
ChatGPT: You can find historical quotes for the Dow Jones Global Total Stock Market Index at several financial websites, such as Yahoo Finance, Google Finance, and Bloomberg. These websites allow you to view the historical performance of the index and see how it has changed over time.To find historical quotes for the Dow Jones Global Total Stock Market Index on Yahoo Finance, go to the Yahoo Finance website and enter "Dow Jones Global Total Stock Market Index" in the search bar. From the search results, click on the link for the Dow Jones Global Total Stock Market Index. On the resulting page, you will be able to view the current value of the index as well as historical data dating back to the index's inception.To find historical quotes for the Dow Jones Global Total Stock Market Index on Google Finance, go to the Go

Unnamed: 0,question_ref_sim,question_gen_sim,gen_ref_sim
count,200.0,200.0,200.0
mean,0.198715,0.27258,0.354096
std,0.101307,0.118067,0.136537
min,0.064813,0.065961,0.10884
25%,0.124651,0.18267,0.239965
50%,0.175704,0.256312,0.345645
75%,0.253095,0.343357,0.452354
max,0.876957,0.736404,0.718917


#### Based on similarity, do you think ChatGPT-generate answers are more relevant to questions than human answers?
The ChatGPT answers were on average closer in similarity to the question than the human answers. However, the human answers did have a higher max similarity for all configurations, suggesting the human answers may sometimes not offer enough detail in comparison to ChatGPT. This can be seen in the question with the highest similarity to the human answer, where the human answer is one line and ChatGPT gives a long paragraph. The question provided is also pretty vague, and is more of a statement than an actual question. ChatGPT does better in terms of providing information when the prompt is vague, but which response is more relevant to the user is unknown. ChatGPT does produce more information in comparision to humans that is relevant to the question, which not unexpected considering the huge training dataset available to ChatGPT while formulating its response.

## Q5 (Bonus): Further Analysis (Open question)


- Can you find at least three significant differences between ChatGPT-generated and human answeres? Use data to support your answer.
- Based on these differences, are you able to design a classifier to identify ChatGPT generated answers? Implement your ideas using traditional machine learning models, such as SVM, decision trees.
