# Description of Data

The AI-GA (Artificial Intelligence Generated Abstracts) dataset is a collection of paper abstracts, either AI-generated or original.

The AI-generated abstracts are generated using state-of-the-art language generation techniques (GPT-3 model).

The dataset is provided in CSV format, with each row representing a single sample (i.e.,  a single abstract).

*The ultimate goal of this assignment is to classify the abstracts based on the source (i.e., whether it is AI-generated or original).*

Total sample size: 14,331 (7,248 AI-generated and 7,082 original)

Each sample contains three columns: abstract, title, and label. The label indicates whether the sample is an original abstract (labeled as 0) or an AI-generated abstract (labeled as 1).

##Package installs and imports

DO NOT CHANGE THIS CODE

In [1]:
!pip3 install nltk spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m84.8 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /home/jupyter/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jupyter/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Load dataset **"ai-ga-dataset.csv"** as a csv file and save it as a dataframe named **"abstracts_df"**

In [3]:
abstracts_df = pd.read_csv("https://raw.githubusercontent.com/elhamod/BA820/main/Assignment/Assignment2/ai-ga-dataset.csv")
abstracts_df.head()

Unnamed: 0,doc_id,title,abstract,label
0,1,Exaggerated Autophagy in Stanford Type A Aorti...,\n\nThis study presents a novel transcriptome ...,1
1,2,ABO blood types and sepsis mortality,\n\nThe ABO blood types have been associated w...,1
2,3,AAV8-Mediated Angiotensin-Converting Enzyme 2 ...,\n\nTitle: AAV8-Mediated Angiotensin-Convertin...,1
3,4,MyCare study: protocol for a controlled trial ...,INTRODUCTION: People with serious mental illne...,0
4,5,Exploring collective emotion transmission in f...,Collective emotion is the synchronous converge...,0


##Inspection:

**Maximum marks: 5**

- Print the number of abstracts that are human or AI generated, respectively.
- Check if any abstracts have invalid values. Address them appropriately.
- Check if any labels have invalid values. Address them appropriately

In [4]:
abstracts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14330 entries, 0 to 14329
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   doc_id    14330 non-null  int64 
 1   title     14330 non-null  object
 2   abstract  14330 non-null  object
 3   label     14330 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 447.9+ KB


In [5]:
abstracts_df.shape

(14330, 4)

In [6]:
ai_gen= abstracts_df[abstracts_df['label']==1]
ori_gen= abstracts_df[abstracts_df['label']==0]
print(len(ai_gen),"numbers of abtracts were generated by AI,", len(ori_gen),"were generated by human")

7248 numbers of abtracts were generated by AI, 7082 were generated by human


- checking for null value of `abstract`.

In [7]:
ai_gen['abstract'].isna().sum()

0

In [8]:
ori_gen['abstract'].isna().sum()

0

**Answer**:

There doesn't seem to be any invalid values of `abstract` for both AI and human generated contents.


#Pre-processing

## Question 1.1: text cleaning

**Maximum marks: 5**

Perform pre-processing on all abstracts by lower-casing and removing all non-alpha-numeric characters (i.e., only keep numbers, English alphabet letters, and white spaces).

In [9]:
abstract= abstracts_df['abstract']

In [10]:
import re
#lower casing the text
clean_df= abstract.str.lower()

In [11]:
clean_df

0        \n\nthis study presents a novel transcriptome ...
1        \n\nthe abo blood types have been associated w...
2        \n\ntitle: aav8-mediated angiotensin-convertin...
3        introduction: people with serious mental illne...
4        collective emotion is the synchronous converge...
                               ...                        
14325    background: falls are a significant source of ...
14326    autosomal dominant polycystic kidney disease (...
14327    we study numerically how the structures of dis...
14328    \n\nthis paper aims to elucidate the role of p...
14329    infectious disease threat events (idtes) are i...
Name: abstract, Length: 14330, dtype: object

In [12]:
#special characters
to_remove = ["\W", "\s"]
clean_df= clean_df.replace(to_remove, " ", regex=True)

In [13]:
clean_df= clean_df.replace('[^\x00-\x7F]', " ", regex= True)

In [14]:
pd.DataFrame(clean_df)

Unnamed: 0,abstract
0,this study presents a novel transcriptome pi...
1,the abo blood types have been associated wit...
2,title aav8 mediated angiotensin converting ...
3,introduction people with serious mental illne...
4,collective emotion is the synchronous converge...
...,...
14325,background falls are a significant source of ...
14326,autosomal dominant polycystic kidney disease ...
14327,we study numerically how the structures of dis...
14328,this paper aims to elucidate the role of pho...


## Question 1.2: Stemming or Lemmatization

**Maximum Marks: 7.5**

We enhance the effectiveness of our text analysis algorithms by normalizing words and reducing them to their root/base forms.

Write a function `process_text` that



1.   removes `english` stop words.
2.   uses `PorterStemmer` and `WordNetLemmatizer` to stem AND lemmatize the tokenized abstracts.

The function would take in a document and return its tokenization as a list of tokens.

To verify its functionality, call the function with the first abstract as input, and then print the transformed abstract as a full text (i.e., as a string, not as a list of tokens).

In [15]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [16]:
def process_text(text):
    processed_sentences = []
    tokenized_text = [word_tokenize(t) for t in clean_df]
    for sent in tokenized_text:
        text_stemmer= [stemmer.stem(word) for word in sent if word not in stop_words]
        text_lemmatizer = [lemmatizer.lemmatize(word) for word in text_stemmer]
        processed_sentences.append(text_lemmatizer)
    return processed_sentences

In [17]:
processed_text= process_text(clean_df)

In [18]:
processed_texts_str = [' '.join(doc) for doc in processed_text]

In [19]:
# from more_itertools import flatten
# flattened_list = list(flatten(processed_text))

In [20]:
# import random
# sample_text = random.sample(flattened_list, 60000)

#Vectorization

Next, we will try different vector representations and see how well each performs.

## Question 2.1: Bag of Words

**Maximum Marks: 5**

Perform Bag of Words on the abstracts and store the vector representation as a DataFrame.

You are expected to apply the `process_text` tokenization.

Print the head of the resulting DataFrame.

How many tokens does BoW yield?

**Answer:**

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

In [22]:
cv= CountVectorizer()
cv_text= cv.fit_transform(processed_texts_str)

In [23]:
text_df= pd.DataFrame(cv_text.toarray(), columns= cv.get_feature_names_out())

In [24]:
text_df.shape

(14330, 45128)

## Question 2.2: TF-IDF

**Maximum Marks: 5**

Using TF-IDF with `process_text` tokenization, vectorize the abstracts. Then, find the top 5 most similar abstracts to the document with doc_id=6 (shown below) in terms of content.

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import sent_tokenize

In [26]:
tf= TfidfVectorizer(norm=None)
tf.fit(processed_texts_str)
tf_output= tf.transform(processed_texts_str)

In [27]:
tf_df= pd.DataFrame(tf_output.toarray(), columns=tf.get_feature_names_out())
tf_df.shape

(14330, 45128)

- Cleaning the sentence

In [28]:
query_index = 6
index6 = abstracts_df["abstract"].iloc[query_index].lower()

In [29]:
processed_index6= word_tokenize(index6)

In [30]:
transform_ind6 = tf.transform(processed_index6)

In [31]:
transform_ind6

<377x45128 sparse matrix of type '<class 'numpy.float64'>'
	with 156 stored elements in Compressed Sparse Row format>

In [32]:
cos_df= pd.DataFrame(cosine_similarity(transform_ind6, tf_output))

In [33]:
cos_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14320,14321,14322,14323,14324,14325,14326,14327,14328,14329
0,0.0,0.0,0.0,0.0,0.0,0.0,0.024647,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.039428,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
372,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
373,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
374,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
375,0.0,0.0,0.0,0.0,0.0,0.0,0.119376,0.0,0.0,0.0,...,0.0,0.013724,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0


In [34]:
similarities= cos_df.sum()
similarities_5= (similarities).sort_values(ascending=False)[1:6]
similarities_5

8270     4.243555
8629     3.847517
1545     3.562759
8676     3.168066
13061    3.020051
dtype: float64

In [134]:
# for i in similarities_5.index:
#     print(abstracts_df["abstract"].iloc[i])

## Question 2.3 Word2Vec

**Maximum Marks: 7.5**

Now repeat Q 2.2 but using Word2Vec. For each token, the model should consider the two adjacent tokens on its left and the two on its right. Use a `workers=4` as a parameter to speed up computations. Include **all** possible words that occur in the abstracts.

Use vector averaging to calculate the vector representation of the sentence based on the vectors of its constituent words.

How do the results of Word2Vec and TF-IDF compare?

In [36]:
pip install --upgrade gensim

Note: you may need to restart the kernel to use updated packages.


In [37]:
from random import sample 

In [38]:
from gensim.models import Word2Vec

In [56]:
w2v_model= Word2Vec(sentences= processed_texts_str, vector_size= 20, window=4, min_count=1, workers=4, negative=20)

In [57]:
dataset_embeddings = np.array([w2v_model.wv[token] for sentence in processed_texts_str for token in sentence if token in w2v_model.wv.key_to_index])

In [52]:
#word embedding
dataset_embeddings.shape

(1908884, 20)

In [59]:
sentence_embeddings = []

# Iterate over each sentence in processed_text_str
for sentence in processed_texts_str:
    # Initialize an empty list to store token embeddings
    token_embeddings = []
    # Iterate over each token in the sentence
    for token in sentence:
        # Check if the token exists in the Word2Vec model's vocabulary
        if token in w2v_model.wv.key_to_index:
            # Get the Word2Vec embedding vector for the token
            embedding_vector = w2v_model.wv[token]
            # Append the embedding vector to the list of token embeddings
            token_embeddings.append(embedding_vector)
    # Calculate the mean of token embeddings for the current sentence
    if token_embeddings:
        sentence_embedding = np.mean(token_embeddings, axis=0)
        sentence_embeddings.append(sentence_embedding)

# Convert the list of sentence embeddings into a NumPy array
sentence_embeddings = np.array(sentence_embeddings)

In [69]:
sentence_embeddings.shape

(14330, 20)

- Sentence id6

In [46]:
embeddings_ind6 = np.array([w2v_model.wv[word] for word in processed_index6 if word in w2v_model.wv.key_to_index])
embeddings_ind6.shape #sentence embedding

(130, 20)

In [70]:
ind6_embedding = np.mean(embeddings_ind6, axis=0)

In [71]:
ind6_embedding.shape

(20,)

In [110]:
cos_sim_w2v = cosine_similarity([ind6_embedding], sentence_embeddings)

In [111]:
cos_sim_w2v

array([[0.7363782 , 0.7387619 , 0.74003905, ..., 0.72811747, 0.7254579 ,
        0.72315717]], dtype=float32)

In [112]:
similarities_5 = cos_sim_glove.argsort()[0][-5:]
print(similarities_5)

[ 6294  7231 11857  3069  4690]


**Answer:**

# Classification

## Question 3.1: GloVe

**Maximum Marks: 7.5**

Instead of training our own Word2Vec model, we decided to use a [GloVe](https://nlp.stanford.edu/projects/glove/) model that was pre-trained by researchers at Stanford University. They used a much larger amount of text in their training (e.g., Wikipedia).

For this question, simply use `get_tokens(doc)` below for tokenization.

**Note:** *Vectorizing the entire dataset using GloVe may take 5-10 minutes. Use the guidelines we discussed in class to test and develop your code before fully applying it to the entire dataset.*

In [89]:
from gensim import downloader

glove_model = downloader.load("glove-wiki-gigaword-50")

In [99]:
# import spacy
# nlp = spacy.load("en_core_web_sm")
# def get_tokens(doc):
#     doc_tokenized = nlp(doc)
#     tokens = [token.text for token in doc_tokenized]
#     return tokens

In [90]:
# doc_embeddings = []
# for sentence in processed_texts_str:
#     words = word_tokenize(sentence)
#     for word in words:
#         if word in glove_model.key_to_index:
#             embedding_vector = glove_model.get_vector(word)
#             doc_embeddings.append(embedding_vector)
# len(doc_embeddings)

In [102]:
sentence_embeddings = []

for sentence in processed_texts_str:
    token_embeddings = []
    for token in sentence:
        if token in glove_model.key_to_index:
            embedding_vector = glove_model.get_vector(token)
            token_embeddings.append(embedding_vector)
    # Calculate the mean of token embeddings for the current sentence
    if token_embeddings:
        sentence_embedding = np.mean(token_embeddings, axis=0)
        sentence_embeddings.append(sentence_embedding)

sentence_embeddings = np.array(sentence_embeddings)

In [103]:
sentence_embeddings.shape

(14330, 50)

- Sentence id6

In [104]:
embeddings_ind6 = np.array([glove_model.get_vector(word) for word in processed_index6 if word in glove_model.key_to_index])
embeddings_ind6.shape #sentence embedding

(361, 50)

In [106]:
ind6_embedding = np.mean(embeddings_ind6, axis=0)
ind6_embedding.shape

(50,)

- Cosine similarities

In [108]:
cos_sim_glove = cosine_similarity([ind6_embedding], sentence_embeddings)
cos_sim_glove

array([[0.7363782 , 0.7387619 , 0.74003905, ..., 0.72811747, 0.7254579 ,
        0.72315717]], dtype=float32)

In [109]:
similarities_5 = cos_sim_glove.argsort()[0][-5:]
print(similarities_5)

[ 6294  7231 11857  3069  4690]


## Question 3.2: Random Forest Classifier

**Maximum Marks: 7.5**

Using a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), compare the classification results using GloVe to those using TF-IDF. Does GloVe do better or worse? Are there any particular issues you faced? Elaborate on your findings and justify them.

Use a test set of 20% the total dataset size. Use `random_state = 42`.

Print the `classification_report` of your model.

**Answer**:

This shows that TF-IDF is better at predicting the label, as the confusion matrix shows higher TP and TN numbers. This is not surprising as the TF model was trained using this specific dataset, which might've cause some leakage and overfitting.

In [113]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [114]:
abstracts_df.head()

Unnamed: 0,doc_id,title,abstract,label
0,1,Exaggerated Autophagy in Stanford Type A Aorti...,\n\nThis study presents a novel transcriptome ...,1
1,2,ABO blood types and sepsis mortality,\n\nThe ABO blood types have been associated w...,1
2,3,AAV8-Mediated Angiotensin-Converting Enzyme 2 ...,\n\nTitle: AAV8-Mediated Angiotensin-Convertin...,1
3,4,MyCare study: protocol for a controlled trial ...,INTRODUCTION: People with serious mental illne...,0
4,5,Exploring collective emotion transmission in f...,Collective emotion is the synchronous converge...,0


In [118]:
X= abstracts_df['abstract'].str.lower()
y= abstracts_df['label']

In [138]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

- GloVe

In [139]:
X_train_gv=[]

for sentence in X_train:
    token_embeddings = []
    for token in sentence:
        if token in glove_model.key_to_index:
            embedding_vector = glove_model.get_vector(token)
            token_embeddings.append(embedding_vector)
    # Calculate the mean of token embeddings for the current sentence
    if token_embeddings:
        X_train1 = np.mean(token_embeddings, axis=0)
        X_train_gv.append(X_train1)

X_train_gv = np.array(X_train_gv)

In [140]:
X_test_gv=[]

for sentence in X_test:
    token_embeddings = []
    for token in sentence:
        if token in glove_model.key_to_index:
            embedding_vector = glove_model.get_vector(token)
            token_embeddings.append(embedding_vector)
    # Calculate the mean of token embeddings for the current sentence
    if token_embeddings:
        X_test1 = np.mean(token_embeddings, axis=0)
        X_test_gv.append(X_test1)

X_test_gv = np.array(X_test_gv)

In [141]:
from sklearn.metrics import confusion_matrix
rf= RandomForestClassifier()
rf.fit(X_train_gv, y_train)
y_pred= rf.predict(X_test_gv)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[1210  216]
 [ 192 1248]]


In [143]:
from sklearn.metrics import classification_report
classification_report(y_test, y_pred)

'              precision    recall  f1-score   support\n\n           0       0.86      0.85      0.86      1426\n           1       0.85      0.87      0.86      1440\n\n    accuracy                           0.86      2866\n   macro avg       0.86      0.86      0.86      2866\nweighted avg       0.86      0.86      0.86      2866\n'

- TF-IDF

In [144]:
X_train_tf= tf.transform(X_train)
X_test_tf= tf.transform(X_test)

In [145]:
rf= RandomForestClassifier()
rf.fit(X_train_tf, y_train)
y_pred= rf.predict(X_test_tf)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[1358   68]
 [  21 1419]]


In [147]:
classification_report(y_test, y_pred)

'              precision    recall  f1-score   support\n\n           0       0.98      0.95      0.97      1426\n           1       0.95      0.99      0.97      1440\n\n    accuracy                           0.97      2866\n   macro avg       0.97      0.97      0.97      2866\nweighted avg       0.97      0.97      0.97      2866\n'