## Baseline Model using TF-IDF with Logistic Regression
TF-IDF Vectorizer with n-grams: which helps capture more contextual information from the text, which can be crucial for understanding nuances in abstracts. Additionally, in this machine learning approach it is used a balance class weight to enhance the performance of the model by paying more attention to underrepresented classes. 

### Import Necessary Libraries

In [241]:
#new 
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, f1_score
import numpy as np


### Loading the Dataset
#### Data Cleaning and Preprocessing 

In [242]:

data_path = 'completed_gold_standard.xlsx'
data = pd.read_excel(data_path)

# cleaning and preprocessing the data
data['ratings'] = pd.to_numeric(data['ratings'], errors='coerce')
data.dropna(subset=['ratings'], inplace=True)
data['ratings'] = data['ratings'].astype(int)
data.head()

Unnamed: 0,title,authors/0,authors/1,authors/2,published,summary,pdf_link,authors/3,authors/4,authors/5,...,authors/16,authors/17,authors/18,authors/19,authors/20,authors/21,authors/22,ratings,positive/negative,description
0,Evaluating Modular Dialogue System for Form Fi...,Sherzod Hakimov,Yan Weiser,David Schlangen,2024-03-01T00:00:00Z,This paper introduces a novel approach to form...,https://aclanthology.org/2024.scichat-1.4,,,,...,,,,,,,,3,Negative,"addressing ""context limitations"" as typical co..."
1,Improving Cross-Domain Low-Resource Text Gener...,Zhuang Li,Levon Haroutunian,Raj Tumuluri,2024-03-01T00:00:00Z,Post-editing has proven effective in improving...,https://aclanthology.org/2024.findings-eacl.24,Philip Cohen,Reza Haf,,...,,,,,,,,4,Positive,ability to generalize across domains when usin...
2,Re3val: Reinforced and Reranked Generative Ret...,EuiYul Song,Sangryul Kim,Haeju Lee,2024-03-01T00:00:00Z,Generative retrieval models encode pointers to...,https://aclanthology.org/2024.findings-eacl.27,Joonkee Kim,James Thorne,,...,,,,,,,,4,Positive,discusses two specific limitations of generati...
3,Reward Engineering for Generating Semi-structu...,Jiuzhou Han,Wray Buntine,Ehsan Shareghi,2024-03-01T00:00:00Z,Semi-structured explanation depicts the implic...,https://aclanthology.org/2024.findings-eacl.41,,,,...,,,,,,,,3,Negative,highlighting the challenges in producing struc...
4,Are Large Language Model-based Evaluators the ...,Rishav Hada,Varun Gumma,Adrian Wynter,2024-03-01T00:00:00Z,Large Language Models (LLMs) excel in various ...,https://aclanthology.org/2024.findings-eacl.71,Harshita Diddee,Mohamed Ahmed,Monojit Choudhury,...,,,,,,,,4,Positive,It details the bias of LLM-based evaluators to...


In [243]:

data['ratings'] = data['ratings'].replace({1: 1, 2: 2, 3: 2, 4: 3, 5: 3})
print(data['ratings'].value_counts())

ratings
1    112
2     52
3     43
Name: count, dtype: int64


#### Convert Text to TF-IDF Features
Convert the text data in the abstract of the papers to tf-idf numerical format, including both unigrams (only one word) and bigrams (a set of words)

In [244]:
#new 
# Convert TF-IDF vectors with bi-grams
vectorizer = TfidfVectorizer(max_features=2000, ngram_range=(1, 2))
X_tfidf = vectorizer.fit_transform(data['summary'])


#### Split the Gold Standard into Training and Test Set- Rotate 3 times
Using stratification ensuring that both the training and test set will have an equal distribution of the 5 rating categories (from 1 to 5)
#### Train the Logistic Regression Model
Initialize the logistic regression model to handle multiple classes with balanced class weights to address class imbalance.

In [245]:
f1_scores = []

# rotating three times, dividing into three unique train and test splits 
for _ in range(3):
    X_train, X_test, y_train, y_test = train_test_split(
        X_tfidf, data['ratings'], test_size=0.2, random_state=None, stratify=data['ratings'])
    
    model = LogisticRegression(class_weight='balanced', solver='lbfgs', max_iter=1000, multi_class='multinomial')
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    f1 = f1_score(y_test, y_pred, average='macro')
    f1_scores.append(f1)




#### Predicting and Evaluating the Tf-idf Model

In [248]:
#new 
# Calculate the average F1 score across the three runs
average_f1_score = np.mean(f1_scores)

print("F1 Scores from each run:", f1_scores)
print("Average Macro F1 Score:", average_f1_score)


F1 Scores from each run: [0.4676463886990203, 0.2142857142857143, 0.40350564468211525]
Average Macro F1 Score: 0.36181258255561666


# SBERT

In [21]:
!pip install sentence-transformers


Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
   ---------------------------------------- 227.1/227.1 kB 2.7 MB/s eta 0:00:00
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-3.0.1



[notice] A new release of pip is available: 24.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Import Necessary Libraries

In [14]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, f1_score
from imblearn.over_sampling import SMOTE
import numpy as np


  from tqdm.autonotebook import tqdm, trange


## Load and Preprocess the Dataset

In [15]:

data_path = 'completed_gold_standard.xlsx'
data = pd.read_excel(data_path)
data['ratings'] = pd.to_numeric(data['ratings'], errors='coerce')
data.dropna(subset=['ratings'], inplace=True)
data['ratings'] = data['ratings'].astype(int)



In [66]:

data['ratings'] = data['ratings'].replace({1: 1, 2: 2, 3: 2, 4: 3, 5: 3})
print(data['ratings'].value_counts())


ratings
1    112
2     52
3     43
Name: count, dtype: int64


## SBERT and Generate Embeddings

In [67]:
sbert_model = SentenceTransformer('all-MiniLM-L12-v2')

# Generate embeddings
embeddings = sbert_model.encode(data['summary'].tolist())





In [68]:
import pandas as pd
# Convert 'ratings' to an integer type, handling errors and missing values
data['ratings'] = pd.to_numeric(data['ratings'], errors='coerce')  # Converts to float by default
data.dropna(subset=['ratings'], inplace=True)
data['ratings'] = data['ratings'].astype(int)


## Data Splitting and SMOTE Application

In [69]:
f1_scores = []
## rotaitng 3 times
for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split( #    #splitting data 3 times into training and test set
        embeddings,
        data['ratings'],
        test_size=0.2,
        random_state=None, 
        stratify=data['ratings']
    )
    smote = SMOTE(random_state=42)
    X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

In [71]:
classifier = LogisticRegression(class_weight='balanced', max_iter=1000)
classifier.fit(X_train_smote, y_train_smote)


In [72]:
y_pred = classifier.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')
f1_scores.append(f1)
    #  for each rotation
print(f"Rotation {i+1}:")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)
print("\n")


Rotation 3:
Confusion Matrix:
[[13  4  6]
 [ 7  1  2]
 [ 1  2  6]]
Classification Report:
              precision    recall  f1-score   support

           1       0.62      0.57      0.59        23
           2       0.14      0.10      0.12        10
           3       0.43      0.67      0.52         9

    accuracy                           0.48        42
   macro avg       0.40      0.44      0.41        42
weighted avg       0.46      0.48      0.46        42





In [73]:
print("Number of embeddings:", len(embeddings))
print("Number of ratings:", len(data['ratings']))


Number of embeddings: 207
Number of ratings: 207


In [74]:
#average f1 score across the three runs
average_f1_score = np.mean(f1_scores)
print("F1 Scores from each run:", f1_scores)
print("Average Macro F1 Score:", average_f1_score)


F1 Scores from each run: [0.41009842672246766]
Average Macro F1 Score: 0.41009842672246766


In [32]:
# from sklearn.linear_model import LogisticRegression

# #class weights 
# classifier = LogisticRegression(class_weight='balanced')
# classifier.fit(X_train, y_train)


In [33]:
# from sklearn.metrics import confusion_matrix, classification_report

# # y_pred = classifier.predict(X_test)
# conf_matrix = confusion_matrix(y_test, y_pred)
# class_report = classification_report(y_test, y_pred)

# print(conf_matrix)
# print(class_report)


[[13  2  2  4  2]
 [ 2  0  0  1  0]
 [ 1  2  2  1  1]
 [ 2  1  1  1  1]
 [ 1  0  1  1  0]]
              precision    recall  f1-score   support

           1       0.68      0.57      0.62        23
           2       0.00      0.00      0.00         3
           3       0.33      0.29      0.31         7
           4       0.12      0.17      0.14         6
           5       0.00      0.00      0.00         3

    accuracy                           0.38        42
   macro avg       0.23      0.20      0.21        42
weighted avg       0.45      0.38      0.41        42



## SBERT

In [37]:
# import pandas as pd

# data_path = 'completed_gold_standard.xlsx'
# data = pd.read_excel(data_path)
# data['ratings'] = pd.to_numeric(data['ratings'], errors='coerce')

# data = data.dropna(subset=['ratings'])

# data['ratings'] = data['ratings'].astype(int)

# print("Data types:\n", data.dtypes)
# print("Number of NaN in ratings:", data['ratings'].isna().sum())


Data types:
 title                 object
authors/0             object
authors/1             object
authors/2             object
published             object
summary               object
pdf_link              object
authors/3             object
authors/4             object
authors/5             object
authors/6             object
authors/7             object
authors/8             object
authors/9             object
authors/10            object
authors/11            object
authors/12            object
authors/13            object
authors/14            object
authors/15            object
authors/16            object
authors/17            object
authors/18            object
authors/19            object
authors/20            object
authors/21            object
authors/22            object
ratings                int32
positive/negative     object
description           object
dtype: object
Number of NaN in ratings: 0


In [41]:
# from sklearn.metrics import confusion_matrix, classification_report

# # Predict the ratings on the test set
# y_pred = classifier.predict(X_test)

# # Generate a confusion matrix and a classification report
# conf_matrix = confusion_matrix(y_test, y_pred)
# class_report = classification_report(y_test, y_pred)

# # Print the results
# print("Confusion Matrix:")
# print(conf_matrix)
# print("\nClassification Report:")
# print(class_report)


Confusion Matrix:
[[14  2  2  3  2]
 [ 1  2  0  0  0]
 [ 2  2  0  2  1]
 [ 2  1  0  1  2]
 [ 1  0  1  0  1]]

Classification Report:
              precision    recall  f1-score   support

           1       0.70      0.61      0.65        23
           2       0.29      0.67      0.40         3
           3       0.00      0.00      0.00         7
           4       0.17      0.17      0.17         6
           5       0.17      0.33      0.22         3

    accuracy                           0.43        42
   macro avg       0.26      0.36      0.29        42
weighted avg       0.44      0.43      0.42        42



# SMOTE 
Using data augmentation techniques such as SMOTE (Synthetic Minority Over-sampling Technique) can significantly help in balancing the class distribution in the dataset. This is especially useful for handling classes that are underrepresented, as is the case of classes 2,3, 4 and 5.

In [42]:
!pip install imbalanced-learn


Collecting imbalanced-learn


[notice] A new release of pip is available: 24.1.2 -> 24.2


  Downloading imbalanced_learn-0.12.3-py3-none-any.whl.metadata (8.3 kB)
Downloading imbalanced_learn-0.12.3-py3-none-any.whl (258 kB)
   ---------------------------------------- 258.3/258.3 kB 5.3 MB/s eta 0:00:00
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.12.3



[notice] To update, run: python.exe -m pip install --upgrade pip


In [43]:
# from imblearn.over_sampling import SMOTE
# from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(
#     embeddings, 
#     data['ratings'], 
#     test_size=0.2, 
#     random_state=42, 
#     stratify=data['ratings']
# )
# smote = SMOTE(random_state=42)
# X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)



In [44]:
# from sklearn.linear_model import LogisticRegression


# classifier = LogisticRegression(class_weight='balanced', max_iter=1000)
# classifier.fit(X_train_smote, y_train_smote)
# from sklearn.metrics import confusion_matrix, classification_report

# y_pred = classifier.predict(X_test)
# conf_matrix = confusion_matrix(y_test, y_pred)
# class_report = classification_report(y_test, y_pred)

# print("Confusion Matrix:")
# print(conf_matrix)
# print("\nClassification Report:")
# print(class_report)


Confusion Matrix:
[[15  2  2  2  2]
 [ 0  1  1  1  0]
 [ 2  1  2  2  0]
 [ 2  0  2  1  1]
 [ 1  0  1  0  1]]

Classification Report:
              precision    recall  f1-score   support

           1       0.75      0.65      0.70        23
           2       0.25      0.33      0.29         3
           3       0.25      0.29      0.27         7
           4       0.17      0.17      0.17         6
           5       0.25      0.33      0.29         3

    accuracy                           0.48        42
   macro avg       0.33      0.35      0.34        42
weighted avg       0.51      0.48      0.49        42

