## CSC-696-001.2025F Final Project(1/2)
**Name: Anna Hyunjung Kim**

**Collaborators: Prof. Patrick Wu**





---





**Title:** Measuring Ethical Risks in AI-Generated News Using NLP with the UNESCO Ethics of AI Framework

**Research Question:** How many problematic errors occur ethically in news articles generated by AI to some extent. Also, which category of the AI ethics principles proposed by UNESCO do these issues correspond closest to?



**Data Set**
1. Train Data(data_1):
    -  https://huggingface.co/datasets/hendrycks/ethics
    -  21.8k rows
    -  Use this data to label sentences ethically appropriate/inappropriate, and train a text classification model.
    - @article{hendrycks2021ethics,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

2. Test Data(data_2):
    - https://huggingface.co/datasets/lvulpecula/ai_watermarked_fake_news-v2
    - 1.5k rows
    - A trained ethics model is applied to this data to identify ethically problematic texts among AI-generated articles.

3. Ethical category(data_3:
    - https://huggingface.co/datasets/ktiyab/ethical-framework-UNESCO-Ethics-of-AI
    - 483 rows
    - For articles classified as "problematic" as a result of the test, match each principle description in this data with the nearest ethical category meaningfully.


In [None]:
from datasets import load_dataset

**Load data_1**

It has two subset, one is commonsence and the other is justice. I combined the commonsense and justice subsets of ETHICS to increase data diversity.
Although they emphasize different moral dimensions (everyday morality vs fairness/entitlement), I treat label 1 consistently as ‘morally problematic’ and label 0 as ‘acceptable’. This slightly broadens the notion of “ethical risk” learned by the classifier, which is appropriate for analyzing AI-generated news.


---


**Common sence:** Total 21.8k rows (label: 0(54.3%), 1(45.7%))

**Justice:** Total 26.5k rows (label: 0(45.7%), 1(54.3%))


---


Initially, I attempted to load the ETHICS dataset using the standard Hugging Face interface: 'ethics_ds = load_dataset("hendrycks/ethics", "commonsense")'.

However, with the newer version of the datasets library, this call failed with a
RuntimeError: Dataset scripts are no longer supported, but found ethics.py.
This is because script-based datasets (like ethics.py) are no longer supported by default.

To resolve this, instead of relying on the old script interface, I directly loaded the underlying CSV files from the Hugging Face repository.

In [None]:
import pandas as pd

commonsense_base = "https://huggingface.co/datasets/hendrycks/ethics/resolve/main/data/commonsense/"

commonsense_train_df = pd.read_csv(commonsense_base + "train.csv")
commonsense_val_df   = pd.read_csv(commonsense_base + "test.csv")
commonsense_test_df  = pd.read_csv(commonsense_base + "test_hard.csv")

commonsense_train_df.head()

In [None]:
justice_base = "https://huggingface.co/datasets/hendrycks/ethics/resolve/main/data/justice/"

justice_train_df = pd.read_csv(justice_base + "train.csv")
justice_val_df   = pd.read_csv(justice_base + "test.csv")
justice_test_df  = pd.read_csv(justice_base + "test_hard.csv")

justice_train_df.head()

In [None]:
#matching column names

# commonsense
commonsense_train_df = commonsense_train_df.rename(columns={"input": "text"})
commonsense_val_df   = commonsense_val_df.rename(columns={"input": "text"})
commonsense_test_df  = commonsense_test_df.rename(columns={"input": "text"})

# justice
justice_train_df = justice_train_df.rename(columns={"scenario": "text"})
justice_val_df   = justice_val_df.rename(columns={"scenario": "text"})
justice_test_df  = justice_test_df.rename(columns={"scenario": "text"})

# only nessesary columns

# commonsense
for df in [commonsense_train_df, commonsense_val_df, commonsense_test_df]:
    df["label"] = df["label"].astype(int)
    df["source"] = "commonsense"
    df.drop(columns=[c for c in df.columns if c not in ["text", "label", "source"]],
            inplace=True)

# justice
for df in [justice_train_df, justice_val_df, justice_test_df]:
    df["label"] = df["label"].astype(int)
    df["source"] = "justice"
    df.drop(columns=[c for c in df.columns if c not in ["text", "label", "source"]],
            inplace=True)

print(commonsense_train_df.head(2))
print(justice_train_df.head(2))



In [None]:
train_df = pd.concat([commonsense_train_df, justice_train_df], ignore_index=True)
val_df   = pd.concat([commonsense_val_df,   justice_val_df],   ignore_index=True)
test_df  = pd.concat([commonsense_test_df,  justice_test_df],  ignore_index=True)

print(train_df["source"].value_counts())

print("Train:", train_df.shape)
print("Val:", val_df.shape)
print("Test:", test_df.shape)

In [None]:
from datasets import Dataset, DatasetDict

# pandas DataFrame → Dataset
# I need to change to dataset because I will use 'Trainer' from Transformers.
train_ds = Dataset.from_pandas(train_df, preserve_index=False)
val_ds   = Dataset.from_pandas(val_df,   preserve_index=False)
test_ds  = Dataset.from_pandas(test_df,  preserve_index=False)

data_1 = DatasetDict({
    "train": train_ds,
    "validation": val_ds,
    "test": test_ds,
})

data_1

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification


In [None]:
# tokeniaing test
row = train_df.iloc[0]
print(row["text"])
print(row["label"], row["source"])

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

encoding = tokenizer(
    row["text"],
    truncation=True,
    padding="max_length",
    max_length=200,
)

# Seeing what happen inside
print(encoding.keys())
print(encoding["input_ids"][:20])
print(encoding["attention_mask"][:20])

**ver1**

In [None]:
# DistilBERT
# https://huggingface.co/docs/transformers/en/model_doc/distilbert

model_name = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2  # 0/1 binary
)


In [None]:
def tokenize_batch(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=128,
    )

tokenized_ds = data_1.map(tokenize_batch, batched=True)

print("tokenized_ds['train'][0]:", tokenized_ds['train'][0])

tokenized_ds = tokenized_ds.remove_columns(["text", "source"])
tokenized_ds.set_format("torch")

# Check there are label, input_ids and attention_mask
tokenized_ds

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    f1  = f1_score(labels, preds)
    prec = precision_score(labels, preds)
    rec  = recall_score(labels, preds)
    return {
        "accuracy": acc,
        "f1": f1,
        "precision": prec,
        "recall": rec
    }


In [None]:
from transformers import TrainingArguments, Trainer
# https://huggingface.co/docs/transformers/main_classes/trainer

In [None]:
training_args = TrainingArguments(
    output_dir="./ethics-distilbert-full",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    report_to="none",
    label_smoothing_factor=0.1 # Added
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


In [None]:
trainer.train()

trainer.evaluate(tokenized_ds["validation"])
trainer.evaluate(tokenized_ds["test"])


In [None]:
all_results = {}

all_results["full_v1"] = {
    "val":  trainer.evaluate(tokenized_ds["validation"]),
    "test": trainer.evaluate(tokenized_ds["test"]),
}

In [None]:
trainer.save_model("./ethics-distilbert-full")
tokenizer.save_pretrained("./ethics-distilbert-full")


**commonsense subset only**

Because the initial model performance was relatively low, I plan to conduct an additional experiment using only the commonsense subset of the ETHICS dataset. This subset is more behavior-focused and less abstract than the justice subset, so using it alone may reduce noise and lead to clearer learning signals for the classifier.

In [None]:
commonsense_train_ds = Dataset.from_pandas(commonsense_train_df, preserve_index=False)
commonsense_val_ds   = Dataset.from_pandas(commonsense_val_df,   preserve_index=False)
commonsense_test_ds  = Dataset.from_pandas(commonsense_test_df,  preserve_index=False)

data_1_2 = DatasetDict({
    "train": commonsense_train_ds,
    "validation": commonsense_val_ds,
    "test": commonsense_test_ds,
})

data_1_2

In [None]:
model_commonsense = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
)

In [None]:
# using same tokenize_batch: change only data set
tokenized_ds_2 = data_1_2.map(tokenize_batch, batched=True)

print("tokenized_ds_2['train'][0]:", tokenized_ds_2['train'][0])

tokenized_ds_2 = tokenized_ds_2.remove_columns(["text", "source"])
tokenized_ds_2.set_format("torch")

# Check there are label, input_ids and attention_mask
tokenized_ds_2

In [None]:
training_args2 = TrainingArguments(
    output_dir="./ethics-distilbert-commonsense",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    report_to="none",
    label_smoothing_factor=0.1
)

trainer2 = Trainer(
    model=model_commonsense,
    args=training_args2,
    train_dataset=tokenized_ds_2["train"],
    eval_dataset=tokenized_ds_2["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer2.train()

trainer2.evaluate(tokenized_ds_2["validation"])
trainer2.evaluate(tokenized_ds_2["test"])

In [None]:
all_results["commonsense_v1"] = {
    "val":  trainer2.evaluate(tokenized_ds_2["validation"] ),
    "test": trainer2.evaluate(tokenized_ds_2["test"] ),
}

In [None]:
trainer2.save_model("./ethics-distilbert-commonsense")
tokenizer.save_pretrained("./ethics-distilbert-commonsense")

**Justice subset only**

In [None]:
justice_train_ds = Dataset.from_pandas(justice_train_df, preserve_index=False)
justice_val_ds   = Dataset.from_pandas(justice_val_df,   preserve_index=False)
justice_test_ds  = Dataset.from_pandas(justice_test_df,  preserve_index=False)

data_1_3 = DatasetDict({
    "train": justice_train_ds,
    "validation": justice_val_ds,
    "test": justice_test_ds,
})

data_1_3

model_justice = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
)

# using same tokenize_batch: change only data set
tokenized_ds_3 = data_1_3.map(tokenize_batch, batched=True)

print("tokenized_ds_3['train'][0]:", tokenized_ds_3['train'][0])

tokenized_ds_3 = tokenized_ds_3.remove_columns(["text", "source"])
tokenized_ds_3.set_format("torch")

# Check there are label, input_ids and attention_mask
tokenized_ds_3

training_args3 = TrainingArguments(
    output_dir="./ethics-distilbert-justice",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    report_to="none",
    label_smoothing_factor=0.1
)

trainer3 = Trainer(
    model=model_justice,
    args=training_args3,
    train_dataset=tokenized_ds_3["train"],
    eval_dataset=tokenized_ds_3["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer3.train()

trainer3.evaluate(tokenized_ds_3["validation"])
trainer3.evaluate(tokenized_ds_3["test"])

In [None]:
all_results["justice_v1"] = {
    "val":  trainer3.evaluate(tokenized_ds_3["validation"] ),
    "test": trainer3.evaluate(tokenized_ds_3["test"] ),
}

In [None]:
trainer3.save_model("./ethics-distilbert-justice")
tokenizer.save_pretrained("./ethics-distilbert-justice")


**ver2**

The Training Loss is good but the F1, precision, and recall are not sure so I thought model remomber the pattern, so I will guess over fitting.

So,

1) dropout=0.2 / attention_dropout=0.2

I increased the dropout rate to 0.2 to reduce overfitting.
Dropout randomly disables part of the model during training,
so the model cannot memorize the training data too much.
This helps it generalize better to new, unseen sentences.

2) num_train_epochs = 1

I reduced the number of training epochs to 1 because the ETHICS dataset is small and noisy.
Training for too long makes the model overfit.
it learns the training data perfectly but performs worse on validation examples. Using only 1 epoch helps prevent overfitting.

**Full**

In [None]:
from transformers import AutoConfig

In [None]:
model_name = "distilbert-base-uncased"

tokenizer_ver2 = AutoTokenizer.from_pretrained(model_name, use_fast=True)

config_ver2 = AutoConfig.from_pretrained(
    model_name,
    num_labels=2,
    dropout=0.2,
    attention_dropout=0.2,
)

model_ver2 = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    config=config_ver2,
)

def tokenize_batch(batch):
    return tokenizer_ver2(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=128,
    )

tokenized_ds_ver2 = data_1.map(tokenize_batch, batched=True)

print("tokenized_ds_ver2['train'][0]:", tokenized_ds_ver2['train'][0])

tokenized_ds_ver2 = tokenized_ds_ver2.remove_columns(["text", "source"])
tokenized_ds_ver2.set_format("torch")

# Check there are label, input_ids and attention_mask
tokenized_ds_ver2

training_args_ver2 = TrainingArguments(
    output_dir="./ethics-distilbert-full_ver2",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    report_to="none",
    label_smoothing_factor=0.1
)
trainer_ver2 = Trainer(
    model=model_ver2,
    args=training_args_ver2,
    train_dataset=tokenized_ds_ver2["train"],
    eval_dataset=tokenized_ds_ver2["validation"],
    tokenizer=tokenizer_ver2,
    compute_metrics=compute_metrics,
)

trainer_ver2.train()

trainer_ver2.evaluate(tokenized_ds_ver2["validation"])
trainer_ver2.evaluate(tokenized_ds_ver2["test"])


In [None]:
all_results["full_v2"] = {
    "val":  trainer_ver2.evaluate(tokenized_ds_ver2["validation"]),
    "test": trainer_ver2.evaluate(tokenized_ds_ver2["test"]),
}

trainer_ver2.save_model("./ethics-distilbert-full_ver2")
tokenizer_ver2.save_pretrained("./ethics-distilbert-full_ver2")

**commonsense**

In [None]:
model_name = "distilbert-base-uncased"

tokenizer_ver2_2 = AutoTokenizer.from_pretrained(model_name, use_fast=True)

config_ver2_2 = AutoConfig.from_pretrained(
    model_name,
    num_labels=2,
    dropout=0.2,
    attention_dropout=0.2,
)

model_ver2_2 = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    config=config_ver2_2,
)

def tokenize_batch(batch):
    return tokenizer_ver2_2(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=128,
    )

tokenized_ds_ver2_2 = data_1_2.map(tokenize_batch, batched=True)

print("tokenized_ds_ver2_2['train'][0]:", tokenized_ds_ver2_2['train'][0])

tokenized_ds_ver2_2 = tokenized_ds_ver2_2.remove_columns(["text", "source"])
tokenized_ds_ver2_2.set_format("torch")

# Check there are label, input_ids and attention_mask
tokenized_ds_ver2_2

training_args_ver2_2 = TrainingArguments(
    output_dir="./ethics-distilbert-commonsense_ver2_2",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    report_to="none",
    label_smoothing_factor=0.1
)
trainer_ver2_2 = Trainer(
    model=model_ver2_2,
    args=training_args_ver2_2,
    train_dataset=tokenized_ds_ver2_2["train"],
    eval_dataset=tokenized_ds_ver2_2["validation"],
    tokenizer=tokenizer_ver2_2,
    compute_metrics=compute_metrics,
)

trainer_ver2_2.train()

trainer_ver2_2.evaluate(tokenized_ds_ver2_2["validation"])
trainer_ver2_2.evaluate(tokenized_ds_ver2_2["test"])

In [None]:
all_results["commonsense_v2"] = {
    "val":  trainer_ver2_2.evaluate(tokenized_ds_ver2_2["validation"]),
    "test": trainer_ver2_2.evaluate(tokenized_ds_ver2_2["test"]),
}

trainer_ver2_2.save_model("./ethics-distilbert-commonsense_ver2_2")
tokenizer_ver2_2.save_pretrained("./ethics-distilbert-commonsense_ver2_2")

**Justice**

In [None]:
model_name = "distilbert-base-uncased"

tokenizer_ver2_3 = AutoTokenizer.from_pretrained(model_name, use_fast=True)

config_ver2_3 = AutoConfig.from_pretrained(
    model_name,
    num_labels=2,
    dropout=0.2,
    attention_dropout=0.2,
)

model_ver2_3 = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    config=config_ver2_3,
)

def tokenize_batch(batch):
    return tokenizer_ver2_3(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=128,
    )

tokenized_ds_ver2_3 = data_1_3.map(tokenize_batch, batched=True)

print("tokenized_ds_ver2_3['train'][0]:", tokenized_ds_ver2_3['train'][0])

tokenized_ds_ver2_3 = tokenized_ds_ver2_3.remove_columns(["text", "source"])
tokenized_ds_ver2_3.set_format("torch")

# Check there are label, input_ids and attention_mask
tokenized_ds_ver2_3


training_args_ver2_3 = TrainingArguments(
    output_dir="./ethics-distilbert-justice_ver2_3",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    report_to="none",
    label_smoothing_factor=0.1
)
trainer_ver2_3 = Trainer(
    model=model_ver2_3,
    args=training_args_ver2_3,
    train_dataset=tokenized_ds_ver2_3["train"],
    eval_dataset=tokenized_ds_ver2_3["validation"],
    tokenizer=tokenizer_ver2_3,
    compute_metrics=compute_metrics,
)

trainer_ver2_3.train()

trainer_ver2_3.evaluate(tokenized_ds_ver2_3["validation"])
trainer_ver2_3.evaluate(tokenized_ds_ver2_3["test"])


In [None]:
all_results["justice_v2"] = {
    "val":  trainer_ver2_3.evaluate(tokenized_ds_ver2_3["validation"]),
    "test": trainer_ver2_3.evaluate(tokenized_ds_ver2_3["test"]),
}

trainer_ver2_3.save_model("./ethics-distilbert-justice_ver2_3")
tokenizer_ver2_3.save_pretrained("./ethics-distilbert-justice_ver2_3")

In [None]:
for model_name, splits in all_results.items():
    print(f"\n################ {model_name} ################")
    for split, metrics in splits.items():
        print(f"\n--- {split.upper()} ---")
        for k, v in metrics.items():
            print(f"{k}: {v}")

In [None]:
import matplotlib.pyplot as plt
import numpy as np

model_names = []
val_f1_list = []
test_f1_list = []

for model_name, splits in all_results.items():
    model_names.append(model_name)
    val_f1_list.append(splits["val"]["eval_f1"])
    test_f1_list.append(splits["test"]["eval_f1"])

x = np.arange(len(model_names))
width = 0.35

fig, ax = plt.subplots(figsize=(9, 4))

bars_val = ax.bar(x - width/2, val_f1_list, width, label="VAL F1")
bars_test = ax.bar(x + width/2, test_f1_list, width, label="TEST F1")

ax.set_xticks(x)
ax.set_xticklabels(model_names, rotation=45, ha="right")
ax.set_ylabel("F1 score")
ax.set_title("ETHICS models – Validation vs Test F1")
ax.legend()

def add_labels(bars):
    for b in bars:
        height = b.get_height()
        ax.text(
            b.get_x() + b.get_width()/2,
            height + 0.01,
            f"{height:.3f}",
            ha="center",
            va="bottom",
            fontsize=8,
        )

add_labels(bars_val)
add_labels(bars_test)

plt.tight_layout()
plt.show()
