Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Multilabel classification #440

Merged
merged 35 commits into from
May 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
d42ef56
Added Multilabel kNN classification evaluator
x-tabdeveloping Apr 19, 2024
2da53e2
Added Multilabel classification AbsTask
x-tabdeveloping Apr 19, 2024
e22c9e6
Added MultiLabelClassification Task type to TaskMetadata
x-tabdeveloping Apr 19, 2024
6bdd3f5
bugfix
x-tabdeveloping Apr 19, 2024
a31ed01
Removed all references to metadata_dict from Multilabel classification
x-tabdeveloping Apr 23, 2024
3b30b50
Added Eurlex (wip)
x-tabdeveloping Apr 23, 2024
4ade123
Made MultiLabelClassification more efficient by moving the embedding …
x-tabdeveloping Apr 24, 2024
c6217db
fix: changed itertools.chain to itertools.chain.from_iter
x-tabdeveloping Apr 24, 2024
e2a0b1d
fix: Fixed validation and import on MultiEURLEX
x-tabdeveloping Apr 24, 2024
b342b3d
Merge branch 'main' into multilabel-classification
x-tabdeveloping Apr 24, 2024
ad955d1
Removed MultioutputClassifier, because kNN can already do that
x-tabdeveloping Apr 25, 2024
b5e99c8
fix: multilabels are not turned into an array
x-tabdeveloping Apr 25, 2024
9381852
Ran linting
x-tabdeveloping Apr 25, 2024
d122507
Added points for PR (2+23*4 for eurlex, 10 for new task type)
x-tabdeveloping Apr 26, 2024
623f7af
fix: Fixed undersampling for training set in Multitask classification
x-tabdeveloping Apr 29, 2024
55efd8f
fix: sped up sampling by using select() instead of indexing
x-tabdeveloping Apr 29, 2024
7f87f96
fix: removed duplicate code for selecting train sentences
x-tabdeveloping Apr 29, 2024
052538b
Added n_samples and avg_length to MultiEURLEX
x-tabdeveloping Apr 29, 2024
0bf888b
Added MultiEURLEX results for paraphrase-multilingual-MiniLM-L12
x-tabdeveloping Apr 29, 2024
1802201
Added EURLEX results for multilingual-e5-small
x-tabdeveloping May 3, 2024
f575fcb
Changed evaluation in multilabel classification to use MLPClassifier
x-tabdeveloping May 3, 2024
3248b70
Limited evaluation to test split in EURLEX
x-tabdeveloping May 8, 2024
5cd605b
multilabel classification now subsamples test set, and the neural net…
x-tabdeveloping May 8, 2024
45bfa29
Multilabel classification now allows tasks to define the samples per …
x-tabdeveloping May 8, 2024
33f3f27
Removed unused code
x-tabdeveloping May 8, 2024
0117b59
Moved subsampling to before encoding
x-tabdeveloping May 8, 2024
49afeb1
Made subsampling error tolerant
x-tabdeveloping May 8, 2024
96312c7
Made sure all labels are represented in the training set
x-tabdeveloping May 8, 2024
f55cbeb
Revert "Made sure all labels are represented in the training set"
x-tabdeveloping May 8, 2024
87ad125
Reran EURLEX
x-tabdeveloping May 9, 2024
6d2d1b0
EURLEX only evaluates on test set, not validation set
x-tabdeveloping May 9, 2024
51ccdc4
Made KNeighbours the default classifier in MultiLabelClassification, …
x-tabdeveloping May 9, 2024
1dad403
Merge branch 'multilabel-classification' of https://github.com/embedd…
x-tabdeveloping May 9, 2024
55374a9
Added results for EURLEX
x-tabdeveloping May 9, 2024
e617020
Merge branch 'main' into multilabel-classification
x-tabdeveloping May 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/mmteb/points/440.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{"GitHub": "x-tabdeveloping", "New dataset": 94}
{"GitHub": "x-tabdeveloping", "New task": 10}
{"GitHub": "KennethEnevoldsen", "Review PR": 2}
178 changes: 178 additions & 0 deletions mteb/abstasks/AbsTaskMultilabelClassification.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
from __future__ import annotations

import itertools
import logging
from collections import defaultdict

import numpy as np
from sklearn.base import ClassifierMixin, clone
from sklearn.metrics import f1_score, label_ranking_average_precision_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MultiLabelBinarizer

from .AbsTask import AbsTask

logger = logging.getLogger(__name__)


def evaluate_classifier(
embeddings_train: np.ndarray,
y_train: np.ndarray,
embeddings_test: np.ndarray,
y_test: np.ndarray,
classifier: ClassifierMixin,
):
scores = {}
classifier = clone(classifier)
classifier.fit(embeddings_train, y_train)
y_pred = classifier.predict(embeddings_test)
accuracy = classifier.score(embeddings_test, y_test)
f1 = f1_score(y_test, y_pred, average="macro")
scores["accuracy"] = accuracy
scores["f1"] = f1
lrap = label_ranking_average_precision_score(y_test, y_pred)
scores["lrap"] = lrap
return scores


class AbsTaskMultilabelClassification(AbsTask):
"""Abstract class for multioutput classification tasks
The similarity is computed between pairs and the results are ranked.

self.load_data() must generate a huggingface dataset with a split matching self.metadata_dict["eval_splits"], and assign it to self.dataset. It must contain the following columns:
text: str
label: list[Hashable]
"""

classifier = KNeighborsClassifier(n_neighbors=5)

def __init__(
self,
n_experiments=None,
samples_per_label=None,
batch_size=32,
**kwargs,
):
super().__init__(**kwargs)
self.batch_size = batch_size

# Bootstrap parameters
self.n_experiments = n_experiments or getattr(self, "n_experiments", 10)
self.samples_per_label = samples_per_label or getattr(
self, "samples_per_label", 8
)
# Run metadata validation by instantiating addressing the attribute
# This is quite hacky. Ideally, this would be done in the constructor of
# each concrete task, but then we have to duplicate the __init__ method's
# interface.
if hasattr(self, "metadata"):
self.metadata

def _add_main_score(self, scores):
if self.metadata.main_score in scores:
scores["main_score"] = scores[self.metadata.main_score]
else:
logger.warn(
f"main score {self.metadata.main_score} not found in scores {scores.keys()}"
)

def evaluate(self, model, eval_split="test", train_split="train", **kwargs):
if not self.data_loaded:
self.load_data()

if self.is_multilingual:
scores = {}
for lang in self.dataset:
logger.info(
f"\nTask: {self.metadata.name}, split: {eval_split}, language: {lang}. Running..."
)
scores[lang] = self._evaluate_monolingual(
model, self.dataset[lang], eval_split, train_split, **kwargs
)
self._add_main_score(scores[lang])
else:
logger.info(
f"\nTask: {self.metadata.name}, split: {eval_split}. Running..."
)
scores = self._evaluate_monolingual(
model, self.dataset, eval_split, train_split, **kwargs
)
self._add_main_score(scores)

return scores

def _evaluate_monolingual(
self, model, dataset, eval_split="test", train_split="train", **kwargs
):
train_split = dataset[train_split]
eval_split = dataset[eval_split]
params = {
"classifier_type": type(self.classifier).__name__,
"classifier_params": self.classifier.get_params(),
"batch_size": self.batch_size,
}
params.update(kwargs)

scores = []
# Bootstrap sample indices from training set for each experiment
train_samples = []
for _ in range(self.n_experiments):
sample_indices, _ = self._undersample_data_indices(
train_split["label"], self.samples_per_label, None
)
train_samples.append(sample_indices)
# Encode all unique sentences at the indices
unique_train_indices = list(set(itertools.chain.from_iterable(train_samples)))
unique_train_sentences = train_split.select(unique_train_indices)["text"]
unique_train_embeddings = dict(
zip(unique_train_indices, model.encode(unique_train_sentences))
)
test_text = eval_split["text"]
binarizer = MultiLabelBinarizer()
y_test = binarizer.fit_transform(eval_split["label"])
# Stratified subsampling of test set to 2000 examples.
try:
if len(test_text) > 2000:
test_text, _, y_test, _ = train_test_split(
test_text, y_test, stratify=y_test, train_size=2000
)
except ValueError:
logger.warn("Couldn't subsample, continuing with the entire test set.")
X_test = model.encode(test_text)
for i_experiment, sample_indices in enumerate(train_samples):
logger.info(
"=" * 10
+ f" Experiment {i_experiment+1}/{self.n_experiments} "
+ "=" * 10
)
X_train = np.stack([unique_train_embeddings[idx] for idx in sample_indices])
y_train = train_split.select(sample_indices)["label"]
y_train = binarizer.transform(y_train)
scores_exp = evaluate_classifier(
X_train, y_train, X_test, y_test, self.classifier
)
scores.append(scores_exp)

if self.n_experiments == 1:
return scores[0]
else:
avg_scores = {k: np.mean([s[k] for s in scores]) for k in scores[0].keys()}
std_errors = {
k + "_stderr": np.std([s[k] for s in scores]) for k in scores[0].keys()
}
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved
return {**avg_scores, **std_errors}
x-tabdeveloping marked this conversation as resolved.
Show resolved Hide resolved

def _undersample_data_indices(self, y, samples_per_label, idxs=None):
"""Undersample data to have samples_per_label samples of each label"""
sample_indices = []
if idxs is None:
idxs = np.arange(len(y))
np.random.shuffle(idxs)
label_counter = defaultdict(int)
for i in idxs:
if any((label_counter[label] < samples_per_label) for label in y[i]):
sample_indices.append(i)
for label in y[i]:
label_counter[label] += 1
return sample_indices, idxs
9 changes: 2 additions & 7 deletions mteb/abstasks/TaskMetadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,7 @@
from datetime import date
from typing import List, Mapping, Union

from pydantic import (
AnyUrl,
BaseModel,
BeforeValidator,
TypeAdapter,
field_validator,
)
from pydantic import AnyUrl, BaseModel, BeforeValidator, TypeAdapter, field_validator
from typing_extensions import Annotated, Literal

from .languages import (
Expand Down Expand Up @@ -80,6 +74,7 @@
TASK_TYPE = Literal[
"BitextMining",
"Classification",
"MultilabelClassification",
"Clustering",
"PairClassification",
"Reranking",
Expand Down
1 change: 1 addition & 0 deletions mteb/abstasks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from .AbsTaskClassification import *
from .AbsTaskClustering import *
from .AbsTaskInstructionRetrieval import *
from .AbsTaskMultilabelClassification import *
from .AbsTaskPairClassification import *
from .AbsTaskReranking import *
from .AbsTaskRetrieval import *
Expand Down
67 changes: 66 additions & 1 deletion mteb/evaluation/evaluators/ClassificationEvaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,12 @@
import numpy as np
import torch
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, average_precision_score, f1_score
from sklearn.metrics import (
accuracy_score,
average_precision_score,
f1_score,
label_ranking_average_precision_score,
)
from sklearn.neighbors import KNeighborsClassifier
from torch import Tensor

Expand All @@ -14,6 +19,66 @@
logger = logging.getLogger(__name__)


def dot_distance(a: np.ndarray, b: np.ndarray) -> float:
return -np.dot(a, b)


class kNNMultiLabelClassificationEvaluator(Evaluator):
def __init__(
self,
embeddings_train,
y_train,
embeddings_test,
isaac-chung marked this conversation as resolved.
Show resolved Hide resolved
y_test,
k=1,
batch_size=32,
limit=None,
**kwargs,
):
super().__init__(**kwargs)
if limit is not None:
isaac-chung marked this conversation as resolved.
Show resolved Hide resolved
embeddings_train = embeddings_train[:limit]
y_train = y_train[:limit]
embeddings_test = embeddings_test[:limit]
y_test = y_test[:limit]
self.embeddings_train = embeddings_train
self.y_train = y_train
self.embeddings_test = embeddings_test
self.y_test = y_test

self.batch_size = batch_size

self.k = k

def __call__(self, model, test_cache=None):
scores = {}
max_accuracy = 0
max_f1 = 0
max_ap = 0
for metric_name in ["cosine", "euclidean", "dot"]:
if metric_name == "dot":
metric = dot_distance
else:
metric = metric_name
classifier = KNeighborsClassifier(n_neighbors=self.k, metric=metric)
classifier.fit(self.embeddings_train, self.y_train)
y_pred = classifier.predict(self.embeddings_test)
accuracy = classifier.score(self.embeddings_test, self.y_test)
f1 = f1_score(self.y_test, y_pred, average="macro")
scores["accuracy_" + metric_name] = accuracy
scores["f1_" + metric_name] = f1
max_accuracy = max(max_accuracy, accuracy)
max_f1 = max(max_f1, f1)
lrap = label_ranking_average_precision_score(self.y_test, y_pred)
scores["lrap_" + metric_name] = lrap
max_ap = max(max_ap, lrap)
scores["accuracy"] = max_accuracy
scores["f1"] = max_f1
if len(np.unique(self.y_train)) == 2:
scores["lrap"] = max_ap
return scores, test_cache


class kNNClassificationEvaluator(Evaluator):
def __init__(
self,
Expand Down
1 change: 1 addition & 0 deletions mteb/tasks/MultiLabelClassification/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .multilingual.MultiEURLEXMultilabelClassification import *
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
from __future__ import annotations

from mteb.abstasks.TaskMetadata import TaskMetadata

from ....abstasks import AbsTaskMultilabelClassification, MultilingualTask


class MultiEURLEXMultilabelClassification(
MultilingualTask, AbsTaskMultilabelClassification
):
metadata = TaskMetadata(
name="MultiEURLEXMultilabelClassification",
dataset={
"path": "mteb/eurlex-multilingual",
"revision": "2aea5a6dc8fdcfeca41d0fb963c0a338930bde5c",
},
description="EU laws in 23 EU languages containing gold labels.",
reference="https://huggingface.co/datasets/coastalcph/multi_eurlex",
category="p2p",
type="MultilabelClassification",
eval_splits=["test"],
eval_langs={
"en": ["eng-Latn"],
"de": ["deu-Latn"],
"fr": ["fra-Latn"],
"it": ["ita-Latn"],
"es": ["spa-Latn"],
"pl": ["pol-Latn"],
"ro": ["ron-Latn"],
"nl": ["nld-Latn"],
"el": ["ell-Grek"],
"hu": ["hun-Latn"],
"pt": ["por-Latn"],
"cs": ["ces-Latn"],
"sv": ["swe-Latn"],
"bg": ["bul-Cyrl"],
"da": ["dan-Latn"],
"fi": ["fin-Latn"],
"sk": ["slk-Latn"],
"lt": ["lit-Latn"],
"hr": ["hrv-Latn"],
"sl": ["slv-Latn"],
"et": ["est-Latn"],
"lv": ["lav-Latn"],
"mt": ["mlt-Latn"],
},
main_score="accuracy",
date=("1958-01-01", "2016-01-01"),
form=["written"],
domains=["Legal", "Government"],
task_subtypes=["Topic classification"],
license="CC BY-SA 4.0",
socioeconomic_status="high",
annotations_creators="expert-annotated",
dialect=[],
text_creation="found",
bibtex_citation="""
@InProceedings{chalkidis-etal-2021-multieurlex,
author = {Chalkidis, Ilias
and Fergadiotis, Manos
and Androutsopoulos, Ion},
title = {MultiEURLEX -- A multi-lingual and multi-label legal document
classification dataset for zero-shot cross-lingual transfer},
booktitle = {Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing},
year = {2021},
publisher = {Association for Computational Linguistics},
location = {Punta Cana, Dominican Republic},
url = {https://arxiv.org/abs/2109.00904}
}
""",
n_samples={"test": 5000},
avg_character_length={"test": 12014.41},
)
1 change: 1 addition & 0 deletions mteb/tasks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from .CLSD import *
from .Clustering import *
from .InstructionRetrieval import *
from .MultiLabelClassification import *
from .PairClassification import *
from .Reranking import *
from .Retrieval import *
Expand Down
Loading
Loading