## Clasificador de texto multiclase.
Puntos clave explicados

- Labels es una lista unidimensional (forma (n_muestras)), con una etiqueta por texto.
- Dado que existen más de dos valores de etiqueta distintos ("civil", "penal", "laboral"), MLPClassifier trata automáticamente el problema como multiclase (no multietiqueta).
- No es necesario realizar ninguna codificación especial (scikit-learn asignará internamente las etiquetas de cadena a enteros).
- Pipeline permite encadenar el preprocesamiento (TfidfVectorizer) y el clasificador (MLPClassifier) ​​para poder ejecutar directamente las funciones fit, predict, etc., en la pipeline.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report

# Example dataset
texts = [
    # Civil Law
    "The claimant files a civil action seeking compensation for pain and suffering after a car accident.",
    "The buyer requests judicial termination of the contract due to hidden defects in the property.",
    "The plaintiff alleges a breach of a service agreement and demands payment of contractual penalties.",

    # Criminal Law
    "The public prosecutor files a criminal complaint for the offense of embezzlement of public funds.",
    "The accused is under investigation for participating in an organized criminal group.",
    "The defendant appeals the conviction for the crime of armed robbery, claiming lack of evidence.",

    # Labor Law
    "The employee brings a labor claim for unpaid overtime and recognition of on-call hours.",
    "The worker alleges unfair dismissal and seeks reinstatement and back pay.",
    "The claimant asserts that the employer failed to provide a safe working environment, causing occupational disease.",
]

# One label per text (1D vector) → multiclass problem
labels = [
    "civil", "civil", "civil",
    "criminal", "criminal", "criminal",
    "labor", "labor", "labor",
]

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    texts,
    labels,
    test_size=0.3,
    random_state=42,
    stratify=labels  # keeps class proportions
)

# Build the pipeline
#    TF-IDF → MLPClassifier
pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        lowercase=True,
        max_features=5000,
        ngram_range=(1, 2),  # unigrams + bigrams
    )),
    ("mlp", MLPClassifier(
        hidden_layer_sizes=(100,),
        activation="relu",
        solver="adam",
        max_iter=500,
        random_state=42
    ))
])

# Train (fit) the model
pipeline.fit(X_train, y_train)

# Evaluate on test set
y_pred = pipeline.predict(X_test)
print("Classification report on test set:\n")
print(classification_report(y_test, y_pred))

# Predict for a new example
text = (
    "The employee files a labor lawsuit claiming unpaid severance payments, "
    "including prior notice, vacation pay with an additional one-third, "
    "and the release of the unemployment insurance funds."
)
expected_label = "labor"

predicted_label = pipeline.predict([text])[0]
print("\nSingle prediction example:")
print(f"Text (first 80 chars): {text[:80]}...")
print(f"Predicted label (multiclass): {predicted_label} | Expected label: {expected_label}")