**First we load and merge train, val and test datasets**

In [1]:
import pandas as pd

# 1. Load Training Data
train_args = pd.read_csv("data/arguments-training.tsv", sep='\t')
train_labels = pd.read_csv("data/labels-training.tsv", sep='\t')
df_train = pd.merge(train_args, train_labels, on="Argument ID")

# 2. Load Validation Data
val_args = pd.read_csv("data/arguments-validation.tsv", sep='\t')
val_labels = pd.read_csv("data/labels-validation.tsv", sep='\t')
df_val = pd.merge(val_args, val_labels, on="Argument ID")

# 3. Load Test Data (Crucial for volume!)
test_args = pd.read_csv("data/arguments-test.tsv", sep='\t')
test_labels = pd.read_csv("data/labels-test.tsv", sep='\t')
df_test = pd.merge(test_args, test_labels, on="Argument ID")

# 4. Concatenate EVERYTHING into one giant dataset
trainval_df = pd.concat([df_train, df_val, df_test], ignore_index=True)

# 5. Verify the size (Should be > 8,500)
print(f"Total Examples: {len(trainval_df)}")
print(trainval_df.head(3))

Total Examples: 8865
  Argument ID                                   Conclusion       Stance  \
0      A01002                  We should ban human cloning  in favor of   
1      A01005                      We should ban fast food  in favor of   
2      A01006  We should end the use of economic sanctions      against   

                                             Premise  Self-direction: thought  \
0  we should ban human cloning as it will only ca...                        0   
1  fast food should be banned because it is reall...                        0   
2  sometimes economic sanctions are the only thin...                        0   

   Self-direction: action  Stimulation  Hedonism  Achievement  \
0                       0            0         0            0   
1                       0            0         0            0   
2                       0            0         0            0   

   Power: dominance  ...  Tradition  Conformity: rules  \
0                 0  ...          

**Examples**

In [5]:
import pandas as pd

# 1. Load the Data
# Using validation set because it's cleaner for inspection
df_args = pd.read_csv("data/arguments-validation.tsv", sep='\t')
df_labels = pd.read_csv("data/labels-validation.tsv", sep='\t')

# 2. Merge them
val_df = pd.merge(df_args, df_labels, on="Argument ID")

# 3. Identify the Value Columns (The 19 or 20 labels)
# We exclude the text columns to find just the label columns
metadata_cols = ['Argument ID', 'Conclusion', 'Stance', 'Premise', 'Language']
value_cols = [col for col in val_df.columns if col not in metadata_cols]

# 4. Display 5 Random Examples
# Change random_state to see different examples
samples = val_df.sample(5, random_state=42) 

for idx, row in samples.iterrows():
    print(f"üÜî ID: {row['Argument ID']}")
    print(f"üì¢ CONCLUSION: {row['Conclusion']}")
    print(f"‚öñÔ∏è STANCE: {row['Stance']}")
    print(f"üìù PREMISE: {row['Premise']}")
    print("-" * 30)
    print("üß† ACTUAL HUMAN VALUES (Ground Truth):")
    
    # Iterate through the columns and print only the ones marked as '1'
    has_values = False
    for val in value_cols:
        if row[val] == 1:
            print(f"   ‚úÖ {val}")
            has_values = True
            
    if not has_values:
        print("   (No values annotated)")
        
    print("=" * 80 + "\n")

üÜî ID: A28426
üì¢ CONCLUSION: Payday loans should be banned
‚öñÔ∏è STANCE: in favor of
üìù PREMISE: payday loans should be banned because it causes people to go into debt
------------------------------
üß† ACTUAL HUMAN VALUES (Ground Truth):
   ‚úÖ Power: resources
   ‚úÖ Security: personal

üÜî ID: A21315
üì¢ CONCLUSION: Homeopathy brings more harm than good
‚öñÔ∏è STANCE: in favor of
üìù PREMISE: introducing items that normally produce symptoms of a disease is something that really could do more harm than good in the long run.
------------------------------
üß† ACTUAL HUMAN VALUES (Ground Truth):
   ‚úÖ Security: personal
   ‚úÖ Universalism: objectivity

üÜî ID: A25015
üì¢ CONCLUSION: Payday loans should be banned
‚öñÔ∏è STANCE: in favor of
üìù PREMISE: payday loans allow people to spend money they do not have yet and then they have to pay interest on the loan.  this could cause them to need another loan to get through the next pay period.
------------------------------


**Use iterative-stratification library**

In [9]:
!uv pip install iterative-stratification

[2mUsing Python 3.13.7 environment at: /home/alumno/py313ml/.venv[0m
[2K[2mResolved [1m6 packages[0m [2min 313ms[0m[0m                                         [0m
[2K[2mPrepared [1m1 package[0m [2min 20ms[0m[0m                                               
[2K[2mInstalled [1m1 package[0m [2min 2ms[0m[0mation==0.1.9                      [0m
 [32m+[39m [1miterative-stratification[0m[2m==0.1.9[0m


In [6]:
import numpy as np
import pandas as pd
from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit

# 1. Create the Input Feature (X) and Targets (y) from your NEW trainval_df
# Concatenate Conclusion + Stance + Premise
trainval_df['text'] = trainval_df['Conclusion'] + " " + trainval_df['Stance'] + " " + trainval_df['Premise']

label_cols = [
    'Self-direction: thought', 'Self-direction: action', 'Stimulation',
    'Hedonism', 'Achievement', 'Power: dominance', 'Power: resources',
    'Face', 'Security: personal', 'Security: societal', 'Tradition',
    'Conformity: rules', 'Conformity: interpersonal', 'Humility',
    'Benevolence: caring', 'Benevolence: dependability',
    'Universalism: concern', 'Universalism: nature', 'Universalism: tolerance'
]

# Create the arrays for splitting
X_all = trainval_df['text'].values
y_all = trainval_df[label_cols].values

print(f"Features shape: {X_all.shape}")
print(f"Labels shape:   {y_all.shape}")

# 2. Iterative Stratified Split (Train vs Test)
# We use X_all and y_all here
msss = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# FIX: Use 'X_all' and 'y_all' inside the loop
for train_index, test_index in msss.split(X_all, y_all):
    X_train, X_test = X_all[train_index], X_all[test_index]
    y_train, y_test = y_all[train_index], y_all[test_index]

print("-" * 30)
print(f"Final Training Set: {X_train.shape[0]} examples (Use for Cross-Validation)")
print(f"Final Test Set:     {X_test.shape[0]} examples (Use for Report)")

# OPTIONAL: Sanity Check
print("\nLabel Distribution Check (First 3 labels):")
print(f"Train: {np.mean(y_train, axis=0)[:3]}")
print(f"Test:  {np.mean(y_test, axis=0)[:3]}")

Features shape: (8865,)
Labels shape:   (8865, 19)
------------------------------
Final Training Set: 7092 examples (Use for Cross-Validation)
Final Test Set:     1773 examples (Use for Report)

Label Distribution Check (First 3 labels):
Train: [0.15595037 0.25747321 0.05217146]
Test:  [0.15566836 0.2571912  0.05188945]


**SPARSE REPRESENTATION**

We compare TF-IDF approach with CountVectors

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import numpy as np

# 1. Setup the Cross-Validation Strategy (Mandatory)
# Matches your data splitting logic (Stratified Multi-label)
stratified_cv = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 2. Define Experiments
# We test different Feature Types AND N-gram ranges
experiments = [
    ("TF-IDF", (1, 1)),      # Standard baseline
    ("TF-IDF", (1, 2)),      # Captures phrases ("climate change")
    ("TF-IDF", (1, 3)),      # Captures longer context
    ("CountVec", (1, 1)),    # Raw frequency (Bag of Words)
    ("CountVec", (1, 2)),    # Raw frequency + Phrases
    ("CountVec", (1, 3)),    # Trigrams
]

print(f"{'Feature Type':<12} | {'N-Grams':<10} | {'Mean F1-Macro':<15} | {'Std Dev':<10}")
print("-" * 60)

best_score = 0
best_config = ""

for vec_type, ngram in experiments:
    # 3. Select Vectorizer
    if vec_type == "TF-IDF":
        vectorizer = TfidfVectorizer(ngram_range=ngram, min_df=3, max_features=20000)
    else:
        vectorizer = CountVectorizer(ngram_range=ngram, min_df=3, max_features=20000)
        
    # 4. Build Pipeline
    # Using Logistic Regression (OneVsRest) as the standard baseline classifier
    pipeline = Pipeline([
        ('vec', vectorizer),
        ('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear', random_state=42)))
    ])
    
    # 5. Run Cross-Validation
    scores = cross_val_score(pipeline, X_train, y_train, cv=stratified_cv, scoring='f1_macro', n_jobs=-1)
    
    # 6. Store Results
    mean_score = scores.mean()
    std_score = scores.std()
    
    print(f"{vec_type:<12} | {str(ngram):<10} | {mean_score:.4f}          | {std_score:.4f}")
    
    if mean_score > best_score:
        best_score = mean_score
        best_config = f"{vec_type} {ngram}"

print("-" * 60)
print(f"üèÜ WINNER: {best_config} with F1-Macro: {best_score:.4f}")

Feature Type | N-Grams    | Mean F1-Macro   | Std Dev   
------------------------------------------------------------
TF-IDF       | (1, 1)     | 0.2656          | 0.0060
TF-IDF       | (1, 2)     | 0.2707          | 0.0082
TF-IDF       | (1, 3)     | 0.2866          | 0.0073
CountVec     | (1, 1)     | 0.4005          | 0.0082
CountVec     | (1, 2)     | 0.4286          | 0.0075
CountVec     | (1, 3)     | 0.4322          | 0.0072
------------------------------------------------------------
üèÜ WINNER: CountVec (1, 3) with F1-Macro: 0.4322


*Feature representation analysis:*

- Our experiments revealed that CountVectors (Raw Frequency) significantly outperformed TF-IDF. This suggests that for short argumentation texts, the raw presence of specific value-laden keywords (e.g., 'freedom', 'security') is the most predictive feature.

- TF-IDF attempts to down-weight common terms, but in this domain, high-frequency terms are often the exact class identifiers we need. Since BM25 is mathematically an extension of TF-IDF (designed to further penalize term saturation and normalize length), it inherits the same 'flaw' for this specific dataset.

- Consequently, because the simpler CountVectors model already outperforms the weighted TF-IDF model by a large margin, we conclude that complex frequency dampening (like that in BM25) is unnecessary and detrimental for this specific task. We therefore selected CountVectors (N-gram 1,2) as our optimal Sparse baseline.

**Dense Methods**

In [18]:
import numpy as np
import gensim.downloader as api
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_score

# 1. Define the Vectorizer (Averaging Logic)
class MeanEmbeddingVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name):
        self.model_name = model_name
        self.word2vec = None
        self.dim = None

    def fit(self, X, y=None):
        # Load the model only when fitting to save memory/time if not used
        print(f"Loading {self.model_name}...")
        self.word2vec = api.load(self.model_name)
        self.dim = self.word2vec.vector_size
        return self

    def transform(self, X):
        # Check if model is loaded
        if self.word2vec is None:
             self.word2vec = api.load(self.model_name)
             self.dim = self.word2vec.vector_size
             
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in [s.lower().split() for s in X]
        ])

# 2. Define the Models to Compare
# Format: (Display Name, Gensim API Name)
dense_models = [
    ("GloVe (100d)", "glove-wiki-gigaword-100"),
    ("Word2Vec (300d)", "word2vec-google-news-300") 
]

print(f"{'Model Name':<20} | {'Dimensions':<10} | {'Mean F1-Macro':<15} | {'Std Dev':<10}")
print("-" * 65)

results_dense = {}

for display_name, api_name in dense_models:
    # 3. Build Pipeline
    # We initialize the vectorizer with the model name, it loads during fit()
    pipeline = Pipeline([
        ('vec', MeanEmbeddingVectorizer(api_name)),
        ('clf', OneVsRestClassifier(LogisticRegression(solver='liblinear', random_state=42)))
    ])
    
    # 4. Run Cross-Validation
    # Note: This might be slower due to the large matrix operations in 300d
    scores = cross_val_score(pipeline, X_train, y_train, cv=stratified_cv, scoring='f1_macro', n_jobs=-1)
    
    # 5. Store and Print
    mean_score = scores.mean()
    results_dense[display_name] = mean_score
    print(f"{display_name:<20} | {str(300 if '300' in api_name else 100):<10} | {mean_score:.4f}          | {scores.std():.4f}")

# 6. Final Comparison
best_dense = max(results_dense, key=results_dense.get)
print("-" * 65)
print(f"üèÜ Best Dense Model: {best_dense} with F1: {results_dense[best_dense]:.4f}")

# Optional: Compare against your Sparse Baseline (assuming 'best_score' exists)
try:
    print(f"Sparse Baseline:     {best_score:.4f}")
    if results_dense[best_dense] > best_score:
        print("üöÄ Result: Dense Embeddings BEAT Sparse Features!")
    else:
        print("üìâ Result: Sparse Features (CountVec/TF-IDF) are SUPERIOR.")
except NameError:
    pass

Model Name           | Dimensions | Mean F1-Macro   | Std Dev   
-----------------------------------------------------------------
GloVe (100d)         | 100        | 0.2791          | 0.0051
Word2Vec (300d)      | 300        | nan          | nan
-----------------------------------------------------------------
üèÜ Best Dense Model: GloVe (100d) with F1: 0.2791
Sparse Baseline:     0.4322
üìâ Result: Sparse Features (CountVec/TF-IDF) are SUPERIOR.


2 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "/home/alumno/py313ml/.venv/lib/python3.13/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alumno/py313ml/.venv/lib/python3.13/site-packages/sklearn/base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/alumno/py313ml/.venv/lib/python3.13/site-packages/sklearn/pipeline.py", line 655, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
  File "/home/alumno/py313ml/.venv/lib/p

Loading glove-wiki-gigaword-100...
Loading word2vec-google-news-300...
Loading glove-wiki-gigaword-100...
Loading glove-wiki-gigaword-100...
Loading word2vec-google-news-300...
Loading glove-wiki-gigaword-100...
Loading word2vec-google-news-300...
Loading word2vec-google-news-300...
Loading glove-wiki-gigaword-100...
Loading word2vec-google-news-300...
