Extracted headlines (text) and their labels (label).
Applied TF-IDF Vectorization to convert text into numerical features.
Removed stop words to improve model efficiency.
Transformed text into a sparse matrix representation.
Splitting the Data
Used train-test split to separate a validation set (20%) from training data.
Helps prevent overfitting by testing models on unseen data.


In [None]:
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier



dataset = load_dataset("wangrongsheng/ag_news")

# Get the training and test data
train_data = dataset['train']
test_data = dataset['test']

# Extract headlines and labels
train_texts = [item['text'] for item in train_data]
train_labels = [item['label'] for item in train_data]
test_texts = [item['text'] for item in test_data]
test_labels = [item['label'] for item in test_data]

# TF-IDF vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

# Train-test split (for validation purposes)
X_train, X_val, y_train, y_val = train_test_split(X_train, train_labels, test_size=0.2, random_state=42)


README.md:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Logistic Regression performed best (91.7% F1-score)

Works well on high-dimensional, sparse data (like TF-IDF).
Suitable for text classification due to its linear nature.
KNN performed well (89.8%) but is computationally expensive

Works best when data points are well-separated.
Struggles with high-dimensional data (TF-IDF has thousands of features).
Decision Tree performed the worst (81.1%)

Overfits easily because it memorizes training data.
Less effective for text classification without feature selection.
Boosting was slightly better than Decision Tree but weaker than KNN and Logistic Regression

Captures complex relationships better than a single decision tree.
Performance depends on the number of trees and learning rate

In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Boosting': GradientBoostingClassifier(),
    'KNN': KNeighborsClassifier()
}

# Train models and evaluate
results = {}
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    results[model_name] = classification_report(y_val, y_pred, output_dict=True)

# Print results
for model_name, result in results.items():
    print(f"Model: {model_name}")
    print(f"Precision: {result['weighted avg']['precision']}")
    print(f"Recall: {result['weighted avg']['recall']}")
    print(f"F1-Score: {result['weighted avg']['f1-score']}")
    print("\n")


Model: Logistic Regression
Precision: 0.9172477676314146
Recall: 0.9174166666666667
F1-Score: 0.9172019982465429


Model: Decision Tree
Precision: 0.8112777314370113
Recall: 0.8117083333333334
F1-Score: 0.8114089624622414


Model: Boosting
Precision: 0.8334691018627339
Recall: 0.8308333333333333
F1-Score: 0.8315347308071943


Model: KNN
Precision: 0.8979489444585644
Recall: 0.8982083333333334
F1-Score: 0.8978776223870867


