## Introduction

The Kaggle competition has been launched, please register using this [link.](https://www.kaggle.com/t/f79b637ede074e70a233661b4614083c)

You will find the training and test data in the data section of the competition, along with a description of the features. You will need to build models on the training data and make predictions on the test data and submit your solutions to Kaggle. You will also find a sample solution file in the data section that shows the format you will need to use for your own submissions.

The deadline for Kaggle solutions is 8PM on 19 April. You will be graded primarily on the basis of your work and how clearly you explain your methods and results. Those in the top three in the competition will receive some extra points. I expect you to experiment with all the methods we have covered: linear models, random forest, gradient boosting, neural networks + parameter tuning, feature engineering.

You will see the public score of your best model on the leaderboard. A private dataset will be used to evaluate the final performance of your model to avoid overfitting based on the leaderboard.

You should also submit to Moodle the documentation (ipynb and pdf) of your work, including exploratory data analysis, data cleaning, parameter tuning and evaluation. Aim for concise explanations.

Feel free to ask questions about the task in Slack. The Kaggle competition is already open, please start working on it and submitting solutions (you cannot submit more than 5 solutions per day).


## The plan

Our plan for the Kaggle competition involves a systematic approach to model development and optimization. Initially, we will split the provided 'train' dataframe into two segments to serve as our training and validation sets. We intend to build and evaluate different predictive models—such as logistic regression, random forests, and gradient boosting machines—focusing on the AUC metric to determine their performance. The model with the highest AUC on our validation set will be selected for further refinement. After the initial modeling phase, we will engage in feature engineering to enhance the dataset, followed by rebuilding and re-evaluating the models with the new features. The best-performing model from this second phase will then be applied to the external validation set. Finally, we will prepare and submit our predictions in the required format (article_id and score) to Kaggle, ensuring they align with the competition's guidelines.

## Import Libraries

In [2]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline


# show max columns and rows
pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)

In [9]:
# read the data
train = pd.read_csv("train.csv")
# check the shape
print(train.shape)
# print first 5 rows
train.head()

(29733, 61)


Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,average_token_length,num_keywords,data_channel_is_lifestyle,data_channel_is_entertainment,data_channel_is_bus,data_channel_is_socmed,data_channel_is_tech,data_channel_is_world,kw_min_min,kw_max_min,kw_avg_min,kw_min_max,kw_max_max,kw_avg_max,kw_min_avg,kw_max_avg,kw_avg_avg,self_reference_min_shares,self_reference_max_shares,self_reference_avg_sharess,weekday_is_monday,weekday_is_tuesday,weekday_is_wednesday,weekday_is_thursday,weekday_is_friday,weekday_is_saturday,weekday_is_sunday,is_weekend,LDA_00,LDA_01,LDA_02,LDA_03,LDA_04,global_subjectivity,global_sentiment_polarity,global_rate_positive_words,global_rate_negative_words,rate_positive_words,rate_negative_words,avg_positive_polarity,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,is_popular,article_id
0,594,9,702,0.454545,1.0,0.620438,11,2,1,0,4.790598,8,0,0,1,0,0,0,4,2900.0,753.875,2900,690400,194850.0,2635.188119,5109.090909,3411.008319,1200.0,6500.0,3850.0,0,0,0,0,1,0,0,0,0.453017,0.025031,0.025053,0.025027,0.471872,0.499743,0.22462,0.047009,0.012821,0.785714,0.214286,0.452328,0.1,1.0,-0.153395,-0.4,-0.1,0.0,0.0,0.5,0.0,0,1
1,346,8,1197,0.470143,1.0,0.666209,21,6,2,13,4.622389,6,0,1,0,0,0,0,-1,417.0,120.0,0,843300,221250.0,0.0,4221.436364,2393.57792,1900.0,68300.0,21600.0,1,0,0,0,0,0,0,0,0.033338,0.034222,0.033336,0.865673,0.03343,0.509674,0.169761,0.060986,0.020886,0.744898,0.255102,0.434561,0.05,1.0,-0.308167,-1.0,-0.1,0.0,0.0,0.5,0.0,0,3
2,484,9,214,0.61809,1.0,0.748092,5,2,1,0,4.476636,8,0,0,0,0,1,0,4,512.0,138.625,2400,843300,184962.5,945.5,3602.455629,2481.347103,331.0,331.0,331.0,0,0,1,0,0,0,0,0,0.025767,0.025002,0.025002,0.025,0.899229,0.311717,0.060404,0.042056,0.028037,0.6,0.4,0.254562,0.1,0.433333,-0.141667,-0.2,-0.05,0.0,0.0,0.5,0.0,0,5
3,639,8,249,0.621951,1.0,0.66474,16,5,8,0,5.180723,6,0,0,0,0,0,0,4,717.0,146.6,0,617900,221150.0,0.0,4612.708333,2920.744778,2900.0,8800.0,5866.666667,0,1,0,0,0,0,0,0,0.033333,0.373002,0.033333,0.526997,0.033333,0.387522,-0.006684,0.036145,0.016064,0.692308,0.307692,0.231818,0.1,0.5,-0.5,-0.8,-0.4,0.0,0.0,0.5,0.0,0,6
4,177,12,1219,0.397841,1.0,0.583578,21,1,1,2,4.659557,4,0,0,0,0,0,1,-1,60.0,25.25,3100,843300,263875.0,844.5625,3529.36651,2250.688809,627.0,627.0,627.0,0,1,0,0,0,0,0,0,0.050001,0.05,0.799995,0.05,0.050004,0.45797,0.077511,0.025431,0.012305,0.673913,0.326087,0.380401,0.05,0.8,-0.441111,-1.0,-0.05,0.0,0.0,0.5,0.0,0,7


In [10]:
# check info
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29733 entries, 0 to 29732
Data columns (total 61 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   timedelta                      29733 non-null  int64  
 1   n_tokens_title                 29733 non-null  int64  
 2   n_tokens_content               29733 non-null  int64  
 3   n_unique_tokens                29733 non-null  float64
 4   n_non_stop_words               29733 non-null  float64
 5   n_non_stop_unique_tokens       29733 non-null  float64
 6   num_hrefs                      29733 non-null  int64  
 7   num_self_hrefs                 29733 non-null  int64  
 8   num_imgs                       29733 non-null  int64  
 9   num_videos                     29733 non-null  int64  
 10  average_token_length           29733 non-null  float64
 11  num_keywords                   29733 non-null  int64  
 12  data_channel_is_lifestyle      29733 non-null 

Apparently we don't have any missing values

## First Models

- Logistic Regression
- Lasso Regression
- Random Forest
- Gradient Boosting Machine (GBM)
- Neural Network
- Explainable Boosting Machine (EBM)
- Support Vector Machine (SVM)
- k-Nearest Neighbors (k-NN)
- XGBoost


### Model a1: Logistic Regression

In [12]:
# define a random state
prng = np.random.RandomState(20240418)

# define the traget and features
X = train.drop('is_popular', axis=1)
y = train['is_popular']

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=prng)

# Select features for each model
# Minimal feature model (select a few you think are most important)
features_minimal = ['n_tokens_title', 'n_tokens_content', 'num_hrefs']

# Moderate feature model (a larger set of features)
features_moderate = ['n_tokens_title', 'n_tokens_content', 'num_hrefs', 'num_imgs', 'num_videos', 'average_token_length']

# All features for the comprehensive model
features_comprehensive = list(X_train.columns)

# Define models
models = {
    'Minimal Features Model': LogisticRegression(max_iter=1000),
    'Moderate Features Model': LogisticRegression(max_iter=1000),
    'Comprehensive Features Model': LogisticRegression(max_iter=1000)
}

# Dictionary to store AUC scores
auc_scores = {}

# Evaluate models
for name, model in models.items():
    # Select the appropriate features for each model
    if 'Minimal' in name:
        features = features_minimal
    elif 'Moderate' in name:
        features = features_moderate
    else:
        features = features_comprehensive
    
    # Create a pipeline with scaling and logistic regression
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', model)
    ])
    
    # Train model
    pipeline.fit(X_train[features], y_train)
    
    # Predict probabilities
    preds_train = pipeline.predict_proba(X_train[features])[:, 1]
    preds_val = pipeline.predict_proba(X_val[features])[:, 1]
    
    # Calculate AUC
    auc_train = roc_auc_score(y_train, preds_train)
    auc_val = roc_auc_score(y_val, preds_val)
    
    # Store AUC scores
    auc_scores[name] = {'AUC Train': auc_train, 'AUC Validation': auc_val}

# Print AUC scores for all models
for model_name, scores in auc_scores.items():
    print(f"{model_name} - AUC Train: {scores['AUC Train']}, AUC Validation: {scores['AUC Validation']}")


Minimal Features Model - AUC Train: 0.5474924736067324, AUC Validation: 0.561370603870939
Moderate Features Model - AUC Train: 0.6039263417730092, AUC Validation: 0.6335697293916618
Comprehensive Features Model - AUC Train: 0.6925358504255668, AUC Validation: 0.7133059337272026
