## ICR - Identifying Age-Related Conditions

The goal of this competition is to predict if a person has any of three medical conditions.  I predict if the person has one or more of any of the three medical conditions (Class 1), or none of the three medical conditions (Class 0).

An accuracy of 0.97% is achieved on a test data set. A Log loss score on kaggle of 0.43 is achieved on the public test data where log loss is defined as:

$\text{Log Loss} = \frac{-\frac{1}{N_0} \sum_{i=1} ^{N_0} y_{0i}\log{p_{0i}} - \frac{1}{N_1}\sum_{i=1} ^{N_0}y_{1i}\log{p_{1i}}}{2}$

where $N_{c}$ is the number of observations of class c, $\log$ is the natural logarithm, $y_{c i}$ is 1 if observation i belongs to class c and 0 otherwise, $p_{c i}$ is the predicted probability that observation i belongs to class c.

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV

In [16]:
# Load the data
df = pd.read_csv('data/train.csv')

# Display the first few rows of the dataframe
df.head()

Unnamed: 0,Id,AB,AF,AH,AM,AR,AX,AY,AZ,BC,...,FL,FR,FS,GB,GE,GF,GH,GI,GL,Class
0,000ff2bfdfe9,0.209377,3109.03329,85.200147,22.394407,8.138688,0.699861,0.025578,9.812214,5.555634,...,7.298162,1.73855,0.094822,11.339138,72.611063,2003.810319,22.136229,69.834944,0.120343,1
1,007255e47698,0.145282,978.76416,85.200147,36.968889,8.138688,3.63219,0.025578,13.51779,1.2299,...,0.173229,0.49706,0.568932,9.292698,72.611063,27981.56275,29.13543,32.131996,21.978,0
2,013f2bd269f5,0.47003,2635.10654,85.200147,32.360553,8.138688,6.73284,0.025578,12.82457,1.2299,...,7.70956,0.97556,1.198821,37.077772,88.609437,13676.95781,28.022851,35.192676,0.196941,0
3,043ac50845d5,0.252107,3819.65177,120.201618,77.112203,8.138688,3.685344,0.025578,11.053708,1.2299,...,6.122162,0.49706,0.284466,18.529584,82.416803,2094.262452,39.948656,90.493248,0.155829,0
4,044fb8a146ec,0.380297,3733.04844,85.200147,14.103738,8.138688,3.942255,0.05481,3.396778,102.15198,...,8.153058,48.50134,0.121914,16.408728,146.109943,8524.370502,45.381316,36.262628,0.096614,1


In [17]:
# Drop the 'Id' column
#df = df.drop('Id', axis=1)

In [18]:
# Define the features and target
X = df.drop('Class', axis=1)
y = df['Class']

In [19]:
# Identify numeric and categorical features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

In [20]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
# Define preprocessing for numeric columns (impute missing values and scale them)
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])

# Define preprocessing for categorical columns (impute missing values and one hot encode)
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [22]:
# Combine preprocessing steps
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features)])
# Create preprocessing and training pipeline for each model
pipeline_svm = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', SVC(random_state=42))])
pipeline_knn = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', KNeighborsClassifier())])
pipeline_gbc = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', GradientBoostingClassifier(random_state=42))])
pipeline_xgb = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', XGBClassifier(eval_metric='mlogloss', random_state=42))])

# List of pipelines and pipeline names
pipelines = [pipeline_svm, pipeline_knn, pipeline_gbc, pipeline_xgb]
pipeline_names = ['Support Vector Machine', 'K-Nearest Neighbors', 'Gradient Boosting', 'XGBoost']

In [23]:
%%time
# Convert the data back into a DataFrame
X_train_df = pd.DataFrame(X_train, columns=X.columns)
X_test_df = pd.DataFrame(X_test, columns=X.columns)
best_accuracy = 0

# Loop to fit each of the three pipelines
for pipe, name in zip(pipelines, pipeline_names):
    print('\n', name)
    pipe.fit(X_train_df, y_train)
    y_pred = pipe.predict(X_test_df)
    accuracy = accuracy_score(y_test, y_pred)
    print('Test Accuracy: ', accuracy)
    if accuracy > best_accuracy:
        best_model = pipe
        best_accuracy = accuracy


 Support Vector Machine
Test Accuracy:  0.8629032258064516

 K-Nearest Neighbors
Test Accuracy:  0.8467741935483871

 Gradient Boosting
Test Accuracy:  0.9516129032258065

 XGBoost
Test Accuracy:  0.967741935483871
CPU times: user 3.22 s, sys: 479 ms, total: 3.7 s
Wall time: 1.2 s


In [10]:
%%time

# Define the parameter grid
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__learning_rate': [0.1, 0.01, 0.001],
    'classifier__subsample': [0.5, 0.7, 1.0],
    'classifier__colsample_bytree': [0.4, 0.6, 0.8, 1.0]
}
# Create the GridSearchCV object
grid_search = GridSearchCV(pipeline_xgb, param_grid, cv=3, scoring='neg_log_loss')

# Fit the model and search for the best parameters
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score
print('Best parameters: ', grid_search.best_params_)
print('Best score: ', grid_search.best_score_)

Best parameters:  {'classifier__colsample_bytree': 0.4, 'classifier__learning_rate': 0.1, 'classifier__max_depth': 10, 'classifier__n_estimators': 100, 'classifier__subsample': 1.0}
Best score:  -0.21846163669816657
CPU times: user 17min 39s, sys: 52.9 s, total: 18min 32s
Wall time: 6min 8s


In [24]:
y_pred = grid_search.predict(X_test_df)
accuracy = accuracy_score(y_test, y_pred)
print('Test Accuracy: ', accuracy)

Test Accuracy:  0.9354838709677419


In [26]:
test_df = pd.read_csv('data/test.csv')
#test_df = test_df.drop('Id', axis=1)
test_df

Unnamed: 0,Id,AB,AF,AH,AM,AR,AX,AY,AZ,BC,...,FI,FL,FR,FS,GB,GE,GF,GH,GI,GL
0,00eed32682bb,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,010ebe33f668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,02fa521e1838,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,040e15f562a2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,046e85c7cc7f,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
final_predictions = best_model.predict_proba(test_df)
submission = pd.DataFrame({'Id': test_df.Id, 'class_0': final_predictions[:,0], 'class_1': final_predictions[:,1]})
submission.to_csv('submission.csv', index=False)

In [40]:
submission

Unnamed: 0,Id,class_0,class_1
0,00eed32682bb,0.746636,0.253364
1,010ebe33f668,0.746636,0.253364
2,02fa521e1838,0.746636,0.253364
3,040e15f562a2,0.746636,0.253364
4,046e85c7cc7f,0.746636,0.253364
