<a href="https://colab.research.google.com/github/anabarrerar/Machine_Learning/blob/main/CatBoost/ICR_KFold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Description

The goal of this project is to predict if a person has one or more of any of three medical conditions (Class 1), or none of the three medical conditions (Class 0). 

To determine if someone has these medical conditions requires a long and intrusive process to collect information from patients. With predictive models, we can shorten this process and keep patient details private by collecting key characteristics relative to the conditions, then encoding these characteristics.

This project will help researchers discover the relationship between measurements of certain characteristics and potential patient conditions.



**Dataset Description**

The competition data comprises over fifty anonymized health characteristics linked to three age-related conditions. Your goal is to predict whether a subject has or has not been diagnosed with one of these conditions -- a binary classification problem.



**train.csv - The training set** 617 rows, 58 columns
* `Id:` Unique identifier for each observation.
* `AB-GL:` Fifty-six anonymized health characteristics. All are numeric except for EJ, which is categorical and binomial.
* `Class:` A binary target, 1 indicates the subject has been diagnosed with one of the three conditions, 0 indicates they have not.


**test.csv - The test set** The goal is to predict the probability that a subject in this set belongs to each of the two classes.


**greeks.csv - Supplemental metadata**, only available for the training set. 617 rows, 6 columns
* `Alpha:` Identifies the type of age-related condition, if present.

    A: No age-related condition. Corresponds to class 0.

    B, D, G: The three age-related conditions. Correspond to class 1.

* `Beta, Gamma, Delta:` Three experimental characteristics.
* `Epsilon:`The date the data for this subject was collected. Note that all of the data in the test set was collected after the training set was collected.

# Load data

In [34]:
Train = pd.read_csv('train.csv')
Test = pd.read_csv('test.csv')
greeks =pd.read_csv('greeks.csv')

# Baseline CatBoost + Stratified K-fold Cross Validation

## Loss Function

**logloss function**
Also known as cross-entropy loss or logistic loss, is commonly used as a loss function in binary classification problems. It measures the dissimilarity between the predicted probabilities and the true binary labels.

The **balance_logloss** is an implementation of the *balanced log loss* evaluation metric. This metric is used in unbalanced binary classification problems, where the classes have different numbers of samples.

In [26]:
def balance_logloss(y_true, y_pred):
    y_pred = np.clip(y_pred, 1e-15, 1-1e-15) # avoid values ​​close to zero or one
    y_pred / np.sum(y_pred, axis=1)[:, None] # normalize the predictions
    nc = np.bincount(y_true) # count the number of samples of each class in the vector of true labels
    w0, w1 = 1/(nc[0]/y_true.shape[0]), 1/(nc[1]/y_true.shape[0]) # the weights w0, w1 compensate the class imbalance. 
    logloss = (-w0/nc[0]*(np.sum(np.where(y_true==0,1,0) * np.log(y_pred[:,0]))) - w1/nc[1]*(np.sum(np.where(y_true!=0,1,0) * np.log(y_pred[:,1])))) / (w0+w1)
    
    return logloss

## Model + Validation

**Stratified k-fold cross-validation**

Overall, this code performs multilabel stratified k-fold cross-validation by dividing the data into `n_splits` folds and assigning a fold index to each row in the "kfold" column of the DataFrame. The purpose of this approach is to ensure that each fold represents a balanced distribution of the target variables `(greeks.iloc[:,1:-1])`, suitable for training and evaluating a multilabel classification model.

In [20]:
greeks_copy=greeks.copy()
greeks = pd.merge(Train[['Id', 'Class']], greeks, on='Id')

<class 'pandas.core.frame.DataFrame'>


In [None]:
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

Train['kfold'] = -1

kf = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_indicies, valid_indicies) in enumerate(kf.split(X=Train, y=greeks.iloc[:,1:-1])):
    Train.loc[valid_indicies, "kfold"] = fold

In [None]:
Train.to_csv("train_folds.csv", index=False) #Export the csv file
#Train=pd.read_csv("train_folds.csv")

**CatBoost** is a machine learning algorithm specifically designed for gradient boosting on categorical features. CatBoost is a powerful gradient boosting algorithm that excels in handling categorical features, handles missing values, and provides robustness to outliers. It offers excellent performance and is widely used for various machine learning tasks

In [None]:
Train['EJ'] = Train['EJ'].replace({'A': 0, 'B': 1})
Test['EJ']  = Test['EJ'].replace({'A': 0, 'B': 1})

In [None]:
features = Train.columns[1:-2]
label    = Train.columns[-2]

In [27]:
final_valid_predictions = {}
final_test_predictions = []
s  = []
bs = []

for k in range(5): #Iterate through 5 folds
    print('------------------ Fold: '+str(k)) #Print number of iteration
    train     = Train[Train['kfold'] !=k].reset_index(drop=True) #training set, except current kfold
    val       = Train[Train['kfold'] ==k].reset_index(drop=True) #validation set, current kfold
    valid_ids = val.Id.values.tolist() #Unique identifiers in the validation set (val.Id) are obtained and stored in the valid_ids list.


    #train_dataset and eval_dataset created using the training and validation data,features, and label
    train_dataset = Pool(data=train[features], label=train[label], cat_features=["EJ"] )
    eval_dataset  = Pool(data=val[features], label=val[label], cat_features=["EJ"])


    #The parameters of the CatBoostClassifier model are defined
    params = {
       "iterations": 10000,
        "verbose": 1000,
        "learning_rate": 0.003,
        "depth": 4,
        'auto_class_weights':'Balanced',
        'loss_function':'MultiClass',
        'eval_metric':'MultiClass:use_weights=False',
    }

    #An instance of the CatBoostClassifier model is created with the specified parameters.
    model = CatBoostClassifier(**params)

    #The model is trained on the training set.
    model.fit(train_dataset, eval_set=eval_dataset, use_best_model=True)

    #Probability predictions are generated for the validation set and the test set.
    preds_valid = model.predict_proba(val[features])
    preds_test  = model.predict_proba(Test[features])

    final_test_predictions.append(preds_test) #Append test set predictions
    final_valid_predictions.update(dict(zip(valid_ids, preds_valid))) #Validation set predictions added to a dictionary, sample identifiers as keys. 
    logloss  = log_loss(val[label], preds_valid) #Calculate the log-loss metric 
    blogloss = balance_logloss(val[label], preds_valid) #Calculate the balanced log-loss metric 

    s.append(logloss) #Append logloss values
    bs.append(blogloss) #Append balanced logloss values

    print(k, logloss, blogloss)
print('Log loss:')
print(s)
print(np.mean(s), np.std(s))  
print('Balance Log loss:')
print(bs)
print(np.mean(bs), np.std(bs))

------------------ Fold: 0
0:	learn: 0.6918613	test: 0.6918524	best: 0.6918524 (0)	total: 60.4ms	remaining: 10m 4s
1000:	learn: 0.3112258	test: 0.3129799	best: 0.3129799 (1000)	total: 12.4s	remaining: 1m 51s
2000:	learn: 0.2102446	test: 0.2228566	best: 0.2228566 (2000)	total: 28s	remaining: 1m 52s
3000:	learn: 0.1412684	test: 0.1690151	best: 0.1690151 (3000)	total: 36.5s	remaining: 1m 25s
4000:	learn: 0.0976817	test: 0.1384604	best: 0.1384604 (4000)	total: 43s	remaining: 1m 4s
5000:	learn: 0.0725291	test: 0.1201606	best: 0.1201606 (5000)	total: 51.3s	remaining: 51.3s
6000:	learn: 0.0561581	test: 0.1080580	best: 0.1080580 (6000)	total: 59.4s	remaining: 39.6s
7000:	learn: 0.0447400	test: 0.0994511	best: 0.0994511 (7000)	total: 1m 12s	remaining: 30.9s
8000:	learn: 0.0365569	test: 0.0929959	best: 0.0929959 (8000)	total: 1m 20s	remaining: 20.1s
9000:	learn: 0.0304336	test: 0.0879040	best: 0.0879040 (9000)	total: 1m 26s	remaining: 9.63s
9999:	learn: 0.0258011	test: 0.0839598	best: 0.0839433 

# Submission File

In [28]:
#Predictions of the validation set
final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index() #dictionary to dataframe
final_valid_predictions.columns = ['Id', 'class_0', 'class_1'] #Column names

#Calculating the average prediction for each sample in the test set (taking the 5 folds).
final_test_predictions = (final_test_predictions[0] + final_test_predictions[1] + final_test_predictions[2] + final_test_predictions[3] + final_test_predictions[4])/5
test_dict = {}
test_dict.update(dict(zip(Test.Id.values.tolist(), final_test_predictions))) #Mapping each sample ID from the Test dataset to its corresponding average prediction from final_test_predictions. 
submission = pd.DataFrame.from_dict(test_dict, orient="index").reset_index() #Dictionary to dataframe
submission.columns = ['Id', 'class_0', 'class_1'] #Column names               

submission.to_csv(r"submission.csv", index=False)

In [29]:
submission.head()

Unnamed: 0,Id,class_0,class_1
0,00eed32682bb,0.803432,0.196568
1,010ebe33f668,0.803432,0.196568
2,02fa521e1838,0.803432,0.196568
3,040e15f562a2,0.803432,0.196568
4,046e85c7cc7f,0.803432,0.196568
