# Titanic dataset
* Get sample data [here](https://github.com/fuyu-quant/IBLM/tree/main/datasets).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fuyu-quant/IBLM/blob/main/examples/iblmodel_titanic.ipynb)

In [1]:
%%capture
!pip install git+https://github.com/fuyu-quant/IBLM.git

In [2]:
import pkg_resources
print(pkg_resources.get_distribution('IBLM').version)

0.0.13


### Training

In [3]:
import pandas as pd
from langchain.llms import OpenAI
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

from iblm import IBLMClassifier


import os
#os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"

In [43]:
#df = pd.read_csv('/content/titanicdata_train.csv')
df = pd.read_csv('../datasets/titanicdata_train.csv')
x_train = df.drop('survived', axis=1)
y_train = df['survived']

In [44]:
llm_model_name = 'gpt-4'

params = {
    'columns_name': True
    }

iblm = IBLMClassifier(llm_model_name=llm_model_name, params=params)

In [45]:
#file_path = '/content/'
file_path = '../datasets/'

model = iblm.fit(x_train, y_train, model_name = 'titanic', file_path=file_path)

> Start of model creating.


Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID d68efd9970911e599409d4ba50885ca4 in your message.).


Tokens Used: 7782
	Prompt Tokens: 7557
	Completion Tokens: 225
Successful Requests: 1
Total Cost (USD): $0.24020999999999998


In [46]:
# Code of the model created
print(model)

import numpy as np
def predict(x):
    df = x.copy()
    output = []
    for index, row in df.iterrows():

        # Feature creation and data preprocessing
        sex = 1 if row['sex'] == 'male' else 0
        age = row['age']
        pclass = row['pclass']
        fare = row['fare']
        sibsp = row['sibsp']
        parch = row['parch']
        adult_male = 1 if row['adult_male'] else 0
        alone = 1 if row['alone'] else 0

        # Prediction logic
        y = -1.5 + 0.8 * sex - 0.02 * age - 0.5 * pclass + 0.001 * fare - 0.3 * sibsp - 0.2 * parch + 0.6 * adult_male - 0.4 * alone

        y = 1 / (1 + np.exp(-y))
        output.append(y)
    return np.array(output)


## Prediction

In [47]:
#df = pd.read_csv('/content/titanicdata_test.csv')
df = pd.read_csv('../datasets/titanicdata_test.csv')
x_test = df.drop('survived', axis=1)
y_test = df['survived']

In [51]:
y_proba = iblm.predict(x_test)
y_pred = (y_proba > 0.5).astype(int)
y_proba

array([0.05134269, 0.07094751, 0.06497929, 0.00929802, 0.1224426 ,
              nan, 0.09500499, 0.04592424, 0.13359959, 0.104544  ,
              nan,        nan, 0.11209679, 0.08069992, 0.06960486,
       0.12025686, 0.01999207, 0.03339125, 0.14338616, 0.07914653,
              nan, 0.01679829, 0.09854916, 0.07699459, 0.01477713,
       0.01861944,        nan, 0.05149023, 0.06993613, 0.02474682,
              nan, 0.0212772 , 0.11301074, 0.14185106,        nan,
       0.09890062, 0.0362528 , 0.0837675 , 0.03892044, 0.04906188,
              nan, 0.07585818, 0.02929801, 0.18089434,        nan,
       0.02982754, 0.01348966,        nan, 0.02387229, 0.0150717 ,
       0.03583802, 0.07228246, 0.09025708, 0.08533681,        nan,
       0.06590662, 0.06975233, 0.08075712, 0.08891138, 0.02120158,
       0.05514059, 0.06985487, 0.09911248, 0.02040918, 0.07783706,
              nan, 0.07850965, 0.06988547,        nan, 0.04961947,
              nan,        nan, 0.08471057,        nan,        

In [49]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Precision
precision = precision_score(y_test, y_pred)
print(f'Precision: {precision}')

# Recall
recall = recall_score(y_test, y_pred)
print(f'Recall: {recall}')

# F1 score
f1 = f1_score(y_test, y_pred)
print(f'F1 score: {f1}')

# ROC-AUC (you need prediction probabilities for this, not just class predictions)
# Here we just reuse y_pred for simplicity
roc_auc = roc_auc_score(y_test, y_proba)
print(f'ROC-AUC: {roc_auc}')

Accuracy: 0.6036308623298033
Precision: 0.0
Recall: 0.0
F1 score: 0.0


ValueError: Input contains NaN.

## Prediction from external files


In [12]:
from model_code import titanic

y_proba = titanic.predict(x_test)
y_pred = (y_proba > 0.5).astype(int)

In [13]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Precision
precision = precision_score(y_test, y_pred)
print(f'Precision: {precision}')

# Recall
recall = recall_score(y_test, y_pred)
print(f'Recall: {recall}')

# F1 score
f1 = f1_score(y_test, y_pred)
print(f'F1 score: {f1}')

# ROC-AUC (you need prediction probabilities for this, not just class predictions)
# Here we just reuse y_pred for simplicity
roc_auc = roc_auc_score(y_test, y_proba)
print(f'ROC-AUC: {roc_auc}')

Accuracy: 0.45
Precision: 0.3956043956043956
Recall: 1.0
F1 score: 0.5669291338582677
ROC-AUC: 0.9197048611111112


## Interpretation of results

In [14]:
description = iblm.interpret()

Tokens Used: 896
	Prompt Tokens: 341
	Completion Tokens: 555
Successful Requests: 1
Total Cost (USD): $0.04353


In [15]:
print(description)

- First, the function `predict` takes a DataFrame `x` as input and creates a copy of it named `df`. The columns of `df` are then renamed to be integer indices.

- The function then initializes an empty list called `output` to store the predictions.

- For each row in the DataFrame `df`, the function extracts the following features:
  - `pclass`: Passenger class (First, Second, or Third)
  - `sex`: Gender of the passenger (male or female)
  - `age`: Age of the passenger
  - `fare`: Ticket fare paid by the passenger
  - `embarked`: Port of embarkation (C, Q, or S)
  - `alone`: Whether the passenger is traveling alone or not (True or False)

- The prediction logic is then applied to these features, and a variable `y` is initialized to 0.

- The following conditions are checked and the corresponding values are added to `y`:
  - If `pclass` is 'First', add 0.3 to `y`.
  - If `pclass` is 'Second', add 0.15 to `y`.
  - If `sex` is 'female', add 0.35 to `y`.
  - If `age` is less than or equal 