# Titanic dataset
* Get sample data [here](https://github.com/fuyu-quant/IBLM/tree/main/datasets).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fuyu-quant/IBLM/blob/main/examples/iblmodel_titanic.ipynb)

In [1]:
%%capture
!pip install git+https://github.com/fuyu-quant/IBLM.git

In [None]:
import pkg_resources
print(pkg_resources.get_distribution('IBLM').version)

### Training

In [2]:
import pandas as pd
from langchain.llms import OpenAI
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

from iblm import IBLMClassifier


import os
os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"

In [3]:
df = pd.read_csv('../datasets/learning.csv')
x_train = df.drop('survived', axis=1)
y_train = df['survived']

In [5]:
llm_model_name = 'gpt-4'

params = {
    'columns_name': False
    }

iblm = IBLMClassifier(llm_model_name=llm_model_name, params=params)

In [6]:
model = iblm.fit(x_train, y_train, model_name = 'titanic', file_path='./model_code/')

> Start of model creating.


Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 813c542ea19eacf56c42ad5b4ad5aa20 in your message.).


Tokens Used: 3695
	Prompt Tokens: 3436
	Completion Tokens: 259
Successful Requests: 1
Total Cost (USD): $0.11862


In [7]:
# Code of the model created
print(model)

import numpy as np
import pandas as pd

def predict(x):
    df = x.copy()
    df.columns = range(df.shape[1])

    output = []
    for index, row in df.iterrows():

        # Feature creation and data preprocessing
        pclass = row[7]
        sex = row[1]
        age = row[2]
        fare = row[5]
        embarked = row[6]
        alone = row[13]

        # Prediction logic
        y = 0
        if pclass == 'First':
            y += 0.3
        elif pclass == 'Second':
            y += 0.15

        if sex == 'female':
            y += 0.35

        if age <= 16:
            y += 0.1
        elif age > 16 and age <= 32:
            y += 0.05

        if fare > 50:
            y += 0.1

        if embarked == 'C':
            y += 0.05

        if alone:
            y += 0.05

        y = 1 / (1 + np.exp(-y))
        output.append(y)

    output = np.array(output)

    return output


## Prediction

In [8]:
df = pd.read_csv('../datasets/pred.csv')
x_test = df.drop('survived', axis=1)
y_test = df['survived']

In [9]:
y_proba = iblm.predict(x_test)
y_pred = (y_proba > 0.5).astype(int)

In [10]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Precision
precision = precision_score(y_test, y_pred)
print(f'Precision: {precision}')

# Recall
recall = recall_score(y_test, y_pred)
print(f'Recall: {recall}')

# F1 score
f1 = f1_score(y_test, y_pred)
print(f'F1 score: {f1}')

# ROC-AUC (you need prediction probabilities for this, not just class predictions)
# Here we just reuse y_pred for simplicity
roc_auc = roc_auc_score(y_test, y_proba)
print(f'ROC-AUC: {roc_auc}')

Accuracy: 0.45
Precision: 0.3956043956043956
Recall: 1.0
F1 score: 0.5669291338582677
ROC-AUC: 0.9197048611111112


## Prediction from external files


In [12]:
from model_code import titanic

y_proba = titanic.predict(x_test)
y_pred = (y_proba > 0.5).astype(int)

In [13]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Precision
precision = precision_score(y_test, y_pred)
print(f'Precision: {precision}')

# Recall
recall = recall_score(y_test, y_pred)
print(f'Recall: {recall}')

# F1 score
f1 = f1_score(y_test, y_pred)
print(f'F1 score: {f1}')

# ROC-AUC (you need prediction probabilities for this, not just class predictions)
# Here we just reuse y_pred for simplicity
roc_auc = roc_auc_score(y_test, y_proba)
print(f'ROC-AUC: {roc_auc}')

Accuracy: 0.45
Precision: 0.3956043956043956
Recall: 1.0
F1 score: 0.5669291338582677
ROC-AUC: 0.9197048611111112


## Interpretation of results

In [14]:
description = iblm.interpret()

Tokens Used: 896
	Prompt Tokens: 341
	Completion Tokens: 555
Successful Requests: 1
Total Cost (USD): $0.04353


In [15]:
print(description)

- First, the function `predict` takes a DataFrame `x` as input and creates a copy of it named `df`. The columns of `df` are then renamed to be integer indices.

- The function then initializes an empty list called `output` to store the predictions.

- For each row in the DataFrame `df`, the function extracts the following features:
  - `pclass`: Passenger class (First, Second, or Third)
  - `sex`: Gender of the passenger (male or female)
  - `age`: Age of the passenger
  - `fare`: Ticket fare paid by the passenger
  - `embarked`: Port of embarkation (C, Q, or S)
  - `alone`: Whether the passenger is traveling alone or not (True or False)

- The prediction logic is then applied to these features, and a variable `y` is initialized to 0.

- The following conditions are checked and the corresponding values are added to `y`:
  - If `pclass` is 'First', add 0.3 to `y`.
  - If `pclass` is 'Second', add 0.15 to `y`.
  - If `sex` is 'female', add 0.35 to `y`.
  - If `age` is less than or equal 