# Titanic dataset
* Get sample data [here](https://github.com/fuyu-quant/IBLM/tree/main/datasets).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fuyu-quant/IBLM/blob/main/examples/iblmodel_titanic.ipynb)

In [1]:
%%capture
!pip install git+https://github.com/fuyu-quant/IBLM.git

In [2]:
import pkg_resources
print(pkg_resources.get_distribution('IBLM').version)

0.0.21


### Training

In [10]:
import pandas as pd
from langchain.llms import OpenAI
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

from iblm import IBLMClassifier


import os
#os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"

In [36]:
#df = pd.read_csv('/content/titanicdata_train.csv')
df = pd.read_csv('../datasets/titanicdata_train.csv')
x_train = df.drop('survived', axis=1)
y_train = df['survived']

In [37]:
llm_model_name = 'gpt-4'

params = {
    'columns_name': True
    }

iblm = IBLMClassifier(llm_model_name=llm_model_name, params=params)

In [64]:
#file_path = '/content/'
file_path = './model_code/'

model = iblm.fit(x_train, y_train, model_name = 'titanic_22_', file_path=file_path)

> Start of model creating.
Tokens Used: 7676
	Prompt Tokens: 7239
	Completion Tokens: 437
Successful Requests: 1
Total Cost (USD): $0.24338999999999997


In [65]:
# Code of the model created
print(model)

import numpy as np

def predict(x):
    df = x.copy()
    output = []
    for index, row in df.iterrows():
        # Do not change the code before this point.
        
        # Calculate the probability based on the given data
        pclass = row['pclass']
        sex = row['sex']
        age = row['age']
        fare = row['fare']
        embarked = row['embarked']
        who = row['who']
        adult_male = row['adult_male']
        alone = row['alone']

        # Initialize probability
        prob = 0

        # Consider the effect of pclass
        if pclass == 1:
            prob += 0.6
        elif pclass == 2:
            prob += 0.4
        else:
            prob += 0.2

        # Consider the effect of sex
        if sex == 'female':
            prob += 0.35
        else:
            prob -= 0.35

        # Consider the effect of age
        if age <= 16:
            prob += 0.1
        elif age >= 60:
            prob -= 0.1

        # Consider the effect of fare
       

### Prediction

In [55]:
#df = pd.read_csv('/content/titanicdata_test.csv')
df = pd.read_csv('../datasets/titanicdata_test.csv')
x_test = df.drop('survived', axis=1)
y_test = df['survived']

In [56]:
y_proba = iblm.predict(x_test)
y_pred = (y_proba > 0.5).astype(int)

In [57]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Precision
precision = precision_score(y_test, y_pred)
print(f'Precision: {precision}')

# Recall
recall = recall_score(y_test, y_pred)
print(f'Recall: {recall}')

# F1 score
f1 = f1_score(y_test, y_pred)
print(f'F1 score: {f1}')

# ROC-AUC (you need prediction probabilities for this, not just class predictions)
# Here we just reuse y_pred for simplicity
roc_auc = roc_auc_score(y_test, y_proba)
print(f'ROC-AUC: {roc_auc}')

Accuracy: 0.6050670640834576
Precision: 0.0
Recall: 0.0
F1 score: 0.0
ROC-AUC: 0.8148294451157171


### Prediction from external files


In [59]:
from model_code import titanic

y_proba = titanic.predict(x_test)
y_pred = (y_proba > 0.5).astype(int)

ImportError: attempted relative import with no known parent package

In [13]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Precision
precision = precision_score(y_test, y_pred)
print(f'Precision: {precision}')

# Recall
recall = recall_score(y_test, y_pred)
print(f'Recall: {recall}')

# F1 score
f1 = f1_score(y_test, y_pred)
print(f'F1 score: {f1}')

# ROC-AUC (you need prediction probabilities for this, not just class predictions)
# Here we just reuse y_pred for simplicity
roc_auc = roc_auc_score(y_test, y_proba)
print(f'ROC-AUC: {roc_auc}')

Accuracy: 0.45
Precision: 0.3956043956043956
Recall: 1.0
F1 score: 0.5669291338582677
ROC-AUC: 0.9197048611111112


## Interpretation of results

In [14]:
description = iblm.interpret()

Tokens Used: 896
	Prompt Tokens: 341
	Completion Tokens: 555
Successful Requests: 1
Total Cost (USD): $0.04353


In [15]:
print(description)

- First, the function `predict` takes a DataFrame `x` as input and creates a copy of it named `df`. The columns of `df` are then renamed to be integer indices.

- The function then initializes an empty list called `output` to store the predictions.

- For each row in the DataFrame `df`, the function extracts the following features:
  - `pclass`: Passenger class (First, Second, or Third)
  - `sex`: Gender of the passenger (male or female)
  - `age`: Age of the passenger
  - `fare`: Ticket fare paid by the passenger
  - `embarked`: Port of embarkation (C, Q, or S)
  - `alone`: Whether the passenger is traveling alone or not (True or False)

- The prediction logic is then applied to these features, and a variable `y` is initialized to 0.

- The following conditions are checked and the corresponding values are added to `y`:
  - If `pclass` is 'First', add 0.3 to `y`.
  - If `pclass` is 'Second', add 0.15 to `y`.
  - If `sex` is 'female', add 0.35 to `y`.
  - If `age` is less than or equal 

### Creating Multiple Models

In [52]:
#file_path = '/content/'
file_path = './model_code/'

for i in range(30):
    model = iblm.fit(x_train, y_train, model_name = f'titanic_{i}_', file_path=file_path)

> Start of model creating.
Tokens Used: 7571
	Prompt Tokens: 7239
	Completion Tokens: 332
Successful Requests: 1
Total Cost (USD): $0.23708999999999997
> Start of model creating.
Tokens Used: 7690
	Prompt Tokens: 7239
	Completion Tokens: 451
Successful Requests: 1
Total Cost (USD): $0.24422999999999997
> Start of model creating.
Tokens Used: 7577
	Prompt Tokens: 7239
	Completion Tokens: 338
Successful Requests: 1
Total Cost (USD): $0.23744999999999997
> Start of model creating.
Tokens Used: 7653
	Prompt Tokens: 7239
	Completion Tokens: 414
Successful Requests: 1
Total Cost (USD): $0.24200999999999998
> Start of model creating.
Tokens Used: 7692
	Prompt Tokens: 7239
	Completion Tokens: 453
Successful Requests: 1
Total Cost (USD): $0.24434999999999998
> Start of model creating.
Tokens Used: 7570
	Prompt Tokens: 7239
	Completion Tokens: 331
Successful Requests: 1
Total Cost (USD): $0.23702999999999996
> Start of model creating.
Tokens Used: 7692
	Prompt Tokens: 7239
	Completion Tokens: 45