# Titanic dataset

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fuyu-quant/CreateTool-AGI/blob/main/examples/titanic.ipynb)

In [1]:
%%capture
!pip install git+https://github.com/fuyu-quant/IBLM.git

## Training

In [2]:
import pandas as pd
from langchain.llms import OpenAI
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

from iblm import IBLMClassifier

SyntaxError: invalid syntax (__init__.py, line 1)

In [61]:
df = pd.read_csv('../datasets/learning.csv')
x_train = df.drop('survived', axis=1)
y_train = df['survived']

In [62]:
llm_model_name = 'gpt-4'

params = {
    'columns_name': False
    }

ibl = IBLMClassifier(llm_model_name=llm_model_name, params=params)

In [63]:
model = ibl.fit(x_train, y_train, model_name = 'titanic', file_path='./model_code/')

> Start of model creating.
Tokens Used: 3698
	Prompt Tokens: 3436
	Completion Tokens: 262
Successful Requests: 1
Total Cost (USD): $0.1188


In [64]:
# Code of the model created
print(model)

import numpy as np
import pandas as pd

def predict(x):
    df = x.copy()
    df.columns = range(df.shape[1])

    output = []
    for index, row in df.iterrows():

        # Feature creation and data preprocessing
        pclass = row[7]
        sex = row[1]
        age = row[2]
        fare = row[5]
        embarked = row[6]
        alone = row[13]

        # Prediction logic
        y = 0

        if pclass == 'First':
            y += 0.3
        elif pclass == 'Second':
            y += 0.1

        if sex == 'female':
            y += 0.35

        if age <= 16:
            y += 0.2
        elif age > 16 and age <= 32:
            y += 0.1

        if fare > 50:
            y += 0.1

        if embarked == 'C':
            y += 0.1

        if alone == 'True':
            y -= 0.1

        y = 1 / (1 + np.exp(-y))
        output.append(y)

    output = np.array(output)

    return output


## Prediction

In [69]:
df = pd.read_csv('../datasets/pred.csv')
x_test = df.drop('survived', axis=1)
y_test = df['survived']

In [70]:
y_proba = ibl.predict(x_test)
y_pred = (y_proba > 0.5).astype(int)

In [71]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Precision
precision = precision_score(y_test, y_pred)
print(f'Precision: {precision}')

# Recall
recall = recall_score(y_test, y_pred)
print(f'Recall: {recall}')

# F1 score
f1 = f1_score(y_test, y_pred)
print(f'F1 score: {f1}')

# ROC-AUC (you need prediction probabilities for this, not just class predictions)
# Here we just reuse y_pred for simplicity
roc_auc = roc_auc_score(y_test, y_proba)
print(f'ROC-AUC: {roc_auc}')

Accuracy: 0.54
Precision: 0.43902439024390244
Recall: 1.0
F1 score: 0.6101694915254238
ROC-AUC: 0.9097222222222222


## Prediction from external files


In [68]:
import titanic

y_proba = titanic.predict(x_test)
y_pred = (y_proba > 0.5).astype(int)

In [56]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Precision
precision = precision_score(y_test, y_pred)
print(f'Precision: {precision}')

# Recall
recall = recall_score(y_test, y_pred)
print(f'Recall: {recall}')

# F1 score
f1 = f1_score(y_test, y_pred)
print(f'F1 score: {f1}')

# ROC-AUC (you need prediction probabilities for this, not just class predictions)
# Here we just reuse y_pred for simplicity
roc_auc = roc_auc_score(y_test, y_proba)
print(f'ROC-AUC: {roc_auc}')

Accuracy: 0.83
Precision: 0.7435897435897436
Recall: 0.8055555555555556
F1 score: 0.7733333333333334
ROC-AUC: 0.9266493055555556


## Interpretation of results

In [57]:
description = ibl.interpret()

Tokens Used: 881
	Prompt Tokens: 537
	Completion Tokens: 344
Successful Requests: 1
Total Cost (USD): $0.036750000000000005


In [58]:
print(description)

- Data preprocessing:
    - Fill missing 'age' values with the median age.
    - Fill missing 'fare' values with the median fare.
    - Fill missing 'embarked' values with the mode (most frequent) of the 'embarked' column.

- Feature creation:
    - Create a new binary feature 'is_female' based on the 'sex' column.
    - Create a new binary feature 'is_child' based on the 'age' column.
    - Create a new binary feature 'is_adult_male' based on the 'adult_male' column.
    - Create a new binary feature 'is_alone' based on the 'alone' column.
    - Create new binary features 'is_first_class', 'is_second_class', and 'is_third_class' based on the 'pclass' column.
    - Create new binary features 'embarked_C', 'embarked_Q', and 'embarked_S' based on the 'embarked' column.

- Prediction logic:
    - Initialize a variable 'y' to 0.
    - Add or subtract weights to 'y' based on the created binary features.
    - Apply the logistic function (sigmoid) to 'y' to get the final prediction.
    - Ap