# Titanic Survival Prediction using FEDOT and LLM

This notebook demonstrates the process of analyzing the Titanic dataset and predicting passenger survival using the FEDOT framework enhanced with Large Language Models (LLM).


## Setup

In [1]:
import os
import sys
import re

import numpy as np
import pandas as pd

from pprint import pprint

module_path = os.path.abspath(os.path.join(os.sep.join(['..', '..'])))
if module_path not in sys.path:
    sys.path.append(module_path)

from fedot_llm.fedot_util import run_example
from fedot_llm.data import Dataset
from fedot_llm.chains import ChainBuilder
from langchain_community.chat_models import ChatOllama
from langchain_core.tracers import ConsoleCallbackHandler

## Data Loading and Exploration

In this section, we load the Titanic dataset and perform initial exploration.

In [2]:
dataset_name = 'titanic'
datasets_folder = os.sep.join([module_path, 'datasets'])
dataset_path = os.sep.join([datasets_folder, dataset_name])
dataset = Dataset.load_from_path(dataset_path)

In [3]:
import json
description_file = 'big_descriptions.json'
with open(os.sep.join([datasets_folder, description_file]), 'r') as json_file:
    dataset_big_descriptions = json.load(json_file)
big_description = dataset_big_descriptions[dataset_name]
pprint(big_description)

('The sinking of the Titanic is one of the most infamous shipwrecks in '
 'history.\n'
 '\n'
 'On April 15, 1912, during her maiden voyage, the widely considered '
 '“unsinkable” RMS Titanic sank after colliding with an iceberg. '
 'Unfortunately, there weren’t enough lifeboats for everyone onboard, '
 'resulting in the death of 1502 out of 2224 passengers and crew.\n'
 '\n'
 'While there was some element of luck involved in surviving, it seems some '
 'groups of people were more likely to survive than others.\n'
 '\n'
 'In this challenge, we ask you to build a predictive model that answers the '
 'question: “what sorts of people were more likely to survive?” using '
 'passenger data (ie name, age, gender, socio-economic class, etc).\n'
 '\n'
 'In this competition, you’ll gain access to two similar datasets that include '
 'passenger information like name, age, gender, socio-economic class, etc. One '
 'dataset is titled train.csv and the other is titled test.csv.\n'
 '\n'
 'Train.csv 

In [4]:
print(dataset.detailed_description)
print("-"*100)
print(dataset.metadata_description)

Assume we have a datasetThe dataset contains the following splits:

The test split stored in file "test.csv" contains following columns: ['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']. It is described as None
The predictions split stored in file "predictions.csv" contains following columns: ['Unnamed: 0', 'Survived']. It is described as None
The train split stored in file "train.csv" contains following columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']. It is described as None
The gender_submission split stored in file "gender_submission.csv" contains following columns: ['PassengerId', 'Survived']. It is described as None
----------------------------------------------------------------------------------------------------
splits:

name: test 
path: /Users/aleksejlapin/Work/AutoML-LLM/AutoML-LLM-24-Jul/datasets/titanic/test.csv 
description: None

name: pre

## Dataset Analysis using LLM

Here we use LLM to analyze and describe various aspects of the dataset.

In [5]:
model = ChatOllama(model='llama3.1', temperature=0.1)
cb = ChainBuilder(assistant=model, dataset=dataset, arbiter=ChatOllama(model='llama3.1', 
                                                                       temperature=0.0,
                                                                       top_k=30,
                                                                       ))

### Define dataset Name, Description and Goal

In [6]:
pprint(cb.dataset_metadata_chain.invoke({'big_description': big_description}))

{'dataset_info': {'dataset_description': 'Here is a short description of the '
                                         'dataset:\n'
                                         '\n'
                                         'The Titanic dataset contains '
                                         'information about 2224 passengers '
                                         'who were on board the RMS Titanic '
                                         'during its ill-fated maiden voyage '
                                         'in 1912. The data includes features '
                                         'such as name, age, gender, '
                                         'socio-economic class, and more, with '
                                         'a binary outcome variable indicating '
                                         'whether each passenger survived or '
                                         'not. The dataset is split into two '
                                         '

In [7]:
print(f"Dataset name: {dataset.name}")
print(f"Dataset description: {dataset.description}")
print(f"Dataset goal: {dataset.goal}")
print(f"Dataset train:\n{dataset.train_split}")
print(f"Dataset test:\n{dataset.test_split}")
print('\n')
print(f"Dataset target: {dataset.target_name}")
print(f"Dataset task type: {dataset.task_type}")

Dataset name: Titanic Dataset
Dataset description: Here is a short description of the dataset:

The Titanic dataset contains information about 2224 passengers who were on board the RMS Titanic during its ill-fated maiden voyage in 1912. The data includes features such as name, age, gender, socio-economic class, and more, with a binary outcome variable indicating whether each passenger survived or not. The dataset is split into two parts: a training set (train.csv) that contains the ground truth for 891 passengers, and a test set (test.csv) that requires prediction of survival outcomes for an additional 418 passengers.
Dataset goal: None
Dataset train:
name: train 
path: /Users/aleksejlapin/Work/AutoML-LLM/AutoML-LLM-24-Jul/datasets/titanic/train.csv 
description: None
Dataset test:
name: test 
path: /Users/aleksejlapin/Work/AutoML-LLM/AutoML-LLM-24-Jul/datasets/titanic/test.csv 
description: None


Dataset target: Survived
Dataset task type: classification


### Column Descriptions and Categorical Columns Identification

In [8]:
cb.categorize_columns_chain.batch(list(cb.dataset.train_split.data.columns), config={'callbacks': [ConsoleCallbackHandler()]})

[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "PassengerId"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Survived"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Pclass"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Name"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Sex"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Age"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "SibSp"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Parch"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableS

[{'reasoning': 'Based on the column description and data, I can conclude that:\n\n1. Unique ratio is 1.0, which means all values are unique.\n2. The column contains a sequence of numbers from 1 to 2224.\n\nHence it is numerical data. The conclusion is a numerical feature, because ratio is 1 -> all values unique.\n\n\n\n```json\n {"name": "PassengerId", "column_type": "numerical"}\n```',
  'category': ColumnType(name='PassengerId', column_type='numerical')},
 {'reasoning': 'Based on the column description and data, I can conclude that:\n\n1. Unique ratio is low (0). This means that there are only two unique values in the column: 0 and 1.\n2. The column contains a binary variable indicating whether each passenger survived or not.\n\nGiven these characteristics, I would classify this column as categorical data, specifically nominal data, since it represents a classification or label rather than a numerical value.\n\n\n\n```json\n {"name": "Survived", "column_type": "categorical"}\n```',
 

## FEDOT Framework Execution
In this section, we prepare the data and run the FEDOT framework to generate predictions.

In [9]:
prediction = run_example(train_df=dataset.train_split.data, test_df=dataset.test_split.data, problem=dataset.task_type, target=dataset.target_name, timeout=5)

2024-08-06 19:52:54,575 - Topological features operation requires extra dependencies for time series forecasting, which are not installed. It can infuence the performance. Please install it by 'pip install fedot[extra]'


Generations:   0%|          | 0/10000 [00:00<?, ?gen/s]

2024-08-06 19:53:02,257 - Topological features operation requires extra dependencies for time series forecasting, which are not installed. It can infuence the performance. Please install it by 'pip install fedot[extra]'
2024-08-06 19:53:02,257 - Topological features operation requires extra dependencies for time series forecasting, which are not installed. It can infuence the performance. Please install it by 'pip install fedot[extra]'
2024-08-06 19:53:02,257 - Topological features operation requires extra dependencies for time series forecasting, which are not installed. It can infuence the performance. Please install it by 'pip install fedot[extra]'
2024-08-06 19:53:25,816 - Topological features operation requires extra dependencies for time series forecasting, which are not installed. It can infuence the performance. Please install it by 'pip install fedot[extra]'
2024-08-06 19:53:25,816 - Topological features operation requires extra dependencies for time series forecasting, which 

## Results

Here we display and analyze the prediction results.

In [None]:
prediction[:5]