# Titanic Survival Prediction using FEDOT and LLM

This notebook demonstrates the process of analyzing the Titanic dataset and predicting passenger survival using the FEDOT framework enhanced with Large Language Models (LLM).


## Setup

In [1]:
import os
import sys
import re

import numpy as np
import pandas as pd

from pprint import pprint

module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from fedot_llm.fedot_util import run_example
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser, PydanticOutputParser
from nlangchain.output_parsers import RetryWithErrorOutputParser # My PR already in master of langchain but not in pypi yet
from langchain_core.runnables import (
    RunnableParallel,
    RunnableLambda,
    RunnablePassthrough
)
from langchain_core.tracers import ConsoleCallbackHandler
from fedot_llm.data import Dataset, Split
from fedot_llm import prompts

from typing_extensions import Literal
from langchain_core.pydantic_v1 import BaseModel, Field

## Data Loading and Exploration

In this section, we load the Titanic dataset and perform initial exploration.

In [2]:
dataset_name = 'titanic'
datasets_folder = os.sep.join([module_path, 'datasets'])
dataset_path = os.sep.join([datasets_folder, dataset_name])
dataset = Dataset.load_from_path(dataset_path)

In [3]:
import json
description_file = 'big_descriptions.json'
with open(os.sep.join([datasets_folder, description_file]), 'r') as json_file:
    dataset_big_descriptions = json.load(json_file)
big_description = dataset_big_descriptions[dataset_name]
pprint(big_description)

('The sinking of the Titanic is one of the most infamous shipwrecks in '
 'history.\n'
 '\n'
 'On April 15, 1912, during her maiden voyage, the widely considered '
 '“unsinkable” RMS Titanic sank after colliding with an iceberg. '
 'Unfortunately, there weren’t enough lifeboats for everyone onboard, '
 'resulting in the death of 1502 out of 2224 passengers and crew.\n'
 '\n'
 'While there was some element of luck involved in surviving, it seems some '
 'groups of people were more likely to survive than others.\n'
 '\n'
 'In this challenge, we ask you to build a predictive model that answers the '
 'question: “what sorts of people were more likely to survive?” using '
 'passenger data (ie name, age, gender, socio-economic class, etc).\n'
 '\n'
 'In this competition, you’ll gain access to two similar datasets that include '
 'passenger information like name, age, gender, socio-economic class, etc. One '
 'dataset is titled train.csv and the other is titled test.csv.\n'
 '\n'
 'Train.csv 

In [4]:
print(dataset.detailed_description)
print("-"*100)
print(dataset.metadata_description)

Assume we have a datasetThe dataset contains the following splits:

The test split stored in file "test.csv" contains following columns: ['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']. It is described as None
The predictions split stored in file "predictions.csv" contains following columns: ['Unnamed: 0', 'Survived']. It is described as None
The train split stored in file "train.csv" contains following columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']. It is described as None
The gender_submission split stored in file "gender_submission.csv" contains following columns: ['PassengerId', 'Survived']. It is described as None
----------------------------------------------------------------------------------------------------
splits:

name: test 
path: /Users/aleksejlapin/Work/AutoML-LLM/AutoML-LLM-24-Jul/datasets/titanic/test.csv 
description: None

name: pre

## Dataset Analysis using LLM

Here we use LLM to analyze and describe various aspects of the dataset.

In [5]:
model = ChatOllama(
            model='llama3.1',
            temperature=0.1)

### Define dataset Name, Description and Goal

In [6]:
def clear_and_set_split(dataset: Dataset, split_name: str, split_type: Literal['train', 'test']):
    split_name = split_name.split(".")[0]
    if split_type == 'train':
        dataset.train_split = split_name
    elif split_type == 'test':
        dataset.test_split = split_name
    else:
        raise ValueError(f"Not supported split type: {split_type}")
    

parser = StrOutputParser()


dataset_name_chain = (prompts.dataset_name_template 
                      | model 
                      | parser 
                      | RunnableLambda(lambda x: setattr(dataset, 'name', x) or x))
dataset_description_chain = (prompts.dataset_description_template 
                             | model 
                             | StrOutputParser() 
                             | RunnableLambda(lambda x: setattr(dataset, 'description', x) or x))
dataset_goal_chain = (prompts.dataset_goal_template 
                      | model 
                      | parser 
                      | RunnableLambda(lambda x: setattr(dataset, 'goal', x) or x))

train_split_def= (prompts.train_split_template 
                  | model 
                  | parser 
                  | RunnableLambda(lambda name: clear_and_set_split(dataset, str(name), 'train') or name))
test_split_def = (prompts.test_split_template 
                  | model 
                  | parser 
                  | RunnableLambda(lambda name: clear_and_set_split(dataset, str(name), 'test') or name))
target_def = (prompts.target_definition_template 
              | model 
              | parser 
              | RunnableLambda(lambda x : re.sub(r'[\'\"“”‘’`´]', '', x) or x) 
              | RunnableLambda(lambda x: setattr(dataset, 'target_name', x) or x)
)
task_type_def = (prompts.task_definition_template
                 | model 
                 | parser 
                 | RunnableLambda(lambda x : re.sub(r'[\'\"“”‘’`´]', '', x.lower())) 
                 | RunnableLambda(lambda x: setattr(dataset, 'task_type', x) or x))


chain = (
    RunnableParallel(
            name=dataset_name_chain,
            description=dataset_description_chain,
            goal=dataset_goal_chain
        )
    | RunnablePassthrough.assign(detailed_description=lambda _: dataset.detailed_description) 
    | RunnableParallel(
            train=train_split_def,
            test=test_split_def,
            target=target_def,
            task_type=task_type_def
        )
)

responses = chain.invoke({'dataset_description': big_description}) # config={'callbacks': [ConsoleCallbackHandler()]} -- for debuging purposes

responses

{'train': 'train.csv',
 'test': 'test.csv',
 'target': 'Survived',
 'task_type': 'classification'}

In [7]:
print(f"Dataset name: {dataset.name}")
print(f"Dataset description: {dataset.description}")
print(f"Dataset goal: {dataset.goal}")
print(f"Dataset train:\n{dataset.train_split}")
print(f"Dataset test:\n{dataset.test_split}")
print('\n')
print(f"Dataset target: {dataset.target_name}")
print(f"Dataset task type: {dataset.task_type}")

Dataset name: Titanic Dataset
Dataset description: Here is a short description of this dataset:

This dataset contains information about passengers on the ill-fated RMS Titanic, including their name, age, gender, socio-economic class, and survival status. The data has been split into two sets: a training set (train.csv) with 891 passengers where the outcome (survival or not) is known, and a test set (test.csv) with 418 unseen passengers where the outcome needs to be predicted based on patterns found in the training set.
Dataset goal: Task: Predicting passenger survival on the Titanic based on their demographic information. The target column is "Survived".
Dataset train:
name: train 
path: /Users/aleksejlapin/Work/AutoML-LLM/AutoML-LLM-24-Jul/datasets/titanic/train.csv 
description: None
Dataset test:
name: test 
path: /Users/aleksejlapin/Work/AutoML-LLM/AutoML-LLM-24-Jul/datasets/titanic/test.csv 
description: None


Dataset target: Survived
Dataset task type: classification


### Column Descriptions and Categorical Columns Identification

In [11]:
class ColumnDescription(BaseModel):
    name: str = Field(description="The name of the column")
    description: str = Field(description="The short description of the column")

class ColumnType(BaseModel):
    name: str = Field(description="The name of the colum")
    column_type: Literal['categorical', 'numerical'] = Field(description="The variables type: categorical or numerical")
    
def get_column_unique_ratio(split: Split, column_name: str, ndigits: int = 2):
    return round(split[column_name].nunique() / len(split[column_name].dropna()), ndigits)

parser=PydanticOutputParser(pydantic_object=ColumnType)

retry_parser = RetryWithErrorOutputParser.from_llm(parser=parser, llm=ChatOllama(model='llama3.1', temperature=0))
    

    
describe_column_template = prompts.describe_column_template.partial(dataset_title=dataset.name,
                                                                    dataset_description=dataset.description,
                                                                    format_instructions=PydanticOutputParser(pydantic_object=ColumnDescription).get_format_instructions()
                                                                    )

categorical_columns_def = prompts.categorical_template.partial(dataset_title=dataset.name,
                                                                dataset_description=dataset.description,
                                                                format_instructions=PydanticOutputParser(pydantic_object=ColumnType).get_format_instructions())


describe_column_chain = (
    {
    "column_name":lambda x: x,
    "column_hint": lambda x: dataset.train_split.get_column_hint(x),
    "column_samples": lambda x: dataset.train_split.get_head_by_column(x, count=10)
    } 
    | describe_column_template 
    | model 
    | JsonOutputParser(pydantic_object=ColumnDescription)
)

def_column_category_chain = (
    {
        'column_name': lambda x: x['name'],
        'column_description': lambda x: x['description'],
        'column_ratio': lambda x:  get_column_unique_ratio(dataset.train_split, x['name']),
        "column_samples": lambda x: dataset.train_split.get_unique_values(x['name'], max_number=10).to_markdown(index=False)
    }
    | categorical_columns_def
    | {
        'completion': model | StrOutputParser(),
        'prompt_value': lambda x: x
    } 
    | RunnableLambda(lambda x: retry_parser.parse_with_prompt(x['completion'], x['prompt_value']))
        .with_retry(
            wait_exponential_jitter=True,
            stop_after_attempt=5
        )
)

main_chain = (
    describe_column_chain 
    | def_column_category_chain
)


result = main_chain.batch(list(dataset.train_split.data.columns), config={'callbacks': [ConsoleCallbackHandler()]}) # -- for debuging purposes
pprint(result)


[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "PassengerId"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Survived"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Pclass"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Name"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Sex"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Age"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "SibSp"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "Parch"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableS

In [12]:
print(f"Dataset train:\n{dataset.train_split}")
print(f"Dataset test:\n{dataset.test_split}")

Dataset train:
name: train 
path: /Users/aleksejlapin/Work/AutoML-LLM/AutoML-LLM-24-Jul/datasets/titanic/train.csv 
description: None
Dataset test:
name: test 
path: /Users/aleksejlapin/Work/AutoML-LLM/AutoML-LLM-24-Jul/datasets/titanic/test.csv 
description: None


## FEDOT Framework Execution
In this section, we prepare the data and run the FEDOT framework to generate predictions.

In [None]:
prediction = run_example(train_df=dataset.train_split.data, test_df=dataset.test_split.data, problem=dataset.task_type, target=dataset.target_name, timeout=5)

## Results

Here we display and analyze the prediction results.

In [None]:
prediction[:5]