# Titanic Survival Prediction using FEDOT and LLM

This notebook demonstrates the process of analyzing the Titanic dataset and predicting passenger survival using the FEDOT framework enhanced with Large Language Models (LLM).


## Setup

In [1]:
import os
import sys
import re

import numpy as np
import pandas as pd

from pprint import pprint

module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from fedot_llm.fedot_util import run_example
from fedot_llm.language_models.actions import ModelAction
from fedot_llm.language_models.llms import OllamaLLM
from fedot_llm.data.data import Dataset
from fedot_llm.language_models import prompts

## Data Loading and Exploration

In this section, we load the Titanic dataset and perform initial exploration.

In [2]:
dataset_name = 'titanic'
datasets_folder = os.sep.join([module_path, 'datasets'])
dataset_path = os.sep.join([datasets_folder, dataset_name])
dataset = Dataset.load_from_path(dataset_path)

In [3]:
import json
description_file = 'big_descriptions.json'
with open(os.sep.join([datasets_folder, description_file]), 'r') as json_file:
    dataset_big_descriptions = json.load(json_file)
big_description = dataset_big_descriptions[dataset_name]
pprint(big_description)

('The sinking of the Titanic is one of the most infamous shipwrecks in '
 'history.\n'
 '\n'
 'On April 15, 1912, during her maiden voyage, the widely considered '
 '“unsinkable” RMS Titanic sank after colliding with an iceberg. '
 'Unfortunately, there weren’t enough lifeboats for everyone onboard, '
 'resulting in the death of 1502 out of 2224 passengers and crew.\n'
 '\n'
 'While there was some element of luck involved in surviving, it seems some '
 'groups of people were more likely to survive than others.\n'
 '\n'
 'In this challenge, we ask you to build a predictive model that answers the '
 'question: “what sorts of people were more likely to survive?” using '
 'passenger data (ie name, age, gender, socio-economic class, etc).\n'
 '\n'
 'In this competition, you’ll gain access to two similar datasets that include '
 'passenger information like name, age, gender, socio-economic class, etc. One '
 'dataset is titled train.csv and the other is titled test.csv.\n'
 '\n'
 'Train.csv 

In [4]:
print(dataset.detailed_description)
print("-"*100)
print(dataset.metadata_description)

Assume we have a datasetThe dataset contains the following splits:

The test_merged split stored in file "test_merged.csv" contains following columns: ['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Survived']. It is described as None
The test split stored in file "test.csv" contains following columns: ['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']. It is described as None
The predictions split stored in file "predictions.csv" contains following columns: ['Unnamed: 0', 'Survived']. It is described as None
The train split stored in file "train.csv" contains following columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']. It is described as None
The gender_submission split stored in file "gender_submission.csv" contains following columns: ['PassengerId', 'Survived']. It is described as None
------------

## Dataset Analysis using LLM

Here we use LLM to analyze and describe various aspects of the dataset.

In [5]:
model = OllamaLLM(model='llama3')
action = ModelAction(model=model)

### Define dataset Name, Description and Goal

In [6]:
task_prompts = {
    "dataset_name": {
        "system": big_description,
        "task": prompts.dataset_name_prompt,
        "context": None,
    },
    "dataset_description": {
        "system": big_description,
        "task": prompts.dataset_description_prompt,
        "context": dataset.detailed_description,
    },
    "dataset_goal": {
        "system": big_description,
        "task": prompts.dataset_goal_prompt,
        "context": dataset.description,
    }
}

responses = action.run_model_multicall(
    task_prompts
)
dataset.name = responses["dataset_name"]
dataset.description = responses["dataset_description"]
dataset.goal = responses["dataset_goal"]

print(f"Dataset name: {dataset.name}")
print(f"Dataset description: {dataset.description}")
print(f"Dataset goal: {dataset.goal}")

Dataset name: Titanic Passenger Survival Dataset
Dataset description: Here is a short description of the dataset:

This dataset contains information about passengers on board the RMS Titanic, including passenger IDs, classes, names, genders, ages, and other relevant details. The data is split into four files: "train.csv" for training, "test.csv" and "test_merged.csv" for testing, and "predictions.csv" for submitting predicted survival outcomes. The goal is to build a predictive model that answers the question "what sorts of people were more likely to survive?" using passenger information from the train dataset and predict the survival outcomes for the test passengers based on patterns learned from the training data.
Dataset goal: Predict whether or not each passenger survived the sinking of the Titanic based on their demographic information and other characteristics. The target column is "Survived".


In [7]:
print(dataset.detailed_description)

Assume we have a dataset called Titanic Passenger Survival Dataset
It could be described as following: Here is a short description of the dataset:

This dataset contains information about passengers on board the RMS Titanic, including passenger IDs, classes, names, genders, ages, and other relevant details. The data is split into four files: "train.csv" for training, "test.csv" and "test_merged.csv" for testing, and "predictions.csv" for submitting predicted survival outcomes. The goal is to build a predictive model that answers the question "what sorts of people were more likely to survive?" using passenger information from the train dataset and predict the survival outcomes for the test passengers based on patterns learned from the training data.
The goal is: Predict whether or not each passenger survived the sinking of the Titanic based on their demographic information and other characteristics. The target column is "Survived".
The dataset contains the following splits:

The test_me

### Define dataset Train and Test splits

In [8]:
task_prompts = {
    "train_split": {
        "system": dataset.detailed_description,
        "task": prompts.train_split_definition_prompt,
        "context": f"Available splits:\n{dataset.metadata_description}",
    },
    "test_split": {
        "system": dataset.detailed_description,
        "task": prompts.test_split_definition_prompt,
        "context": f"Available splits:\n{dataset.metadata_description}",
    }
}

responses = action.run_model_multicall(
    task_prompts
)
operations = {
    "train_split": lambda x : x.split(".")[0],
    "test_split": lambda x  : x.split(".")[0],
}
responses = action.process_model_responses(responses, operations)

dataset.train_split = responses["train_split"]
dataset.test_split = responses["test_split"]


print(f"Train split:\n{dataset.train_split}\n\n")
print(f"Test split:\n{dataset.test_split}")

Train split:
name: train 
path: /Users/aleksejlapin/Work/AutoML-LLM/AutoML-LLM-24-Jul/datasets/titanic/train.csv 
description: None


Test split:
name: test 
path: /Users/aleksejlapin/Work/AutoML-LLM/AutoML-LLM-24-Jul/datasets/titanic/test.csv 
description: None


### Column Descriptions

In [9]:
column_descriptions = action.generate_all_column_description(split=dataset.train_split, dataset=dataset)
dataset.train_split.set_column_descriptions(column_descriptions)
pprint(column_descriptions)

{'Age': 'The age of a passenger, ranging from children (0-5 years) to adults '
        'and seniors, with some missing values (nan).',
 'Cabin': 'The cabin numbers range from single letters indicating cabins '
          'without numbers, to combinations of letters and numbers '
          'representing specific cabins.',
 'Embarked': 'The embarked location of the passengers, with values '
             'representing Southampton (S), Cherbourg (C), Queenstown (Q), and '
             'unknown (nan)',
 'Fare': 'Passenger fares, ranging from approximately $7 to over $500.',
 'Name': "Passenger names in the format of 'Last Name', Mr./Mrs./Miss.",
 'Parch': 'Number of parents or children aboard, with possible values being '
          'the number of immediate family members (0-5)',
 'PassengerId': 'Unique identifier for each passenger.',
 'Pclass': 'Passenger class, with 1st class being the most luxurious and 3rd '
           'class being the least.',
 'Sex': 'Indicates whether a passenger is m

### Target Column and Task Type Identification

In [10]:
task_prompts = {
    "target_column": {
        "system": dataset.description,
        "task": prompts.target_definition_prompt,
        "context": None,
    },
    "task_type": {
        "system": dataset.description,
        "task": prompts.task_definition_prompt,
        "context": None,
    }
}

responses = action.run_model_multicall(
    task_prompts
)

pattern = r'[\'\"“”‘’`´]'
operations = {
    "target_column" :  lambda x : re.sub(pattern, '', x),
    "task_type": lambda x : re.sub(pattern, '', x.lower())
}
responses = action.process_model_responses(responses, operations)
dataset.target_name = responses["target_column"]
dataset.task_type = responses["task_type"]

print(f"Target column: {dataset.target_name}")
print(f"Task type: {dataset.task_type}")

Target column: Survived
Task type: classification


### Categorical Columns Identification

In [11]:
categorical_columns =  action.get_categorical_features(split=dataset.train_split, dataset=dataset)
pprint(categorical_columns)

['Survived', 'Pclass', 'Name', 'Sex', 'Parch', 'Ticket', 'Embarked']


## FEDOT Framework Execution
In this section, we prepare the data and run the FEDOT framework to generate predictions.

In [12]:
prediction = run_example(train_df=dataset.train_split.data, test_df=dataset.test_split.data, problem=dataset.task_type, target=dataset.target_name, timeout=5)

2024-07-29 17:25:24,519 - Topological features operation requires extra dependencies for time series forecasting, which are not installed. It can infuence the performance. Please install it by 'pip install fedot[extra]'


Generations:   0%|          | 0/10000 [00:00<?, ?gen/s]

2024-07-29 17:25:30,884 - Topological features operation requires extra dependencies for time series forecasting, which are not installed. It can infuence the performance. Please install it by 'pip install fedot[extra]'
2024-07-29 17:25:30,884 - Topological features operation requires extra dependencies for time series forecasting, which are not installed. It can infuence the performance. Please install it by 'pip install fedot[extra]'
2024-07-29 17:25:30,884 - Topological features operation requires extra dependencies for time series forecasting, which are not installed. It can infuence the performance. Please install it by 'pip install fedot[extra]'
2024-07-29 17:25:52,481 - Topological features operation requires extra dependencies for time series forecasting, which are not installed. It can infuence the performance. Please install it by 'pip install fedot[extra]'
2024-07-29 17:25:52,481 - Topological features operation requires extra dependencies for time series forecasting, which 

## Results

Here we display and analyze the prediction results.

In [None]:
prediction[:5]