# Exploring the Raw Titanic Dataset
© Explore Data Science Academy

Part of the journey to making a good regression model is to understand the data that we are modelling. To do this, we will perform some exploratory data analysis on the raw data from the [Titanic Kaggle Challenge](https://www.kaggle.com/c/titanic). The purpose of this challenge is to predict the probability of survival for a given passenger, given their boarding details.

### Honour Code

I **Bruce, Kgarimetsa**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code (https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).  

Non-compliance with the honour code constitutes a material breach of contract.

## Imports

In [142]:
import pandas as pd
import numpy as np

## Data

In [143]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/titanic_train_raw.csv')
df_clean = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/titanic_train_clean_raw.csv')

In [144]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [145]:
df_clean.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,B96 B98,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,B96 B98,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,B96 B98,S


In [146]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [147]:
df.Age.dtype

dtype('float64')

## Questions

### Question 1

After briefly looking through the data, you may notice that some entries are missing.

Write a function that determines the number of missing entries for a specified column in the dataset. The function should return a `string` detailing the number of missing values.

_**Function Specifications:**_
* Should take a pandas `DataFrame` and a `column_name` as input and return a `string` as output.
* The `string` should detial the number of missing entries in the column.
* Should be generalised to be able to work on _**ANY**_ dataframe.

In [153]:
### START FUNCTION
def total_missing(df,column_name):
    # your code here
    if column_name not in df.columns:
        raise ValueError(f"'{column_name}' does not exist")
    else:
        count = df[column_name].isnull().sum()
        missing = "{} has {} missing values".format(column_name, count)
    return missing
### END FUNCTION 

In [154]:
total_missing(df,'Age')

'Age has 177 missing values'

In [155]:
total_missing(df,'Survived')

'Survived has 0 missing values'

In [156]:
total_missing(df,'Age') == "Age has 177 missing values"

True

In [157]:
total_missing(df,'PassengerId')

'PassengerId has 0 missing values'

In [158]:
total_missing(df,'Survived') == "Survived has 0 missing values"

True

_**Expected Outputs:**_
```python
total_missing(df,'Age') == "Age has 177 missing values"
total_missing(df,'Survived') == "Survived has 0 missing values"
```

### Question 2

It would be a good idea to replace some of our missing data. Missing values can be replaced with the either the _mean_ , the _median_ or the _mode_ (in the case of categorical columns). Write a function that takes in as input a dataframe and a column name, and returns the `mean` for numerical columns and the `mode` for non-numerical columns.

_**Function Specifications:**_
* The function should take two inputs: `(df, column_name)`, where `df` is a pandas `DataFrame`, `column_name` is a `str`.
* If the `column_name` does not exist in `df`, raise a `ValueError`.
* Should return as output the `mean` if the specified column is numerical and return a list of the `mode(s)` otherwise.
* The mean should be rounded to 2 decimal places.
* **If there is more than one `mode` for a given non-numerical column, the fuction should return a list of all modes**. 


In [159]:
### START FUNCTION
def calc_mean_mode(df, column_name):
    # your code here
    if column_name not in df.columns:
        raise ValueError(f"'{column_name}' does not exist")
        
    ##elif df.select_dtypes(include=['int64','float64']).columns:
    elif df.columns.dtype == 'int64' or 'float64':
        ans = round(np.mean(df[column_name]), 2)
        
    elif df.columns.dtype == 'object':
        ans = df[column_name].value_counts().idxmax
            
    ##elif df.select_dtypes(include=['object']).columns:
    #elif non_numeric == df.select_dtypes(include=['object']).columns:
     #   ans = sorted(list(df[column_name].value_counts().index[df[column_name].value_counts() == df[column_name].value_counts().max()]))

    return ans

### END FUNCTION

In [160]:
calc_mean_mode(df,'Age')

29.7

In [161]:
calc_mean_mode(df, 'Age') == 29.7

True

In [115]:
calc_mean_mode(df, 'Embarked')

TypeError: can only concatenate str (not "int") to str

_**Expected Outputs:**_
```python
calc_mean_mode(df, 'Age') == 29.7
calc_mean_mode(df, 'Embarked') == ['S']
```

In [112]:
calc_mean_mode(df, 'Embarked') == ['S']

TypeError: can only concatenate str (not "int") to str

### Question 3

We ultimately want to predict the survival chances of the passengers in the testing set. We can start by building a simple model using the data we already have by using _conditional probability_ ! Write a function that returns the survival probability of a passenger, given a condition on a **numerical variable** from the dataset. The condition will consist of a `column_name`, a `value` and a `boolean_operator`. Possible boolean operators include `"<"`,`">"`, or `"=="`. For example, `column_name = "Age"`, `boolean_operator = ">"`, and `value = 40` together form the condition `Age > 40`.

_**Function specifications:**_
* The function should make use of the `df_clean` `DataFrame` loaded earlier in this notebook.
* It should take a numerical `column_name` string, a `boolean_operator` string, and a `value` of type string as input. 
* It should return a survival likelihood as a number between 0 and 1, rounded to 2 decimal places. 
* Assume that `column_name` exists in `df_clean`.

_**Hint:** You can use `eval()` to evaluate string boolean expressions._

In [163]:
### START FUNCTION
def survival_likelihood(df_clean,column_name, boolean_operator, value):

    return round(len(df_clean[column_name][(eval(f'df_clean["{column_name}"]' + boolean_operator + value)) & (df_clean['Survived'] == 1)])/len(df_clean[column_name][eval(f'df_clean["{column_name}"]' + boolean_operator + value)]), 2)

### END FUNCTION

In [164]:
survival_likelihood(df_clean,"Age","<","15")

0.58

_**Expected Outputs:**_
```python
survival_likelihood(df,"Pclass","==","3") == 0.24
survival_likelihood(df,"Age","<","15") == 0.58
```

In [165]:
survival_likelihood(df,"Pclass","==","3")

0.24

In [166]:
survival_likelihood(df,"Age","<","15") == 0.58

True

In [167]:
survival_likelihood(df,"Pclass","==","3") == 0.24

True