[Titanic](https://www.kaggle.com/c/titanic/data)

# Project 1: Titanic

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this project, you will exercise your skills with loading data, python data structures, and Pandas to identify charactaristics of Titanic survivors!

### Considerations:

* You will be generating long data strutures- if you can, avoid displaying the whole thing. Display just the first or last few entries and look at the length or shape to check whether your code gives you back what you want and expect.
* Make functions whenever possiblle!
* Be explicit with your naming. You may forget what `this_list` is, but you will have an idea of what `passenger_fare_list` is. Variable naming will help you in the long run!
* Don't forget about tab autocomplete!
* Use markdown cells to document your planning, thoughts, and results. 
* Delete cells you will not include in your final submission
* Try to solve your own problems using this framework:
  1. Check your spelling
  2. Google your errors
  3. Ask your classmates
  4. Ask an instructor or TA

# 1. Using the `with open()` method in the `csv` library, load the titanic dataset into a list of lists.

* The `type()` of your dataset should be `list`
* The `type()` of each element in your dataset should also be `list`
* The `len()` of your dataset should be 892 (892 rows, including the header)
* The `len()` of each row element in your dataset should be have a `len()` of 12
* Print out the first 3 rows including the header to check your data.

In [1]:
import csv
from IPython.display import display
import numpy as np

In [2]:
dataset = []

with open('titanic.csv') as csvfile:
    my_reader = csv.reader(csvfile, delimiter=',')
    for row in my_reader:
        dataset.append(row)

In [3]:
len(dataset)

892

In [4]:
len(dataset[0])

12

In [5]:
dataset[:3]

[['PassengerId',
  'Survived',
  'Pclass',
  'Name',
  'Sex',
  'Age',
  'SibSp',
  'Parch',
  'Ticket',
  'Fare',
  'Cabin',
  'Embarked'],
 ['1',
  '0',
  '3',
  'Braund, Mr. Owen Harris',
  'male',
  '22',
  '1',
  '0',
  'A/5 21171',
  '7.25',
  '',
  'S'],
 ['2',
  '1',
  '1',
  'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
  'female',
  '38',
  '1',
  '0',
  'PC 17599',
  '71.2833',
  'C85',
  'C']]

# 2. Separate the first header row from the rest of your dataset. 

* The header should be a list of the column names
* The data should be the rest of your data
* Display the header and the first row of the dataset zipped together

In [6]:
header = dataset[0]
data = dataset[1:]

In [7]:
list(zip(header, data[0]))

[('PassengerId', '1'),
 ('Survived', '0'),
 ('Pclass', '3'),
 ('Name', 'Braund, Mr. Owen Harris'),
 ('Sex', 'male'),
 ('Age', '22'),
 ('SibSp', '1'),
 ('Parch', '0'),
 ('Ticket', 'A/5 21171'),
 ('Fare', '7.25'),
 ('Cabin', ''),
 ('Embarked', 'S')]

# 3. Using a `for` loop, load your data into a `dict` called `data_dict`.

* The keys of your `data_dict` should be `PassengerId`
* The values of your `data_dict` should be dictionaries...
  * Each of these dictionaries should reperesent a column value within a row
  * The keys should be the names of the columns
  * The values should be the values of that column
  
The beginning of your `data_dict` should look like: 

    {'1': {'Age': '22',
      'Cabin': '',
      'Embarked': 'S',
      'Fare': '7.25',
      'Name': 'Braund, Mr. Owen Harris',
      'Parch': '0',
      'Pclass': '3',
      'Sex': 'male',
      'SibSp': '1',
      'Survived': '0',
      'Ticket': 'A/5 21171'},
     '10': {'Age': '14',
      'Cabin': '',
      'Embarked': 'C',
      'Fare': '30.0708',
      'Name': 'Nasser, Mrs. Nicholas (Adele Achem)',
      'Parch': '0',
      'Pclass': '2',
      'Sex': 'female',
      'SibSp': '1',
      'Survived': '1',
      'Ticket': '237736'},
      ...
      }

In [8]:
data_dict = {}

for row in data:
    zipped = zip(header, row)
    row_dict = {}
    for col, element in zipped[1:]:
        row_dict[col] = element
    data_dict[row[0]] = row_dict

In [9]:
data_dict

{'1': {'Age': '22',
  'Cabin': '',
  'Embarked': 'S',
  'Fare': '7.25',
  'Name': 'Braund, Mr. Owen Harris',
  'Parch': '0',
  'Pclass': '3',
  'Sex': 'male',
  'SibSp': '1',
  'Survived': '0',
  'Ticket': 'A/5 21171'},
 '10': {'Age': '14',
  'Cabin': '',
  'Embarked': 'C',
  'Fare': '30.0708',
  'Name': 'Nasser, Mrs. Nicholas (Adele Achem)',
  'Parch': '0',
  'Pclass': '2',
  'Sex': 'female',
  'SibSp': '1',
  'Survived': '1',
  'Ticket': '237736'},
 '100': {'Age': '34',
  'Cabin': '',
  'Embarked': 'S',
  'Fare': '26',
  'Name': 'Kantor, Mr. Sinai',
  'Parch': '0',
  'Pclass': '2',
  'Sex': 'male',
  'SibSp': '1',
  'Survived': '0',
  'Ticket': '244367'},
 '101': {'Age': '28',
  'Cabin': '',
  'Embarked': 'S',
  'Fare': '7.8958',
  'Name': 'Petranec, Miss. Matilda',
  'Parch': '0',
  'Pclass': '3',
  'Sex': 'female',
  'SibSp': '0',
  'Survived': '0',
  'Ticket': '349245'},
 '102': {'Age': '',
  'Cabin': '',
  'Embarked': 'S',
  'Fare': '7.8958',
  'Name': 'Petroff, Mr. Pastcho ("Pen

# 4. Repeat step 3 using a dictionary comprehension.

* Using `==`, check if your `data_dict` from your `for` loop is the same as the one from your dictionary comprehension.

In [None]:
data_dict_comp = {row[0]:{i:j for i,j in zip(header[1:], row[1:])} for row in data}

In [None]:
data_dict == data_dict_comp

# 5. Transform your `data_dict` to be oriented by column and call it `data_dict_columns`

* Currently, our `data_dict` is oriented by row, indexed by `"PassengerId"`. 
* Transform your data so that the title of each row is a key, the values are of type `list` and represent column vectors.

If you display `data_dict_columns`, the beginning should look like...

    {'Age': ['25',
      '36',
      '24',
      '40',
      '45',
      '2',
      '24',
      '28',
      '33',
      '26',
      '39',
      ...

In [10]:
data_dict_columns = {}
for col in header:
    data_dict_columns[col] = []

for pass_id, row in data_dict.items():
    data_dict_columns['PassengerId'].append(pass_id)
    for col, value in row.items():
        data_dict_columns[col].append(value)

# 6. Data Types

What is the current `type` of each column? What do you think the data type of each column *should* be? The data types in Python are...

* `int`
* `float`
* `str`
* `bool`
* `tuple`
* `list`
* `dict`
* `set`

In a markdown cell, describe what each column represents and what the `type` of each value should be. **Extra:** If you want to be fancy, use a markdown table to display your results.

### Add link to markdown table syntax


|Column|Type|
|---|---|
|Fare|`float`|
|Name|`str`|
|Embarked|`str`|
|Age|`float`|
|Parch|`int`|
|Pclass|`int`|
|Sex|`str`|
|Survived|`int`|
|SibSp|`int`|
|PassengerId|`int`|
|Ticket|`str`|
|Cabin|`str`|

# 7. Transform each column to the appropriate type if needed.

Build a function called `transform_column` that takes arguments for a `data_dict`, `column_name`, and `datatype`, and use it to transofm the columns that need transformation.

**NOTE:** There are values in this dataset that cannot be directly cast to a numerical value. Use `if/then` or `try/except` statements to handle errors. 

**To help identify potential sources of errors, explore the `set` of values in each column.**

In [11]:
data_dict_columns.keys()

['Fare',
 'Name',
 'Embarked',
 'Age',
 'Parch',
 'Pclass',
 'Sex',
 'Survived',
 'SibSp',
 'PassengerId',
 'Ticket',
 'Cabin']

In [12]:
non_str_cols = ['Fare','Age','Parch','Pclass','Survived','SibSp','PassengerId']
type_list = [float, float, int, int, int, int, int]

In [13]:
set(data_dict_columns[non_str_cols[0]])

{'0',
 '10.1708',
 '10.4625',
 '10.5',
 '10.5167',
 '106.425',
 '108.9',
 '11.1333',
 '11.2417',
 '11.5',
 '110.8833',
 '113.275',
 '12',
 '12.275',
 '12.2875',
 '12.35',
 '12.475',
 '12.525',
 '12.65',
 '12.875',
 '120',
 '13',
 '13.4167',
 '13.5',
 '13.7917',
 '13.8583',
 '13.8625',
 '133.65',
 '134.5',
 '135.6333',
 '14',
 '14.1083',
 '14.4',
 '14.4542',
 '14.4583',
 '14.5',
 '146.5208',
 '15',
 '15.0458',
 '15.05',
 '15.1',
 '15.2458',
 '15.5',
 '15.55',
 '15.7417',
 '15.75',
 '15.85',
 '15.9',
 '151.55',
 '153.4625',
 '16',
 '16.1',
 '16.7',
 '164.8667',
 '17.4',
 '17.8',
 '18',
 '18.75',
 '18.7875',
 '19.2583',
 '19.5',
 '19.9667',
 '20.2125',
 '20.25',
 '20.525',
 '20.575',
 '21',
 '21.075',
 '21.6792',
 '211.3375',
 '211.5',
 '22.025',
 '22.3583',
 '22.525',
 '221.7792',
 '227.525',
 '23',
 '23.25',
 '23.45',
 '24',
 '24.15',
 '247.5208',
 '25.4667',
 '25.5875',
 '25.925',
 '25.9292',
 '26',
 '26.25',
 '26.2833',
 '26.2875',
 '26.3875',
 '26.55',
 '262.375',
 '263',
 '27',
 '27

In [14]:
def transform_column(data_dict, column, datatype):
    new_col = []
    for i in data_dict[column]:
        try:
            new_col.append(datatype(i))
        except:
            new_col.append(np.NaN)
    data_dict[column] = new_col
    return data_dict

In [15]:
for col, datatype in zip(non_str_cols, type_list):
    data_dict_columns = transform_column(data_dict_columns, col, datatype)

# 8. Build functions to calculate the mean, standard deviation (sample, not population), median, and mode of a list of ints or floats. 


If you filled any missing values with `np.NaN`, you may need to handle that in your functions (look up `np.isnan()`).


**Hint:** Mode is tricky. Start by building a function that counts the occurances of each value. You may also need to sort using a `key` with a `lambda function` inside. You may also find a `defaultdict` useful.

**Optional:** for Mode, return the mode value *and* the count of that value.

Mean

In [16]:
def this_mean(data_list):
    data_list = [float(i) for i in data_list if not np.isnan(i)]
    return sum(data_list)/len(data_list)

Standard Deviation

In [17]:
def this_std(data_list):
    data_list = [float(i) for i in data_list if not np.isnan(i)]
    df = len(data_list) - 1
    mean = this_mean(data_list)
    dev = sum([(i - mean)**2 for i in data_list])
    return (dev/df)**.5

Median

In [18]:
def this_median(data_list):
    data_list = [float(i) for i in data_list if not np.isnan(i)]
    data_list.sort()
    l = len(data_list)
    if l%2 == 1:
        ind = l/2
        return data_list[ind]
    else:
        ind1 = l//2
        ind2 = ind1+1
        return this_mean([data_list[ind1], data_list[ind2]])

Mode

In [19]:
from collections import defaultdict

In [20]:
def counter(data_list):
    '''
    Returns a dict with the values as keys and the counts of each value as the dict values.
    '''
    count_dict = defaultdict(int)
    for i in data_list:
        count_dict[i] += 1
    return dict(count_dict)

In [21]:
def this_mode(data_list):
    '''
    Returns the mode and the no. of occurances of that value.
    '''
    counts = counter(data_list).items()
    counts.sort(key = lambda x: x[1], reverse=True)
    value, count = counts[0]
    return value, count

# 9. Summary Statistics of Numerical Columns

For numerical columns, what is the mean, standard deviation, mean, and mode for that data? Which measure of central tendency is the most descriptive of each column? Why? Explain your answer in a markdown cell.

In [22]:
def get_sum_stats(data_col):
    for name, method in [('mean',this_mean), 
                         ('standard deviation',this_std), 
                         ('median', this_median),
                         ('mode', this_mode)
                        ]:
        print("{:20} : {}".format(name, method(data_col)))

In [23]:
for col in non_str_cols:
    print("Column: {}".format(col))
    get_sum_stats(data_dict_columns[col])
    print('\n')

Column: Fare
mean                 : 32.2042079686
standard deviation   : 49.6934285972
median               : 14.4542
mode                 : (8.05, 43)


Column: Age
mean                 : 29.6991176471
standard deviation   : 14.5264973323
median               : 28.0
mode                 : (nan, 177)


Column: Parch
mean                 : 0.381593714927
standard deviation   : 0.80605722113
median               : 0.0
mode                 : (0, 678)


Column: Pclass
mean                 : 2.30864197531
standard deviation   : 0.836071240977
median               : 3.0
mode                 : (3, 491)


Column: Survived
mean                 : 0.383838383838
standard deviation   : 0.486592454265
median               : 0.0
mode                 : (0, 549)


Column: SibSp
mean                 : 0.523007856341
standard deviation   : 1.10274343229
median               : 0.0
mode                 : (0, 608)


Column: PassengerId
mean                 : 446.0
standard deviation   : 257.353842015
media

#### Fare

Fare has a mean of 32 with a std of almost 50. Because the median is around 14 and the mode is 8, the data is skewed. Thus, the median is the most appropriate measure to describe this data.

#### Age

The mean and median of Age are close to each other, so the data is likely not heavily skewed. The mean is appropriate in this case.

#### Parch

The mode is descriptive here: 678 individuals out of our entire dataset traveled without parents or children.

#### SibSp

The mode is descriptive here: 608 individuals out of our entire dataset traveled without siblings or spouses.

#### PassengerId

No descriptive stastics are informative

#### Pclass

The mode is descriptive here. More than half the passengers are in class 3

#### Survived

The mean is important here. The mean is the frequentist probability of survival on the Titanic

# 10. Splitting the Data to Predicting Survival

For all the passengers in the dataset, the mean survival rate is around .38 (38% of the passengers survived). From our data, we may be able to profile who survived and who didn't!

Split the data by pclass. Does the class a passenger was in affect survivability? You can do this by:
* Creating a list of `True` and `False` values conditional on a column's value
* Taking the mean of the `Survived` column where those values are `True`

In [24]:
set(data_dict_columns['Pclass'])

{1, 2, 3}

In [25]:
pclass_1_mask = [True if i == 1 else False for i in data_dict_columns['Pclass']]
pclass_2_mask = [True if i == 2 else False for i in data_dict_columns['Pclass']]
pclass_3_mask = [True if i == 3 else False for i in data_dict_columns['Pclass']]

In [26]:
pclass_1_prob = this_mean([i for i,j in zip(data_dict_columns['Survived'],pclass_1_mask) if j])
pclass_2_prob = this_mean([i for i,j in zip(data_dict_columns['Survived'],pclass_2_mask) if j])
pclass_3_prob = this_mean([i for i,j in zip(data_dict_columns['Survived'],pclass_3_mask) if j])

In [27]:
display(pclass_1_prob)
display(pclass_2_prob)
display(pclass_3_prob)

0.6296296296296297

0.47282608695652173

0.24236252545824846

# 11. Independent Work

Use the techniques from step 10 to make different conditional splits in the `Survived` column. Can you find a combination of splits that maximizes the survival rate?

In [28]:
child_mask = [True if i < 6.0 else False for i in data_dict_columns['Age']]

female_mask = [True if i == 'female' else False for i in data_dict_columns['Sex']]

In [29]:
women_children_mask = [i&j for i,j in zip(child_mask, female_mask)]

In [30]:
women_children_prob = this_mean([i for i,j in zip(data_dict_columns['Survived'],women_children_mask) if j])

In [31]:
women_children_prob

0.7619047619047619

# 12. Pandas

### A: Load the titanic csv into a `DataFrame` using `pd.read_csv()`

In [32]:
import pandas as pd

In [33]:
titanic_df = pd.read_csv('titanic.csv')

### B: Display the first 5 rows, the last 4 rows, and a sample of 3 rows.

In [34]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [35]:
titanic_df.tail(4)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [36]:
titanic_df.sample(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
415,416,0,3,"Meek, Mrs. Thomas (Annie Louise Rowley)",female,,0,0,343095,8.05,,S
219,220,0,2,"Harris, Mr. Walter",male,30.0,0,0,W/C 14208,10.5,,S
777,778,1,3,"Emanuel, Miss. Virginia Ethel",female,5.0,0,0,364516,12.475,,S


### C: Create a row mask that is `True` when `Pclass == 3`. Use this to mask your `DataFrame`. Find the mean of the `Survived` column. Is it the same as what we calculated in part 10?

In [37]:
pclass3_mask = titanic_df['Pclass'] == 3

In [38]:
titanic_df[pclass3_mask]['Survived'].mean()

0.24236252545824846

### D: Using a `.groupby()`, what is the mean of the survival column grouped by `Pclass` and `Sex`. What are your observations?

In [39]:
titanic_df.groupby(['Pclass','Sex'])[['Survived']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived
Pclass,Sex,Unnamed: 2_level_1
1,female,0.968085
1,male,0.368852
2,female,0.921053
2,male,0.157407
3,female,0.5
3,male,0.135447


### E: Survival Rate by Age Range:  `pd.cut()` takes two arguments: A `list`, `Series`, or `array`, and a list of bins. Create a new column in your `DataFrame` using `pd.cut()` that groups your ages into bins of 5 years. Then, use `.groupby()` to display the survival rate and count for each age group

In [40]:
titanic_df['age_group'] = pd.cut(titanic_df['Age'],range(0,81,5))

In [41]:
titanic_df.groupby(['age_group'])[['Survived']].agg(['mean','count'])

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,count
age_group,Unnamed: 1_level_2,Unnamed: 2_level_2
"(0, 5]",0.704545,44
"(5, 10]",0.35,20
"(10, 15]",0.578947,19
"(15, 20]",0.34375,96
"(20, 25]",0.344262,122
"(25, 30]",0.388889,108
"(30, 35]",0.465909,88
"(35, 40]",0.41791,67
"(40, 45]",0.361702,47
"(45, 50]",0.410256,39


# 13. Write-up

Use markdown cells to answer the following questions:

1. What is the main difference between a list and a tuple?
2. Can you iterate over a dictionary? If so, how?
3. What is the term for a list or array of `True` and `False` values used to select certain rows or columns in a `DataFrame`?
4. If you have a continuous variable, when would the median be a better descriptor than the mean? Why?
5. Give a qualitative description of the survivors of the titanic based on the effects of your splits in the data on survival rate. Or, given a row, what columns would you look at to guess if they survived?


# 14. Evaluation

Please use markdown cells to submit your responses. 

1. What was easy for you in this project?
2. What was difficult?
3. Where did you make the most improvement?
4. Where would you like to improve?