![](https://upload.wikimedia.org/wikipedia/en/b/bb/Titanic_breaks_in_half.jpg)

# Project 1: [Titanic](https://www.kaggle.com/c/titanic/data)
---

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this project, you will exercise your skills with loading data, python data structures, and Pandas to identify charactaristics of Titanic survivors!

---
#### Your goals should be to:
* Practice python programming including loops, conditionals, types, functions, and data structures
* Start thinking critically about manipulating, organizing, and interpreting data
* Troubleshoot errors

---
#### Getting Started:
* **fork** the repository on git.generalassemb.ly
* **clone** your forked repo

---
#### Submission:
* You should be working on a **fork** of the GA project one repository. 
* Use **git** to manage versions of your project. Make sure to `add`, `commit`, and `push` your changes to **your fork** of the github 
* Submit a link to your project repository in the submission form by **Friday, 9/29 11:59 PM**. You will then receive the solutions.
* Create a copy of your original notebook (file > make a copy in jupyter notebook)
* In the copy, use the solutions to correct your work. Make sure to take note of your successes and struggles. Did you learn anything new from correcting your work?
* Submit the corrected version by **Sunday, 10/1 11:59 PM** to receive instructor feedback on your work. ***Projects submitted after this deadline will not receive instructor feedback.***

### Considerations:

* You will be generating long data strutures- avoid displaying the whole thing. Display just the first or last few entries and look at the length or shape to check whether your code gives you back what you want and expect.
* Make functions whenever possiblle!
* Be explicit with your naming. You may forget what `this_list` is, but you will have an idea of what `passenger_fare_list` is. Variable naming will help you in the long run!
* Don't forget about tab autocomplete!
* Use markdown cells to document your planning, thoughts, and results. 
* Delete cells you will not include in your final submission
* Try to solve your own problems using this framework:
  1. Check your spelling
  2. Google your errors. Is it on stackoverflow?
  3. Ask your classmates
  4. Ask a TA or instructor
* Do not include errors or stack traces (fix them!)

# 1. Using the `with open()` method in the `csv` library, load the titanic dataset into a list of lists.

* The `type()` of your dataset should be `list`
* The `type()` of each element in your dataset should also be `list`
* The `len()` of your dataset should be 892 (892 rows, including the header)
* The `len()` of each row element in your dataset should be have a `len()` of 12
* Print out the first 3 rows including the header to check your data.

In [45]:
import csv
from IPython.display import display
import numpy as np

In [46]:
dataset = []

with open('titanic.csv') as csvfile:
    my_reader = csv.reader(csvfile, delimiter=',')
    for row in my_reader:
        dataset.append(row)

In [None]:
type(dataset)

In [None]:
type(dataset[0])

In [47]:
len(dataset)

892

In [48]:
len(dataset[0])

12

In [49]:
dataset[:3]

[['PassengerId',
  'Survived',
  'Pclass',
  'Name',
  'Sex',
  'Age',
  'SibSp',
  'Parch',
  'Ticket',
  'Fare',
  'Cabin',
  'Embarked'],
 ['1',
  '0',
  '3',
  'Braund, Mr. Owen Harris',
  'male',
  '22',
  '1',
  '0',
  'A/5 21171',
  '7.25',
  '',
  'S'],
 ['2',
  '1',
  '1',
  'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
  'female',
  '38',
  '1',
  '0',
  'PC 17599',
  '71.2833',
  'C85',
  'C']]

### Rubric:

#### Great (2)
---
* Student completed the goal of loading the CSV into a list of lists
* Student confirmed the number of rows and number of columns
* Student checked the type of their outer and inner `list`s
* Student dispayed the first 3 rows of data, including the header
* Sample response: Great job! Although it's rarely used in practice, the `with open` method is useful to know, especially when reading in non-csv text files. Grow the habit of checking the number of rows and columns, or the shape, of your data. It's always good to knwo what to expect in terms of the shape of your data!
* Add any additional notes.
  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Code is disorganized, repetitive, or variables named poorly
* Sample response: Good job, you've completed the basic requirements of the question. Hoever, your code is disorganized. Make sure to clearly name your variables, check types and lengths, and keep your code clear, concise, and coherent. Please take a look at the solution code to see a cleaner implementation of this problem.
* Add any additional notes.
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. Here, your goal was to load data into a usable data structure. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?

# 2. Separate the first header row from the rest of your dataset. 

* The header should be a list of the column names
* The data should be the rest of your data
* Display the header and the first row of the dataset zipped together using `zip`
* Your result should look like...


```
[('PassengerId', '1'),
 ('Survived', '0'),
 ('Pclass', '3'),
 ...
 ('Embarked', 'S')]
 ```

In [50]:
header = dataset[0]
data = dataset[1:]

In [51]:
list(zip(header, data[0]))

[('PassengerId', '1'),
 ('Survived', '0'),
 ('Pclass', '3'),
 ('Name', 'Braund, Mr. Owen Harris'),
 ('Sex', 'male'),
 ('Age', '22'),
 ('SibSp', '1'),
 ('Parch', '0'),
 ('Ticket', 'A/5 21171'),
 ('Fare', '7.25'),
 ('Cabin', ''),
 ('Embarked', 'S')]

### Rubric:

#### Great (2)
---
* Student completed the goals of the exercise.
* Sample response: Great job! It's common to separate your header from your body so you can work with your data and column names independently.
* Add any additional notes.
  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Code is disorganized, repetitive, or variables named poorly
* Sample response: Good job, you've completed the basic requirements of the question.
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

# 3. Using a `for` loop, load your data into a `dict` called `data_dict`.

* The keys of your `data_dict` should be `PassengerId`
* The values of your `data_dict` should be dictionaries...
  * Each of these dictionaries should reperesent a column value within a row
  * The keys should be the names of the columns
  * The values should be the values of that column
  
The beginning of your `data_dict` should look like: 

    {'1': {'Age': '22',
      'Cabin': '',
      'Embarked': 'S',
      'Fare': '7.25',
      'Name': 'Braund, Mr. Owen Harris',
      'Parch': '0',
      'Pclass': '3',
      'Sex': 'male',
      'SibSp': '1',
      'Survived': '0',
      'Ticket': 'A/5 21171'},
     '10': {'Age': '14',
      'Cabin': '',
      'Embarked': 'C',
      'Fare': '30.0708',
      'Name': 'Nasser, Mrs. Nicholas (Adele Achem)',
      'Parch': '0',
      'Pclass': '2',
      'Sex': 'female',
      'SibSp': '1',
      'Survived': '1',
      'Ticket': '237736'},
      ...
      }

In [52]:
data_dict = {}

for row in data:
    zipped = list(zip(header, row))
    row_dict = {}
    for col, element in zipped[1:]:
        row_dict[col] = element
    data_dict[row[0]] = row_dict

In [53]:
data_dict[data_dict.keys()[0]]

{'Age': '25',
 'Cabin': '',
 'Embarked': 'S',
 'Fare': '13',
 'Name': 'Sedgwick, Mr. Charles Frederick Waddington',
 'Parch': '0',
 'Pclass': '2',
 'Sex': 'male',
 'SibSp': '0',
 'Survived': '0',
 'Ticket': '244361'}

In [54]:
data_dict[data_dict.keys()[-1]]

{'Age': '22',
 'Cabin': '',
 'Embarked': 'S',
 'Fare': '7.5208',
 'Name': 'Karlsson, Mr. Nils August',
 'Parch': '0',
 'Pclass': '3',
 'Sex': 'male',
 'SibSp': '0',
 'Survived': '0',
 'Ticket': '350060'}

### Rubric:

#### Great (2)
---
* Student completed the goals of the exercise.
* Sample response: Great job! A dictionary is a great structure in which to store data. When you have nested data structures, like lists of lists, a nested for loop can be a good way of iterating through the data.
* Add any additional notes.
  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Code is disorganized, repetitive, or variables named poorly
* Sample response: Good job, you've completed the basic requirements of the question. You may want to gain some additional practice with nested for loops, dictionaries, and lists.
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

# 4. Repeat step 3 using a dictionary comprehension.

* Using `==`, check if your `data_dict` from your `for` loop is the same as the one from your dictionary comprehension.

In [55]:
data_dict_comp = {row[0]:{i:j for i,j in zip(header[1:], row[1:])} for row in data}

In [56]:
data_dict == data_dict_comp

True

### Rubric:

#### Great (2)
---
* Student completed the goals of the exercise.
* Sample response: Great job! List and dictionary comprehens can come in really useful and help keep your code clean and readable. Use them whenever possible, especially when you don't have a heavily nested data structure!
* Add any additional notes.
  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Code is disorganized, repetitive, or variables named poorly
* Sample response: Good job, you've completed the basic requirements of the question. You may want to gain some additional practice with comprehensions. Practice writing for loops first, and fitting them into dict and list comprehensions.
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

# 5. Transform your `data_dict` to be oriented by column and call it `data_dict_columns`

* Currently, our `data_dict` is oriented by row, indexed by `"PassengerId"`. 
* Transform your data so that the title of each row is a key, the values are of type `list` and represent column vectors.

If you display `data_dict_columns`, the beginning should look like...

    {'Age': ['25',
      '36',
      '24',
      '40',
      '45',
      '2',
      '24',
      '28',
      '33',
      '26',
      '39',
      ...

In [57]:
data_dict_columns = {}
for col in header:
    data_dict_columns[col] = []

for pass_id, row in data_dict.items():
    data_dict_columns['PassengerId'].append(pass_id)
    for col, value in row.items():
        data_dict_columns[col].append(value)

### Rubric:

#### Great (2)
---
* Student completed the goals of the exercise.
* Sample response: Great job! While having a `dict` oriented by rows can help you find individual data, orienting the data by columns can help you easily find summary statistics or perform operations on an entire column.
* Add any additional notes.
  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Code is disorganized, repetitive, or variables named poorly
* Sample response: Good job, you've completed the basic requirements of the question. While having a `dict` oriented by rows can help you find individual data, orienting the data by columns can help you easily find summary statistics or perform operations on an entire column. You may want to gain some additional practice indexing and iterating over dictionaries. This great python practice, and is useful when thinking about data transformations.
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

# 6. Data Types

What is the current `type` of each column? What do you think the data type of each column *should* be? The data types in Python are...

* `int`
* `float`
* `str`
* `bool`
* `tuple`
* `list`
* `dict`
* `set`

In a markdown cell, describe what each column represents and what the `type` of each value should be. **Extra:** If you want to be fancy, use a [markdown table](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#tables) to display your results.


|Column|Type|
|---|---|
|Fare|`float`|
|Name|`str`|
|Embarked|`str`|
|Age|`float`|
|Parch|`int`|
|Pclass|`int`|
|Sex|`str`|
|Survived|`int`|
|SibSp|`int`|
|PassengerId|`int`|
|Ticket|`str`|
|Cabin|`str`|

### Rubric:

#### Great (2)
---
* Student completed the goals of the exercise.
* Sample response: Great job! It's always important to consider what `type` a particular row or column should be.
* Add any additional notes.
  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Code is disorganized, repetitive, or variables named poorly
* Some of the types are wrong
* Sample response: Good job, you've completed the basic requirements of the question. Make sure to think critically about what type a column should be (int, str, float, bool). This will help define how you treat the data in subsequent steps.
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

# 7. Transform each column to the appropriate type if needed.

Build a function called `transform_column` that takes arguments for a `data_dict`, `column_name`, and `datatype`, and use it to transofm the columns that need transformation.

**NOTE:** There are values in this dataset that cannot be directly cast to a numerical value. Use `if/then` or `try/except` statements to handle errors. 

**To help identify potential sources of errors, explore the `set` of values in each column.**

In [11]:
data_dict_columns.keys()

['Fare',
 'Name',
 'Embarked',
 'Age',
 'Parch',
 'Pclass',
 'Sex',
 'Survived',
 'SibSp',
 'PassengerId',
 'Ticket',
 'Cabin']

In [12]:
non_str_cols = ['Fare','Age','Parch','Pclass','Survived','SibSp','PassengerId']
type_list = [float, float, int, int, int, int, int]

In [14]:
def transform_column(data_dict, column, datatype):
    new_col = []
    for i in data_dict[column]:
        try:
            new_col.append(datatype(i))
        except:
            new_col.append(np.NaN)
    data_dict[column] = new_col
    return data_dict

In [15]:
for col, datatype in zip(non_str_cols, type_list):
    data_dict_columns = transform_column(data_dict_columns, col, datatype)

### Rubric:

#### Great (2)
---
* Student completed the goals of the exercise.
* Sample response: Great job! Casting columns to the correct type is very important and will affect subsequent work on the data.
* Add any additional notes.
  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Code is disorganized, repetitive, or variables named poorly
* Some of the types are wrong
* Sample response: Good job, you've completed the basic requirements of the question. Make sure to keep in mind how to deal with things like missing values, strangely formatted strings that encode numeric values, etc.
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

# 8. Build functions to calculate the mean, sample standard deviation, and median of a list of ints or floats. Use `scipy.stats.mode` or build your own mode function!


**Hint:** If you filled any missing values with `np.NaN`, you may need to handle that in your functions (look up `np.isnan()`).

**Extra Credit:** Build a function to calculate the `mode`.

**Optional:**  Build a function for calculating the Mode that returns the mode value *and* the count of that value. Mode is tricky, so start by building a function that counts the occurances of each value. You may also need to sort using a `key` with a `lambda function` inside. You may also find a `defaultdict` useful.

Mean

In [16]:
def this_mean(data_list):
    data_list = [float(i) for i in data_list if not np.isnan(i)]
    return sum(data_list)/len(data_list)

Standard Deviation

In [17]:
def this_std(data_list):
    data_list = [float(i) for i in data_list if not np.isnan(i)]
    df = len(data_list) - 1
    mean = this_mean(data_list)
    dev = sum([(i - mean)**2 for i in data_list])
    return (dev/df)**.5

Median

In [18]:
def this_median(data_list):
    data_list = [float(i) for i in data_list if not np.isnan(i)]
    data_list.sort()
    l = len(data_list)
    if l%2 == 1:
        ind = l/2
        return data_list[ind]
    else:
        ind1 = l//2
        ind2 = ind1+1
        return this_mean([data_list[ind1], data_list[ind2]])

Mode

In [19]:
from collections import defaultdict

In [20]:
def counter(data_list):
    '''
    Returns a dict with the values as keys and the counts of each value as the dict values.
    '''
    count_dict = defaultdict(int)
    for i in data_list:
        count_dict[i] += 1
    return dict(count_dict)

In [21]:
def this_mode(data_list):
    '''
    Returns the mode and the no. of occurances of that value.
    '''
    counts = counter(data_list).items()
    counts.sort(key = lambda x: x[1], reverse=True)
    value, count = counts[0]
    return value, count

#### OR

In [1]:
from scipy.stats import mode

### Rubric:

#### Great (2)
---
* Student completed the goals of the exercise.
* Sample response: Great job! It seems like you're comfortable writing functions and understand these measures of central tendency.
* Add any additional notes.
  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Code is disorganized, repetitive, or variables named poorly
* Some of the types are wrong
* Sample response: Good job, you've completed the basic requirements of the question. Keep practicing writing functions and use them whenever possible! Make sure to name your functions appropriately and refactor them when needed.
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

#### Extra (1 additional point)
---
* Student wrote their own function to calculate Mode

# 9. Summary Statistics of Numerical Columns

For numerical columns, what is the mean, standard deviation, mean, and mode for that data? Which measure of central tendency is the most descriptive of each column? Why? Explain your answer in a markdown cell.

In [22]:
def get_sum_stats(data_col):
    for name, method in [('mean',this_mean), 
                         ('standard deviation',this_std), 
                         ('median', this_median),
                         ('mode', this_mode)
                        ]:
        print("{:20} : {}".format(name, method(data_col)))

In [23]:
for col in non_str_cols:
    print("Column: {}".format(col))
    get_sum_stats(data_dict_columns[col])
    print('\n')

Column: Fare
mean                 : 32.2042079686
standard deviation   : 49.6934285972
median               : 14.4542
mode                 : (8.05, 43)


Column: Age
mean                 : 29.6991176471
standard deviation   : 14.5264973323
median               : 28.0
mode                 : (nan, 177)


Column: Parch
mean                 : 0.381593714927
standard deviation   : 0.80605722113
median               : 0.0
mode                 : (0, 678)


Column: Pclass
mean                 : 2.30864197531
standard deviation   : 0.836071240977
median               : 3.0
mode                 : (3, 491)


Column: Survived
mean                 : 0.383838383838
standard deviation   : 0.486592454265
median               : 0.0
mode                 : (0, 549)


Column: SibSp
mean                 : 0.523007856341
standard deviation   : 1.10274343229
median               : 0.0
mode                 : (0, 608)


Column: PassengerId
mean                 : 446.0
standard deviation   : 257.353842015
media

#### Fare

Fare has a mean of 32 with a std of almost 50. Because the median is around 14 and the mode is 8, the data is skewed. Thus, the median is the most appropriate measure to describe this data.

#### Age

The mean and median of Age are close to each other, so the data is likely not heavily skewed. The mean is appropriate in this case.

#### Parch

The mode is descriptive here: 678 individuals out of our entire dataset traveled without parents or children.

#### SibSp

The mode is descriptive here: 608 individuals out of our entire dataset traveled without siblings or spouses.

#### PassengerId

No descriptive stastics are informative

#### Pclass

The mode is descriptive here. More than half the passengers are in class 3

#### Survived

The mean is important here. The mean is the frequentist probability of survival on the Titanic

### Rubric:

#### Great (2)
---
* Student completed the goals of the exercise.
* Sample response: Great job! You clearly understand when mean, median, and mode should be used, and were able to apply the functions that you created!
* Add any additional notes.
  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Code is disorganized, repetitive, or variables named poorly
* Sample response: Good job, you've completed the basic requirements of the question. You clearly understand when mean, median, and mode should be used, and were able to apply the functions that you created! Please keep the readability of your code in mind- in the future, you will usually be the only person to read over your code in detail, so make sure it's understandable when you revisit it!
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

# 10. Splitting the Data to Predicting Survival

For all the passengers in the dataset, the mean survival rate is around .38 (38% of the passengers survived). From our data, we may be able to profile who survived and who didn't!

Split the data by pclass. Does the class a passenger was in affect survivability? You can do this by:
* Creating a list of `True` and `False` values conditional on a column's value
* Taking the mean of the `Survived` column where those values are `True`

In [24]:
set(data_dict_columns['Pclass'])

{1, 2, 3}

In [25]:
pclass_1_mask = [True if i == 1 else False for i in data_dict_columns['Pclass']]
pclass_2_mask = [True if i == 2 else False for i in data_dict_columns['Pclass']]
pclass_3_mask = [True if i == 3 else False for i in data_dict_columns['Pclass']]

In [26]:
pclass_1_prob = this_mean([i for i,j in zip(data_dict_columns['Survived'],pclass_1_mask) if j])
pclass_2_prob = this_mean([i for i,j in zip(data_dict_columns['Survived'],pclass_2_mask) if j])
pclass_3_prob = this_mean([i for i,j in zip(data_dict_columns['Survived'],pclass_3_mask) if j])

In [27]:
display(pclass_1_prob)
display(pclass_2_prob)
display(pclass_3_prob)

0.6296296296296297

0.47282608695652173

0.24236252545824846

### Rubric:

#### Great (2)
---
* Student completed the goals of the exercise.
* Sample response: Great job! You've done your first proto-model! You've explored how `Pclass` affects survival. You could see how you can use how different passengers survived to understand who survived on the titanic.
* Add any additional notes.
  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Code is disorganized, repetitive, or variables named poorly
* Sample response: Good job, you've completed the basic requirements of the question. You've done your first proto-model! You've explored how `Pclass` affects survival. You could see how you can use how different passengers survived to understand who survived on the titanic. However, you could've kept your code a bit more organized and displayed it more clearly.
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

# 11. Independent Work

Use the techniques from step 10 to make different conditional splits in the `Survived` column. Can you find a combination of splits that maximizes the survival rate?

In [28]:
child_mask = [True if i < 6.0 else False for i in data_dict_columns['Age']]

female_mask = [True if i == 'female' else False for i in data_dict_columns['Sex']]

In [29]:
women_children_mask = [i&j for i,j in zip(child_mask, female_mask)]

In [30]:
women_children_prob = this_mean([i for i,j in zip(data_dict_columns['Survived'],women_children_mask) if j])

In [31]:
women_children_prob

0.7619047619047619

### Rubric:

#### Great (2)
---
* Student completed the goals of the exercise. Student used multiple features to increase probability of survival.
* Sample response: Great job! You've beat the original score we predicted from pclass. What does your work tell you about survivors?
* Add any additional notes.
  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Code is disorganized, repetitive, or variables named poorly
* Sample response: Good job, you've completed the basic requirements of the question. If you didn't beat the original probability of survival from pclass, you may want to try some conditionals to pull in more than one feature. 
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

# 12. Pandas

### A: Load the titanic csv into a `DataFrame` using `pd.read_csv()`

In [32]:
import pandas as pd

In [33]:
titanic_df = pd.read_csv('titanic.csv')

### Rubric:

#### Satisfactory (1)
---
* Student loaded the data in Pandas using `read_csv()`
  
#### Wrong/Incomplete (0)
---
* Student did not load the data in Pandas using `read_csv()`
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?

### B: Display the first 5 rows, the last 4 rows, and a sample of 3 rows.

In [34]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [35]:
titanic_df.tail(4)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [36]:
titanic_df.sample(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
415,416,0,3,"Meek, Mrs. Thomas (Annie Louise Rowley)",female,,0,0,343095,8.05,,S
219,220,0,2,"Harris, Mr. Walter",male,30.0,0,0,W/C 14208,10.5,,S
777,778,1,3,"Emanuel, Miss. Virginia Ethel",female,5.0,0,0,364516,12.475,,S


### Rubric:

  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Sample response: Good job! It's always a good idea to check the head, tail, or sample of your data. This allows you to explore the data without needing to visualize the whole dataset.
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

### C: Create a row mask that is `True` when `Pclass == 3`. Use this to mask your `DataFrame`. Find the mean of the `Survived` column. Is it the same as what we calculated in part 10?

In [37]:
pclass3_mask = titanic_df['Pclass'] == 3

In [38]:
titanic_df[pclass3_mask]['Survived'].mean()

0.24236252545824846

### Rubric:

  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Sample response: Good job! Masking is a great way to explore your data in Pandas.
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

### D: Using a `.groupby()`, what is the mean of the survival column grouped by `Pclass` and `Sex`. What are your observations?

In [39]:
titanic_df.groupby(['Pclass','Sex'])[['Survived']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived
Pclass,Sex,Unnamed: 2_level_1
1,female,0.968085
1,male,0.368852
2,female,0.921053
2,male,0.157407
3,female,0.5
3,male,0.135447


### Rubric:

  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Sample response: Good job! Grouping is a great way to explore your data in Pandas.
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

### E: Survival Rate by Age Range:  `pd.cut()` takes two arguments: A `list`, `Series`, or `array`, and a list of bins. Create a new column in your `DataFrame` using `pd.cut()` that groups your ages into bins of 5 years. Then, use `.groupby()` to display the survival rate and count for each age group

In [40]:
titanic_df['age_group'] = pd.cut(titanic_df['Age'],range(0,81,5))

In [41]:
titanic_df.groupby(['age_group'])[['Survived']].agg(['mean','count'])

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,count
age_group,Unnamed: 1_level_2,Unnamed: 2_level_2
"(0, 5]",0.704545,44
"(5, 10]",0.35,20
"(10, 15]",0.578947,19
"(15, 20]",0.34375,96
"(20, 25]",0.344262,122
"(25, 30]",0.388889,108
"(30, 35]",0.465909,88
"(35, 40]",0.41791,67
"(40, 45]",0.361702,47
"(45, 50]",0.410256,39


### Rubric:

  
#### Satisfactory (1)
---
* Basic goals from above accomplished
* Sample response: Good job! Although rarely used, `pd.cut()` can be used to create additional features. Once we start plotting histograms, we can think of the above as a strategy to create bins.
* Add any additional notes. What could've made this response great?
  
#### Wrong/Incomplete (0)
---
* The core goals of this exercise were not accomplished. 
* Sample response: This exercise is incomplete. It's important to complete each exercise because it is designed to help you practice important tasks to be successful in this course. What did you have trouble with? Is there anything I can help clear up for you regarding this problem?
* Add any additional notes. What could've made this response good or great?

# 13. Write-up

Use markdown cells to answer the following questions:

1. What is the main difference between a list and a tuple?
2. Can you iterate over a dictionary? If so, how?
3. What is the term for a list or array of `True` and `False` values used to select certain rows or columns in a `DataFrame`?
4. If you have a continuous variable, when would the median be a better descriptor than the mean? Why?
5. Give a qualitative description of the survivors of the titanic based on the effects of your splits in the data on survival rate. Or, given a row, what columns would you look at to guess if they survived?


#### Answer:
1. Tuples are **immutable** and denoted by `()`, while lists are mutable and denoted by `[]`.
2. To iterate over a dictionary, you must use `.keys()` which returns an itterator for the dict keys, `.values()` which returns an itterator for the dict values, or `.items()` which returns an iterator of `(key, value)` pairs. 
3. mask, bitmask, row mask, column mask
4. If the data is skewed, the median describes the data better than the mean. The mean does not well represent many of the data points, and the data is not centered about the mean.
5. Pretty much, rich women and children survived. 

### Rubric:

* 1 point for each correct question. If they're almost there, feel free to give them the point based on your judgement. 
* 5 points possible

# 14. Evaluation

Please use markdown cells to submit your responses. 

1. What was easy for you in this project?
2. What was difficult?
3. Where did you make the most improvement?
4. Where would you like to improve?