# Working With Missing Data
Learn to handle missing data using pandas and a data set on Titanic survival.

In [66]:
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv("/home/aida/Desktop/Dataquest/data-analysis-with-pandas/data/titanic_survival.csv")

### Finding the Missing Data
    Description
- Count how many values in the "age" column have null values

In [67]:
age = titanic_survival["age"]
age_is_null = pd.isnull(age)
age_null_true = age[age_is_null]
age_null_count = len(age_null_true)

### Whats the big deal with missing data?
    Description
- Use age_is_null to create a vector that only contains values from the "age" column that aren't NaN.
- Calculate the mean of the new vector.

In [68]:
age_is_null = pd.isnull(titanic_survival["age"])
good_age = titanic_survival["age"][age_is_null == False]
correct_mean_age = sum(good_age)/len(good_age)
print(correct_mean_age)

29.8811345124


### Easier Ways to Do Math
    Description
- Assign the mean of the "fare" column to correct_mean_fare.

In [69]:
correct_mean_fare = titanic_survival["fare"].mean()
correct_mean_fare

33.29547928134572

###  Calculating Summary Statistics

- Use a for loop to iterate over passenger_classes. Within the for loop:
- Once the loop completes, the dictionary fares_by_class should have 1, 2, and 3 as keys, with the average fares as the corresponding values.


In [70]:
passenger_classes = [1, 2, 3]
fares_by_class = {}

for i in passenger_classes:
    row = titanic_survival["pclass"] == i
    rows_pclass = titanic_survival[row]
    fare_pclass = rows_pclass["fare"]
    fare_pclass_mean = fare_pclass.mean()
    fares_by_class[i] = fare_pclass_mean
print(fares_by_class)

{1: 87.50899164086687, 2: 21.1791963898917, 3: 13.302888700564957}


### Making Pivot Tables
    Description
- Use the DataFrame.pivot_table() method to calculate the mean age for each passenger class ("pclass").

In [71]:
passenger_age = titanic_survival.pivot_table(index = "pclass", values = "age", aggfunc = np.mean)
print(passenger_age)

              age
pclass           
1.0     39.159918
2.0     29.506705
3.0     24.816367


###  More Complex Pivot Tables

    Make a pivot table that calculates the total fares collected ("fare") and total number of survivors ("survived") for each embarkation port ("embarked").
    Assign the result to port_stats.
    Display port_stats using the print() function.


In [72]:
port_stats = titanic_survival.pivot_table(index = "embarked", values = ["survived", "fare"], aggfunc = np.sum)
print(port_stats)

                fare  survived
embarked                      
C         16830.7922     150.0
Q          1526.3085      44.0
S         25033.3862     304.0


### Drop Missing Values
    Description
- Drop all columns in titanic_survival that have missing values.
- Drop all rows in titanic_survival where the columns "age" or "sex" have missing values.

In [73]:
drop_na_columns = titanic_survival.dropna(axis = 1)
new_titanic_survival = titanic_survival.dropna(axis = 0, subset = ["age", "sex"] )

### Using iloc to Access Rows by Position
    Description
- Assign the first ten rows from new_titanic_survival to first_ten_rows.
    Assign the fifth row from new_titanic_survival to row_position_fifth.
    Assign the row with index label 25 from new_titanic_survivalto row_index_25.


In [74]:
first_ten_rows = new_titanic_survival.iloc[0:10]
row_position_fifth = new_titanic_survival.iloc[4]
row_index_25 = new_titanic_survival.loc[25]


### Using Column Indexes
    Description

- Assign the value at row index label 1100, column index label "age" from new_titanic_survival.
- Assign the value at row index label 25, column index label "survived" from new_titanic_survival.
- Assign the first 5 rows and first three columns from new_titanic_survival.


In [75]:
row_index_1100_age = new_titanic_survival.loc[1100, "age"]
row_index_25_survived = new_titanic_survival.loc[25, "survived"]
five_rows_three_columns = new_titanic_survival.iloc[0:5, 0:3]

### Reindexing Rows
    Description

    Reindex the new_titanic_survival dataframe so the row indexes start from 0, and the old index is dropped.
    Assign the final result to titanic_reindexed.
    Print the first 5 rows and the first 3 columns of titanic_reindexed.


In [78]:
titanic_reindex = new_titanic_survival.reset_index(drop = True)
print(titanic_reindex.iloc[0:5, 0:3])

   pclass  survived                                             name
0     1.0       1.0                    Allen, Miss. Elisabeth Walton
1     1.0       1.0                   Allison, Master. Hudson Trevor
2     1.0       0.0                     Allison, Miss. Helen Loraine
3     1.0       0.0             Allison, Mr. Hudson Joshua Creighton
4     1.0       0.0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)


###  Apply Functions Over a DataFrame
    Description

- Write a function that counts the number of null elements in a all titanic_survivers Series.

In [88]:
def count_null(ele):
    null_ele = pd.isnull(ele)
    null_true = titanic_survival[null_ele]
    nr_null = len(null_true)
    return nr_null
column_null_count = titanic_survival.apply(count_null)
print(column_null_count)

pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64


###  Applying a Function to a Row
    Description
    
- Create a function that returns the string "minor" if someone is under 18, "adult" if they are equal to or over 18, and "unknown" if their age is null.
- Find the correct label for everyone in the titanic_survival dataframe.

In [96]:
def age_people(x):
    row = x['age']
    if pd.isnull(row):
        return "unknown"
    elif row < 18:
        return "minor"
    else:
        return "adult"
age_labels = titanic_survival.apply(age_people, axis=1)
titanic_survival["age_labels"] = age_labels

### Calculating Survival Percentage by Age Group
Now that we have age labels for everyone, let's make a pivot table to find the probability of survival for each age group.
    
    Description
    
- Create a pivot table that calculates the mean survival chance("survived") for each age group ("age_labels") of the dataframe titanic_survival.
- Assign the resulting Series object to age_group_survival.

In [103]:
age_group_survival = titanic_survival.pivot_table(index = "age_labels", values = "survived",aggfunc = np.mean)
print(age_group_survival)

            survived
age_labels          
adult       0.387892
minor       0.525974
unknown     0.277567
