## Python Version

In [1]:
!python --version

Python 3.6.4 :: Anaconda custom (64-bit)


## Mission
In this mission, we'll dive into how to work with missing values, use pivot tables, and calculate summary statistics

## Import data
- Read file 
<br><br>
read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf8" for reading, and generally utf-8 for to_csv.

You can also use the alias 'latin1' instead of 'ISO-8859-1'.

In [2]:
import pandas as pd
titanic_survival = pd.read_csv("titanic_survival.csv", encoding = "latin1")
col_names = titanic_survival.columns.tolist()
print(col_names)
titanic_survival.head(3)

['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,3,0,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.55,,S,,,
1,3,0,"Abbott, Master. Eugene Joseph",male,13.0,0,2,C.A. 2673,20.25,,S,,,"East Providence, RI"
2,3,0,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.25,,S,,190.0,"East Providence, RI"


Each row contains information for a specific Titanic passenger.
Here are the first few rows of the dataset:

Lets take a closer look at a few of the key columns:

    pclass -- The passenger's cabin class from 1 to 3 where 1 was the highest class
    survived -- 1 if the passenger survived, and 0 if they did not.
    sex -- The passenger's gender
    age -- The passenger's age
    fare -- The amount the passenger paid for their ticket
    embarked -- Either C, Q, or S, to indicate which port the passenger boarded the ship from.

Many of the columns, such as age and sex, have missing values.

## Missing data

Missing data can take a few different forms:

    - In Python, the None keyword and type indicates no value.
    - The Pandas library uses NaN, which stands for "not a number", to indicate a missing value.

In general terms, both NaN and None can be called null values.

If we want to see which values are NaN, we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False values, the same way that NumPy did when we compared arrays.

In [3]:
sex = titanic_survival["sex"]
sex_is_null = pd.isnull(sex)

We can use this resultant series to select only the rows that have null values.

In [4]:
sex_null_true = sex[sex_is_null]
titanic_survival[sex_is_null]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest


**Instruction:**

- Count how many values in the "age" column have null values:
    - Use pandas.isnull() on age variable to create a Series of True and False values.
    - Use the resulting series to select only the elements in age that are null, and assign the result to age_null_true
- Assign the length of age_null_true to age_null_count.
    - Print age_null_count to see how many null values are in the "age" column.



In [5]:
age = titanic_survival["age"]
# Series of age is null contains either True or False
age_is_null = pd.isnull(age)

# DataFrame of age is null 
df_age_null = titanic_survival[age_is_null]

# Series of age is null == True 
age_null_true = age[age_is_null]

age_null_count = len(age_null_true) # or len(df_age_null)
print("age_null_count is", age_null_count)
df_age_null.head(3)

age_null_count is 263


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
97,1,0,"Baumann, Mr. John D",male,,0,0,PC 17318,25.925,,S,,,"New York, NY"
118,3,0,"Betros, Master. Seman",male,,0,0,2622,7.2292,,C,,,
139,3,0,"Boulos, Mr. Hanna",male,,0,0,2664,7.225,,C,,,Syria


**Instruction:**

- mean_age = sum(titanic_survival["age"])/ len(titanic_survival["age"])
- Use age_is_null to create a vector that only contains values from the "age" column that aren't NaN.
- Calculate the mean of the new vector, and assign the result to correct_mean_age

In [6]:
# Series of Age is null = True or False
age_is_null = pd.isnull(titanic_survival["age"]) # or pd.isnull(age)

# Series of Age is not null that contains numerical values 
age_not_null = titanic_survival["age"][age_is_null == False] # or age[age_is_null == False]

# Compute mean age 
correct_mean_age = age_not_null.sum()/len(age_not_null)
print("correct_mean_age is", correct_mean_age)

correct_mean_age is 29.8811345124283


## Mean
To calculate the mean age, we can replace all of our code with one line

In [7]:
# Compute mean age 
correct_mean_age = titanic_survival["age"].mean()
print("correct_mean_age is", correct_mean_age)

correct_mean_age is 29.8811345124283


**Instruction**
- Assign the mean of the "fare" column to correct_mean_fare

In [8]:
# Compute mean age 
correct_mean_fare = titanic_survival["fare"].mean()
print("correct_mean_fare is", correct_mean_fare)

correct_mean_fare is 33.29547928134565


## Summary Statistics
Let's calculate more summary statistics for the data. The pclass column indicates the cabin class for each passenger, which was either first class (1), second class (2), or third class (3). You'll use the list passenger_classes, which contains these values, in the following exercise.

**Instruction:**
- Use a for loop to iterate over passenger_classes. Within the for loop:
    - Select just the rows in titanic_survival where the pclass value is equivalent to the current iterator value (class).
    - Select just the fare column for the current subset of rows.
    - Use the Series.mean method to calculate the mean of this subset.
    - Add the mean of the class to the fares_by_class dictionary with class as the key.
- Once the loop completes, the dictionary fares_by_class should have 1, 2, and 3 as keys, with the average fares as the corresponding values.

In [9]:
passenger_classes = [1, 2, 3]
fares_by_class = {}

for pclass in passenger_classes:
    pclass_fare = titanic_survival["fare"][titanic_survival["pclass"] == pclass]
    fares_by_class[pclass] = pclass_fare.mean()
    
for pclass, fare in fares_by_class.items():
    print("pclass", pclass, "average fare is", fare)

pclass 1 average fare is 87.50899164086687
pclass 2 average fare is 21.1791963898917
pclass 3 average fare is 13.302888700564957


## Pivot Tables
Pivot tables provide an easy way to subset by one column and then apply a calculation like a sum or a mean. The concept of Pivot tables was popularized with the introduction of the 'PivotTable' feature in Microsoft Excel in the mid 1990's.

Pivot tables first group and then apply a calculation. In the previous screen, we actually made a pivot table manually by grouping by the column "pclass" and then calculating the mean of the "fare" column for each class.

Luckily, we can use the Dataframe.pivot_table() method instead, which simplifies the kind of work we did on the last screen. To produce the same data, we could use one line.

In [10]:
import numpy as np
passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=np.mean)
passenger_class_fares

Unnamed: 0_level_0,fare
pclass,Unnamed: 1_level_1
1,87.508992
2,21.179196
3,13.302889


The first parameter of the method, index tells the method which column to group by. The second parameter values is the column that we want to apply the calculation to, and aggfunc specifies the calculation we want to perform. The default for the aggfunc parameter is actually the mean, so if we're calculating this we can omit this parameter.

**Instruction**
- Use the DataFrame.pivot_table() method to calculate the mean age for each passenger class ("pclass").
- Assign the result to passenger_age.
- Display the passenger_age pivot table using the print() function.

In [11]:
import numpy as np
import pandas as pd

# index specifies which column to subset data based on 
# values specifies which column to subset based on the index
# The aggfunc specifies what to do with the subsets
# In this case, we split survived into 3 vectors, one for each passenger class, and take the mean of each


passenger_survival = titanic_survival.pivot_table(index="pclass", values="survived", aggfunc=np.mean)
print(passenger_survival)
print()
passenger_age = titanic_survival.pivot_table(index="pclass", values="age", aggfunc=np.mean)
print(passenger_age)

        survived
pclass          
1       0.619195
2       0.429603
3       0.255289

              age
pclass           
1       39.159918
2       29.506705
3       24.816367


In [12]:
# You also can group by multiple values
titanic_survival.pivot_table(index="pclass", values=["fare","age","survived"])

Unnamed: 0_level_0,age,fare,survived
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,39.159918,87.508992,0.619195
2,29.506705,21.179196,0.429603
3,24.816367,13.302889,0.255289


We can use the DataFrame.pivot_table() method to perform even more advanced tasks. If we pass a list of column names to the values parameter instead of a single value, we can perform calculations on multiple columns at once.

We can also specify a custom calculation to be made. For instance, if we pass np.sum to the aggfunc parameter it will total the values in each column.

**Instruction**

- Make a pivot table that calculates the total fares collected ("fare") and total number of survivors ("survived") for each embarkation port ("embarked").
- Assign the result to port_stats.
- Display port_stats using the print() function.


In [15]:
import pandas as pd
import numpy as np
port_stats = titanic_survival.pivot_table(index="embarked", values=["fare","survived"], aggfunc=np.sum)
print(port_stats)

                fare  survived
embarked                      
C         16830.7922       150
Q          1526.3085        44
S         25033.3862       304


We learned how to remove the missing values in a vector of data, but how about in a matrix?

We can use the DataFrame.dropna() method on pandas DataFrames to do this. The method will drop any rows that contain missing values.

The dropna() method takes an axis parameter, which indicates whether you would like to drop rows or columns. Specifying axis=0 or axis='index' will drop any rows that have null values, while specifying axis=1 or axis='columns' will drop any columns that have null values. We will use 0 and 1 since they're more commonly used, but you can use either.

The code below will drop all rows in titanic_survival that have null values.

In [17]:
drop_na_rows = titanic_survival.dropna(axis=0)

There is also a parameter that allows you to specify a list of columns or rows to look at when using dropna(). You will need to use this in the next exercise - take a look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) to work out the name of this parameter and how it works.

In [19]:
help(pd.DataFrame.dropna)

Help on function dropna in module pandas.core.frame:

dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)
    Return object with labels on given axis omitted where alternately any
    or all of the data are missing
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, or tuple/list thereof
        Pass tuple or list to drop on multiple axes
    how : {'any', 'all'}
        * any : if any NA values are present, drop that label
        * all : if all values are NA, drop that label
    thresh : int, default None
        int value : require that many non-NA values
    subset : array-like
        Labels along other axis to consider, e.g. if you are dropping rows
        these would be a list of columns to include
    inplace : boolean, default False
        If True, do operation inplace and return None.
    
    Returns
    -------
    dropped : DataFrame
    
    Examples
    --------
    >>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.

** Instructions**

- Drop all columns in titanic_survival that have missing values and assign the result to drop_na_columns.
- Drop all rows in titanic_survival where the columns "age" or "sex" have missing values and assign the result to new_titanic_survival.



In [21]:
# Drop the column where any elements contains nan
drop_na_columns = titanic_survival.dropna(axis=1, how="any")
new_titanic_survival = titanic_survival.dropna(subset=["age","sex"])
new_titanic_survival.head(3)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,3,0,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.55,,S,,,
1,3,0,"Abbott, Master. Eugene Joseph",male,13.0,0,2,C.A. 2673,20.25,,S,,,"East Providence, RI"
2,3,0,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.25,,S,,190.0,"East Providence, RI"


In [44]:
# Sorted the new_titanic_survival by "age" column with highest to lowest
new_titanic_survival = new_titanic_survival.sort_values(by=["age", "name"], ascending=False)
#new_titanic_survival.sort_values("age", ascending=False, inplace=True)


## Row indices

**Instruction**
- Assign the first ten rows from new_titanic_survival to first_ten_rows.
- Assign the fifth row from new_titanic_survival to row_position_fifth.
- Assign the row with index label 25 from new_titanic_survivalto row_index_25

**Hint**
- Remember to use .loc when addressing by label, and .iloc when indexing by position

In [45]:
first_ten_rows = new_titanic_survival.iloc[0:10]
first_ten_rows

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
93,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S,B,,"Hessle, Yorks"
217,1,1,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1,0,19877,78.85,C46,S,6,,"Little Onn Hall, Staffs"
1168,3,0,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S,,,
457,1,0,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C,,,"New York, NY"
54,1,0,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"
263,3,0,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q,,171.0,
814,2,0,"Mitchell, Mr. Henry Michael",male,70.0,0,0,C.A. 24580,10.5,,S,,,"Guernsey / Montclair, NJ and/or Toledo, Ohio"
282,1,0,"Crosby, Capt. Edward Gifford",male,70.0,1,1,WE/P 5735,71.0,B22,S,,269.0,"Milwaukee, WI"
1159,1,0,"Straus, Mr. Isidor",male,67.0,1,0,PC 17483,221.7792,C55 C57,S,,96.0,"New York, NY"
1264,2,0,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S,,,"Guernsey, England / Edgewood, RI"


In [46]:
row_position_fifth = new_titanic_survival.iloc[4]
row_position_fifth

pclass                             1
survived                           0
name         Artagaveytia, Mr. Ramon
sex                             male
age                               71
sibsp                              0
parch                              0
ticket                      PC 17609
fare                         49.5042
cabin                            NaN
embarked                           C
boat                             NaN
body                              22
home.dest        Montevideo, Uruguay
Name: 54, dtype: object

In [48]:
row_index_25 = new_titanic_survival.loc[25]
row_index_25

pclass                                          1
survived                                        0
name         Allison, Mr. Hudson Joshua Creighton
sex                                          male
age                                            30
sibsp                                           1
parch                                           2
ticket                                     113781
fare                                       151.55
cabin                                     C22 C26
embarked                                        S
boat                                          NaN
body                                          135
home.dest         Montreal, PQ / Chesterville, ON
Name: 25, dtype: object

We can also index columns using both the loc[] and iloc[] methods. With .loc[], we specify the column label strings as we have in the earlier exercises in this missions. With iloc[], we simply use the integer number of the column, starting from the left-most column which is 0. Similar to indexing with NumPy arrays, you separate the row and columns with a comma, and can use a colon to specify a range or as a wildcard.

In [49]:
first_row_first_column = new_titanic_survival.iloc[0,0]
all_rows_first_three_columns = new_titanic_survival.iloc[:,0:3]
row_index_83_age = new_titanic_survival.loc[83,"age"]
row_index_766_pclass = new_titanic_survival.loc[766,"pclass"]

**Instruction**

- Assign the value at row index label 1100, column index label "age" from new_titanic_survival to row_index_1100_age.
- Assign the value at row index label 25, column index label "survived" from new_titanic_survival to row_index_25_survived.
- Assign the first 5 rows and first three columns from new_titanic_survival to five_rows_three_cols.


In [61]:
#row_index_1100_age = new_titanic_survival.loc[1100, "age"]
row_index_25_survived = new_titanic_survival.loc[25, "survived"]
five_row_three_cols = new_titanic_survival.iloc[0:2,0:2]

## Reset index
After we sorted new_titanic_survival by age, the row indexes were no longer sequential. Each row retained its original index from titanic_survival.

Sometimes it's useful to reindex, starting from 0. We can use the DataFrame.reset_index() method to do this. By default, the method retains the old index by adding an extra column to the dataframe with the old index values.

In this exercise, we don't want to retain the index. Check the documentation to see what parameter you need to add so that we don't retain the old index.

**Instructions**

- Reindex the new_titanic_survival dataframe so the row indexes start from 0, and the old index is dropped.
- Assign the final result to titanic_reindexed.
- Print the first 5 rows and the first 3 columns of titanic_reindexed.


In [59]:
# Use drop parameter to avoid the old index being added as a column
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
titanic_reindexed.iloc[0:5,0:3]

Unnamed: 0,pclass,survived,name
0,1,1,"Barkworth, Mr. Algernon Henry Wilson"
1,1,1,"Cavendish, Mrs. Tyrell William (Julia Florence..."
2,3,0,"Svensson, Mr. Johan"
3,1,0,"Goldschmidt, Mr. George B"
4,1,0,"Artagaveytia, Mr. Ramon"


## Apply()
To perform a complex calculation across pandas objects, we'll need to learn about the DataFrame.apply() method. By default, DataFrame.apply() will iterate through each column in a DataFrame, and perform on each column. When we create our function, we give it one parameter, apply() method passes each column to the parameter as a pandas series.

The result from the function will be combined with all of the other results, and placed into a new series. The function results will have the same position as the column or row we generated them from. Let's look at a simple example:


In [66]:
# Return hundredth item from a series
def hundredth_row(column):
    return column.iloc[99]

# Return the hundredth item from each row
hundredth_row_var = titanic_survival.apply(hundredth_row)
hundredth_row_var

pclass                                                     1
survived                                                   1
name         Baxter, Mrs. James (Helene DeLaudeniere Chaput)
sex                                                   female
age                                                       50
sibsp                                                      0
parch                                                      1
ticket                                              PC 17558
fare                                                 247.521
cabin                                                B58 B60
embarked                                                   C
boat                                                       6
body                                                     NaN
home.dest                                       Montreal, PQ
dtype: object

**Instructions**
- Write a function that counts the number of null elements in a Series.
- Use the DataFrame.apply() method along with your function to run across all the columns in titanic_survival.
- Assign the result to column_null_count.


In [70]:
# Return the null count of a column 
def null_count(column):
    return len(column[pd.isnull(column)]) # or len(column[pd.isnull(column) == True])
# Same as above
#def null_count_x(column):
#    column_is_null = pd.isnull(column)
#    return len(column[column_is_null])
column_null_count = titanic_survival.apply(null_count)
column_null_count

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     563
dtype: int64

## Applying a function to a Row
By passing in the axis=1 argument, we can use the DataFrame.apply() method to iterate over rows instead of columns.

In [72]:
def is_minor(row):
    return row["age"] < 18

minors = titanic_survival.apply(is_minor, axis=1)

We can use this to calculate some summary information about the ages of the passengers on the Titanic. You will need to use an if/elif/else statement in your function. The elif statement just means else if. Below is an example of how these statements work.

In [76]:
def which_class(row):
    pclass = row['pclass']
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    else:
        return "Third Class"

classes = titanic_survival.apply(which_class, axis=1)

When the function is called, each test runs until one of the if, elif or else statements is met.

**Instructions**

- Create a function that returns the string "minor" if someone is under 18, "adult" if they are equal to or over 18, and "unknown" if their age is null.
- Then, use the function along with .apply() to find the correct label for everyone in the titanic_survival dataframe.
- Assign the result to age_labels.
- You can use pd.isnull to check if a value is null or not.


In [81]:
def age_label(row):
    age = row["age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"

age_labels = titanic_survival.apply(age_label, axis=1)

In [82]:
# Add "age_labels" to titanic_survival
titanic_survival["age_labels"] = age_labels

**Instruction** 
- Create a pivot table that calculates the mean survival chance("survived") for each age group ("age_labels") of the dataframe titanic_survival.
- Assign the resulting Series object to age_group_survival.

In [83]:
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="survived")
print(age_group_survival)

            survived
age_labels          
adult       0.387892
minor       0.525974
unknown     0.277567
