## Table of contents:
1. [Load the dataset and libraries](#1-Load)
2. [Data leakage and the Titanic dataset](#2-Data-leakage)
3. [Removing columns](#3-Removing-columns)
4. [Handling the index column](#4-Handling)
5. [Accessing the value of a cell](#5-Variables)
    5.1. [Method 1: loc[]](#5-1-Method-1)
    5.2. [Method 2: .iloc[]](#5-2-Method-2)
6. [Mask](#6-Mask)
7. [By group statistics](#7-By-group)
8. [Sorting](#8-Sorting)

# 1 Load the dataset and libraries <a name="1-Load"></a>

To begin, import pandas library to handle DataFrames

## Import the needed Libraries

In [1]:
import pandas as pd

Next, run the code below reviewing the comments. Note that we are loading a different type of file here, a dataset of tab separated values (.tsv) rather than the usual csv filetype. While a less common file format, as a budding data scientist it is good to know how to handle a range of dataset filetypes.

In [2]:
titanic_filename = 'titanic_full.tsv' 

# The function required to load the tsv file expects multiple parameters.  
# First, we must specify the filepath. 

# The remaining parameters have default values 

# the parameter sep='\t' specifies that columns are delimited with tabs 

# header=0 specifies the row number to use as header (or can be None) 

# index_col=0 specifies that the column at the absolute left is an index column  

titanic_df = pd.read_csv(titanic_filename, sep='\t', header=0, index_col=0) 

titanic_df 

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
passengerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,1,1,"Allen, Miss. Elisabeth Walton",female,29.00,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
2,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Miss. Helen Loraine",female,2.00,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.00,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
5,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.00,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,3,0,"Zabour, Miss. Hileni",female,14.50,1,0,2665,14.4542,,C,,328.0,
1306,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1307,3,0,"Zakarian, Mr. Mapriededer",male,26.50,0,0,2656,7.2250,,C,,304.0,
1308,3,0,"Zakarian, Mr. Ortin",male,27.00,0,0,2670,7.2250,,C,,,


# 2 Data leakage and the Titanic dataset <a name="2-Data-leakage"></a>

Let’s consider the Titanic dataset, and more specifically its columns. Run the code below.



In [3]:
titanic_df.columns 

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

**Think** 🤔 Consider that you want to predict whether an individual would have survived the Titanic disaster. What information can be relevant and what will likely cause leakage?


<details>

    The column ‘survived’ is our target (a.k.a label) column, the outcome we’d like to predict. In the modelling phase, this column should be set specifically as a target column. The columns ‘boat’ (for survivors, what boat they were rescued in) and ‘body’ (for non-survivors, whether their body was recovered from the disaster) will cause a leakage if not removed from the dataset because this information is not expected to be available at prediction time. This data was collected after the accident happened and indicates whether passengers survived or not (remembering that this is what we are trying to predict). Understanding the problem at hand is crucial for making decisions regarding which features to use. The index of a dataset can contribute to data leakage issues as well. As it doesn’t contain information that you want your model to learn, it should be excluded from the training dataset. In the titanic dataset, passengerId seems just a random integer range at first. But when we remove the boat and body columns, it visually seems to correlate with the name, pclass and survived columns. Since passengerId is already set as the DataFrame index, it won’t be considered as part of the training process. However, the data seems to be ordered based on the passengerId and the other columns. Therefore, to avoid any bias we can just remove the current DataFrame index and generate a new random one. This will give us a random unique identifier for each passenger on the Titanic.

 </details>

# 3 Removing columns <a name="3-Removing columns"></a>

Let’s drop the columns that cause leakage ‘boat’ and ‘body’ with the `DataFrame.drop()` method. 

**Note**: in pandas applying a method on a Series or DataFrame won’t usually modify the original object by default, it will instead return a modified copy of the original DataFrame. To change that, we can either specify the parameter inplace=True or just assign the result to the original variable to override the existing object as in the code below.

In [4]:
titanic_df = titanic_df.drop(columns=['boat', 'body']) 

titanic_df

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
passengerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,1,1,"Allen, Miss. Elisabeth Walton",female,29.00,0,0,24160,211.3375,B5,S,"St Louis, MO"
2,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.5500,C22 C26,S,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Miss. Helen Loraine",female,2.00,1,2,113781,151.5500,C22 C26,S,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.00,1,2,113781,151.5500,C22 C26,S,"Montreal, PQ / Chesterville, ON"
5,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.00,1,2,113781,151.5500,C22 C26,S,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...
1305,3,0,"Zabour, Miss. Hileni",female,14.50,1,0,2665,14.4542,,C,
1306,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,
1307,3,0,"Zakarian, Mr. Mapriededer",male,26.50,0,0,2656,7.2250,,C,
1308,3,0,"Zakarian, Mr. Ortin",male,27.00,0,0,2670,7.2250,,C,


In [5]:
titanic_df.head(100)

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
passengerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,1,1,"Allen, Miss. Elisabeth Walton",female,29.00,0,0,24160,211.3375,B5,S,"St Louis, MO"
2,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.5500,C22 C26,S,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Miss. Helen Loraine",female,2.00,1,2,113781,151.5500,C22 C26,S,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.00,1,2,113781,151.5500,C22 C26,S,"Montreal, PQ / Chesterville, ON"
5,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.00,1,2,113781,151.5500,C22 C26,S,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...
96,1,1,"Dodge, Mrs. Washington (Ruth Vidaver)",female,54.00,1,1,33638,81.8583,A34,S,"San Francisco, CA"
97,1,0,"Douglas, Mr. Walter Donald",male,50.00,1,0,PC 17761,106.4250,C86,C,"Deephaven, MN / Cedar Rapids, IA"
98,1,1,"Douglas, Mrs. Frederick Charles (Mary Helene B...",female,27.00,1,1,PC 17558,247.5208,B58 B60,C,"Montreal, PQ"
99,1,1,"Douglas, Mrs. Walter Donald (Mahala Dutton)",female,48.00,1,0,PC 17761,106.4250,C86,C,"Deephaven, MN / Cedar Rapids, IA"


# 4 Handling the index column <a name="4-Handling"></a>

Let’s recap what we know about the index column: 

* The index column is not a feature 

* We need to leave index column out during the learning/training phase 

* To solve this problem, we can specify the index column while loading the data set in pandas (as we did in the beginning of this activity) 

The *passengerid* column is already defined as the index column. To replace it let’s randomly shuffle the data and reset the index column. We can use [`DataFrame.sample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) method for random sampling of rows from the DataFrame, specifying `frac=1` and sampling without replacement (False by default) will simply randomly shuffle the data. 

In [6]:
#reset index will create a new index column, drop=True will drop the existing index column instead of appending it to the data 

titanic_df = titanic_df.sample(frac=1).reset_index(drop=True) 

titanic_df


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
0,3,0,"Hart, Mr. Henry",male,,0,0,394140,6.8583,,Q,
1,1,1,"Smith, Mrs. Lucien Philip (Mary Eloise Hughes)",female,18.0,1,0,13695,60.0000,C31,S,"Huntington, WV"
2,3,0,"Davies, Mr. Alfred J",male,24.0,2,0,A/4 48871,24.1500,,S,"West Bromwich, England Pontiac, MI"
3,1,0,"Case, Mr. Howard Brown",male,49.0,0,0,19924,26.0000,,S,"Ascot, Berkshire / Rochester, NY"
4,3,0,"Dika, Mr. Mirko",male,17.0,0,0,349232,7.8958,,S,
...,...,...,...,...,...,...,...,...,...,...,...,...
1304,1,0,"Williams, Mr. Charles Duane",male,51.0,0,1,PC 17597,61.3792,,C,"Geneva, Switzerland / Radnor, PA"
1305,1,1,"Stengel, Mr. Charles Emil Henry",male,54.0,1,0,11778,55.4417,C116,C,"Newark, NJ"
1306,2,0,"Parker, Mr. Clifford Richard",male,28.0,0,0,SC 14888,10.5000,,S,"St Andrews, Guernsey"
1307,1,1,"Lurette, Miss. Elise",female,58.0,0,0,PC 17569,146.5208,B80,C,


# 5 Accessing the value of a cell <a name="5-Accessing"></a>

We’ll now look at methods for accessing data, similar to the head method which returns the first n rows. There are multiple methods for [indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) in pandas. We will show two recommended methods here and demonstrate the usage to achieve the same result. Later you may encounter other methods that aren’t covered here, but note that all are covered in the documentation. Which method you choose to use depends on which methods makes more sense to you and suits your personal preference.  

For simplicity let’s define another DataFrame to be used as a basic example. Run the code below. 

In [7]:
df = pd.DataFrame(index=['row0','row1','row2'], columns=['col0','col1','col2'], data=[[11,12,13],[21,22,23],[31,32,33]]) 

df 

Unnamed: 0,col0,col1,col2
row0,11,12,13
row1,21,22,23
row2,31,32,33


## 5.1 Method 1: .loc[] <a name="5-1-Method-1"></a>

The `.loc` method is used with square brackets and primarily accepts label based indices, but it can also be used with a boolean array. Generally the `syntax is .loc[row, column]` Let’s see some basic examples, run the code below. 

In [8]:
print('------- Single Value -------') 

print(df.loc['row0', 'col1']) 

------- Single Value -------
12


In [9]:
print('------- Single row, all columns -------') 

print(df.loc['row1'])

------- Single row, all columns -------
col0    21
col1    22
col2    23
Name: row1, dtype: int64


In [10]:
print('------- 2 rows, 2 columns -------') 

print(df.loc[['row0','row2'], ['col1','col0']]) 
# note that passing multiple rows/columns should be within lists and order matters

------- 2 rows, 2 columns -------
      col1  col0
row0    12    11
row2    32    31


In [11]:
print('------- using boolean array -------') 

print(df.loc[[True,False,True]]) # this will return the first and third rows, and all the columns 

------- using boolean array -------
      col0  col1  col2
row0    11    12    13
row2    31    32    33


Now let’s use .loc[] to access the age of the passenger in row 2. Run the code below. 

In [12]:
titanic_df.loc[2, 'age'] # note that in that case 2 is both the position index and the label index 

24.0

## 5.2 Method 2: .iloc[] <a name="5-2-Method-2"></a>

The `.iloc` method also uses square brackets, but primarily accepts integer position based indices (0 to length-1). It can also be used with a boolean array. Again the syntax is `.iloc[row, column]`. Let’s see some basic examples, run the code below. 

In [13]:
print('------- Single Value -------') 

print(df.iloc[0, 1]) 

------- Single Value -------
12


In [14]:
print('------- Single row, all columns -------') 

print(df.iloc[1]) 

------- Single row, all columns -------
col0    21
col1    22
col2    23
Name: row1, dtype: int64


In [15]:
print('------- All rows, Single column -------') 

print(df.iloc[:,1]) 

------- All rows, Single column -------
row0    12
row1    22
row2    32
Name: col1, dtype: int64


In [16]:
print('------- 2 rows, 2 columns -------') 

print(df.iloc[[0,2], [1,0]]) # note that passing multiple rows/columns should be wihin lists and order matters 

------- 2 rows, 2 columns -------
      col1  col0
row0    12    11
row2    32    31


In [17]:
print('------- using boolean array -------') 

print(df.iloc[[True,False,True]]) # this will return the first and third rows, and all the columns 

------- using boolean array -------
      col0  col1  col2
row0    11    12    13
row2    31    32    33


Now let’s use .iloc[] to access the age of the passenger in row 2. Run the code below. 

In [18]:
titanic_df.iloc[2, 4] # note that in that case 2 is both the location index and the label index 

24.0

# 6 Mask <a name="6-Mask"></a>

Let’s use a mask to select the Titanic passengers who survived. We can call this “mask_survived”. We specify that we want to return rows where “survived” is equal to 1, designating the passenger as a survivor. Run the code below. 

In [19]:
mask_survived = titanic_df['survived'] == 1 # another method to access a single column 

mask_survived 

0       False
1        True
2       False
3       False
4       False
        ...  
1304    False
1305     True
1306    False
1307     True
1308    False
Name: survived, Length: 1309, dtype: bool

In the result we see a boolean array that has the same length as titanic_df and we have a True if the passenger has survived and False otherwise. i.e. we can immediately see which rows are equal to True, meaning that the column survived was equal to 1. 

Now, to check the ‘Age’ of those survived we can use the following command. Run the code below.  

In [20]:
survived_ages = titanic_df.loc[mask_survived, 'age'] 

survived_ages 

1       18.0
6       54.0
13      34.0
14      38.0
17       NaN
        ... 
1298     5.0
1301    18.0
1302    34.0
1305    54.0
1307    58.0
Name: age, Length: 500, dtype: float64

# 7 By group statistics <a name="7-By-group"></a>

We’ll now look at some statistics about the Titanic dataset. To find the median age of the Titanic survivors, we can use the [`Series.median()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.median.html) method. Run the code below.

In [21]:
survived_ages.median() 

28.0

But what if we would want to compare the median age of both groups: survived (1) and not survived (0)?

The [`DataFrame.groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method can help us to compare across groups. This will group the rows by a given column/s, so that we can then apply any aggregation function (including mean, median, sum, count) on each group separately. Run the code below.

In [22]:
# this returns a special groupby object that contains information about the groups 

grouped_object = titanic_df.groupby(['survived']) 

In [23]:
# we then calculate the median for a single column 

grouped_object['age'].median() 

survived
0    28.0
1    28.0
Name: age, dtype: float64

We can also group by more than one column. Let’s check how the “women and children” first policy that was applied during the evacuation of the Titanic has affected the median ages of those groups. Run the code below. 

In [24]:
titanic_df.groupby(['survived','sex'])['age'].median() 

survived  sex   
0         female    24.5
          male      29.0
1         female    28.5
          male      27.0
Name: age, dtype: float64

# 8 Sorting  <a name="8-Sorting"></a>

Lastly, If you need to sort the data based on a specific column, you can use the [`DataFrame.sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method. Let’s look at reordering the data based on the age of the passengers. Run the code below. 

In [25]:
# this will sort the data by age in ascending order and display the top 10 results

titanic_df.sort_values(by='age', ascending=True).head(10) 

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
202,3,1,"Dean, Miss. Elizabeth Gladys ""Millvina""",female,0.17,1,2,C.A. 2315,20.575,,S,"Devon, England Wichita, KS"
242,3,0,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,0,2,347080,14.4,,S,"Stanton, IA"
912,3,1,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C,
1277,2,1,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S,"Detroit, MI"
495,3,1,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C,"Syria New York, NY"
627,3,0,"Peacock, Master. Alfred Edward",male,0.75,1,1,SOTON/O.Q. 3101315,13.775,,S,
428,3,1,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C,"Syria New York, NY"
637,3,1,"Aks, Master. Philip Frank",male,0.83,0,1,392091,9.35,,S,"London, England Norfolk, VA"
1084,2,1,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S,"Bangkok, Thailand / Roseville, IL"
1189,2,1,"Richards, Master. George Sibley",male,0.83,1,1,29106,18.75,,S,"Cornwall / Akron, OH"


And we can also sort by multiple columns, let’s sort by age and ticket fare. Run the code below. 

In [26]:
# this will sort the data first by age in ascending order, then rows with equal age will be sorted by fare in descending order 

titanic_df.sort_values(by=['age','fare'], ascending=[True,False]).head(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
202,3,1,"Dean, Miss. Elizabeth Gladys ""Millvina""",female,0.17,1,2,C.A. 2315,20.575,,S,"Devon, England Wichita, KS"
242,3,0,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,0,2,347080,14.4,,S,"Stanton, IA"
912,3,1,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C,
1277,2,1,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S,"Detroit, MI"
428,3,1,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C,"Syria New York, NY"
495,3,1,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C,"Syria New York, NY"
627,3,0,"Peacock, Master. Alfred Edward",male,0.75,1,1,SOTON/O.Q. 3101315,13.775,,S,
1084,2,1,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S,"Bangkok, Thailand / Roseville, IL"
1189,2,1,"Richards, Master. George Sibley",male,0.83,1,1,29106,18.75,,S,"Cornwall / Akron, OH"
637,3,1,"Aks, Master. Philip Frank",male,0.83,0,1,392091,9.35,,S,"London, England Norfolk, VA"


<div class="warning" style='padding:0.1em; background-color:#e6ffff'>
<span>
<p style='margin:1em;'>
<b>Congratulations!</b></p>
<p style='margin:1em;'>
You’ve now learned some fundamental skills that will also contribute to the integrity of your coming work. To revise this activity, return to the Canvas page and read over the content under “What you’ll learn”.
</p>
<p style='margin-bottom:1em; margin-right:1em; text-align:right; font-family:Georgia'>
</p></span>
</div>