In this notebook we'll be exploring other functionalities of the pandas library by using a dataset available in python's Seaborn library which contains data about the survivors of the titanic shipwrek

In [0]:
import numpy as np
import pandas as pd
import seaborn as sns

In [0]:
titanic = sns.load_dataset("titanic")

Let's get a quick look on the DataFrame by looking into a few lines

In [68]:
titanic.head(6)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True


In [69]:
titanic.tail()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


**let's see the ages of the survivors**

In [70]:
titanic.age.unique()
# notice that we have some values < 1
# also we have a NaN value

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

**if we want to have some additional informations:**


In [71]:
titanic.describe(include="all")

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
count,891.0,891.0,891,714.0,891.0,891.0,891.0,889,891,891,891,203,889,891,891
unique,,,2,,,,,3,3,3,2,7,3,2,2
top,,,male,,,,,S,Third,man,True,C,Southampton,no,True
freq,,,577,,,,,644,491,537,537,59,644,549,537
mean,0.383838,2.308642,,29.699118,0.523008,0.381594,32.204208,,,,,,,,
std,0.486592,0.836071,,14.526497,1.102743,0.806057,49.693429,,,,,,,,
min,0.0,1.0,,0.42,0.0,0.0,0.0,,,,,,,,
25%,0.0,2.0,,20.125,0.0,0.0,7.9104,,,,,,,,
50%,0.0,3.0,,28.0,0.0,0.0,14.4542,,,,,,,,
75%,1.0,3.0,,38.0,1.0,0.0,31.0,,,,,,,,


the method ```.describe()``` is very usefull to check tho consistency of the data.

the ```include="all"``` parameter is used to include the non numerical informations as the sex or the deck

**The NaN value**

the NaN (Not a number) appears when a non number value is in a place where only numbers are allowed.

For example we recieve an NaN when we ask Pandas to sum a set of Strings. And all operations including a NaN return a NaN.

In [72]:
titanic.age.head(6)
# here we can see that we have a NaN in the 6th row

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
Name: age, dtype: float64

In [73]:
# if we try to sum all the ages:
sum(titanic.age)
# we will have NaN due to the 6th row

nan

**Cleaning NaN values:**

The first methode is to use the ```.fillna()``` function, which replace all the NaN by a given value:

In [74]:
# as we saw in the head, we have a NaN in the 'age' attribute in the 6th row
# so we will replace it by 0.0
titanic.fillna(value={"age":0.0})
# returns the DataFrame with all the NaN replaced by 0.0 in the column "age"

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,0.0,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [75]:
titanic.fillna(value={"age":0.0}).age.head(10)
# we can see that the NaN in the 6th row is replaed bt 0.0

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     0.0
6    54.0
7     2.0
8    27.0
9    14.0
Name: age, dtype: float64

In [76]:
# now we can ask pandas to do operations on the row:
sum(titanic.fillna(value={"age":0.0}).age)

21205.17

**IF** we don's have a value to replace the NaN with, we can use ```X.fillna(method="pad")``` to replace it with the value before it.

**```.dropna()``` method:**

The function is used to drop the row or the columns which contain NaN (by default it drops the rows)

In [79]:
titanic.age.dropna().head(10)
# we can see that the 6th row is gone

0     22.0
1     38.0
2     26.0
3     35.0
4     35.0
6     54.0
7      2.0
8     27.0
9     14.0
10     4.0
Name: age, dtype: float64

In [83]:
titanic.dropna(axis="columns").head(7)
# here, all the columns which contain NaN are gone (including the 'age' column)


Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,class,who,adult_male,alive,alone
0,0,3,male,1,0,7.25,Third,man,True,no,False
1,1,1,female,1,0,71.2833,First,woman,False,yes,False
2,1,3,female,0,0,7.925,Third,woman,False,yes,True
3,1,1,female,1,0,53.1,First,woman,False,yes,False
4,0,3,male,0,0,8.05,Third,man,True,no,True
5,0,3,male,0,0,8.4583,Third,man,True,no,True
6,0,1,male,0,0,51.8625,First,man,True,no,True


**Renaming a column**

In [86]:
titanic.rename(columns={"sex":"sexe"}).columns

Index(['survived', 'pclass', 'sexe', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

**Droping axes**

In [87]:
# To drop a row:
titanic.drop(1).head(2)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True


In [88]:
# To drop a column:
titanic.drop(columns=["age","sex"]).head(2)

Unnamed: 0,survived,pclass,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False


**!!! IMPORT NOTE !!!**

All the functions above (and other non montioned functions) accept a boolean parameter nemed ```inplace``` which if it's at False the function will return a modified copie of the DataFrame. if it's at True the DataFrame will be modified.

The paramete is at False y default.