### Notes:
Below is a completed and cleaned up version of the "data cleaning" that we performed on the titanic dataset.  This notebook should be placed into the student files directory of our class.  This will allow it to find the titanic.csv data file in the resources directory of the student files.  See the very bottom for our final version of this dataframe.
<br><br>
Also, per a the question asked at the end of class on day 4, the following are some resources on creating a Python project structure.  The following shows the structure of a typical Python project.  Also view the root page here as well:  https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/
<br><br>
The Poetry tool can perform a number of tasks including creating virtual environments and building a project structure for you as well.  Check out this resource to gain a little understanding of the project structure that Poetry provides...
https://python-poetry.org/docs/basic-usage/


In [1]:
import numpy as np
import pandas as pd

In [2]:
# load the titanic data file into a dataframe
titanic = pd.read_csv('resources/titanic.csv')

In [3]:
titanic.shape         # check its shape (rows, cols)

(891, 15)

In [4]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [5]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB


In [6]:
# view the number of "NaNs" (missing values) in the various columns
pd.isnull(titanic).sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [7]:
# Too many missing values in the "deck" column.  Let's drop it...
titanic.drop(labels=['deck'], axis=1, inplace=True)

In [8]:
# drop the 2 rows containing NaNs in the embarked column
titanic.dropna(subset=['embarked'], inplace=True)

In [9]:
# Use value replacement encoding to encode the embarked column
titanic.replace({'embarked': {'S': 0, 'C': 1, 'Q': 2}}, inplace=True)

  titanic.replace({'embarked': {'S': 0, 'C': 1, 'Q': 2}}, inplace=True)


In [10]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,0,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,1,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,0,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,0,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,0,Third,man,True,Southampton,no,True


In [11]:
# Examine the unique values in the embarked column
titanic.embarked.unique()

array([0, 1, 2], dtype=int64)

In [12]:
titanic.embarked.value_counts()

embarked
0    644
1    168
2     77
Name: count, dtype: int64

In [13]:
# Encode the embark_town column using "categorical encoding"
titanic.embark_town = titanic.embark_town.astype('category')

In [14]:
titanic.embark_town = titanic.embark_town.cat.codes

In [15]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,0,Third,man,True,2,no,False
1,1,1,female,38.0,1,0,71.2833,1,First,woman,False,0,yes,False
2,1,3,female,26.0,0,0,7.925,0,Third,woman,False,2,yes,True
3,1,1,female,35.0,1,0,53.1,0,First,woman,False,2,yes,False
4,0,3,male,35.0,0,0,8.05,0,Third,man,True,2,no,True


In [16]:
# Impute (fill in missing) Values in the age column by first getting the mean() and then the median() value in that column.
age_mean = titanic.age.mean()
age_mean

29.64209269662921

In [17]:
age_median = titanic.age.median()
age_median

28.0

In [18]:
titanic.age.fillna(age_median, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic.age.fillna(age_median, inplace=True)


In [19]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     889 non-null    int64  
 1   pclass       889 non-null    int64  
 2   sex          889 non-null    object 
 3   age          889 non-null    float64
 4   sibsp        889 non-null    int64  
 5   parch        889 non-null    int64  
 6   fare         889 non-null    float64
 7   embarked     889 non-null    int64  
 8   class        889 non-null    object 
 9   who          889 non-null    object 
 10  adult_male   889 non-null    bool   
 11  embark_town  889 non-null    int8   
 12  alive        889 non-null    object 
 13  alone        889 non-null    bool   
dtypes: bool(2), float64(2), int64(5), int8(1), object(4)
memory usage: 85.9+ KB


In [20]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,0,Third,man,True,2,no,False
1,1,1,female,38.0,1,0,71.2833,1,First,woman,False,0,yes,False
2,1,3,female,26.0,0,0,7.925,0,Third,woman,False,2,yes,True
3,1,1,female,35.0,1,0,53.1,0,First,woman,False,2,yes,False
4,0,3,male,35.0,0,0,8.05,0,Third,man,True,2,no,True


In [21]:
# complete the encoding process on the remaining columns and delete any columns with redundant information...
titanic.drop(labels=['alive', 'alone', 'embark_town', 'adult_male', 'class', 'sex'], axis=1, inplace=True)

In [22]:
titanic.who = titanic.who.astype('category')
titanic.who = titanic.who.cat.codes

In [23]:
titanic.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked,who
0,0,3,22.0,1,0,7.25,0,1
1,1,1,38.0,1,0,71.2833,1,2
2,1,3,26.0,0,0,7.925,0,2
3,1,1,35.0,1,0,53.1,0,2
4,0,3,35.0,0,0,8.05,0,1


Above is our final cleaned-up version of the titanic dataset!