Scenario: the last voyage of the Titanic
As an eager marine archaeologist, you have an unusually keen interest in maritime disasters. Late one night while scrolling between images of whale bones and ancient scrolls about Atlantis, you come across a public dataset listing people known to be on the Titanic during its first – and last – voyage. Captured by the balance between fate and chance, you ponder – what were the factors that dictated whether a person survived this famous shipwreck? Data from this period are slightly patchy – a lot of information for certain passengers is unknown. You’ll need to find ways to patch up this data before analyzing it in full.

#Titanic Dataset - Find and Visualize Missing Data

It is quite common for datasets to have data missing, which can cause problems when we perform machine learning. Missing data can be hard to spot at a first glance.

Recall our scenario - we have obtained a list of passengers on the Titanic's failed maiden voyage and would like to try to figure out which kinds of things predicted whether a passenger would survive. Our first task, which we'll perform here, is to check whether our dataset has missing information.

## Preparing data

Let's use Pandas to load the dataset and take a cursory look at it:


In [1]:
import pandas as pd
!pip install missingno

# Load data from our dataset file into a pandas dataframe
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/titanic.csv
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
dataset = pd.read_csv('titanic.csv', index_col=False, sep=",", header=0)

# Let's take a look at the data
dataset.head()


Defaulting to user installation because normal site-packages is not writeable
Collecting missingno
  Using cached missingno-0.5.1-py3-none-any.whl (8.7 kB)
Collecting seaborn
  Downloading seaborn-0.12.2-py3-none-any.whl (293 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.3/293.3 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: seaborn, missingno
Successfully installed missingno-0.5.1 seaborn-0.12.2
--2023-01-24 09:46:34--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/titanic.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61194 (60K) [text/plain]
Saving to: ‘titanic.csv’


2023-01-24 09:46:34 (1.30 MB/s) - ‘titanic.csv’ saved [6119

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
