## Investigasi sampel data titanic berikut dengan cara :
1. Cek secara head, tail, sample, info lalu observasi apa yang bisa anda peroleh ?
2. Lakukan Statistical Summary dengan mengekstrak informasi yang didapat dari observasi anda ?
3. Cek apakah ada duplikat dan bagaimana handlenya ?
4. Cek apakah ada missing value, berapa persentasenya jika ada, dan bagaimana cara handlenya ?

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

## 1. Data Observasi

In [None]:
# import data
df = pd.read_excel('titanic.xlsx')

df.head() # menampilkan 5 teratas

Unnamed: 0,survived,name,sex,age
0,1,"Allen, Miss. Elisabeth Walton",female,29.0
1,1,"Allison, Master. Hudson Trevor",male,0.9167
2,0,"Allison, Miss. Helen Loraine",female,2.0
3,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0
4,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0


In [None]:
#menampilkan 5 terbawah
df.tail()

Unnamed: 0,survived,name,sex,age
495,1,"Mallet, Mrs. Albert (Antoinette Magnin)",female,24.0
496,0,"Mangiavacchi, Mr. Serafino Emilio",male,
497,0,"Matthews, Mr. William John",male,30.0
498,0,"Maybery, Mr. Frank Hubert",male,40.0
499,0,"McCrae, Mr. Arthur Gordon",male,32.0


In [None]:
# mengambil data dari 5 baris acak
df.sample(5)

Unnamed: 0,survived,name,sex,age
294,0,"Thayer, Mr. John Borland",male,49.0
194,0,"Maguire, Mr. John Edward",male,30.0
483,1,"Lehmann, Miss. Bertha",female,17.0
79,1,"Cornell, Mrs. Robert Clifford (Malvina Helen L...",female,55.0
496,0,"Mangiavacchi, Mr. Serafino Emilio",male,


In [None]:
# cek struktur info dataset titanic
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  500 non-null    int64  
 1   name      500 non-null    object 
 2   sex       500 non-null    object 
 3   age       451 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 15.8+ KB


## 2. Statistical Summary

In [None]:
# menampilkan statistik dasar dari kolom numerik dalam dataset
df.describe()

Unnamed: 0,survived,age
count,500.0,451.0
mean,0.54,35.917775
std,0.498897,14.766454
min,0.0,0.6667
25%,0.0,24.0
50%,1.0,35.0
75%,1.0,47.0
max,1.0,80.0


## 3. Cek Duplikat dan handlingnya

In [None]:
# Hitung total baris yang duplikat
# (kecuali baris pertama tiap grup)
n_dups = df.duplicated().sum()
print(f"Jumlah baris duplikat: {n_dups}")

Jumlah baris duplikat: 1


In [None]:
# hapus data duplikat
df.drop_duplicates(inplace=True)

## 4. Cek Missing Value

In [None]:
# mengecek jumlah missing value di tiap kolom
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100

# menggabungkan informasi jadi satu tabel
missing_info = pd.DataFrame({
    'Missing Values': missing,
    'Percentage (%)': missing_percent
})
display(missing_info)

# menampilkan baris yang punya nilai kosong di kolom mana pun
rows_with_null = df[df.isnull().any(axis=1)]
print(f"Jumlah baris yang memiliki setidaknya satu nilai kosong: {len(rows_with_null)}")
display(rows_with_null)

Unnamed: 0,Missing Values,Percentage (%)
survived,0,0.0
name,0,0.0
sex,0,0.0
age,49,9.819639


Jumlah baris yang memiliki setidaknya satu nilai kosong: 49


Unnamed: 0,survived,name,sex,age
15,0,"Baumann, Mr. John D",male,
37,1,"Bradley, Mr. George (""George Arthur Brayton"")",male,
40,0,"Brewe, Dr. Arthur Jackson",male,
46,0,"Cairns, Mr. Alexander",male,
59,1,"Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev...",female,
69,1,"Chibnall, Mrs. (Edith Martha Bowerman)",female,
70,0,"Chisholm, Mr. Roderick Robert Crispin",male,
74,0,"Clifford, Mr. George Quincy",male,
80,0,"Crafton, Mr. John Bertram",male,
106,0,"Farthing, Mr. John",male,


In [None]:
# buat hapus data yg null
df = df.dropna()