## Investigasi sampel data titanic berikut dengan cara :
1. Cek secara head, tail, sample, info lalu observasi apa yang bisa anda peroleh ?
2. Lakukan Statistical Summary dengan mengekstrak informasi yang didapat dari observasi anda ?
3. Cek apakah ada duplikat dan bagaimana handlenya ?
4. Cek apakah ada missing value, berapa persentasenya jika ada, dan bagaimana cara handlenya ?

## Import Libraries

In [61]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

Load Data

In [62]:
# import data
df = pd.read_excel('titanic.xlsx')
data = df.copy()
df.head() #Menampilkan 5 teratas

Unnamed: 0,survived,name,sex,age
0,1,"Allen, Miss. Elisabeth Walton",female,29.0
1,1,"Allison, Master. Hudson Trevor",male,0.9167
2,0,"Allison, Miss. Helen Loraine",female,2.0
3,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0
4,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0


In [63]:
df.tail()
# Menampilkan 5 data terbawah

Unnamed: 0,survived,name,sex,age
495,1,"Mallet, Mrs. Albert (Antoinette Magnin)",female,24.0
496,0,"Mangiavacchi, Mr. Serafino Emilio",male,
497,0,"Matthews, Mr. William John",male,30.0
498,0,"Maybery, Mr. Frank Hubert",male,40.0
499,0,"McCrae, Mr. Arthur Gordon",male,32.0


In [64]:
df.sample(5)
#Menampilkan 5 data acak

Unnamed: 0,survived,name,sex,age
177,1,"Kimball, Mr. Edwin Nelson Jr",male,42.0
119,1,"Frauenthal, Dr. Henry William",male,50.0
147,0,"Harrington, Mr. Charles H",male,
253,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48.0
123,1,"Frolicher-Stehli, Mr. Maxmillian",male,60.0


In [65]:
df.info()
# Menampilkan informasi umum dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  500 non-null    int64  
 1   name      500 non-null    object 
 2   sex       500 non-null    object 
 3   age       451 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 15.8+ KB


survived : Bertipe integer (int64), berisi 0 atau 1 yang menunjukkan apakah penumpang tidak selamat (0) atau selamat (1).

name : Bertipe objek (object), berisi nama penumpang.

sex : Bertipe objek (object), berisi informasi jenis kelamin penumpang.

age : Bertipe float (float64), berisi usia penumpang. Namun, hanya terdapat 451 data yang tidak kosong, artinya ada 49 missing values di kolom ini.

STATISTICAL SUMMARY

In [66]:
df.describe()

Unnamed: 0,survived,age
count,500.0,451.0
mean,0.54,35.917775
std,0.498897,14.766454
min,0.0,0.6667
25%,0.0,24.0
50%,1.0,35.0
75%,1.0,47.0
max,1.0,80.0


In [68]:
df.describe(include='object')

Unnamed: 0,name,sex
count,500,500
unique,499,2
top,"Eustis, Miss. Elizabeth Mussey",male
freq,2,288


DUPLICATE HANDLING

In [69]:
len(df.drop_duplicates()) / len(df)

0.998

In [70]:
# Menampilkan baris duplikat
duplicates = df[df.duplicated(keep=False)]

duplicate_counts = duplicates.groupby(list(df.columns)).size().reset_index(name='jumlah_duplikat')

sorted_duplicates = duplicate_counts.sort_values(by='jumlah_duplikat', ascending=False)

print("Baris yang terduplikasi:")
sorted_duplicates



Baris yang terduplikasi:


Unnamed: 0,survived,name,sex,age,jumlah_duplikat
0,1,"Eustis, Miss. Elizabeth Mussey",female,54.0,2


In [71]:
# Hapus duplikat
df = df.drop_duplicates()

In [72]:
len(df.drop_duplicates()) / len(df)

1.0

Missing Value Handling

In [58]:
# Cek jumlah dan persentase missing value
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100

print("Jumlah Missing Value:")
print(missing)
print("\nPersentase Missing Value:")
print(missing_percent)

# Handling: isi missing value pada kolom 'age' dengan median
median_age = df['age'].median()
df['age'].fillna(median_age, inplace=True)

print("\nMissing value pada kolom 'age' telah diisi dengan nilai median:", median_age)


Jumlah Missing Value:
survived     0
name         0
sex          0
age         49
dtype: int64

Persentase Missing Value:
survived    0.000000
name        0.000000
sex         0.000000
age         9.819639
dtype: float64

Missing value pada kolom 'age' telah diisi dengan nilai median: 35.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(median_age, inplace=True)
