<a href="https://colab.research.google.com/github/davidivan13/Data-Science/blob/main/Introduction_to_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Library

In [1]:
import numpy as np #library untuk matematika
import pandas as pd #library untuk dataframe
import seaborn as sns #library untuk visualisasi
import matplotlib.pyplot as plt

## Tentang dataset

Dataset yang akan digunakan berasal dari [sini](https://www.sciencedirect.com/science/article/pii/S2352340918315191) dan dapat diundah [di sini](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand)


## Input Data

In [2]:
dataset = pd.read_csv("hotel_bookings.csv")
dataset.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

## Memebrsihkan Data

### Memeriksa nilai hilang


In [4]:
#Memeriksa persentase data hilang
dataset.isna().mean().round(3).mul(100)

Unnamed: 0,0
hotel,0.0
is_canceled,0.0
lead_time,0.0
arrival_date_year,0.0
arrival_date_month,0.0
arrival_date_week_number,0.0
arrival_date_day_of_month,0.0
stays_in_weekend_nights,0.0
stays_in_week_nights,0.0
adults,0.0


### Mengisi nilai hilang dengan suatu nilai

In [5]:
dataset["children"] = dataset["children"].fillna(dataset["children"].median())

In [6]:
dataset["children"].isna().sum()

np.int64(0)

### Mengubah tipe data kolom children

In [7]:
dataset["children"] = dataset["children"].astype("int")
dataset["children"].head()

Unnamed: 0,children
0,0
1,0
2,0
3,0
4,0


### Mengecek dan membuang nilai duplikat

In [8]:
dataset.duplicated().sum()

np.int64(31994)

In [9]:
dataset1 = dataset.drop_duplicates()
dataset1.duplicated().sum()

np.int64(0)

## Transformasi Data

### Membuat Kolom Baru

In [10]:
dataset1["total_penghuni"] = dataset1["adults"] + dataset1["babies"] + dataset1["children"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset1["total_penghuni"] = dataset1["adults"] + dataset1["babies"] + dataset1["children"]


In [11]:
dataset1["total_penghuni"]

Unnamed: 0,total_penghuni
0,2
1,2
2,1
3,1
4,2
...,...
119385,2
119386,3
119387,2
119388,2


### Melakukan Encode

In [12]:
dataset1["is_resort"] = pd.get_dummies(dataset1["hotel"], drop_first = True)
dataset1["is_resort"].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset1["is_resort"] = pd.get_dummies(dataset1["hotel"], drop_first = True)


Unnamed: 0,is_resort
0,True
1,True
2,True
3,True
4,True


### Membuat Kategori Baru (Diskretisasi)

In [13]:
def lead_time_category(x):

  if x <= 30:
    return "1mo"
  elif x <= 60:
    return "2mo"
  else:
    return ">3mo"

dataset1["lead_time_category"] = dataset1["lead_time"].apply(lead_time_category)
dataset1["lead_time_category"].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset1["lead_time_category"] = dataset1["lead_time"].apply(lead_time_category)


Unnamed: 0,lead_time_category
0,>3mo
1,>3mo
2,1mo
3,1mo
4,1mo


### Mengubah kategori

In [14]:
dataset1["co_status"] = dataset1["reservation_status"].map({"Check-Out" : "Success",
                                    "Canceled": "Not Success",
                                    "No-Show" : "Not Sucess"})
dataset1["co_status"].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset1["co_status"] = dataset1["reservation_status"].map({"Check-Out" : "Success",


Unnamed: 0,co_status
0,Success
1,Success
2,Success
3,Success
4,Success


### Pengubahan skala data

In [15]:
dataset1["adr_scaled"] = dataset1[["adr"]].apply(lambda x: (x - x.max())/ (x.max() - x.min()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset1["adr_scaled"] = dataset1[["adr"]].apply(lambda x: (x - x.max())/ (x.max() - x.min()))


### Aggregasi Data

In [16]:
by_country = dataset1.groupby("country")[["lead_time", "total_penghuni", "adr"]].mean()
by_country.head()

Unnamed: 0_level_0,lead_time,total_penghuni,adr
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ABW,126.0,2.5,128.34
AGO,23.897661,1.836257,117.970029
AIA,0.0,4.0,265.0
ALB,93.272727,1.909091,85.203636
AND,47.0,2.714286,202.652857


### Pivot

In [17]:
pivot_lead_time = dataset1.pivot_table(index = "reservation_status",
                     columns = "hotel",
                     values = "lead_time",
                     aggfunc = "mean")
pivot_lead_time.head()

hotel,City Hotel,Resort Hotel
reservation_status,Unnamed: 1_level_1,Unnamed: 2_level_1
Canceled,104.133325,115.775616
Check-Out,67.387999,73.999115
No-Show,50.759358,60.011278


### Melt

In [18]:
pivot_lead_time.melt()

Unnamed: 0,hotel,value
0,City Hotel,104.133325
1,City Hotel,67.387999
2,City Hotel,50.759358
3,Resort Hotel,115.775616
4,Resort Hotel,73.999115
5,Resort Hotel,60.011278
