# Modul 1 Sains Data: Pengenalan Pandas, Transformasi Data

Kembali ke [Sains Data](./saindat2024genap.qmd)

## Prerequisites
Pada module ini kita akan coba mememahami package pandas, yang merupakan package inti dalam sains-data. kita akan coba melakukan beberapa transformasi data menggunakan pandas.

sebelum itu, `python module` di bawah ini yang akan digunakan selama praktikum.

In [1]:
import numpy as np
import pandas as pd

Apabila ada yang belum terinstal, silakan instal terlebih dahulu menggunakan `pip`:

```default
!pip install numpy
!pip install pandas
```

atau `conda` jika sedang menggunakan Anaconda:

```default
conda install numpy
conda install pandas
```

## Series


`pandas.Series` sangat mirip dengan array NumPy (bahkan dibangun di atas objek array NumPy). Yang membedakan array NumPy dari sebuah Series adalah bahwa sebuah Series dapat memiliki label index, yang berarti dapat diindeks dengan label, bukan hanya lokasi nomor saja. Selain itu, sebuah Series tidak perlu menyimpan data numerik, ia dapat menyimpan objek Python sembarang.

### Membuat pd.Series dengan list

Paling mudah, ktia dapat membuat `pd.Series` dengan python `list`

In [2]:
my_index= ['a','b','c','d','e']
my_data= [1,2,3,4,5]
my_series= pd.Series(data=my_data, index=my_index)

In [3]:
print(my_series)

a    1
b    2
c    3
d    4
e    5
dtype: int64


In [4]:
print(my_series.__class__)

<class 'pandas.core.series.Series'>


### Membuat pd.Series dengan dictionary
Kita juga dapat membuat `pd.Series` dengan `dictionary`

In [5]:
# creating a series from a dictionary
my_dict= {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
my_series_dict= pd.Series(my_dict)

In [6]:
print(my_series_dict)

a    1
b    2
c    3
d    4
e    5
dtype: int64


In [7]:
print(my_series_dict.__class__)

<class 'pandas.core.series.Series'>


### Operasi pada Series

In [8]:
# Imaginary Sales Data for 1st and 2nd Quarters for Global Company
q1 = {'Japan': 80, 'China': 450, 'India': 200, 'USA': 250}
q2 = {'Brazil': 100,'China': 500, 'India': 210,'USA': 260}


# Creating a Series from a Dictionary q1 and q2
q1_series= pd.Series(q1)
q2_series= pd.Series(q2)

In [9]:
print(q1_series)

Japan     80
China    450
India    200
USA      250
dtype: int64


Kita dapat mengindeks dengan label

In [10]:
# call values of q1_series based on named index
print(q1_series['Japan'])
print(q1_series['China'])
print(q1_series['India'])

80
450
200


kita dapat tetap dapat mengindeks dengan integer

In [11]:
# u can also call values of q1_series based on positional index
print(q1_series[0])
print(q1_series[1])
print(q1_series[2])

80
450
200


hati-hati dalam melakukan indexing dengan label. bisa saja terjadi error jika label tidak ada di dalam `pd.series`

In [12]:
# remember named index is case sensitive
try:
    print(q1_series['japan'])
except:
    print('something went wrong')

something went wrong


Operasi aritmatik sederhana pada `pd.Series` bersifat *broadcasting*, yaitu diterapkan ke masing-masing elemen

In [13]:
# operations with arithmetic on series are broadcasted to all values
print(q1_series*2)

Japan    160
China    900
India    400
USA      500
dtype: int64


In [14]:
print(q1_series+1000)

Japan    1080
China    1450
India    1200
USA      1250
dtype: int64


Untuk penjumlahan antara dua `pd.Series`, apabila ada label yang hanya muncul di salah satu *series*, maka label tersebut akan muncul di hasil jumlah dengan data NaN (*not a number*, di sini artinya tidak ada data).

(Kebetulan, keterangan NaN hanya bisa muncul untuk tipe data `float` atau koma-komaan, sehingga tipe data terpaksa diubah menjadi `float`.)

In [15]:
# operation between series are also broadcasted
print(q1_series+q2_series)

Brazil      NaN
China     950.0
India     410.0
Japan       NaN
USA       510.0
dtype: float64


Mengapa tidak nol saja? Ketiadaan label pada salah satu *series* dianggap sebagai ketidaktahuan data untuk label tersebut, bukan dianggap nol.

Apabila diinginkan agar data yang tiada dianggap nol terlebih dahulu baru dijumlahkan, bisa seperti berikut:

In [16]:
print(q1_series.add(q2_series, fill_value=0))

Brazil    100.0
China     950.0
India     410.0
Japan      80.0
USA       510.0
dtype: float64


## data frame

Sebuah `pd.DataFrame` terdiri dari beberapa `pd.Series` yang berbagi nilai indeks.

Misalkan kita punya data seperti berikut.

In [17]:
my_data = np.array([
    [25, 59, 18],
    [75, 54, 65],
    [29, 21,  7],
    [32, 68, 16]
])

In [18]:
my_data

array([[25, 59, 18],
       [75, 54, 65],
       [29, 21,  7],
       [32, 68, 16]])

Kita akan membuat `pd.Dataframe` melalui python `list`. Perhatikan bahwa kita dapat memberikan nama pada kolom dan baris

In [19]:
my_index= ["Toko A", "Toko B", "Toko C", "Toko D"]
my_columns= ["Apel", "Jeruk", "Pisang"]

df= pd.DataFrame(data=my_data, index=my_index, columns=my_columns)

In [20]:
df

Unnamed: 0,Apel,Jeruk,Pisang
Toko A,25,59,18
Toko B,75,54,65
Toko C,29,21,7
Toko D,32,68,16


In [21]:
df_2 = pd.DataFrame(data=my_data)
df_2

Unnamed: 0,0,1,2
0,25,59,18
1,75,54,65
2,29,21,7
3,32,68,16


In [22]:
df_3 = pd.DataFrame(data=my_data, columns=my_columns)
df_3

Unnamed: 0,Apel,Jeruk,Pisang
0,25,59,18
1,75,54,65
2,29,21,7
3,32,68,16


### membaca file csv sebagai `pd.DataFrame`



Jika berkas .py atau .ipynb Anda berada di lokasi folder yang sama persis dengan berkas .csv yang ingin Anda baca, cukup berikan nama berkas sebagai string, misalnya:

 `df = pd.read_csv('some_file.csv')`

Berikan s berkas jika Anda berada di direktori yang berbeda. Jalur berkas harus 100% benar agar ini berfungsi. Misalnya:

 `df = pd.read_csv("C:\\Users\\myself\\files\\some_file.csv")`
 
sebelum itu, kalian dapat mendownload dataset "Waiter's Tips Dataset" melalui salah satu link berikut:

* [Direct link (langsung dari GitHub Pages ini)](./tips.csv)

* [Kaggle](https://www.kaggle.com/datasets/aminizahra/tips-dataset)

* [Google Drive](https://drive.google.com/file/d/1NHbbqX0_kAO0n9T-W7pTR7XTjpvNwaRF/view?usp=sharing)

In [23]:
df_tips = pd.read_csv('./tips.csv')

In [24]:
df_tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17


### Operasi sederhana pada DataFrame

In [25]:
# mengecek nama kolom
df_tips.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size',
       'price_per_person', 'Payer Name', 'CC Number', 'Payment ID'],
      dtype='object')

In [26]:
# mengecek 
df_tips.index

RangeIndex(start=0, stop=244, step=1)

In [27]:
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


In [28]:
df_tips.head(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
5,25.29,4.71,Male,No,Sun,Dinner,4,6.32,Erik Smith,213140353657882,Sun9679
6,8.77,2.0,Male,No,Sun,Dinner,2,4.38,Kristopher Johnson,2223727524230344,Sun5985
7,26.88,3.12,Male,No,Sun,Dinner,4,6.72,Robert Buck,3514785077705092,Sun8157
8,15.04,1.96,Male,No,Sun,Dinner,2,7.52,Joseph Mcdonald,3522866365840377,Sun6820
9,14.78,3.23,Male,No,Sun,Dinner,2,7.39,Jerome Abbott,3532124519049786,Sun3775


In [29]:
df_tips.tail()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.0,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.0,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17
243,18.78,3.0,Female,No,Thur,Dinner,2,9.39,Michelle Hardin,3511451626698139,Thur672


In [30]:
df_tips.tail(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
234,15.53,3.0,Male,Yes,Sat,Dinner,2,7.76,Tracy Douglas,4097938155941930,Sat7220
235,10.07,1.25,Male,No,Sat,Dinner,2,5.04,Sean Gonzalez,3534021246117605,Sat4615
236,12.6,1.0,Male,Yes,Sat,Dinner,2,6.3,Matthew Myers,3543676378973965,Sat5032
237,32.83,1.17,Male,Yes,Sat,Dinner,2,16.42,Thomas Brown,4284722681265508,Sat2929
238,35.83,4.67,Female,No,Sat,Dinner,3,11.94,Kimberly Crane,676184013727,Sat9777
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.0,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.0,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17
243,18.78,3.0,Female,No,Thur,Dinner,2,9.39,Michelle Hardin,3511451626698139,Thur672


In [31]:
df_tips.describe()

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
count,244.0,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,7.888197,2563496000000000.0
std,8.902412,1.383638,0.9511,2.914234,2369340000000000.0
min,3.07,1.0,1.0,2.88,60406790000.0
25%,13.3475,2.0,2.0,5.8,30407310000000.0
50%,17.795,2.9,2.0,7.255,3525318000000000.0
75%,24.1275,3.5625,3.0,9.39,4553675000000000.0
max,50.81,10.0,6.0,20.27,6596454000000000.0


In [32]:
df_tips.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_bill,244.0,19.78594,8.902412,3.07,13.3475,17.795,24.1275,50.81
tip,244.0,2.998279,1.383638,1.0,2.0,2.9,3.5625,10.0
size,244.0,2.569672,0.9510998,1.0,2.0,2.0,3.0,6.0
price_per_person,244.0,7.888197,2.914234,2.88,5.8,7.255,9.39,20.27
CC Number,244.0,2563496000000000.0,2369340000000000.0,60406790000.0,30407310000000.0,3525318000000000.0,4553675000000000.0,6596454000000000.0


### Transformasi data (row-wise)

#### filtering

In [33]:
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


In [34]:
print(df_tips["size"] == 3)

0      False
1       True
2       True
3      False
4      False
       ...  
239     True
240    False
241    False
242    False
243    False
Name: size, Length: 244, dtype: bool


In [35]:
conditional_size = (df_tips["size"] == 3)
df_tips[conditional_size]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
16,10.33,1.67,Female,No,Sun,Dinner,3,3.44,Elizabeth Foster,4240025044626033,Sun9715
17,16.29,3.71,Male,No,Sun,Dinner,3,5.43,John Pittman,6521340257218708,Sun2998
18,16.97,3.5,Female,No,Sun,Dinner,3,5.66,Laura Martinez,30422275171379,Sun2789
19,20.65,3.35,Male,No,Sat,Dinner,3,6.88,Timothy Oneal,6568069240986485,Sat9213
35,24.06,3.6,Male,No,Sat,Dinner,3,8.02,Joseph Mullins,5519770449260299,Sat632
36,16.31,2.0,Male,No,Sat,Dinner,3,5.44,William Ford,3527691170179398,Sat9139
37,16.93,3.07,Female,No,Sat,Dinner,3,5.64,Erin Lewis,5161695527390786,Sat6406
38,18.69,2.31,Male,No,Sat,Dinner,3,6.23,Brandon Bradley,4427601595688633,Sat4056


In [36]:
conditional = (df_tips["size"] == 3) & (df_tips["total_bill"] > 20)
print(conditional)

0      False
1      False
2       True
3      False
4      False
       ...  
239     True
240    False
241    False
242    False
243    False
Length: 244, dtype: bool


In [37]:
df_tips[conditional]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
19,20.65,3.35,Male,No,Sat,Dinner,3,6.88,Timothy Oneal,6568069240986485,Sat9213
35,24.06,3.6,Male,No,Sat,Dinner,3,8.02,Joseph Mullins,5519770449260299,Sat632
39,31.27,5.0,Male,No,Sat,Dinner,3,10.42,Mr. Brandon Berry,6011525851069856,Sat6373
48,28.55,2.05,Male,No,Sun,Dinner,3,9.52,Austin Fisher,6011481668986587,Sun4142
65,20.08,3.15,Male,No,Sat,Dinner,3,6.69,Justin Dixon,180021262464926,Sat6840
102,44.3,2.5,Female,Yes,Sat,Dinner,3,14.77,Heather Cohen,379771118886604,Sat6240
112,38.07,4.0,Male,No,Sun,Dinner,3,12.69,Jeff Lopez,3572865915176463,Sun591
114,25.71,4.0,Female,No,Sun,Dinner,3,8.57,Katie Smith,5400160161311292,Sun6492
129,22.82,2.18,Male,No,Thur,Lunch,3,7.61,Raymond Torres,4855776744024,Thur9424


In [38]:
df_tips[(df_tips["size"] == 3) & (df_tips["total_bill"] > 20)]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
19,20.65,3.35,Male,No,Sat,Dinner,3,6.88,Timothy Oneal,6568069240986485,Sat9213
35,24.06,3.6,Male,No,Sat,Dinner,3,8.02,Joseph Mullins,5519770449260299,Sat632
39,31.27,5.0,Male,No,Sat,Dinner,3,10.42,Mr. Brandon Berry,6011525851069856,Sat6373
48,28.55,2.05,Male,No,Sun,Dinner,3,9.52,Austin Fisher,6011481668986587,Sun4142
65,20.08,3.15,Male,No,Sat,Dinner,3,6.69,Justin Dixon,180021262464926,Sat6840
102,44.3,2.5,Female,Yes,Sat,Dinner,3,14.77,Heather Cohen,379771118886604,Sat6240
112,38.07,4.0,Male,No,Sun,Dinner,3,12.69,Jeff Lopez,3572865915176463,Sun591
114,25.71,4.0,Female,No,Sun,Dinner,3,8.57,Katie Smith,5400160161311292,Sun6492
129,22.82,2.18,Male,No,Thur,Lunch,3,7.61,Raymond Torres,4855776744024,Thur9424


In [39]:
conditional_or = (df_tips["tip"] > 4) | (df_tips["total_bill"] > 20)
df_tips[conditional_or]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
5,25.29,4.71,Male,No,Sun,Dinner,4,6.32,Erik Smith,213140353657882,Sun9679
7,26.88,3.12,Male,No,Sun,Dinner,4,6.72,Robert Buck,3514785077705092,Sun8157
...,...,...,...,...,...,...,...,...,...,...,...
237,32.83,1.17,Male,Yes,Sat,Dinner,2,16.42,Thomas Brown,4284722681265508,Sat2929
238,35.83,4.67,Female,No,Sat,Dinner,3,11.94,Kimberly Crane,676184013727,Sat9777
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766


In [40]:
weekend = ["Sun", "Sat"]
conditional_in = df_tips["day"].isin(weekend)
df_tips[conditional_in]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
...,...,...,...,...,...,...,...,...,...,...,...
238,35.83,4.67,Female,No,Sat,Dinner,3,11.94,Kimberly Crane,676184013727,Sat9777
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880


In [41]:
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


#### mencari nilai unik 

In [42]:
df_tips["day"].unique()

array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)

In [43]:
df_tips[["day","time"]]

Unnamed: 0,day,time
0,Sun,Dinner
1,Sun,Dinner
2,Sun,Dinner
3,Sun,Dinner
4,Sun,Dinner
...,...,...
239,Sat,Dinner
240,Sat,Dinner
241,Sat,Dinner
242,Sat,Dinner


In [44]:
df_tips.drop_duplicates(["day","time"])[["day","time"]]

Unnamed: 0,day,time
0,Sun,Dinner
19,Sat,Dinner
77,Thur,Lunch
90,Fri,Dinner
220,Fri,Lunch
243,Thur,Dinner


### Transforming Data (Column Wise)

#### Selecting Columns

In [45]:
print(df_tips["day"])

0       Sun
1       Sun
2       Sun
3       Sun
4       Sun
       ... 
239     Sat
240     Sat
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object


In [46]:
print(df_tips.day)

0       Sun
1       Sun
2       Sun
3       Sun
4       Sun
       ... 
239     Sat
240     Sat
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object


In [47]:
df_tips[["day","time"]]

Unnamed: 0,day,time
0,Sun,Dinner
1,Sun,Dinner
2,Sun,Dinner
3,Sun,Dinner
4,Sun,Dinner
...,...,...
239,Sat,Dinner
240,Sat,Dinner
241,Sat,Dinner
242,Sat,Dinner


#### Mutating (create new column)

In [48]:
df_tips["tips_percentage"]= df_tips["tip"]/df_tips["total_bill"]*100

df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,tips_percentage
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,5.944673
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,16.054159
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,16.658734
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,13.978041
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,14.680765


#### renaming column

In [49]:
df_tips.rename(columns={"tips_percentage": "tips_%"}, inplace=True)
df_tips.head()


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,tips_%
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,5.944673
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,16.054159
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,16.658734
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,13.978041
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,14.680765


#### relocate columns

In [50]:
#relocate tips_percentage_% column to the rightmost
cols = list(df_tips.columns)
cols = [cols[-1]]+ cols[:-2]

df_tips = df_tips[cols]

In [51]:
df_tips

Unnamed: 0,tips_%,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number
0,5.944673,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410
1,16.054159,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230
2,16.658734,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322
3,13.978041,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994
4,14.680765,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221
...,...,...,...,...,...,...,...,...,...,...,...
239,20.392697,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842
240,7.358352,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404
241,8.822232,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196
242,9.820426,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950


## Export DataFrame ke CSV

In [52]:
df_tips.to_csv("tips_modified.csv")