# Tugas 2

### Diskritisasi

Diskritisasi juga disebut binning, merubah atribut numerik menjadi atribut kategorikal. Biasanya digunakan untuk metode/model data mining yang tidak dapat menangani atribut numerik.

In [1]:
import pandas as pd
import numpy as np

In [18]:
#source data
dataset_url = "https://raw.githubusercontent.com/calvinr14/datamining/gh-pages/iris.data.csv"
#create dataframe
df = pd.read_csv(dataset_url)

In [19]:
#show first 15 rows
df.head(15)

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
5,4.6,3.4,1.4,0.3,Iris-setosa
6,5.0,3.4,1.5,0.2,Iris-setosa
7,4.4,2.9,1.4,0.2,Iris-setosa
8,4.9,3.1,1.5,0.1,Iris-setosa
9,5.4,3.7,1.5,0.2,Iris-setosa


In [21]:
# CONSTAN SERIES
SEPAL_LENGTH_SERIES = df["5.1"]
SEPAL_WIDTH_SERIES = df["3.5"]
PETAL_LENGTH_SERIES = df["1.4"]
PETAL_WIDTH_SERIES = df["0.2"]

# Class Species Name 
IRIS_SPECIES = df["Iris-setosa"]

**Rentang Lebar Sama (Equal-Width Intervals)**

Pendekatan diskritisasi yang paling sederhana adalah membagi rentang dari X menjadi k rentang dengan lebar sama (equal-width interval).

**Cut**

Salah satu tools dari pandas adalah cut. Cut digunakan untuk menghitung equal-width.

**Sepal Width**

In [22]:
# equal-width intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

sepal_width_ew_binning = pd.cut(SEPAL_WIDTH_SERIES, amount_of_binning, True, labels)
labelled_sepal_width_ew_binning = sepal_width_ew_binning.value_counts()
interval_sepal_width_ew_binning = pd.cut(SEPAL_WIDTH_SERIES, amount_of_binning, True).value_counts()

In [23]:
# dataframe of sepal-width and sepal category
df_sepal_width_ew = pd.DataFrame(pd.concat((SEPAL_WIDTH_SERIES, sepal_width_ew_binning, IRIS_SPECIES), axis=1))

In [24]:
# change columns name
df_sepal_width_ew.columns = ["sepal.width", "category", "species"]

In [25]:
df_sepal_width_ew

Unnamed: 0,sepal.width,category,species
0,3.0,lebar,Iris-setosa
1,3.2,lebar,Iris-setosa
2,3.1,lebar,Iris-setosa
3,3.6,lebar,Iris-setosa
4,3.9,sangat_lebar,Iris-setosa
...,...,...,...
144,3.0,lebar,Iris-virginica
145,2.5,sedikit_lebar,Iris-virginica
146,3.0,lebar,Iris-virginica
147,3.4,lebar,Iris-virginica


In [26]:
# equal-width intervals binning with label
labelled_sepal_width_ew_binning

lebar            87
sedikit_lebar    47
sangat_lebar     15
Name: 3.5, dtype: int64

In [27]:
# equal-width intervals without label
interval_sepal_width_ew_binning

(2.8, 3.6]      87
(1.998, 2.8]    47
(3.6, 4.4]      15
Name: 3.5, dtype: int64

**Petal Width**

In [28]:
# equal-width intervals
labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

petal_width_ew_binning = pd.cut(PETAL_WIDTH_SERIES, amount_of_binning, True, labels)
labelled_petal_width_ew_binning = petal_width_ew_binning.value_counts()
interval_petal_width_ew_binning = pd.cut(PETAL_WIDTH_SERIES, amount_of_binning, True).value_counts()

In [29]:
# dataframe of petal-width and petal category
df_petal_width = pd.DataFrame(pd.concat((PETAL_WIDTH_SERIES, petal_width_ew_binning, IRIS_SPECIES), axis=1))

In [30]:
# change columns name
df_petal_width.columns = ["petal.width", "category", "species"]

In [31]:
df_petal_width

Unnamed: 0,petal.width,category,species
0,0.2,sedikit_lebar,Iris-setosa
1,0.2,sedikit_lebar,Iris-setosa
2,0.2,sedikit_lebar,Iris-setosa
3,0.2,sedikit_lebar,Iris-setosa
4,0.4,sedikit_lebar,Iris-setosa
...,...,...,...
144,2.3,sangat_lebar,Iris-virginica
145,1.9,sangat_lebar,Iris-virginica
146,2.0,sangat_lebar,Iris-virginica
147,2.3,sangat_lebar,Iris-virginica


In [32]:
# equal-width intervals with label
labelled_petal_width_ew_binning

lebar            54
sedikit_lebar    49
sangat_lebar     46
Name: 0.2, dtype: int64

In [33]:
# equal-width intervals without label
interval_petal_width_ew_binning

(0.9, 1.7]       54
(0.0976, 0.9]    49
(1.7, 2.5]       46
Name: 0.2, dtype: int64

**Sepal Lenght**

In [34]:
# equal-width intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

sepal_length_ew_binning = pd.cut(SEPAL_LENGTH_SERIES, amount_of_binning, True, labels)
labelled_sepal_length_ew_binning = sepal_length_ew_binning.value_counts()
interval_sepal_length_ew_binning = pd.cut(SEPAL_LENGTH_SERIES, amount_of_binning, True).value_counts()

In [35]:
# dataframe of sepal-width and sepal category
df_sepal_length_ew = pd.DataFrame(pd.concat((SEPAL_LENGTH_SERIES, sepal_length_ew_binning, IRIS_SPECIES), axis=1))

In [36]:
# change columns name
df_sepal_length_ew.columns = ["sepal_length", "category", "species"]

In [37]:
df_sepal_length_ew

Unnamed: 0,sepal_length,category,species
0,4.9,sedikit_lebar,Iris-setosa
1,4.7,sedikit_lebar,Iris-setosa
2,4.6,sedikit_lebar,Iris-setosa
3,5.0,sedikit_lebar,Iris-setosa
4,5.4,sedikit_lebar,Iris-setosa
...,...,...,...
144,6.7,lebar,Iris-virginica
145,6.3,lebar,Iris-virginica
146,6.5,lebar,Iris-virginica
147,6.2,lebar,Iris-virginica


In [38]:
# equal-width intervals with label
labelled_sepal_length_ew_binning

lebar            71
sedikit_lebar    58
sangat_lebar     20
Name: 5.1, dtype: int64

**Petal Lenght**

In [39]:
# equal-width intervals
labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

petal_length_ew_binning = pd.cut(PETAL_LENGTH_SERIES, amount_of_binning, True, labels)
labelled_petal_length_ew_binning = petal_length_ew_binning.value_counts()
interval_petal_length_ew_binning = pd.cut(PETAL_LENGTH_SERIES, amount_of_binning, True).value_counts()

In [40]:
# dataframe of petal-width and petal category
df_petal_length_ew =  pd.DataFrame(pd.concat((PETAL_LENGTH_SERIES,petal_length_ew_binning, IRIS_SPECIES), axis=1))

In [41]:
# change columns name
df_petal_length_ew.columns = ["petal_length", "category", "species"]

In [42]:
df_petal_length_ew

Unnamed: 0,petal_length,category,species
0,1.4,sedikit_lebar,Iris-setosa
1,1.3,sedikit_lebar,Iris-setosa
2,1.5,sedikit_lebar,Iris-setosa
3,1.4,sedikit_lebar,Iris-setosa
4,1.7,sedikit_lebar,Iris-setosa
...,...,...,...
144,5.2,sangat_lebar,Iris-virginica
145,5.0,sangat_lebar,Iris-virginica
146,5.2,sangat_lebar,Iris-virginica
147,5.4,sangat_lebar,Iris-virginica


In [43]:
# equal-width intervals binning with label
labelled_petal_length_ew_binning

lebar            54
sedikit_lebar    49
sangat_lebar     46
Name: 1.4, dtype: int64

In [44]:
# equal-width intervals out label
interval_petal_length_ew_binning

(2.967, 4.933]    54
(0.994, 2.967]    49
(4.933, 6.9]      46
Name: 1.4, dtype: int64

**Rentang Frekwensi Sama (Equal-Frequency Intervals)**

Pada diskritisasi frekwensi sama, kita membagi rentang dari X menjadi rentang rentang yang berisi jumlah data yang sama (mendekati sama), frekuensi yang sama mungkin tidak dimungkinkan karena ada nilai yang diulang.

**Qcut**

salah satu tools pandas digunakan untuk melakukan perhitungan equal-frequency

**Sepal Width**

In [45]:
# equal-frequency intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

sepal_width_ef_binning = pd.qcut(SEPAL_WIDTH_SERIES, amount_of_binning, labels)
labelled_sepal_width_ef_binning = sepal_width_ef_binning.value_counts()
interval_sepal_width_ef_binning = pd.qcut(SEPAL_WIDTH_SERIES, amount_of_binning).value_counts()

In [46]:
# dataframe of sepal-width and sepal category
df_sepal_width_ef = pd.DataFrame(pd.concat((SEPAL_WIDTH_SERIES, sepal_width_ef_binning, IRIS_SPECIES), axis = 1))

In [47]:
# change columns name
df_sepal_width_ef.columns = ["sepal_width", "category", "species"]

In [48]:
df_sepal_width_ef

Unnamed: 0,sepal_width,category,species
0,3.0,lebar,Iris-setosa
1,3.2,lebar,Iris-setosa
2,3.1,lebar,Iris-setosa
3,3.6,sangat_lebar,Iris-setosa
4,3.9,sangat_lebar,Iris-setosa
...,...,...,...
144,3.0,lebar,Iris-virginica
145,2.5,sedikit_lebar,Iris-virginica
146,3.0,lebar,Iris-virginica
147,3.4,sangat_lebar,Iris-virginica


In [49]:
# equal-frequency intervals binning with label
labelled_sepal_width_ef_binning

sedikit_lebar    57
lebar            51
sangat_lebar     41
Name: 3.5, dtype: int64

In [50]:
# equal-frequency intervals out label
interval_sepal_width_ef_binning

(1.999, 2.9]    57
(2.9, 3.2]      51
(3.2, 4.4]      41
Name: 3.5, dtype: int64

**Petal Width**

In [51]:
# equal-frequency intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

petal_width_ef_binning = pd.qcut(PETAL_WIDTH_SERIES, amount_of_binning, labels)
labelled_petal_width_ef_binning = petal_width_ef_binning.value_counts()
interval_petal_width_ef_binning = pd.qcut(PETAL_WIDTH_SERIES, amount_of_binning).value_counts()

In [52]:
# dataframe of petal-width and petal category
df_petal_width_ef = pd.DataFrame(pd.concat((PETAL_WIDTH_SERIES, petal_width_ef_binning, IRIS_SPECIES), axis = 1))

In [53]:
# change columns name
df_petal_width_ef.columns = ["petal_width", "category", "species"]

In [54]:
df_petal_width_ef

Unnamed: 0,petal_width,category,species
0,0.2,sedikit_lebar,Iris-setosa
1,0.2,sedikit_lebar,Iris-setosa
2,0.2,sedikit_lebar,Iris-setosa
3,0.2,sedikit_lebar,Iris-setosa
4,0.4,sedikit_lebar,Iris-setosa
...,...,...,...
144,2.3,sangat_lebar,Iris-virginica
145,1.9,sangat_lebar,Iris-virginica
146,2.0,sangat_lebar,Iris-virginica
147,2.3,sangat_lebar,Iris-virginica


In [55]:
# equal-frequency intervals binning with label
labelled_petal_width_ef_binning

sedikit_lebar    56
sangat_lebar     48
lebar            45
Name: 0.2, dtype: int64

In [56]:
# equal-frequency intervals without label
interval_petal_width_ef_binning

(0.099, 1.0]    56
(1.6, 2.5]      48
(1.0, 1.6]      45
Name: 0.2, dtype: int64

**Petal Lenght**

In [57]:
# equal-frequency intervals

labels = ["sedikit_lebar", "lebar", "sangat_lebar"]
amount_of_binning = len(labels)

petal_length_ef_binning = pd.qcut(PETAL_LENGTH_SERIES, amount_of_binning, labels)
labelled_petal_length_ef_binning = petal_length_ef_binning.value_counts()
interval_petal_length_ef_binning = pd.qcut(PETAL_LENGTH_SERIES, amount_of_binning).value_counts()

In [58]:
# dataframe of petal-length and petal category
df_petal_length_ef = pd.DataFrame(pd.concat((PETAL_LENGTH_SERIES, petal_length_ef_binning, IRIS_SPECIES), axis=1))

In [59]:
# change columns name
df_petal_length_ef.columns = ["petal_length", "category", "species"]

In [60]:
df_petal_length_ef

Unnamed: 0,petal_length,category,species
0,1.4,sedikit_lebar,Iris-setosa
1,1.3,sedikit_lebar,Iris-setosa
2,1.5,sedikit_lebar,Iris-setosa
3,1.4,sedikit_lebar,Iris-setosa
4,1.7,sedikit_lebar,Iris-setosa
...,...,...,...
144,5.2,sangat_lebar,Iris-virginica
145,5.0,sangat_lebar,Iris-virginica
146,5.2,sangat_lebar,Iris-virginica
147,5.4,sangat_lebar,Iris-virginica


In [61]:
# equal-frequency intervals binning with label
labelled_petal_length_ef_binning

lebar            53
sedikit_lebar    50
sangat_lebar     46
Name: 1.4, dtype: int64

In [62]:
# equal-frequency intervals out label
interval_petal_length_ef_binning

(3.1, 4.9]      53
(0.999, 3.1]    50
(4.9, 6.9]      46
Name: 1.4, dtype: int64