# Tugas Kecil 2
### Eksplorasi Scikit-Learn pada Jupyter Notebook

Dion Saputra 1351645 <br>
Rabbi Fijar Mayoza 13516081

Import necessary library

In [4]:
from sklearn import datasets
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB

### A. Load Dataset 

#### A.1 Load Dataset Iris 

In [5]:
def load_dataset_iris():
    # load iris dataset from standar scikit dataset
    return datasets.load_iris()

def build_dataframe_iris(iris):
    # save feature values in pandas dataframe
    iris_feature_df = pd.DataFrame(iris.data)
    iris_feature_df.columns = iris.feature_names

    # save label values in pandas dataframe
    iris_target_df = pd.DataFrame(iris.target);
    iris_target_df.columns = ['target']
    map_target = pd.Series(iris.target_names, index=[0,1,2]);
    iris_target_df['target'] = iris_target_df['target'].map(map_target)

    # concat feature dataframe and label dataframe
    iris_df = pd.concat([iris_feature_df, iris_target_df], axis=1)
    
    return iris_df

# load dataset iris
iris = load_dataset_iris()

# show dataframe iris
iris_df = build_dataframe_iris(iris)
iris_df.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


#### A.2 Load Dataset Play-Tennis 

In [6]:
# load play-tennis dataset from external csv using pandas
def build_dataframe_tennis(tennis_file):
    tennis_df = pd.read_csv(tennis_file);
    return tennis_df

# show tennis dataframe
tennis_file = "weather.nominal.csv"
tennis_df = build_dataframe_tennis(tennis_file)
tennis_df

Unnamed: 0,outlook,temperature,humidity,windy,play
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes
5,rainy,cool,normal,True,no
6,overcast,cool,normal,True,yes
7,sunny,mild,high,False,no
8,sunny,cool,normal,False,yes
9,rainy,mild,normal,False,yes


Attribute dari dataset pada tennis_file memiliki value berupa string, sehingga perlu di-encode terlebih dahulu dalam representasi numeric.

Attribute pertama yaitu <b>outlook</b> dengan outlook = {sunny, overcast, rainy\} dapat di-encode berdasarkan tingkat kecerahan yaitu outlook = {0:rainy, 1:overcast, 2: sunny}

Attribute kedua yaitu <b>temperature</b> dengan temperature = {hot, mild, cold} dapat di-encode berdasarkan tingkat kepanasan yaitu temperature = {0:cool, 1:mild, 2:hot}

Attribute ketiga yaitu <b>humidity</b> dengan humidity = {normal,high} dapat di-encode berdasarkan tingkat kelembapan menjadi humidity = {0:normal, 1:high}

Attribute keempat yaitu <b>windy</b> dengan windy = {True,False} dapat di-encode menjadi windy = {0:False, 1:True}

Label <b>play</b> memiliki values play = {no,yes} dapat di-encode menjadi label = {0:no, 1:yes}

In [7]:
# encode tennis dataframe attribute
def encode_columns_tennis(tennis_df):
    map_outlook = pd.Series([0,1,2], index=["rainy","overcast","sunny"])
    map_temperature = pd.Series([0,1,2], index=["cool","mild","hot"])
    map_humidity = pd.Series([0,1], index=["normal","high"])
    map_windy = pd.Series([0,1], index=[False,True])
    map_play = pd.Series([0,1], index=["no","yes"])

    tennis_df['outlook'] = tennis_df["outlook"].map(map_outlook)
    tennis_df['temperature'] = tennis_df["temperature"].map(map_temperature)
    tennis_df['humidity'] = tennis_df["humidity"].map(map_humidity)
    tennis_df['windy'] = tennis_df["windy"].map(map_windy)
    tennis_df['play'] = tennis_df["play"].map(map_play)
    
    return tennis_df

# show tennis dataframe after attribute values encoding
tennis_df = encode_columns_tennis(tennis_df)
tennis_df

Unnamed: 0,outlook,temperature,humidity,windy,play
0,2,2,1,0,0
1,2,2,1,1,0
2,1,2,1,0,1
3,0,1,1,0,1
4,0,0,0,0,1
5,0,0,0,1,0
6,1,0,0,1,1
7,2,1,1,0,0
8,2,0,0,0,1
9,0,1,0,0,1


### B. Full-Train Learning
<b>Full-train learning</b> merupakan metode learning dengan menggunakan keseluruhan data pada dataset sebagai data training. Begitu juga dengan data test

#### B.1. Naive Bayes
<b>Naive Bayes</b> merupakan metode untuk <i>supervised learning</i> dengan memanfaatkan <i>Teorema Bayes</i>. Disebut <i>Naive</i> karena pada implementasinya menggunakan <i>naive assumption</i> bahwa antar-feature pada dataset independen.

Pada scikit-learn terdapat 3 jenis algoritma naive bayes, yaitu:
<ol>
    <li><i>Gaussian Naive Bayes (GaussianNB)</i></li>
    <li><i>Multinomial Naive Bayes (MultinomialNB)</i></li>
    <li><i>Complement Naive Bayes (ComplementNB)</i></li>
<ol>

##### B.1.a Gaussian Naive Bayes (GaussianNB) 

In [8]:
# gaussian naive bayes model for iris data

# build model and do prediction
gnb_iris_model = GaussianNB().fit(iris.data,iris.target)
y_predict = gnb_iris_model.predict(iris.data)

# show correct prediction
print("Number of correct prediction from %d data is: %d" 
      %(iris.data.shape[0], (iris.target == y_predict).sum()))

Number of correct prediction from 150 data is: 144


In [9]:
# gaussian naive bayes model for tennis data

# extract feature-label and mapping label value to integer
feature = tennis_df.drop('play',1,inplace=False)
label = tennis_df['play']

gnb_tennis_model = GaussianNB().fit(feature,label)
y_predict = gnb_tennis_model.predict(feature)

# show correct prediction
print("Number of correct prediction from %d data is: %d" 
      %(feature.shape[0], (label == y_predict).sum()))

Number of correct prediction from 14 data is: 11


#### B.1.b Multinomial Naive Bayes (MultinomialNB) 

In [10]:
# multinomial naive bayes model for iris data

# build model and do prediction
mnb_iris_model = MultinomialNB().fit(iris.data,iris.target)
y_predict = mnb_iris_model.predict(iris.data)

# show correct prediction
print("Number of correct prediction from %d data is: %d" 
      %(iris.data.shape[0], (iris.target == y_predict).sum()))

Number of correct prediction from 150 data is: 143


In [13]:
# multinomial naive bayes model for tennis data

# extract feature-label and mapping label value to integer
feature = tennis_df.drop('play',1,inplace=False)
label = tennis_df['play']

mnb_tennis_model = MultinomialNB().fit(feature,label)
y_predict = mnb_tennis_model.predict(feature)

# show correct prediction
print("Number of correct prediction from %d data is: %d" 
      %(feature.shape[0], (label == y_predict).sum()))

Number of correct prediction from 14 data is: 9


#### B.1.c Complement Naive Bayes (ComplementNB)

In [12]:
# complement naive bayes model for iris data

# build model and do prediction
cnb_iris_model = ComplementNB().fit(iris.data,iris.target)
y_predict = cnb_iris_model.predict(iris.data)

# show correct prediction
print("Number of correct prediction from %d data is: %d" 
      %(iris.data.shape[0], (iris.target == y_predict).sum()))

Number of correct prediction from 150 data is: 100


In [15]:
# complement naive bayes model for tennis data

# extract feature-label and mapping label value to integer
feature = tennis_df.drop('play',1,inplace=False)
label = tennis_df['play']

cnb_tennis_model = ComplementNB().fit(feature,label)
y_predict = cnb_tennis_model.predict(feature)

# show correct prediction
print("Number of correct prediction from %d data is: %d" 
      %(feature.shape[0], (label == y_predict).sum()))

Number of correct prediction from 14 data is: 8
