# 1. Missing Value

Sering kali, data rusak, atau hilang, kita perlu mengurusnya terlebih dahulu karena kedepannya data ini tidak berfungsi saat data hilang atau tidak lengkap.

### 1.1 Imputing missing values dengan Imputer

In [1]:
import pandas as pd
from sklearn.preprocessing import Imputer

In [3]:
df = pd.read_csv('Data.csv')
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [7]:
df.isnull()

Unnamed: 0,Country,Age,Salary,Purchased
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,True,False
5,False,False,False,False
6,False,True,False,False
7,False,False,False,False
8,False,False,False,False
9,False,False,False,False


Cek Data Null pada DataFrame

In [4]:
df.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

Hilangkan Data Kosong

In [5]:
df.dropna()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [6]:
# Drop Kolom yang spesifik yang mengandung NaN
df.dropna(subset=['Age'])

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [8]:
df.iloc[:, 1:3]
#ambil semua data, index 1 sampe 3

Unnamed: 0,Age,Salary
0,44.0,72000.0
1,27.0,48000.0
2,30.0,54000.0
3,38.0,61000.0
4,40.0,
5,35.0,58000.0
6,,52000.0
7,48.0,79000.0
8,50.0,83000.0
9,37.0,67000.0


Mengganti Nilai NaN, menjadi dat yang terisi

In [9]:
# replace every occurrence of missing_values to one defined by strategy

# which can be mean, median, mode. Axis = 0 means rows, 1 means column

# Imputer ( Missing Values-nya, Strateginya apa, Axisnya 0 berarti baris 1 berarti kolom )

imputer = Imputer(missing_values='NaN', strategy='mean', axis = 0)
# Fit_transform, Method preprocessing 
df.iloc[:, 1:3] = imputer.fit_transform(df.iloc[:, 1:3])
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
6,Spain,38.777778,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


# 2. Encoding Data Kategori

In [10]:
# Label Encoder will replace every categorical variable with number. Useful for replacing yes by 1, no by 0.

# One Hot Encoder will create a separate column for every variable and give a value of 1 where the variable is present
# Merubah data dari No dan Yes
# Menggunakan Label Encoder dan OneHotEncoder

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [21]:
label_encoder = LabelEncoder() #Masukkan Ke Object dengan nama label_encoder

# Copy untuk mengcopy isinya.
temp = df.copy()

#Merubah Kolom yang kolom ke 0, menjadi Label Numerik, 
temp.iloc[:, 0] = label_encoder.fit_transform(df.iloc[:, 0])
print(label_encoder.classes_)
print(temp)
print("-------------------")
#Merubah Puchase No Menjadi 0, Yes menjadi 1
temp.iloc[:, 3] = label_encoder.fit_transform(df.iloc[:, 3])
print(label_encoder.classes_) #melihat class yang ada di encodingnya
print(temp)

['France' 'Germany' 'Spain']
   Country        Age        Salary Purchased
0        0  44.000000  72000.000000        No
1        2  27.000000  48000.000000       Yes
2        1  30.000000  54000.000000        No
3        2  38.000000  61000.000000        No
4        1  40.000000  63777.777778       Yes
5        0  35.000000  58000.000000       Yes
6        2  38.777778  52000.000000        No
7        0  48.000000  79000.000000       Yes
8        1  50.000000  83000.000000        No
9        0  37.000000  67000.000000       Yes
-------------------
['No' 'Yes']
   Country        Age        Salary  Purchased
0        0  44.000000  72000.000000          0
1        2  27.000000  48000.000000          1
2        1  30.000000  54000.000000          0
3        2  38.000000  61000.000000          0
4        1  40.000000  63777.777778          1
5        0  35.000000  58000.000000          1
6        2  38.777778  52000.000000          0
7        0  48.000000  79000.000000          1
8        

Contoh Encoding dengan OneHOt Encoding

In [22]:
# you can pass an array of indices of categorical features
# one_hot_encoder = OneHotEncoder(categorical_features=[0])
# temp = df.copy()
# temp.iloc[:, 0] = one_hot_encoder.fit_transform(df.iloc[:, :0])
# temp
# you can achieve the same thing using get_dummies
pd.get_dummies(df.iloc[:, :-1])

Unnamed: 0,Age,Salary,Country_France,Country_Germany,Country_Spain
0,44.0,72000.0,1,0,0
1,27.0,48000.0,0,0,1
2,30.0,54000.0,0,1,0
3,38.0,61000.0,0,0,1
4,40.0,63777.777778,0,1,0
5,35.0,58000.0,1,0,0
6,38.777778,52000.0,0,0,1
7,48.0,79000.0,1,0,0
8,50.0,83000.0,0,1,0
9,37.0,67000.0,1,0,0


# 3. Binarizing

Mengubah Data menjadi 0 dan 1. Kita akan mencoba dataset lain, yaitu dataset iris yang ada pada library scikit-learn. (https://archive.ics.uci.edu/ml/datasets/iris)

In [24]:
from sklearn.datasets import load_iris

iris_dataset = load_iris()
#Dimensi Data dan Target
X = iris_dataset.data
y = iris_dataset.target

feature_names = iris_dataset.feature_names
print(feature_names)


['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [25]:
#Contoh Sepal Width saja

X[:, 1]

array([3.5, 3. , 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3. ,
       3. , 4. , 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.6, 3.3, 3.4, 3. ,
       3.4, 3.5, 3.4, 3.2, 3.1, 3.4, 4.1, 4.2, 3.1, 3.2, 3.5, 3.1, 3. ,
       3.4, 3.5, 2.3, 3.2, 3.5, 3.8, 3. , 3.8, 3.2, 3.7, 3.3, 3.2, 3.2,
       3.1, 2.3, 2.8, 2.8, 3.3, 2.4, 2.9, 2.7, 2. , 3. , 2.2, 2.9, 2.9,
       3.1, 3. , 2.7, 2.2, 2.5, 3.2, 2.8, 2.5, 2.8, 2.9, 3. , 2.8, 3. ,
       2.9, 2.6, 2.4, 2.4, 2.7, 2.7, 3. , 3.4, 3.1, 2.3, 3. , 2.5, 2.6,
       3. , 2.6, 2.3, 2.7, 3. , 2.9, 2.9, 2.5, 2.8, 3.3, 2.7, 3. , 2.9,
       3. , 3. , 2.5, 2.9, 2.5, 3.6, 3.2, 2.7, 3. , 2.5, 2.8, 3.2, 3. ,
       3.8, 2.6, 2.2, 3.2, 2.8, 2.8, 2.7, 3.3, 3.2, 2.8, 3. , 2.8, 3. ,
       2.8, 3.8, 2.8, 2.8, 2.6, 3. , 3.4, 3.1, 3. , 3.1, 3.1, 3.1, 2.7,
       3.2, 3.3, 3. , 2.5, 3. , 3.4, 3. ])

Kita akan mengubah 0 jika dibawah rata-rata, dan 1 jika diatas rata-rata

In [27]:
from sklearn.preprocessing import Binarizer
#Menggunakan Threshold
binarizer_obj = Binarizer(threshold=X[:, 1].mean())
print(binarizer_obj)

X[:, 1:2] = binarizer_obj.fit_transform(X[:, 1].reshape(-1, 1))

X[:, 1]

Binarizer(copy=True, threshold=3.0540000000000003)


array([1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       1., 1., 0., 1., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0.])

# 4. Fitur Scaling

Fitur Scaling Adalah Normalisasi
Jika kita punya data Numerik, akan menjadi banyak sekali, 
cara yang akan dilakukan adalah melakukan scaling

In [29]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler

df = pd.read_csv('Data.csv').dropna()
print(df)

X = df[["Age", "Salary"]].values.astype(np.float64)
print(X)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
5   France  35.0  58000.0       Yes
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes
[[4.4e+01 7.2e+04]
 [2.7e+01 4.8e+04]
 [3.0e+01 5.4e+04]
 [3.8e+01 6.1e+04]
 [3.5e+01 5.8e+04]
 [4.8e+01 7.9e+04]
 [5.0e+01 8.3e+04]
 [3.7e+01 6.7e+04]]


#### StandardScaler
Reff : http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Artinya : Ada kaitannya dengan Varian, 

#### Normalizing
Reff : http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html
Artinya : Setiap Value secara Independen dibagi dengan , Hasil perkalian antara 2 vektor adalah 

#### MinMax Scaling
Reff : http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
Artinya : yang Maksimum menjadi 1 yang minimum menjadi 0

standard_scaler = StandardScaler()
print("Standardization")
print(standard_scaler.fit_transform(X))
print("--------------")

normalizer = Normalizer()
print("Normalizing")
print(normalizer.fit_transform(X))
print("--------------")

min_max_scaler = MinMaxScaler()
print("MinMax Scaling")
print(min_max_scaler.fit_transform(X))
print("--------------")

# 5. Ekstraksi Fitur

Pada pertemuan sebelumnya kalian telah mencoba membuat program WordCount. WordCount merupakan sebuah teknik dalam melakukan ekstraksi Fitur. Namun, kalian tidak perlu membuat sendiri. Scikit-Learn telah menyediakan librarynya. Ekstraksi Fitur ini nantinya akan berguna dalam pemrosesan klasifikasi, clustering, maupun teknik pembelajaran mesin lainnya.

## 5.1 Count Vectorizer

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

docs = ["Mayur mayur is a nice boy.", "Mayur rock! wohooo!", "My name is Mayur, and I am a Pythonista!"]
cv = CountVectorizer() #mentod CountVectorizer
X = cv.fit_transform(docs) #Transfor

print(X) # Print (Nama List Dokumen, Vocabnya, Kemunculannya)
print(cv.vocabulary_) # Print Kodenya

print(X.todense())
#Kemunculan Data
# [0 0 1 1 2 0 0 1 0 0 0]
# 0 = kata tidak muncul, 
# [0 0 0 0 1 0 0 0 0 1 1]

  (0, 2)	1
  (0, 7)	1
  (0, 3)	1
  (0, 4)	2
  (1, 10)	1
  (1, 9)	1
  (1, 4)	1
  (2, 8)	1
  (2, 0)	1
  (2, 1)	1
  (2, 6)	1
  (2, 5)	1
  (2, 3)	1
  (2, 4)	1
{'mayur': 4, 'is': 3, 'nice': 7, 'boy': 2, 'rock': 9, 'wohooo': 10, 'my': 5, 'name': 6, 'and': 1, 'am': 0, 'pythonista': 8}
[[0 0 1 1 2 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 0 1 1]
 [1 1 0 1 1 1 1 0 1 0 0]]


### Dict Vectorizer

DictVectorizer melakukan mapping dari dictionry wordcount ke Vektor

In [38]:
from sklearn.feature_extraction import DictVectorizer

docs = [{"Aku": 1, "suka": 1, "makan": 2}, {"Aku": 1, "tidak": 1, "suka": 2, "makan": 3, "kambing": 1, "bakar": 2, "madu": 3}]
dv = DictVectorizer(sort=False)
X = dv.fit_transform(docs)

print(X)
print(dv.vocabulary_)
# Dijadikan Matriks
print(X.todense())

  (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	2.0
  (1, 0)	1.0
  (1, 1)	2.0
  (1, 2)	3.0
  (1, 3)	1.0
  (1, 4)	1.0
  (1, 5)	2.0
  (1, 6)	3.0
{'Aku': 0, 'suka': 1, 'makan': 2, 'tidak': 3, 'kambing': 4, 'bakar': 5, 'madu': 6}
[[1. 1. 2. 0. 0. 0. 0.]
 [1. 2. 3. 1. 1. 2. 3.]]


## TfIdf Vectorizer

Word Count (Term Frekuensi dikali dengan Inverse Dokumen Frekuensi),

Tutorial dapat dilihat pada link berikut: https://datascience.mipa.ugm.ac.id/id/representasi-teks-dalam-vektor-part-1/ https://datascience.mipa.ugm.ac.id/id/representasi-teks-dalam-vektor-part-2/

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf_vectorizer = TfidfVectorizer()
cv_vectorizer = CountVectorizer()

docs = ["Mayur is a Guitarist Guitarist", "Mayur is Musician", "Mayur is also a programmer"]
# Kata Mayur dan Kata Kata is tidak penting, kata yang muncul di semua dokumen adalah tidak penting
# sehingga nilai vektornya tidak penting
# hanya muncul di dokumen 1 itu akan tinggi, contoh guitarist 2 kali


X_idf = tfidf_vectorizer.fit_transform(docs) # Df => Dokumen Frekuensi
X_cv = cv_vectorizer.fit_transform(docs)

print(X_idf.todense())
print(tfidf_vectorizer.vocabulary_)
print(X_cv.todense())

[[0.         0.92276146 0.27249889 0.27249889 0.         0.        ]
 [0.         0.         0.45329466 0.45329466 0.76749457 0.        ]
 [0.6088451  0.         0.35959372 0.35959372 0.         0.6088451 ]]
{'mayur': 3, 'is': 2, 'guitarist': 1, 'musician': 4, 'also': 0, 'programmer': 5}
[[0 2 1 1 0 0]
 [0 0 1 1 1 0]
 [1 0 1 1 0 1]]


1. Stop removal
2. Staming (Mengatakan, Dikatakang)