# Apa itu AI, ML, DL, RL

Artificial Intelligence (AI) adalah sebuah cabang ilmu komputer yang berfokus pada pembuatan mesin cerdas yang mampu melakukan pekerjaan yang biasanya memerlukan kecerdasan manusia. AI sendiri memiliki beberapa sub-cabang, salah satunya adalah Machine Learning (ML).

Machine Learning (ML) adalah sebuah cabang dari AI yang berfokus pada pembuatan mesin cerdas yang mampu belajar dari data. ML sendiri memiliki beberapa sub-cabang, salah satunya adalah Deep Learning (DL).

Deep Learning (DL) adalah sebuah cabang dari ML yang berfokus pada pembuatan mesin cerdas yang mampu belajar dari data yang memiliki struktur yang kompleks, seperti gambar, suara, dan teks. DL sendiri memiliki beberapa sub-cabang, salah satunya adalah Reinforcement Learning (RL).

Reinforcement Learning (RL) adalah sebuah cabang dari DL yang berfokus pada pembuatan mesin cerdas yang mampu belajar dari interaksi dengan lingkungan. RL sendiri memiliki beberapa sub-cabang, salah satunya adalah Q-Learning.

# Pengaplikasian

AI:
- Sorting
- Searching
- Storing

ML:
- Prediksi harga saham
- Prediksi cuaca
- Prediksi harga rumah
- Prediksi harga mobil
- Prediksi harga barang

DL:
- Pengenalan gambar
- Pengenalan suara
- Pengenalan teks
- Pengenalan tulisan tangan

RL:
- Pengendalian robot
- Pengendalian mobil
- Bot permainan

Hal yang dapat dilakukan dengan AI, ML, DL, dan RL sangatlah banyak, dan tidak terbatas pada contoh-contoh di atas. Dengan berkembangnya teknologi, AI, ML, DL, dan RL akan semakin banyak digunakan di berbagai bidang.

Dari sudut pandang target yang diinginkan, dapat berupa Classification, Regression, Clustering, Detection, Segmentation, Generation, dan lain-lain.

# Machine Learning secara sederhana

### Load Data

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv('data/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


PassengerId dapat digunakan sebagai index, karena PassengerId adalah unique identifier.

In [5]:
df = pd.read_csv('data/titanic.csv', index_col='PassengerId')
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### EDA (Exploratory Data Analysis)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [7]:
df.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [8]:
df = pd.get_dummies(df, columns=['Embarked'], dtype=int)
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,0,0,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,1,0,0
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,0,0,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,0,0,1
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,0,0,1


In [9]:
df.Sex = df.Sex.replace({'male':1, 'female':0})
df.head()

  df.Sex = df.Sex.replace({'male':1, 'female':0})


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,0,0,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,1,0,0
3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,0,0,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,0,0,1
5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,0,0,1


In [10]:
df.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.647587,29.699118,0.523008,0.381594,32.204208,0.188552,0.08642,0.722783
std,0.486592,0.836071,0.47799,14.526497,1.102743,0.806057,49.693429,0.391372,0.281141,0.447876
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,20.125,0.0,0.0,7.9104,0.0,0.0,0.0
50%,0.0,3.0,1.0,28.0,0.0,0.0,14.4542,0.0,0.0,1.0
75%,1.0,3.0,1.0,38.0,1.0,0.0,31.0,0.0,0.0,1.0
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   Name        891 non-null    object 
 3   Sex         891 non-null    int64  
 4   Age         714 non-null    float64
 5   SibSp       891 non-null    int64  
 6   Parch       891 non-null    int64  
 7   Ticket      891 non-null    object 
 8   Fare        891 non-null    float64
 9   Cabin       204 non-null    object 
 10  Embarked_C  891 non-null    int32  
 11  Embarked_Q  891 non-null    int32  
 12  Embarked_S  891 non-null    int32  
dtypes: float64(2), int32(3), int64(5), object(3)
memory usage: 87.0+ KB


Pada langkah selanjutnya, kita akan buang kolom yang tidak digunakan, yaitu: Name, Ticket, dan Cabin. Kita juga akan coba mengisi missing value pada kolom Age

In [12]:
df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,1,22.0,1,0,7.25,0,0,1
2,1,1,0,38.0,1,0,71.2833,1,0,0
3,1,3,0,26.0,0,0,7.925,0,0,1
4,1,1,0,35.0,1,0,53.1,0,0,1
5,0,3,1,35.0,0,0,8.05,0,0,1


In [13]:
df.Age = df.Age.fillna(df.Age.mean())
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   Sex         891 non-null    int64  
 3   Age         891 non-null    float64
 4   SibSp       891 non-null    int64  
 5   Parch       891 non-null    int64  
 6   Fare        891 non-null    float64
 7   Embarked_C  891 non-null    int32  
 8   Embarked_Q  891 non-null    int32  
 9   Embarked_S  891 non-null    int32  
dtypes: float64(2), int32(3), int64(5)
memory usage: 66.1 KB


In [14]:
df.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.647587,29.699118,0.523008,0.381594,32.204208,0.188552,0.08642,0.722783
std,0.486592,0.836071,0.47799,13.002015,1.102743,0.806057,49.693429,0.391372,0.281141,0.447876
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,22.0,0.0,0.0,7.9104,0.0,0.0,0.0
50%,0.0,3.0,1.0,29.699118,0.0,0.0,14.4542,0.0,0.0,1.0
75%,1.0,3.0,1.0,35.0,1.0,0.0,31.0,0.0,0.0,1.0
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0


Scaling sangat penting dilakukan agar model yang kita buat tidak terlalu sensitif terhadap feature yang memiliki skala yang besar. SKLearn hadir dengan beberapa teknik scaling, yaitu:

- MinMaxScaler
  MinMaxScaler adalah teknik scaling yang paling sederhana. Teknik ini akan mengubah setiap feature menjadi skala yang berada di antara 0 dan 1. Teknik ini sangat sensitif terhadap outlier.

- StandardScaler
  StandardScaler adalah teknik scaling yang paling umum digunakan. Teknik ini akan mengubah setiap feature menjadi distribusi normal dengan mean 0 dan standard deviation 1. Teknik ini tidak terlalu sensitif terhadap outlier.
  
- RobustScaler
  RobustScaler adalah teknik scaling yang paling tahan terhadap outlier. Teknik ini akan mengubah setiap feature menjadi distribusi normal dengan median 0 dan IQR 1. Teknik ini sangat cocok digunakan jika dataset kita memiliki outlier yang signifikan.

Ada terlalu besar kesenjangan antara nilai minimum dan maksimum pada kolom Fare dan Age, sehingga kita akan coba scaling menggunakan MinMaxScaler.

In [15]:
from sklearn.preprocessing import MinMaxScaler
scalerAge = MinMaxScaler()
scalerFare = MinMaxScaler()

In [16]:
df.Age = scalerAge.fit_transform(df.Age.values.reshape(-1, 1))
df.Fare = scalerFare.fit_transform(df.Fare.values.reshape(-1, 1))

df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,1,0.271174,1,0,0.014151,0,0,1
2,1,1,0,0.472229,1,0,0.139136,1,0,0
3,1,3,0,0.321438,0,0,0.015469,0,0,1
4,1,1,0,0.434531,1,0,0.103644,0,0,1
5,0,3,1,0.434531,0,0,0.015713,0,0,1


In [17]:
df.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.647587,0.367921,0.523008,0.381594,0.062858,0.188552,0.08642,0.722783
std,0.486592,0.836071,0.47799,0.163383,1.102743,0.806057,0.096995,0.391372,0.281141,0.447876
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,0.271174,0.0,0.0,0.01544,0.0,0.0,0.0
50%,0.0,3.0,1.0,0.367921,0.0,0.0,0.028213,0.0,0.0,1.0
75%,1.0,3.0,1.0,0.434531,1.0,0.0,0.060508,0.0,0.0,1.0
max,1.0,3.0,1.0,1.0,8.0,6.0,1.0,1.0,1.0,1.0


### Train Test Split dan Cross Validation

Train Test Split adalah teknik yang digunakan untuk membagi dataset kita menjadi 2 bagian, yaitu training set dan testing set. Training set digunakan untuk melatih model, sedangkan testing set digunakan untuk mengukur seberapa baik model kita bekerja.

Cross Validation adalah teknik yang digunakan untuk membagi dataset kita menjadi beberapa bagian, lalu kita akan melatih model sebanyak bagian yang kita buat. Teknik ini sangat berguna jika kita memiliki dataset yang kecil.

In [18]:
X = df.drop(columns=['Survived'])
y = df.Survived

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((712, 9), (179, 9), (712,), (179,))

Jika dilihat dengan seksama, tidak ada yang salah dengan proses data diatas. Namun, kita bisa melakukan proses data tersebut dengan lebih mudah menggunakan Data Pipeline dengan kelebihan menghindari data leakage.

### Data Pipeline

1. Load data
2. Drop kolom yang tidak digunakan
3. Replace jika diperlukan
4. Gunakan get_dummies untuk kolom kategorikal
5. Gunakan Pipeline yang berisi scaling dan model (kita gunakan KNN)

In [20]:
df = pd.read_csv('data/titanic.csv', index_col='PassengerId')
df.drop(columns=['Name', 'Ticket', 'Cabin'], inplace=True)
df = pd.get_dummies(df, columns=['Embarked'], dtype=int)
df.Sex = df.Sex.replace({'male':1, 'female':0})

df.head()

  df.Sex = df.Sex.replace({'male':1, 'female':0})


Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,1,22.0,1,0,7.25,0,0,1
2,1,1,0,38.0,1,0,71.2833,1,0,0
3,1,3,0,26.0,0,0,7.925,0,0,1
4,1,1,0,35.0,1,0,53.1,0,0,1
5,0,3,1,35.0,0,0,8.05,0,0,1


In [21]:
df.corr()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
Survived,1.0,-0.338481,-0.543351,-0.077221,-0.035322,0.081629,0.257307,0.16824,0.00365,-0.15566
Pclass,-0.338481,1.0,0.1319,-0.369226,0.083081,0.018443,-0.5495,-0.243292,0.221009,0.08172
Sex,-0.543351,0.1319,1.0,0.093254,-0.114631,-0.245489,-0.182333,-0.082853,-0.074115,0.125722
Age,-0.077221,-0.369226,0.093254,1.0,-0.308247,-0.189119,0.096067,0.036261,-0.022405,-0.032523
SibSp,-0.035322,0.083081,-0.114631,-0.308247,1.0,0.414838,0.159651,-0.059528,-0.026354,0.070941
Parch,0.081629,0.018443,-0.245489,-0.189119,0.414838,1.0,0.216225,-0.011069,-0.081228,0.063036
Fare,0.257307,-0.5495,-0.182333,0.096067,0.159651,0.216225,1.0,0.269335,-0.117216,-0.166603
Embarked_C,0.16824,-0.243292,-0.082853,0.036261,-0.059528,-0.011069,0.269335,1.0,-0.148258,-0.778359
Embarked_Q,0.00365,0.221009,-0.074115,-0.022405,-0.026354,-0.081228,-0.117216,-0.148258,1.0,-0.496624
Embarked_S,-0.15566,0.08172,0.125722,-0.032523,0.070941,0.063036,-0.166603,-0.778359,-0.496624,1.0


In [22]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['Survived'])
y = df.Survived

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((712, 9), (179, 9), (712,), (179,))

In [23]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV

In [24]:
numerical_pipeline = Pipeline([
    ("inputer", SimpleImputer(strategy="mean")),
    ("scaler", MinMaxScaler())
])

In [25]:
preprocessor = ColumnTransformer([
    ("numeric", numerical_pipeline, ['Age', 'Fare'])
])

In [26]:
pipeline = Pipeline([
    ("prep", preprocessor),
    ("algo", KNeighborsClassifier())
])

### Search CV

In [27]:
parameter = {
    "algo__n_neighbors": range(1, 21),
    "algo__weights": ['uniform','distance'],
    "algo__p": [1,2]
}

model = GridSearchCV(pipeline, parameter, cv=3, n_jobs=-1, verbose=1)
model.fit(X_train, y_train)

Fitting 3 folds for each of 80 candidates, totalling 240 fits


### Model Evaluation

In [28]:
model.best_params_

{'algo__n_neighbors': 19, 'algo__p': 1, 'algo__weights': 'uniform'}

In [29]:
model.score(X_train, y_train), model.score(X_test, y_test)

(0.7120786516853933, 0.6312849162011173)

### Save Model and Load Model

In [30]:
import pickle

In [31]:
pickle.dump(model, open('models/model-titanic.pkl', 'wb'))

In [32]:
model = pickle.load(open('models/model-titanic.pkl', 'rb'))

In [33]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0], dtype=int64)

### Save data

pandas menyediakan opsi simpan data ke dalam berbagai format, seperti csv, excel, json, dan lain-lain.

In [34]:
df.to_csv('data/titanic-processed.csv', index=False)

Score yang didapat bisa saja tidak bagus. Beberapa hal yang bisa kita coba untuk meningkatkan score adalah:
- Feature Engineering
- Hyperparameter Tuning
- Menggunakan model yang berbeda
- Menggunakan data yang lebih banyak
- Menggunakan data yang lebih berkualitas
- Menggunakan deep learning