## **Tensorflow**

Kali ini kita akan mencoba hal baru, yaitu belajar basic deep learning menggunakan Tensorflow, hanya tensorflow dan library bawaannya, tanpa menggunakan Keras.
Untuk dataset kali ini kita akan memakai sensus penduduk California. Nah di data ini target kita adalah kolom bernama 'income'. Dalam feature tersebut, isinya hanya >50K atau <=50K. Dari sini udah jelas kalau kita akan menggunakan Classification untuk memecahkan case ini.

## **Import Libraries**

In [2]:
import pandas as pd
import tensorflow as tf

## **Import Dataframe**

In [3]:
df = pd.read_csv('census_data_Classification.csv')
df.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [6]:
listitem = []
for col in df.columns:
    listitem.append([col, df[col].dtype, df[col].unique()])
    
dfDesc = pd.DataFrame(columns=['Column Name', 'Type', 'Description'],
                     data=listitem)
dfDesc

Unnamed: 0,Column Name,Type,Description
0,age,int64,"[39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 30, 2..."
1,workclass,object,"[ State-gov, Self-emp-not-inc, Private, Fed..."
2,education,object,"[ Bachelors, HS-grad, 11th, Masters, 9th, ..."
3,education_num,int64,"[13, 9, 7, 14, 5, 10, 12, 11, 4, 16, 15, 3, 6,..."
4,marital_status,object,"[ Never-married, Married-civ-spouse, Divorce..."
5,occupation,object,"[ Adm-clerical, Exec-managerial, Handlers-cl..."
6,relationship,object,"[ Not-in-family, Husband, Wife, Own-child, ..."
7,race,object,"[ White, Black, Asian-Pac-Islander, Amer-In..."
8,gender,object,"[ Male, Female]"
9,capital_gain,int64,"[2174, 0, 14084, 5178, 5013, 2407, 14344, 1502..."


In [7]:
df['income_bracket'].unique()

array([' <=50K', ' >50K'], dtype=object)

Coba lihat target kita, isinya berupa string. Ini jadi masalah, karena tensorflow tidak akan bisa membaca format ini. Jadi kita harus membuat label pada target kita. Kita tentukan jika lebih dari 50K maka akan menjadi 1 dan sebaliknya akan diisi 0.

In [9]:
listkosong = []
for i in df['income_bracket']:
    if i == ' <=50K':
        listkosong.append(0)
    else :
        listkosong.append(1)
df['income_bracket'] = listkosong
df.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


Coba lihat beberapa feature yang punya type numerik. beberapa diantaranya memiliki range dan gap yang cukup jauh dari satu feature ke feature lainnya. Oleh karena itu, kita membutuhkan Normalization untuk memberikan interval yang sama, yaitu [-1,1] atau [0,1]

## **Normalization**

In [12]:
cols = ['age', 'education_num','capital_gain','capital_loss', 'hours_per_week']
df[cols] = df[cols].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

## **Creating continous and categorical feature**

Feature ini akan menjadi kolom perantara antara raw data dengan estimator. Feature column ini kegunaannya sangat banyak. Coba kalian googling tensorflow dan simak penjelasan secara detail tentang feature column disana.
Objectivenya kalau kalian ingin terus berkembang di dunia teknologi, intinya kalian harus rajin-rajin baca jangan malas, dan memahami setiap dokumentasi dari hal-hal yang ingin kalian tahu.

Nah untuk membuat feature column, kita perlu memanggil fungsi dari tensorlownya sendiri, yaitu tf.feature_column Kalau ingin mengatasi numerik kolom, kita bisa gunakan function tf.feature_column.numeric_column

In [13]:
education_num_feature=tf.feature_column.numeric_column("education_num")
capital_gain_feature=tf.feature_column.numeric_column("capital_gain")
capital_loss_feature=tf.feature_column.numeric_column("capital_loss")
hours_feature=tf.feature_column.numeric_column("hours_per_week")
age1=tf.feature_column.numeric_column("age")

Untuk mengatasi categorical columns, tensorflow memberikan 2 opsi
- tf.feature_column.categorical_column_with_hash_bucket: Gunakan ini jika kamu tidak tahu set nilai yang mungkin untuk kolom kategorikal sebelumnya dan ada terlalu banyak
- tf.feature_column.categorical_column_with_vocabulary_list: Gunakan ini jika kamu mengetahui set semua nilai fitur kolom yang mungkin dan hanya ada beberapa saja

Karena dalam kasus kita, kita memiliki terlalu banyak nilai fitur di setiap kolom kategorikal, maka lita gunakan fungsi hash.
Pastikan untuk menentukan nilai hash yang lebih besar dari jumlah total kategori kolom untuk menghindari dua kategori berbeda tapi dimasukkan ke dalam nilai hash yang sama.

In [15]:
education1=tf.feature_column.categorical_column_with_hash_bucket("education",hash_bucket_size=16)
workclass1=tf.feature_column.categorical_column_with_hash_bucket("workclass",hash_bucket_size=10)
martial1=tf.feature_column.categorical_column_with_hash_bucket("marital_status",hash_bucket_size=7)
occupation1=tf.feature_column.categorical_column_with_hash_bucket("occupation",hash_bucket_size=14)
relationship1=tf.feature_column.categorical_column_with_hash_bucket("relationship",hash_bucket_size=6)
race1=tf.feature_column.categorical_column_with_hash_bucket("race",hash_bucket_size=5)
gender1=tf.feature_column.categorical_column_with_hash_bucket("gender",hash_bucket_size=2)
native_country1=tf.feature_column.categorical_column_with_hash_bucket("native_country",hash_bucket_size=60)

Sekarang kita akan menggabungkan semua variabelnya kedalam sebuah list bernama feature column

In [16]:
feat_columns=[age1,education1,workclass1,martial1,occupation1,relationship1,race1,gender1,native_country1,education_num_feature,capital_gain_feature,capital_loss_feature,hours_feature]

In [19]:
df

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,0.301370,State-gov,Bachelors,0.800000,Never-married,Adm-clerical,Not-in-family,White,Male,0.021740,0.0,0.397959,United-States,0
1,0.452055,Self-emp-not-inc,Bachelors,0.800000,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.000000,0.0,0.122449,United-States,0
2,0.287671,Private,HS-grad,0.533333,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.000000,0.0,0.397959,United-States,0
3,0.493151,Private,11th,0.400000,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.000000,0.0,0.397959,United-States,0
4,0.150685,Private,Bachelors,0.800000,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.000000,0.0,0.397959,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0.136986,Private,Assoc-acdm,0.733333,Married-civ-spouse,Tech-support,Wife,White,Female,0.000000,0.0,0.377551,United-States,0
32557,0.315068,Private,HS-grad,0.533333,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.000000,0.0,0.397959,United-States,1
32558,0.561644,Private,HS-grad,0.533333,Widowed,Adm-clerical,Unmarried,White,Female,0.000000,0.0,0.397959,United-States,0
32559,0.068493,Private,HS-grad,0.533333,Never-married,Adm-clerical,Own-child,White,Male,0.000000,0.0,0.193878,United-States,0


## **Train test split**

In [17]:
data = df.drop('income_bracket', axis=1)

In [18]:
target = df['income_bracket']

In [21]:
from sklearn.model_selection import train_test_split

xtr,xts,ytr,yts = train_test_split(data,
                                  target,
                                  test_size=.30,
                                  random_state=101)

Kita sekarang akan membuat input function yang nantinya akan mengubah dataframe kita menjadi model classifier. Tapi kita harus menentukan dulu targetnya, dan feature apa saja yang akan digunakan.

In [23]:
input_func=tf.estimator.inputs.pandas_input_fn(x=xtr,y=ytr,batch_size=10,num_epochs=1000,shuffle=True)

Selanjutnya, kita akan mendefinisikan classifier linier kita. Klasifikasi linier akan melatih model linier untuk mengklasifikasikan targetnya. Tapi kenapa linear? ini karena kita hanya memiliki 2 kelas saja dalam target kita yaitu 0 dan 1.

In [24]:
model=tf.estimator.LinearClassifier(feature_columns=feat_columns,n_classes=2)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\Hp\\AppData\\Local\\Temp\\tmp48guo6bu', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000241CF71CD88>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


## **Training model**

In [25]:
model.train(input_fn=input_func,steps=1000)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Calling model_fn.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Saving checkpoints for 0 into C:\Users\Hp\AppData\Local\Temp\tmp48guo6bu\model.ckpt.
INFO:tensorflow:loss = 6.931472, step = 1
INFO:tensorflow:global_step/sec: 69.0434
INFO:tensorflow:loss = 4.384447, step = 101 (1.457 sec)
INFO:tensorflow:global_step/sec: 199.338
INFO:tensorflow:loss = 4.3695326, step = 201 (0.501 sec)
INFO:tensorflow:global_step/sec: 211.088
INFO:tensorflow

<tensorflow_estimator.python.estimator.canned.linear.LinearClassifier at 0x241cf71cd48>

## **Prediction**

Sekarang saatnya melihat hasil prediksi kita. Apakah tebakan kita benar?
Tapi perlu diingat bahwa prediksi bukanlah perkataan Tuhan, jadi kalau ada kesalahan tebakan itu adalah hal yang wajar.
Pertama, kita perlu mendefinisikan kembali function input kita. Sementara pada saat training model kita sudah harus menentukan target dan x nya.
Tapi pada saat membuat prediksi, kita tidak menentukan target kita. Justru nanti prediksi kita akan dibandingkan dengan data aktual yang ada di yts untuk mengevaluasi model. Jadi mari kita mulai!

In [28]:
pred_fn = tf.estimator.inputs.pandas_input_fn(
      x=xts,
      y=yts,
      batch_size=10,
      num_epochs=1,
      shuffle=False)

In [29]:
predictions = list(model.predict(input_fn=pred_fn))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from C:\Users\Hp\AppData\Local\Temp\tmp48guo6bu\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


In [30]:
predictions[0]

{'logits': array([-1.3125604], dtype=float32),
 'logistic': array([0.21205875], dtype=float32),
 'probabilities': array([0.7879413, 0.2120587], dtype=float32),
 'class_ids': array([0], dtype=int64),
 'classes': array([b'0'], dtype=object)}

In [31]:
final_preds = []
for pred in predictions:
    final_preds.append(pred['class_ids'][0])

In [32]:
final_preds[:10]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

## **Evaluasi**

Kita sekarang sudah sampai pada tahap akhir proyek.
Kita akan mencoba untuk menilai prediksi model kita dan membandingkannya dengan data aktual menggunakan library sklearn.
Tapi sebelumnya kalian udah harus tahu dulu perbedaan cara evaluasi classification, regression, dan clustering.

In [36]:
from sklearn.metrics import classification_report

print(classification_report(yts,final_preds))

              precision    recall  f1-score   support

           0       0.82      0.97      0.89      7436
           1       0.77      0.32      0.45      2333

    accuracy                           0.81      9769
   macro avg       0.79      0.64      0.67      9769
weighted avg       0.81      0.81      0.78      9769



In [42]:
results = model.evaluate(eval_input_func)
results

INFO:tensorflow:Calling model_fn.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-11-28T09:03:58Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\Hp\AppData\Local\Temp\tmp48guo6bu\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-11-28-09:04:01
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.8137987, accuracy_baseline = 0.7611833, auc = 0.84021556, auc_precision_recall = 0.6496148, average_loss = 0.4061277, global_step = 1000, label/mean = 0.23881666, loss = 3967.4614, precision = 0.7671518, prediction/mean = 0.20776603, recall = 0.3163309
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: C:\Users\Hp\AppData\Local\Temp\tmp48guo6bu\model.ckpt-1000


{'accuracy': 0.8137987,
 'accuracy_baseline': 0.7611833,
 'auc': 0.84021556,
 'auc_precision_recall': 0.6496148,
 'average_loss': 0.4061277,
 'label/mean': 0.23881666,
 'loss': 3967.4614,
 'precision': 0.7671518,
 'prediction/mean': 0.20776603,
 'recall': 0.3163309,
 'global_step': 1000}

**Kesimpulan**

Hasil prediksi kita tidak sepenuhnya bagus, dan tidak sepenuhnya juga jelek. Kenapa recall 0 mencapai 92% sedangkan 1 hanya 32%? Ini karena datanya imbalance.
Namun tergantung lagi dari kasus bisnisnya apa dan membutuhkan solusi yang seperti apa.