## Tensorflow: Logistic Regression

### Intro to logistic regression

### What is logistic regression?

* A statistical method for modeling one or more independent variables

* Model is a classifier

* Predicts a binary outcome

* Generalized Linear Model(GLM)

* Learns parameters using loss with an optimizer

* Outputs class probabilities

* LR is linear because of __z=w.T x + b__

* Linear function z is argument of logistic function

* Logistic function squashes all outputs to within [0,1]
   

### When to use logistic regression?

  * Depends on dataset and objective
  
  * Good for creating basic binary classifiers from mixed data types
  
  * Categorical data should be encoded ( using one-hot encoded )

## Neural networks

### Types

* MLP

* Autoencoder 

* Convolutional (CNN)

* Recurrent (RNN)

### Activation Functions

* sigmoid (logistic)

* ReLU

* tanh

* linear

* softmax

## Example Logistic Regression model in TF

__Data Source: https://www.kaggle.com/c/titanic/data__



In [91]:
import numpy as np
import tensorflow as tf
import pandas as pd
from pandas import DataFrame as DF, Series
import gc

In [92]:
data = pd.read_csv("datasets/titanic/train.csv")

In [93]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [65]:
data.drop(["PassengerId", "Name", "Ticket"], axis=1, inplace=True)

In [66]:
data.to_csv("datasets/titanic-reformatted-data.csv", index=False)

In [67]:
del data
gc.collect()

0

In [68]:
data = pd.read_csv("datasets/titanic-reformatted-data.csv")
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


In [69]:
data.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [70]:
data.fillna({"Age": -1, "Cabin": "Unk", "Embarked": "Unk", "Fare": -1}, inplace=True)

In [71]:
# convert sex to binary
data.loc[:, "Sex"] = (data.Sex == "female").astype(int)

Xtr = data.loc[:, ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare"]].sample(frac=0.75)

Xts = data[~data.index.isin(Xtr.index)].loc[:, ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare"]]

# one hot encode 
Ytr = pd.get_dummies(data[data.index.isin(Xtr.index)].Survived).values
Yts = pd.get_dummies(data[~data.index.isin(Xtr.index)].Survived).values

In [73]:
data.head()
data.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Cabin       0
Embarked    0
dtype: int64

### note

no custom initializers were defined for the weights and bias

if we were to use custom initializers we would use __get_variable__ instead

In [74]:
num_features = Xtr.shape[1]
num_classes = 2

X = tf.placeholder(tf.float32, [None, num_features], name="X")
Y = tf.placeholder(tf.float32, [None, num_classes], name="Y")

W = tf.Variable(tf.zeros([num_features, num_classes]))
b = tf.Variable(tf.zeros([num_classes]))

# define logistic model
# y = wx + b
yhat = tf.nn.softmax(tf.add(tf.matmul(X, W), b))

# define loss function
loss_fn = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=yhat, labels=Y))

# define optimizer
opt = tf.train.AdamOptimizer(0.01).minimize(loss_fn)

print(W.shape)
print(X.shape)

(6, 2)
(?, 6)


In [75]:
init = tf.global_variables_initializer()
num_epochs = 10

with tf.Session() as sess:
    sess.run(init)
    
    for i in range(num_epochs):
        sess.run(opt, feed_dict={X: Xtr, Y: Ytr})

    # accuracy function
    accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(yhat, 1), tf.argmax(Y, 1)), "float"))
    
    # accuracy test value
    accuracy_value = sess.run(accuracy, feed_dict={X: Xts, Y: Yts})

In [76]:
accuracy_value

0.6547085

## Logistic Regression with Batching

In [77]:
data = pd.read_csv("datasets/titanic-reformatted-data.csv")
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


In [87]:
# define columns and default values
_csv_column_defaults = [[0], [-1], ['Unk'], [-1.], [0], [0], [-1.], ['Unk'], ['Unk']]
_csv_columns = data.columns.tolist()

def input_fn(csv_file, feature_names, batch_size=16, n_epochs=10, shuffle=False):
    def decode_csv(line):
        parsed_line = tf.decode_csv(line, _csv_column_defaults)
        features_dict = dict(zip(feature_names, parsed_line))
        
        labels = features_dict.pop("Survived")
        return features_dict, labels
    
    if shuffle:
        dataset = dataset.shuffle(buffer_size=100*1024)
    
    dataset = tf.data.TextLineDataset(csv_file).skip(1).map(decode_csv, num_parallel_calls=3)
    dataset = dataset.batch(batch_size)
    dataset = dataset.repeat(n_epochs)
    iterator = dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator.get_next()
    return batch_features, batch_labels

## Handling categorical features

Using `tf.feature_column` to map data to a model rather than using feed dicts

In [88]:
sex = tf.feature_column.categorical_column_with_vocabulary_list("Sex", vocabulary_list=["female", "male", "Unk"])
embarked = tf.feature_column.categorical_column_with_vocabulary_list("Embarked", vocabulary_list=["S", "C", "Q", "Unk"])

age = tf.feature_column.numeric_column("Age")

sib = tf.feature_column.numeric_column("SibSp")

parch = tf.feature_column.numeric_column("Parch")

fare = tf.feature_column.numeric_column("Fare")


## Define model

In [90]:
columns = [sex, embarked, age, sib, parch, fare]

model_dir = "lr_model"

# uses logistic regression model
model = tf.estimator.LinearClassifier(model_dir=model_dir, 
                                      feature_columns=columns, 
                                      optimizer=tf.train.AdamOptimizer())

model.train(input_fn=lambda: input_fn("datasets/titanic-reformatted-data.csv", _csv_columns))

results = model.evaluate(input_fn=lambda: input_fn("datasets/titanic-reformatted-data.csv", _csv_columns, n_epochs=1))

results

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'lr_model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x12600d978>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from lr_model/model.ckpt-560
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving check

{'accuracy': 0.7968575,
 'accuracy_baseline': 0.6161616,
 'auc': 0.83007914,
 'auc_precision_recall': 0.7614253,
 'average_loss': 0.5082387,
 'label/mean': 0.3838384,
 'loss': 8.08644,
 'precision': 0.7523511,
 'prediction/mean': 0.4128513,
 'recall': 0.7017544,
 'global_step': 1120}