# Linear Regression with Tensorflow

This notebook will show how to make a linear regression with the help of the `Tensorflow` library. As an example, a set of OpenStreetMap elements gathered around Bordeaux will be used. The goal of the notebook is to predict the number of contributors for each element, starting from a set of other element characteristics.

## Step 0: module imports

`matplotlib` will be used to plot regression results, `os` is necessary for relative path handling, then `pandas` is used to handle the input dataframe, and of course, `tensorflow` will be needed to do the regression.

In [None]:
%matplotlib inline

In [None]:
import math
import matplotlib.pyplot as plt
import os
import pandas as pd
import tensorflow as tf

## Step 1: data recovering and preparation

The used data describes a set of OSM elements, we admit it is available on the computer.

In [None]:
data = pd.read_csv("/home/rde/data/osm-history/output-extracts/bordeaux-metropole/element-metadata.csv", index_col=0)
data.shape

We have 2760999 individuals in this table, described by 17 different features. One can provide a short extract of this dataset:

In [None]:
data.sample(6).T

In this study, we will consider the number of contributors as the output to predict. We select a small set of features as predictors: the number of days between first and last modifications, the number of days since first modification, the number of days during which modifications arised, the last version, the number of change sets and the numbers of autocorrections and corrections.

In [None]:
# Create data_x and data_y, two subsets of data that will be respectively the predictors and the predicted feature
# list of features to integrate into data_x: "lifespan", "n_inscription_days", "n_activity_days", "version", "n_chgset", "n_autocorr", "n_corr"


As a good practice, we can use the dedicated `sklearn` function to split the dataset into **train** and **test** data.

In [None]:
# Import the accurate module from sklearn.model_selection

# Create four arrays x_train, x_test, y_train and y_test with train_test_split function (test_size=0.1)


In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

## Step 2: Parameter settings

In [None]:
# Parameters
learning_rate = 0.01
training_epochs = 1000
display_step = 100

## Step 3: Tensorflow model design

For a sake of readability (code) and clarity (graph visualization), we will use `tf.name_scope` from now. The model design will be far more cleaner with this kind of context manager.

First we need two tensors, *i.e.* one for inputs and one for outputs.

In [None]:
# Create the context manager (based on tf.name_scope)

    # Create the placeholders X and Y (tf.float32)


The linear regression is defined through weights and biases, that are set as tensorflow variables, and injected into the output variable `predictions`. The linear model is as follows:

`Y[N,1] = X[N,k] * W[k,1] + b[1,1]` (`k` being the number of predictors, `W` the vector of weights, and `b` the bias)

In [None]:
# Create a new context manager containing the model definition (let say 'linear_reg')

    # Create the weights associated to each predictors (initializer=tf.truncated_normal_initializer()), be careful to shape!
    
    # Create the bias associated to the model (initializer=tf.constant_initializer(0.0))

    # Create an other tensor for the model prediction (recall the linear model definition)


The cost function is the sum of squares of differences between predictions and true outputs. A regularization term is added to this value.

In [None]:
# Create a new context manager containing the loss (objective function, to minimize)

    # Create the loss function, by using tf.reduce_sum and tf.square (+ regularization value 0.01*tf.nn.l2_loss(w))


We use Adam optimizer to update the model variable.

In [None]:
# Create a new context manager containing the optimizer

    # Declare the minimization of the loss through the optimizer (tf.train.AdamOptimizer, alternative choice: GradientDescentOptimizer, ...)
    # Use learning_rate as a parameter of the optimizer


## Step 5: Variable initialization

In [None]:
# Initialize all the variables


## Final step: running the model

First we have to open a new session, initialize the variable, and prepare the graph (and checkpoint utilities):

In [None]:
# Old way of session opening (only for this notebook purpose!)
session = tf.Session()

In [None]:
# Run the initializer tensor

# Create a graph summary (tf.summary.FileWriter)


The model is ready to be trained. We proceed to as many training steps as indicated by the previous parametrization.

In [None]:
costs = list()
weights = list()
biases = list()
for epoch in range(training_epochs):
    # Run the linear regression model with train data (by using feed_dict parameter of session.run)

    # Print the current state of training according to epoch value
    if (epoch+1) % display_step == 0:
        # Re-run the model without train it, for printing purpose

        print("*** Epoch", '%04d' % (epoch+1), "cost={}\nn_user = {:.3f}*X1 + {:.3f}*X2 + {:.3f}*X3 + {:.3f}*X4 + {:.3f}*X5 + {:.3f}*X6 + {:.3f}*X7 + {:.3f} ***"
              .format(training_cost, weight[0][0], weight[1][0], weight[2][0], weight[3][0], weight[4][0], weight[5][0], weight[6][0], bias[0]))
        # Store the model results into dedicated lists
        costs.append(training_cost)
        weights.append(weight[:,0])
        biases.append(bias[0])

The results are stored into a pandas dataframe (and may be saved onto the file system).

In [None]:
param_history = pd.DataFrame(weights,columns=["lifespan", "n_inscription_days", "n_activity_days", "version", "n_chgset", "n_autocorr", "n_corr"])
param_history["bias"] = biases
param_history["loss"] = costs

We use the results for plotting purpose.

In [None]:
f, ax = plt.subplots(3, 3, figsize=(12,6))
for i in range(param_history.shape[1]):
    ax[i % 3][int(i / 3)].plot(param_history.iloc[:,i])
    ax[i % 3][int(i / 3)].set_title(param_history.columns[i])
f.tight_layout()
f.show()

Then the model is run with test data (this dataset was not used for model training). The goal is to evaluate the correspondance between true value of `y` and the model prediction.

In [None]:
# Run the model on test data, to get its predictions

print("Test cost = {}, i.e. +/- {:.3f} contributor(s) per OSM elements on average"
      .format(cost, math.sqrt(cost/len(y_test))))

A last plot is produced starting from the test step: it shows how good the predictions are.

In [None]:
plt.plot(y_test, y_pred, 'go')
output_min, output_max = int(min(y_pred)[0]), int(max(y_pred)[0])
plt.plot(range(output_min, output_max+2), range(output_min, output_max+2))
plt.xlabel("True values of y")
plt.ylabel("Model predictions")
plt.xlim(min(y_test)[0], max(y_test)[0]+2)
plt.ylim(output_min, output_max+2)
plt.tight_layout()
plt.show()

Last the tensorflow session is closed.

In [None]:
# Close the session
session.close()