# We'll start using tensorflow and keras(uses tensorflow, MxNet and other deep learning library as a backend).

Tensorflow is a big libray. More you use it and encounder more use cases better you will learn how to translate your idea into code(computational graph).

## Please look at the posted presentation. 
## It contains summary of google tensorflow whitepaper
http://download.tensorflow.org/paper/whitepaper2015.pdf, I read sometime back

In short it is a library to generate/build a computation graph.
 - One writes codes to specify abstract computation like addition and matrix multiplication
 - One can feed the actual data later to evaluate different nodes in the graph.

# But Why to build graph for doing computations?
Building graph before doing actual computation provides lot of optimization benefits like
- Common Subexpression Elimination. Avoids redundent computation
- optimize operations

Graph optimization is itself a big areas of research and we can benefits from this research without worrying about actual research.

# and what is a tensor and where it fits in above computation graph lingo?

If you want

  - to go deep down what a tensor is, we need to take a deeper dive.
      + [Intoduction to Tensor](https://math.stackexchange.com/questions/10282/an%C2%ADintroduction%C2%ADto%C2%ADtensors?%20noredirect=1&lq=1)
      + For a lighter reading just skim thought https://en.wikipedia.org/wiki/Tensor
     

They represent linear relationships between vector and other tensors.

Think about how matrix(a kind of tensor) $M_{n,m}$ maps a **m-dimentional** vector $v_{m,1}$ into another
vector **n-dimentional** $(Mv)_{n,1}$.

<font color = "red">Note: Every multidimentional thing is not a tensor. Tensor represent function. </font>

You may not have thought about scalar and vector and matrix in this way but
- scalars $x \in \mathbb{R}$  0th-order tensor
- Vector are 1th order tensor
- Matrix are 2 nd order tensor

So tensorflow builds a computation graph where **node** in the graph represent computation and **input and output** represent the flow of tensors. It is one way to think about tensorflow or **flow of tensors(scalar, vector, matrix etc.) via  computation nodes.**

# Let's look at a simple graph from the google paper
<a href="https://github.com/psnegi/ml_s2018/blob/master/hws/"> <img src= "computation_graph.png">  </img> </a>

- In previous graph think of  weight matrix(paramters to learn) $W$, feature vector $X$, scalar $b$ as tensors.
- Various nodes MatMul, Add etc as  computations you want to perform
- Input and output flow along the edge of the graph

### All the above things are great and but one of the biggest advantage of  tensorflow like graph building libraries is that they do automatic differentations too
<a href="https://github.com/psnegi/ml_s2018/blob/master/hws/"> <img src= "gradient_computation.png">  </img> </a>
 

One need not to worry about building gradient computation nodes. As one keep adding various part of the graph(left graph in above picture), library automatically keep building differentiation graph(right graph in the picture) too.

Remember some time there is no close form solution to find parameters $W$ which maximizes likelihood or log likelihood function $C(W)$ (**MLE estimation procedure**) as in logistic regression. We showed that if function $C(W)$  is differentiable
one can use an iterative procedure called **gradient descent** to update the parameters.

$W_{k+1} = W_k + \eta \frac{dC(W)}{W}$

If function $C(W)$ is convex, and one uses right step size $\eta \in \mathbb{R}^{+}$(set of all positive numbers), one is gauranteed to find optimal value of parameter $W$. In logistic regression case $C(W)$ is cross entopy and measure how well the logistic machine is performing currently by computing the loss $C(W)$. We want to find parameter $W$ of logistic model which minimize the loss $C(W).$

Note: 
- For initial step $k=0$, in general one can use any random value for parameters $W_0$. 

# I hope by now you are somewhat convinced that using computation graph building libraries are quite useful and powerful in machine learning.

# Let's get started with using tensorflow(creating various type of nodes and tensors).

https://www.tensorflow.org/guide/

**Please follow instruction in course webpage to install tensorflow and keras.**

https://github.com/psnegi/ml_s2018#tensorflow-and-keras

[Keras](https://keras.io/) is a well-thought-out  high level computational graph building library. One need not to write a lot of boiler plate code.

<font color ='red'> Following code will not work if you haven't installed tensorflow </font>

In [None]:
import tensorflow as tf # importing tensorflow
import numpy as np

## creating a constant of type string
It takes no input and outputs stored tensor. Also they are immutable(can't change the value once defined).

In [None]:
hello_ml = tf.constant('Hello machine learning, probabilistic perspective')
print(hello_ml)

# To get value out we need to run it using a session


In [None]:
with tf.Session() as sess:
    value = sess.run(hello_ml)
    print(value)

# Q1 (1 point) Create a constant tensor with value "tensorflow" and print the value by using sess as done above

In [None]:
## write code here

# variable

Training a machine learning model is nothing but learning the paramters of a choosen model.

We need a way to define variable representing the parameters. They are mutable. When we train the model, value stored in variable changes reflecting the learned parameters.

# Let's do a simple matrix multiplication

In [None]:
M = tf.Variable([[1, 2, 1],[2, 2, 2]])
I = tf.Variable([[1, 0 ,0],[0, 1, 0], [0, 0, 1]])
random_normal = tf.Variable(initial_value= tf.random_normal([10], mean = 1.0, stddev= 0.1))
# Please keep using . tab or shit tab to find method and argument list respectively

In [None]:
prod = tf.matmul(M, I)

## uncomment this cell and try to run it

In [None]:

#with tf.Session() as sess:
#    prod_value, random_normal_samples = sess.run([prod, random_normal])
#    print(prod_value)
#    print(random_normal_samples)

If you try to run the pervious cell you will get **FailedPreconditionError**. We always need to initilaize our variable before using them


In [None]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer()) # always need to initialize
    prod_value, random_normal_value = sess.run([prod, random_normal])
    print(prod_value)
    print(random_normal_value)

In the previous code section see different ways to give initial values to variables.
Checkout various methods here

https://www.tensorflow.org/api_docs/python/tf/initializers

The best way to create a variable is to call the **tf.get_variable function.**

We'll not use tf.Variable directly hence onward

# Placeholder
As the name specifies we can create place holder in the graph withour  creating actual tensor.
But when you run the graph you need to feed the right shape and type of the tensor

**See below how we created x as a placeholder**

In [None]:
with tf.variable_scope("M", reuse=tf.AUTO_REUSE):
    M = tf.get_variable(name = 'matrix', initializer=  tf.constant([[1.0, 2, 1],[2, 2, 2]]))
x = tf.placeholder(tf.float32, shape=(3, 1))# just tell the type and shape
matrix_vector_mul = tf.matmul(M,x)

In [None]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    # feeding thave value of place holder x at run time
    print(sess.run(matrix_vector_mul, feed_dict={x: np.array([[1],[0.0], [0.0]])}))

# Let's do ridge linear regression(predicting continous value) using tensorflow on 

Boston house price dataset


## Let's download dataset $\mathcal{D} = \{(\mathbf{x_i}, y_i) \}_{i=1}^{N}$ containing features $\mathbf{x_i}$  and target value $y_i$ 

In [None]:
import pandas as pd # for doing eploratory data analysis
import seaborn as sns # statistical visualization
import matplotlib.pyplot as plt
from sklearn import model_selection
import numpy as np
# to make graphics inline
%matplotlib inline 
sns.set()

# using pandas read_csv and giving name for the columns

In [None]:
from sklearn.datasets import load_boston
boston_data = load_boston()
print(boston_data.keys())

In [None]:
boston_data.feature_names

In [None]:
boston_data.DESCR

In [None]:
df = pd.DataFrame(boston_data.data, columns=boston_data.feature_names)
df['target'] = boston_data.target

In [None]:
df.shape[0]

# Q2 (1 point) create a panda Series of size df.shape[0]  with value 1 and name CONST

In [None]:
const_df = ###? write code here

In [None]:
df = pd.concat([const_df, df], axis=1) # adding 1 to all the observations

In [None]:
print(df.shape)
df.head()

In [None]:
# summary
df.describe()

# Statistical Summary and data sanity check

Please read pandas **isnull** and **any** functions

In [None]:
# just to make sure values in different columns are not missing
df.isnull().any()

## As per the above output none of the columns have  any null value

In [None]:
# Making sure datatype is also good, so that relevant algebra on columns make sense
df.dtypes

In [None]:
from sklearn.model_selection import  train_test_split
validation_size = 0.40
seed = 3
train_df, valid_test_df = train_test_split(df, test_size=validation_size, random_state=seed)
valid_df, test_df = train_test_split(valid_test_df, test_size=.5, random_state=seed)

<font color = '#FF5733'>Can you guess why we splitted data into train, validation and test set? </font>

In ridge regression remember we have to tune parameter $\lambda$. It controls strength of regularization(How small each $w_i$ should be). There is not formula for it given the data. We need to tune it.

We will use validation data to search a grid of values to find optimal $\lambda$ using validation data as we can't touch test data to evaluate performance of the linear model.

<font color = '#FF5733'>Test data works as proxy for unseeen data. Only touch it when you have selected a final model. </font>

In [None]:
train_df.shape, valid_test_df.shape, valid_df.shape, test_df.shape

In [None]:
train_df.head()

In machine learning we would like uncorrelated features. Let see how our attribute/features are correlated

In [None]:
train_df.corr()

Last column tell the predictivity of the various attribute

Correlation is not the only way to measure predictivity of the features. One can use information theoretic ideas
like mutual inofmation etc. to find more predictive(powerful) features.

In [None]:
fig, ax = plt.subplots(figsize=(14,14)) 
sns.heatmap(train_df.corr(), annot=True, ax=ax)

Visually we can see lot of correlation among attribute(like DIS and INDUS) and some attribute being more predictive than other(looks like LSTAT is more predictive)

We can select feature based on correlation and predictivity. It is very important activity to make sure features are uncorrelated. Once can use PCA, ICA, dimentionality reduction, manifold learning to create uncorrelated features.

We can go head and pick the feature based on correlation and predictivity but
but let's go head and do learn ridge regression to take care of correlation among features.

## Let's build design matrix X containing observations along the rows

In [None]:
train_df.columns

In [None]:
# We are doing random selection
selected_feature =['CONST', 'CRIM', 'INDUS', 'NOX', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']
X_train = train_df[selected_feature].values
y_train = train_df['target'].values
X_valid = valid_df[selected_feature].values
y_valid = valid_df['target'].values.reshape((-1,1))
X_test =  test_df[selected_feature].values
y_test = test_df['target'].values.reshape((-1,1))
train_df.head()

In [None]:
X_train[0:3,:], y_train[0:3] 

Solution of ridge regression is given by
$$\hat{w}_{ridge} = (\lambda I_D + X^TX)^{-1}X^T y$$

For us X is X_train and y is y_train

## Let's build a computational graph to find $\hat{w}_{ridge}$ 

# Q3 ( 1point) create a place holder of y. This represents the  vector containing all the y_i values

In [None]:
X = tf.placeholder(tf.float32, shape= X_train.shape, name='input_training_features')
y = ###??? write your code here
l = tf.placeholder(tf.float32, shape= [], name='regularization_weight')

In [None]:
X.shape, y.shape, l.shape

In [None]:
y_train.shape

## build ridge formula(computation)
first

$\lambda I_D + X^TX$

In [None]:
temp = tf.multiply(l, tf.eye(X_train.shape[1])) + tf.matmul(tf.transpose(X), X)

In [None]:
temp.shape

# Q 4 (1 point) Write code to build
$(\lambda I_D + X^TX)^{-1} X^T$

Note that temp already has $\lambda I_D + X^TX$

Hint: search for how to represent matrix inverse 

In [None]:
temp_ridge = ###write code here
print(temp_ridge.shape)

matmul needs both the arguments to have 2 dimension. Hence adding second dimension. I don't why? In numpy this is not an issue. Let me know if have better solution to mulitply matrix and vector

In [None]:
ridge_weights = tf.matmul(temp_ridge, tf.expand_dims(y,1))
print(ridge_weights.shape)

## Let's run our computation graph, feeding actual data
<font color = 'green'>see the feed_dict argument. How we are providing the actual data so that required graph dependency is statisfied </font>

# 1-d  grid of $\lambda$ values 

In [None]:
lambdas = [1e-20, 1e-10, 1e-5, 1e-4, 1e-3,1e-1, 1, 5.0, 10, 50, 100]
print(lambdas)
print(type(lambdas[0]))

# Building a pandas dataframe to store $\lambda$ and learned weights

In [None]:
ind =['lambda_{}'.format(la) for la in lambdas]
column_names = ['lambda', 'mse']+ ['w_{}'.format(i) for i in range(X_train.shape[1])]
print(ind)
print(column_names)
coeff_matrix = pd.DataFrame(index=ind, columns=column_names, dtype=np.float32)

In [None]:
coeff_matrix.dtypes

In [None]:
coeff_matrix.head()
# we haven't filled values in different columns.  NaN is ok

In classification we used accuracy as a measure of how well our model performed.

But how to measure regression model performance for various value of $\lambda$?

For linear model we can calculate the MSE (mean square error)

Look at this link to learn about various other metrics to use for model selection and evaluation

http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

## Q 5 (1 point)  In the feed_dict part feed the actual value for y and l(regularization_weight) tensor

In [None]:

with tf.Session() as sess:
    for i, reg in enumerate(lambdas):
        # running the graph and feeding actual data
        ridge_weight_value = sess.run(ridge_weights, feed_dict={X:X_train, ##write code here##, ###write code here#})
        # Let's evaluate the performance using MSE on y_valid, y_valid_prediction data
        #print(ridge_weight_value)
        y_valid_pred= np.dot(X_valid, ridge_weight_value)
        # See how we can evaluate l_2 norm in numpy
        mse = (np.linalg.norm(y_valid - y_valid_pred ,ord=2)/(len(y_valid)))**2
        coeff_matrix.iloc[i, 0] = reg
        coeff_matrix.iloc[i, 1] = mse
        print(mse)
        coeff_matrix.iloc[i, 2:] = ridge_weight_value.T
    
    

In [None]:
y_valid.shape, y_valid_pred.shape, ridge_weight_value.shape, X_valid.shape

In [None]:
coeff_matrix

## Note how weights are decreasing as lambda increases in the bottom part for $w_i$

based on minimum mse (0.349990) let's pick corresponding weight vector to evalue mse on test set

## Q 6 (1 point) Select the index name you think has minumum value of mse in corresponding row

hint: Like if lambda_0.0001 is minimum mse then

selected_index = 'lambda_0.0001'

In [None]:
selected_index = ### Write string index here 

## See what would have happened if you choose the average of y_valid for prediction

In [None]:
mean_val = np.mean(y_valid, axis=0)
print(mean_val)

# This would have been our MSE in this base scenario

In [None]:
(np.linalg.norm(y_valid -mean_val, ord=2)/len(y_valid))**2

# Let see how well we did on truly unseen data(never touched during building model)

In [None]:
selected_weights =  coeff_matrix.loc[selected_index:].values.reshape((-1,1))

In [None]:
y_test_pred= np.dot(X_test, selected_weights)
print(y_test_pred.shape, y_test.shape)
print(type(y_test_pred), type(y_test))
test_mse = (np.linalg.norm(y_test - y_test_pred ,ord=2)/(len(y_test)))**2

In [None]:
test_mse

# See what we get if we used sklearn

# Q7 (1 point) Use Ridge from sklearn, fit on training data and comute the MSE error on test set. Keep normalize=False when instantiating Ride class

In [None]:
from sklearn.linear_model import Ridge
# Write code here



# Also one can see how good is  the linear model by analysing the error on training data. Error $\epsilon_i$ should be normally distributed

In [None]:
from statsmodels.graphics.gofplots import qqplot

In [None]:
y_train_pred= np.dot(X_train, coeff_matrix.loc[selected_index, 'w_0':].values.reshape((-1,1)))

In [None]:
error = y_train_pred - y_train

In [None]:
qqplot(error, line='s')
# looks line not a great fit