# Linear Model with TensorFlow

Let's use the MTA Subway data, combined with weather data, to predict the number of entries at the station.
In this notebook I will use TensorFlow to create a simple linear regression model.

Start with importing necessary libraries and loading data.

In [3]:
import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset
from exploratory_analysis import *

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

turnstile, weather = load_data('master_turnstile_file.txt', 'weather_underground.csv')
data = merge_turnstile_weather(turnstile, weather)

data.head()

Unnamed: 0,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn,EXITSn_hourly,ENTRIESn_hourly,...,precipi,snowfalli,since1jancoolingdegreedaysnormal,precipm,snowfallm,thunder,monthtodateheatingdegreedays,meantempi,maxvism,meantempm
0,A002,R051,02-00-00,2011-05-21,00:00:00,REGULAR,3169391,1097585,0.0,0.0,...,0.1,0.0,41,1.8,0.0,0,81,67,16,19
1,A002,R051,02-00-00,2011-05-21,04:00:00,REGULAR,3169415,1097588,3.0,24.0,...,0.1,0.0,41,1.8,0.0,0,81,67,16,19
2,A002,R051,02-00-00,2011-05-21,08:00:00,REGULAR,3169431,1097607,19.0,16.0,...,0.1,0.0,41,1.8,0.0,0,81,67,16,19
3,A002,R051,02-00-00,2011-05-21,12:00:00,REGULAR,3169506,1097686,79.0,75.0,...,0.1,0.0,41,1.8,0.0,0,81,67,16,19
4,A002,R051,02-00-00,2011-05-21,16:00:00,REGULAR,3169693,1097734,48.0,187.0,...,0.1,0.0,41,1.8,0.0,0,81,67,16,19


There are some columns with only null entries, and some columns with unique value, so I will drop them.

In [4]:
data = drop_null_columns(data)
data = drop_one_value_columns(data)

data.describe()

Unnamed: 0,ENTRIESn,EXITSn,EXITSn_hourly,ENTRIESn_hourly,HOUR,maxpressurem,maxdewptm,maxpressurei,maxdewpti,since1julheatingdegreedaysnormal,...,minpressurei,monthtodatecoolingdegreedays,maxtempi,minpressurem,precipi,since1jancoolingdegreedaysnormal,precipm,monthtodateheatingdegreedays,meantempi,meantempm
count,5256.0,5256.0,5256.0,5256.0,5256.0,5256.0,5256.0,5256.0,5256.0,5256.0,...,5256.0,5256.0,5256.0,5256.0,5256.0,5256.0,5256.0,5256.0,5256.0,5256.0
mean,3847457.0,3366837.5,26825.8,40324.2,10.3,1016.3,17.9,30.0,64.2,4750.3,...,29.8,16.6,75.5,1010.2,0.0,49.7,0.7,91.4,68.2,19.9
std,3543497.8,3863643.5,468432.1,441129.5,6.8,4.3,2.5,0.1,4.3,4.5,...,0.1,12.4,8.8,4.8,0.0,5.9,1.0,4.7,6.8,3.8
min,0.0,0.0,-5249768.0,0.0,0.0,1009.0,14.0,29.8,57.0,4743.0,...,29.5,5.0,58.0,1000.0,0.0,41.0,0.0,81.0,56.0,13.0
25%,1051834.8,908510.5,35.0,46.0,4.0,1014.0,16.0,29.9,61.0,4746.0,...,29.8,5.0,67.0,1009.0,0.0,44.0,0.0,90.0,61.0,16.0
50%,3053227.0,2224428.5,143.0,175.0,12.0,1016.0,17.0,30.0,63.0,4751.0,...,29.9,14.0,78.0,1011.0,0.0,50.0,0.0,94.0,71.0,22.0
75%,5433198.0,4425136.2,393.0,444.0,16.0,1021.0,21.0,30.1,69.0,4755.0,...,29.9,30.0,83.0,1013.0,0.1,56.0,1.8,94.0,74.0,23.0
max,18935728.0,24578513.0,17560300.0,11521719.0,22.0,1023.0,21.0,30.2,70.0,4757.0,...,30.0,39.0,84.0,1017.0,0.1,59.0,2.5,94.0,75.0,24.0


Columns 'EXITSn_hourly' and 'ENTRIESn_hourly' both can be chosen as targets/features, and they have some evident outliers. Let's