Getting Data and Applying Models
=====

Some of the code contained within this notebook is from Ch. 9 of *Data Science from Scratch* by J. Grus.



## Getting Data

Let's grab some text from the web and then determine the most common words.

In [1]:
from collections import Counter
import urllib2

counter = Counter()

for line in urllib2.urlopen('http://www.gutenberg.org/files/84/84-0.txt').read(20000).split("\n"):

    counter += Counter(word.lower() for word in line.strip().split() if word)
    
for word, count in counter.most_common(10):
    print count, "\t", word

163 	the
112 	i
107 	of
99 	and
99 	to
88 	my
79 	a
62 	in
38 	which
37 	that


## Serialization and JSON

Python supports json, javascript object notation.  Files containing json can be read and written with `load` and `dump`.  These require a file pointer.  `loads` and '`dumps` work with strings.  

Serialization is one way to port objects or dump custom data structures.  

In [3]:
import json

serialized = """{ "title" : "Data Science Book",
                      "author" : "Joel Grus",
                      "publicationYear" : 2014,
                      "topics" : [ "data", "science", "data science"] }"""

# Load the object ...
deserialized = json.loads(serialized)

print type(deserialized)

# Encode a dictionary and list, then decode ...

counts = {'the': 32, 'a': 5}
nums = [1, 2, 3]

### Solution ###
encoded = json.dumps(counts)
print type(encoded), encoded

decoded = json.loads(encoded)
print type(decoded), decoded

<type 'dict'>
<type 'str'> {"a": 5, "the": 32}
<type 'dict'> {u'a': 5, u'the': 32}


As an example of what can be stored in json, let's inspect something taken from Twitter [Link](https://gist.githubusercontent.com/hrp/900964/raw/2bbee4c296e6b54877b537144be89f19beff75f4/twitter.json)

In [17]:
# https://gist.githubusercontent.com/hrp/900964/raw/2bbee4c296e6b54877b537144be89f19beff75f4/twitter.json
import json
import urllib2

serialized = ""

for line in urllib2.urlopen('https://gist.githubusercontent.com/hrp/900964/raw/2bbee4c296e6b54877b537144be89f19beff75f4/twitter.json'):
    serialized += line

# print serialized
# print type(serialized)

tweet = json.loads(serialized)
print type(tweet)
print tweet.keys()
print tweet["user"]["screen_name"]

print json.dumps(tweet, indent=4, sort_keys=True)

<type 'dict'>
[u'user', u'favorited', u'retweeted_status', u'entities', u'contributors', u'truncated', u'text', u'created_at', u'retweeted', u'in_reply_to_status_id_str', u'coordinates', u'id', u'source', u'in_reply_to_status_id', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'place', u'retweet_count', u'geo', u'in_reply_to_user_id_str', u'id_str']
OldGREG85
{
    "contributors": null, 
    "coordinates": null, 
    "created_at": "Sun Apr 03 23:48:36 +0000 2011", 
    "entities": {
        "hashtags": [], 
        "urls": [], 
        "user_mentions": [
            {
                "id": 271572434, 
                "id_str": "271572434", 
                "indices": [
                    3, 
                    19
                ], 
                "name": "PostGradProblems", 
                "screen_name": "PostGradProblem"
            }
        ]
    }, 
    "favorited": false, 
    "geo": null, 
    "id": 54691802283900930, 
    "id_str": "54691802283900928", 
    "in_reply_

## Normalization

Load a dataset over house prices in Boston.  Houses are characterized by 13 features.  These a positive, real valued and the target is value (in thousands of dollars). [Boston Housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html)

In [28]:
import numpy as np

from sklearn.datasets import load_boston
boston = load_boston()
print(boston.data.shape)
print(type(boston.data))
print(boston.target.shape)

print boston.data[0,:], boston.target[0]

# Determine the range of each feature
X = boston.data

### Solution ###
print "Ranges", np.max(X, axis=0) - np.min(X, axis=0)
print "Max. per feature", np.max(X, axis=0)
print "Min. per feature", np.min(X, axis=0)
print "Mean. per feature", np.mean(X, axis=0)
print "Std. per feature", np.std(X, axis=0)

# Normalize with z scores
X = boston.data
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
print np.min(X, axis=0), np.max(X, axis=0)


# Normalize by scaling
X = boston.data
rnge = (np.max(X, axis=0) - np.min(X, axis=0))
X = (X - np.min(X, axis=0))
X = X / rnge

print np.min(X, axis=0), np.max(X, axis=0)

(506, 13)
<type 'numpy.ndarray'>
(506,)
[6.320e-03 1.800e+01 2.310e+00 0.000e+00 5.380e-01 6.575e+00 6.520e+01
 4.090e+00 1.000e+00 2.960e+02 1.530e+01 3.969e+02 4.980e+00] 24.0
Ranges [8.896988e+01 1.000000e+02 2.728000e+01 1.000000e+00 4.860000e-01
 5.219000e+00 9.710000e+01 1.099690e+01 2.300000e+01 5.240000e+02
 9.400000e+00 3.965800e+02 3.624000e+01]
Max. per feature [ 88.9762 100.      27.74     1.       0.871    8.78   100.      12.1265
  24.     711.      22.     396.9     37.97  ]
Min. per feature [6.3200e-03 0.0000e+00 4.6000e-01 0.0000e+00 3.8500e-01 3.5610e+00
 2.9000e+00 1.1296e+00 1.0000e+00 1.8700e+02 1.2600e+01 3.2000e-01
 1.7300e+00]
Mean. per feature [3.61352356e+00 1.13636364e+01 1.11367787e+01 6.91699605e-02
 5.54695059e-01 6.28463439e+00 6.85749012e+01 3.79504269e+00
 9.54940711e+00 4.08237154e+02 1.84555336e+01 3.56674032e+02
 1.26530632e+01]
Std. per feature [8.59304135e+00 2.32993957e+01 6.85357058e+00 2.53742935e-01
 1.15763115e-01 7.01922514e-01 2.81210326e+01

## Reading in CSV Files

Data can readily be saved and accessed from comma seperated value files (CSV).  

Each line corresponds to one record, with individual fields separated by commas.  The first line may contain a header.  

In [33]:
# https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv

import numpy as np

boston = np.genfromtxt('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', delimiter=',', skip_header=1)

# print boston
# print boston.shape
# print boston[0,:]

X = boston[:,:13] #For every row, get the first 13 columns
Y = boston[:,13]

print X.shape, Y.shape

(506, 13) (506,)


In [37]:
import numpy as np

# This is the game data from Jack and Jill.  There first two columns are the plays of Jack and Jill
#   and the third column is the winner.

# Read in data and save to X, y

### Solution ###
data = np.genfromtxt('http://mlid.cps.cmich.edu/resources/game-data.txt', delimiter = ' ')
print data

X = data[:,:2]
Y = data[:,-1]

[[0.426 0.283 0.   ]
 [0.244 0.827 1.   ]
 [0.485 0.661 1.   ]
 ...
 [0.248 0.566 1.   ]
 [0.884 0.196 0.   ]
 [0.931 0.919 0.   ]]
[0. 1. 1. ... 1. 0. 0.]


Let's assume we have a model parameterized by a middle layer with 2 nodes and 1 output node.  We will use a sigmoid activation function.  Apply the model to our game data.

In [38]:
import numpy as np
import math

def sigmoid(x):
    "Numerically stable sigmoid function."
    if x >= 0:
        z = math.exp(-x)
        return 1 / (1 + z)
    else:
        # if x is less than zero then z will be small, denom can't be
        # zero because it's 1+z.
        z = math.exp(x)
        return z / (1 + z)

def NN_apply(X, W1, B1, W2, B2):
    "Do a feed-forward pass through a neural network"
    A1 = np.dot(X, W1) + B1
    Z1 = np.vectorize(sigmoid)(A1)

    A2 = np.dot(Z1, W2) + B2
    Z2 = np.vectorize(sigmoid)(A2)

    return Z2

W1 = np.matrix([[0.2, 0.2], [0.1, 0.3]])  # Were did these values come from ???
B1 = np.matrix([[0.05, 0.05]])
W2 = np.matrix([[0.4], [0.2]])
B2 = np.matrix([[0.1]])

NN_apply(X, W1, B1, W2, B2)

matrix([[0.60522265],
        [0.60713222],
        [0.60785989],
        ...,
        [0.60562946],
        [0.60794918],
        [0.61243735]])

Let's evaluate the accuracy of the predictions...

In [41]:
from __future__ import division

# Generate predictions ...
model_output = NN_apply(X, W1, B1, W2, B2)

# Initialize predictions to 0, then determine indices of predictions that are 
#  above the 0.5 threshold
predictions = np.zeros(model_output.shape)
predictions[model_output > 0.5] = 1

correct_count = 0
for i, p in enumerate(predictions):
    if p == Y[i]:
        correct_count += 1
        
### Determine the accuracy of the predicitons
print correct_count / predictions.shape[0]

0.4969


Somehow we would like to adjust the weights to improve the performace. 

In [42]:
# ???

mid_layer_size = 2
epochs = 10
batch_size = 20
learning_rate = 0.01

# Initialize some random values for the weights in the NN
W1 = np.random.rand(X.shape[1], int(mid_layer_size))
W1 = 0.01 * W1
B1 = np.random.rand(1, int(mid_layer_size))
B1 = 0.01 * B1

W2 = np.random.rand(int(mid_layer_size), 1)
W2 = 0.01 * W2
B2 = np.random.rand(1,1)
B2 = 0.01 * B2

def sigmoid_prime(x):
    return sigmoid(x) * sigmoid(1-x)

def evaluate(X, Y):
    " calculate the accuracy of the model "
    O = NN_apply(X, W1, B1, W2, B2)

    P = np.zeros(O.shape)
    for i in range(P.shape[0]):
        if O[i] > 0.5:
            P[i] = 1

    correct_count = 0
    for i in range(P.shape[0]):
        if abs(Y[i] - P[i]) < 0.001:
               correct_count = correct_count + 1

    print "Accuracy ", 1.0 * correct_count / Y.shape[0]


for i in range(0, epochs):
    print "Starting epoch ", i + 1

    for j in range(0, X.shape[0] // batch_size):

        Xt = X[j*batch_size:(j+1)*batch_size,:]
        Yt = y[j*batch_size:(j+1)*batch_size]

        # Forward pass
        A1 = np.dot(Xt, W1) + B1
        Z1 = np.vectorize(sigmoid)(A1)

        A2 = np.dot(Z1, W2) + B2
        Z2 = np.vectorize(sigmoid)(A2)
        
        # Calculate D
        D2 = (Z2 - np.reshape(Yt, (Z2.shape[0],1))) * np.vectorize(sigmoid_prime)(A2)
        D1 = np.dot(D2, np.transpose(W2))  * np.vectorize(sigmoid_prime)(A1)

        DE_dw2 = np.dot(np.transpose(Z1) , D2)               
        DE_dw1 = np.dot(np.transpose(Xt) , D1)

        # Update weights               

        W2 = W2 - learning_rate * DE_dw2 
        W1 = W1 - learning_rate * DE_dw1
        B2 = B2 - learning_rate * np.dot(np.ones((1, D2.shape[0])), D2)
        B1 = B1 - learning_rate * np.dot(np.ones((1, D1.shape[0])), D1)
        
    evaluate(X, y)

                

Starting epoch  1


NameError: name 'y' is not defined