# Intro to Classification Modeling using Naive Bayes

Galvanize: 2017-04-29

Slides http://lukas.show/, https://s3.amazonaws.com/ai-learn-l2k/ML_Course.pdf

GitHub: https://github.com/lukas/scikit-class

In [1]:
import pandas as pd
import sklearn as sk
import keras
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.linear_model import Perceptron
from sklearn import cross_validation

Using TensorFlow backend.


# Looking at the data

In [56]:
tweets = pd.read_csv('tweets.csv')

In [57]:
tweets.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [58]:
target = tweets.is_there_an_emotion_directed_at_a_brand_or_product
text = tweets.tweet_text

In [59]:
text[0:5]

0    .@wesley83 I have a 3G iPhone. After 3 hrs twe...
1    @jessedee Know about @fludapp ? Awesome iPad/i...
2    @swonderlin Can not wait for #iPad 2 also. The...
3    @sxsw I hope this year's festival isn't as cra...
4    @sxtxstate great stuff on Fri #SXSW: Marissa M...
Name: tweet_text, dtype: object

## Feature Extraction

First need to remove null texts

In [60]:
fixed_text = text[pd.notnull(text)]
fixed_target = target[pd.notnull(text)]

In [61]:
count_vect=CountVectorizer()
count_vect.fit(fixed_text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

count_vect.vocabulary is a dictionary
count_vect.fit(fixed_text) is a nxm matrix where n is the number of tweets and m is the number of total words. An entry (i, j) in the matrix is how many times word j occurs in tweet i.

In [62]:
# What is the column of the word the
print(count_vect.vocabulary_.get(u'the'))
print(count_vect.vocabulary_.get(u'eye'))

8562
3108


In [63]:
len(count_vect.vocabulary_)

9706

In [75]:
# Transform the text of the tweets into a matrix
counts = count_vect.transform(fixed_text)

In [65]:
print(fixed_text[0:2])

0    .@wesley83 I have a 3G iPhone. After 3 hrs twe...
1    @jessedee Know about @fludapp ? Awesome iPad/i...
Name: tweet_text, dtype: object


In [81]:
# See counts for the first tweet
print(counts[0])

  (0, 168)	1
  (0, 430)	1
  (0, 774)	2
  (0, 2291)	1
  (0, 3981)	1
  (0, 4210)	1
  (0, 4573)	1
  (0, 4610)	1
  (0, 5766)	1
  (0, 6478)	1
  (0, 7232)	1
  (0, 8076)	1
  (0, 8323)	1
  (0, 8702)	1
  (0, 8920)	1
  (0, 9062)	1
  (0, 9303)	1
  (0, 9373)	1


In [71]:
print(count_vect.transform(['iPhone']))

  (0, 4573)	1


## Fitting a naive bayes model

In [82]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(counts, fixed_target)

print(nb.predict(count_vect.transform(["iphone!!!"])))
print(nb.predict(count_vect.transform(["iphone does not suck"])))

['No emotion toward brand or product']
['Negative emotion']


# Testing Results

In [85]:
predictions = nb.predict(counts)
correct = sum(predictions == fixed_target)
incorrect = sum(predictions != fixed_target)
acc = correct/(len(fixed_target))
print(acc)

0.795094588649


## Hacky way to do a hold-out

In [87]:
nb.fit(counts[0:6000], fixed_target[0:6000])
predictions = nb.predict(counts[6000:9092])

In [88]:
correct = sum(predictions == fixed_target[6000:9092])
incorrect = sum(predictions != fixed_target[6000:9092])
acc = correct/(len(fixed_target))
print(acc)

0.225802903652


# Intro to Deep Learning                                                         

A good resource: http://neuralnetworksanddeeplearning.com/

In [90]:
from sklearn.linear_model import Perceptron

perceptron = Perceptron()

from sklearn import cross_validation

scores = cross_validation.cross_val_score(perceptron, counts, fixed_target, cv=10)
print(scores)
print(scores.mean())


[ 0.63516484  0.65274725  0.64175824  0.63626374  0.62087912  0.61428571
  0.62706271  0.58305831  0.53031974  0.45644983]
0.599798948321


## Using Keras for image learning

A tensor is essentially matrix of arbitrary dimensions. It is made up of n x m dimension arrays?
Question: How do we prepare the data set?

Keras documentation: https://keras.io/models/model/

In [2]:
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()

digit = X_train[0]
print(digit.shape)
str = ""
# Doing some hacky drawing
for i in range(digit.shape[0]):
    for j in range(digit.shape[1]):
        if digit[i][j] == 0:
            str += " "
        elif digit[i][j] < 128:
            str += "."
        else:
            str += "X"
    str += "\n"

print(str)
print("Label: ", y_train[0])

(28, 28)
                            
                            
                            
                            
                            
            .....XX.XXX.    
        ...XXXXXXXXXXXX.    
       .XXXXXXXXXX.....     
       .XXXXXXXXXX          
        .X.XXX. .X          
         ..XX.              
           XXX.             
           .XX.             
            .XXX..          
             .XXX..         
              .XXXX.        
               ..XXX        
                 XXX.       
              .XXXXX.       
            .XXXXXXX        
          ..XXXXXX.         
        ..XXXXXX..          
      .XXXXXXX..            
    .XXXXXXXX.              
    XXXXXXX.                
                            
                            
                            

Label:  5


In [5]:
X_train.shape

(60000, 28, 28)

In [5]:
print(type(X_train))
print(type(digit))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [6]:
digit

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   3,
         18,  18,  18, 126, 136, 175,  26, 166, 255, 247, 127,   0,   0,
          0,   0],
       [  

In [3]:
digit.shape

(28, 28)

In [12]:
# Now see how well the perceptron is actually doing
X_train = [x.flatten() for x in X_train]
perceptron = Perceptron()
scores = cross_validation.cross_val_score(perceptron, X_train, y_train, cv=10)
print(scores)
print(scores.mean())

[ 0.88661339  0.87922705  0.84685886  0.87716667  0.87366667  0.8708118
  0.87747958  0.84664111  0.73169918  0.89292862]
0.85830929207


## Aside: What is tensorflow

A Tensor object is a symbolic handle to the result of an operation, but does not actually hold the values of the operation's output. Instead, TensorFlow encourages users to build up complicated expressions (such as entire neural networks and its gradients) as a dataflow graph. You then offload the computation of the entire dataflow graph (or a subgraph of it) to a TensorFlow tf.Session, which is able to execute the whole computation much more efficiently than executing the operations one-by-one.

https://www.tensorflow.org/programmers_guide/faq

In [14]:
import tensorflow as tf
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + y + 2

with tf.Session() as sess:
        x.initializer.run()
        y.initializer.run()
        result = f.eval()
        print(result)

42


## One-hot encoding

y_train is a n x 1 array that are the actual labels of the digit. We're one-hot encoding it to make it a n x 10 array where each column is an indicator for whether the value is equal to the digit that the column represents.

X_train is a 3 dimensional numpy matrix that represents the actual image

In [23]:
import numpy
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten

from keras.layers import Dropout
from keras.utils import np_utils

# load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
img_width = X_train.shape[1]
img_height = X_train.shape[2]

# Normalizing the data set helps with Keras's implementation. The 255 I think is coming from the valid RGB values.
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255.
X_test /= 255.

# one hot encode outputs
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
num_classes = y_train.shape[1]

print(y_train)


[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  1.  0.]]


In [21]:
# create model
model=Sequential()
model.add(Flatten(input_shape=(img_width,img_height)))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model
model.fit(X_train, y_train)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x11a552ef0>

Cross Validation Test

In [24]:
model.fit(X_train, y_train, validation_data=(X_test, y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x11a7f2940>

## Adding multiple layers

In [26]:
# create model
model=Sequential()
model.add(Flatten(input_shape=(img_width,img_height)))
model.add(Dense(30, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=200)

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x11dd15da0>