# Lambda School Data Science Unit 4 Sprint Challenge 4

## RNNs, CNNs, AutoML, and more...

In this sprint challenge, you'll explore some of the cutting edge of Data Science.

*Caution* - these approaches can be pretty heavy computationally. All problems were designed so that you should be able to achieve results within at most 5-10 minutes of runtime on Colab or a comparable environment. If something is running longer, doublecheck your approach!

## Part 1 - RNNs

Use an RNN to fit a simple classification model on tweets to distinguish from tweets from Austen Allred and tweets from Weird Al Yankovic.

Following is code to scrape the needed data (no API auth needed, uses [twitterscraper](https://github.com/taspinar/twitterscraper)):

In [1]:
!pip install twitterscraper

Collecting twitterscraper
  Downloading https://files.pythonhosted.org/packages/38/7d/0bf84247b78d7d223914cbf410e1160203a65d39086aaf8c6cad521cec74/twitterscraper-0.9.3.tar.gz
Collecting coala-utils~=0.5.0 (from twitterscraper)
  Downloading https://files.pythonhosted.org/packages/54/00/74ec750cfc4e830f9d1cfdd4d559f3d2d4ba1b834b78d5266446db3fd1d6/coala_utils-0.5.1-py3-none-any.whl
Building wheels for collected packages: twitterscraper
  Building wheel for twitterscraper (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/45/50/9b/70128bca07e2bf8b5ed3f504002e9e74a6eaa5e756341b6931
Successfully built twitterscraper
Installing collected packages: coala-utils, twitterscraper
Successfully installed coala-utils-0.5.1 twitterscraper-0.9.3


In [2]:
from twitterscraper import query_tweets

austen_tweets = query_tweets('from:austen', 1000)
len(austen_tweets)

INFO: queries: ['from:austen since:2006-03-21 until:2006-11-14', 'from:austen since:2006-11-14 until:2007-07-11', 'from:austen since:2007-07-11 until:2008-03-05', 'from:austen since:2008-03-05 until:2008-10-30', 'from:austen since:2008-10-30 until:2009-06-25', 'from:austen since:2009-06-25 until:2010-02-19', 'from:austen since:2010-02-19 until:2010-10-15', 'from:austen since:2010-10-15 until:2011-06-11', 'from:austen since:2011-06-11 until:2012-02-04', 'from:austen since:2012-02-04 until:2012-09-30', 'from:austen since:2012-09-30 until:2013-05-26', 'from:austen since:2013-05-26 until:2014-01-20', 'from:austen since:2014-01-20 until:2014-09-15', 'from:austen since:2014-09-15 until:2015-05-12', 'from:austen since:2015-05-12 until:2016-01-05', 'from:austen since:2016-01-05 until:2016-08-31', 'from:austen since:2016-08-31 until:2017-04-26', 'from:austen since:2017-04-26 until:2017-12-21', 'from:austen since:2017-12-21 until:2018-08-16', 'from:austen since:2018-08-16 until:2019-04-12']
INFO

181

In [3]:
austen_tweets[0].text

'I love love love working with great people.pic.twitter.com/fCKOm6Vl'

In [4]:
al_tweets = query_tweets('from:AlYankovic', 1000)
len(al_tweets)

INFO: queries: ['from:AlYankovic since:2006-03-21 until:2006-11-14', 'from:AlYankovic since:2006-11-14 until:2007-07-11', 'from:AlYankovic since:2007-07-11 until:2008-03-05', 'from:AlYankovic since:2008-03-05 until:2008-10-30', 'from:AlYankovic since:2008-10-30 until:2009-06-25', 'from:AlYankovic since:2009-06-25 until:2010-02-19', 'from:AlYankovic since:2010-02-19 until:2010-10-15', 'from:AlYankovic since:2010-10-15 until:2011-06-11', 'from:AlYankovic since:2011-06-11 until:2012-02-04', 'from:AlYankovic since:2012-02-04 until:2012-09-30', 'from:AlYankovic since:2012-09-30 until:2013-05-26', 'from:AlYankovic since:2013-05-26 until:2014-01-20', 'from:AlYankovic since:2014-01-20 until:2014-09-15', 'from:AlYankovic since:2014-09-15 until:2015-05-12', 'from:AlYankovic since:2015-05-12 until:2016-01-05', 'from:AlYankovic since:2016-01-05 until:2016-08-31', 'from:AlYankovic since:2016-08-31 until:2017-04-26', 'from:AlYankovic since:2017-04-26 until:2017-12-21', 'from:AlYankovic since:2017-12

960

In [5]:
al_tweets[0].text

'Well well well... look what just showed up on my doorstep! http://twitpic.com/59mi2c'

In [6]:
len(austen_tweets + al_tweets)

1141

In [7]:
len(austen_tweets)

181

In [26]:
len(al_tweets)

960

Your tasks:

- Encode the characters to a sequence of integers for the model
- Get the data into the appropriate shape/format, including labels and a train/test split
- Use Keras to fit a predictive model, classifying tweets as being from Austen versus Weird Al
- Report your overall score and accuracy

For reference, the [Keras IMDB sentiment classification example](https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py) will be useful, as well the RNN code we used in class.

*Note* - focus on getting a running model, not on maxing accuracy with extreme data size or epoch numbers. Only revisit and push accuracy if you get everything else done!

###   Encoding the characters to a sequence of integers for the model

In [0]:
import numpy as np

In [0]:
alltweets_string = ''
for i in range(181):
    alltweets_string += austen_tweets[i].text

for i in range(960):
    alltweets_string += al_tweets[i].text

In [28]:
alltweets_string



In [29]:
# split and remove duplicate characters. convert to list.
chars = list(set(alltweets_string))

# the number of unique characters
num_chars = len(chars) 
txt_data_size = len(alltweets_string)

print("unique characters : ", num_chars)
print("txt_data_size : ", txt_data_size)

unique characters :  106
txt_data_size :  110220


In [30]:
# one hot encode
char_to_int = dict((c, i) for i, c in enumerate(chars)) # "enumerate" retruns index and value. Convert it to dictionary
int_to_char = dict((i, c) for i, c in enumerate(chars))
print(char_to_int)
print("----------------------------------------------------")
print(int_to_char)
print("----------------------------------------------------")

{'G': 0, 'O': 1, '8': 2, 'I': 3, '&': 4, 'f': 5, ':': 6, '–': 7, 'V': 8, '/': 9, ';': 10, 'y': 11, '6': 12, '.': 13, 'l': 14, '3': 15, 'M': 16, '*': 17, 'r': 18, '™': 19, '(': 20, 'n': 21, 'H': 22, 'E': 23, 'B': 24, 'е': 25, '9': 26, 'g': 27, 'í': 28, 'z': 29, 'S': 30, 'm': 31, '‘': 32, 'а': 33, '0': 34, 'b': 35, '_': 36, '$': 37, '-': 38, '4': 39, 'с': 40, 'F': 41, 'i': 42, 'A': 43, '…': 44, 'e': 45, 'L': 46, 'K': 47, 'D': 48, 's': 49, '”': 50, 'o': 51, 'U': 52, 'q': 53, 'Y': 54, 'Q': 55, '#': 56, ',': 57, '"': 58, '?': 59, '\xa0': 60, 'J': 61, 'a': 62, 'р': 63, 'j': 64, 'R': 65, '%': 66, '—': 67, 'й': 68, 'u': 69, '“': 70, 'é': 71, 'ï': 72, '7': 73, 'N': 74, 'W': 75, '5': 76, '2': 77, 'c': 78, '1': 79, 'v': 80, 'у': 81, 'k': 82, 'T': 83, 'p': 84, ' ': 85, '\n': 86, 'w': 87, 'Z': 88, 'd': 89, 'З': 90, 'h': 91, 't': 92, 'в': 93, 'т': 94, ')': 95, '!': 96, 'x': 97, '@': 98, 'P': 99, 'C': 100, 'X': 101, '+': 102, 'д': 103, "'": 104, '’': 105}
---------------------------------------------

In [119]:
# integer encode austin tweets

# "integer_encoded" is a list which has a sequence 
# converted from an original data to integers.
austen_tweetsEnc = []
for each in range(181):
    integer_encoded = [char_to_int[i] for i in austen_tweets[each].text]
    austen_tweetsEnc.append(integer_encoded)

print(austen_tweetsEnc)
len(austen_tweetsEnc)

[[3, 85, 14, 51, 80, 45, 85, 14, 51, 80, 45, 85, 14, 51, 80, 45, 85, 87, 51, 18, 82, 42, 21, 27, 85, 87, 42, 92, 91, 85, 27, 18, 45, 62, 92, 85, 84, 45, 51, 84, 14, 45, 13, 84, 42, 78, 13, 92, 87, 42, 92, 92, 45, 18, 13, 78, 51, 31, 9, 5, 100, 47, 1, 31, 12, 8, 14], [54, 45, 49, 57, 85, 92, 91, 45, 85, 42, 31, 84, 62, 78, 92, 85, 87, 62, 49, 85, 31, 45, 21, 92, 42, 51, 21, 45, 89, 85, 42, 21, 85, 92, 91, 45, 85, 30, 38, 79], [65, 42, 49, 82, 49, 85, 42, 21, 85, 52, 35, 45, 18, 104, 49, 85, 30, 79, 6, 86, 86, 100, 51, 31, 84, 45, 92, 42, 92, 51, 18, 49, 86, 48, 45, 14, 45, 92, 45, 52, 35, 45, 18, 86, 65, 45, 84, 69, 92, 62, 92, 42, 51, 21, 86, 86, 77, 9, 15, 85, 62, 18, 45, 85, 49, 45, 14, 5, 38, 42, 21, 5, 14, 42, 78, 92, 45, 89, 85, 20, 62, 21, 89, 85, 42, 5, 85, 42, 92, 85, 87, 45, 18, 45, 21, 104, 92, 85, 5, 51, 18, 85, 48, 45, 14, 45, 92, 45, 52, 35, 45, 18, 85, 46, 11, 5, 92, 85, 31, 42, 27, 91, 92, 85, 35, 45, 85, 78, 14, 51, 49, 45, 85, 92, 51, 85, 89, 45, 62, 89, 57, 85, 49, 51

181

In [121]:
# integer encode al tweets

# "integer_encoded" is a list which has a sequence 
# converted from an original data to integers.
al_tweetsEnc = []
for each in range(960):
    integer_encoded = [char_to_int[i] for i in al_tweets[each].text]
    al_tweetsEnc.append(integer_encoded)

print(al_tweetsEnc)
len(al_tweetsEnc)

[[75, 45, 14, 14, 85, 87, 45, 14, 14, 85, 87, 45, 14, 14, 13, 13, 13, 85, 14, 51, 51, 82, 85, 87, 91, 62, 92, 85, 64, 69, 49, 92, 85, 49, 91, 51, 87, 45, 89, 85, 69, 84, 85, 51, 21, 85, 31, 11, 85, 89, 51, 51, 18, 49, 92, 45, 84, 96, 85, 91, 92, 92, 84, 6, 9, 9, 92, 87, 42, 92, 84, 42, 78, 13, 78, 51, 31, 9, 76, 26, 31, 42, 77, 78], [83, 91, 42, 21, 82, 42, 21, 27, 85, 51, 5, 85, 78, 91, 62, 21, 27, 42, 21, 27, 85, 31, 11, 85, 14, 51, 51, 82, 85, 62, 27, 62, 42, 21, 13, 85, 100, 51, 21, 49, 42, 89, 45, 18, 42, 21, 27, 85, 43, 21, 92, 51, 21, 85, 100, 91, 42, 27, 69, 18, 91, 85, 91, 62, 42, 18, 78, 69, 92, 85, 62, 21, 89, 85, 69, 21, 42, 35, 18, 51, 87, 85, 92, 62, 92, 92, 51, 51, 13], [74, 45, 87, 85, 54, 51, 18, 82, 85, 83, 42, 31, 45, 49, 57, 85, 11, 104, 62, 14, 14, 96, 91, 92, 92, 84, 6, 9, 9, 64, 13, 31, 84, 9, 31, 49, 100, 55, 91, 49, 60], [3, 85, 78, 62, 21, 104, 92, 85, 35, 45, 14, 42, 45, 80, 45, 85, 3, 85, 21, 45, 80, 45, 18, 85, 27, 51, 92, 85, 62, 18, 51, 69, 21, 89, 85, 92

960

### Get the data into the appropriate shape/format, including labels and a train/test split

In [117]:
# labels for austen tweets
y_austen =[1] * len(austen_tweetsEnc)
print(y_austen)
len(y_austen)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


181

In [118]:
# labels for al tweets
y_al = [0] * len(al_tweetsEnc)
print(y_al)
len(y_al)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

960

In [131]:
# X --> joined austen and al encoded tweets
X = austen_tweetsEnc + al_tweetsEnc
X = np.asarray(X)
X

array([list([3, 85, 14, 51, 80, 45, 85, 14, 51, 80, 45, 85, 14, 51, 80, 45, 85, 87, 51, 18, 82, 42, 21, 27, 85, 87, 42, 92, 91, 85, 27, 18, 45, 62, 92, 85, 84, 45, 51, 84, 14, 45, 13, 84, 42, 78, 13, 92, 87, 42, 92, 92, 45, 18, 13, 78, 51, 31, 9, 5, 100, 47, 1, 31, 12, 8, 14]),
       list([54, 45, 49, 57, 85, 92, 91, 45, 85, 42, 31, 84, 62, 78, 92, 85, 87, 62, 49, 85, 31, 45, 21, 92, 42, 51, 21, 45, 89, 85, 42, 21, 85, 92, 91, 45, 85, 30, 38, 79]),
       list([65, 42, 49, 82, 49, 85, 42, 21, 85, 52, 35, 45, 18, 104, 49, 85, 30, 79, 6, 86, 86, 100, 51, 31, 84, 45, 92, 42, 92, 51, 18, 49, 86, 48, 45, 14, 45, 92, 45, 52, 35, 45, 18, 86, 65, 45, 84, 69, 92, 62, 92, 42, 51, 21, 86, 86, 77, 9, 15, 85, 62, 18, 45, 85, 49, 45, 14, 5, 38, 42, 21, 5, 14, 42, 78, 92, 45, 89, 85, 20, 62, 21, 89, 85, 42, 5, 85, 42, 92, 85, 87, 45, 18, 45, 21, 104, 92, 85, 5, 51, 18, 85, 48, 45, 14, 45, 92, 45, 52, 35, 45, 18, 85, 46, 11, 5, 92, 85, 31, 42, 27, 91, 92, 85, 35, 45, 85, 78, 14, 51, 49, 45, 85, 92, 5

In [102]:
# y --> joined y for austen and al
y = np.concatenate([y_austen,y_al])
y

array([1, 1, 1, ..., 0, 0, 0])

In [112]:
len(X)

1141

In [0]:
def shuffle_split_data(X, y):
    arr_rand = np.random.rand(len(X))
    split = np.random.choice(range(len(X)), int(0.7*len(X)))

    X_train = X[split]
    y_train = y[split]
    X_test =  X[~split]
    y_test = y[~split]

    print(len(X_train), len(y_train), len(X_test), len(y_test))
    return X_train, y_train, X_test, y_test

In [133]:
shuffle_split_data(X,y)

798 798 798 798


(array([list([41, 42, 18, 49, 92, 85, 78, 45, 14, 45, 35, 18, 42, 92, 11, 85, 89, 45, 62, 92, 91, 85, 51, 5, 85, 77, 34, 79, 73, 13, 85, 85, 56, 65, 3, 99, 91, 92, 92, 84, 6, 9, 9, 35, 42, 92, 13, 14, 11, 9, 77, 42, 35, 24, 52, 31, 89, 60]),
        list([3, 21, 85, 14, 42, 21, 45, 85, 18, 42, 27, 91, 92, 85, 21, 51, 87, 85, 92, 51, 85, 49, 45, 45, 85, 92, 91, 45, 85, 21, 45, 87, 85, 98, 78, 62, 18, 18, 42, 45, 5, 5, 42, 49, 91, 45, 18, 85, 31, 51, 80, 42, 45, 13]),
        list([83, 91, 42, 49, 85, 21, 45, 87, 85, 0, 13, 0, 13, 85, 43, 14, 14, 42, 21, 85, 89, 51, 78, 69, 31, 45, 21, 92, 62, 18, 11, 85, 14, 51, 51, 82, 49, 85, 43, 16, 43, 88, 3, 74, 0, 13, 84, 42, 78, 13, 92, 87, 42, 92, 92, 45, 18, 13, 78, 51, 31, 9, 64, 77, 51, 5, 35, 14, 62, 100, 78, 61]),
        list([16, 45, 21, 69, 89, 51, 85, 89, 51, 45, 49, 21, 104, 92, 85, 49, 45, 45, 31, 85, 92, 51, 85, 35, 45, 85, 62, 49, 85, 84, 51, 84, 69, 14, 62, 18, 85, 62, 49, 85, 92, 91, 45, 11, 85, 51, 21, 78, 45, 85, 87, 45, 18, 45,

In [153]:
# TODO - your code!
from __future__ import print_function

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

max_features = 2
# cut texts after this number of words (among top max_features most common words)
maxlen = 40
batch_size = 32

# print('Loading data...')
arr_rand = np.random.rand(len(X))
split = np.random.choice(range(len(X)), int(0.7*len(X)))
X_train = X[split]
y_train = y[split]
X_test =  X[~split]
y_test = y[~split]
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(X_train, maxlen=maxlen)
x_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 798))
model.add(LSTM(798, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(X_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(X_test, y_test))
score, acc = model.evaluate(X_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

798 train sequences
798 test sequences
Pad sequences (samples x time)
X_train shape: (798,)
X_test shape: (798,)
Build model...
Train...
Train on 798 samples, validate on 798 samples
Epoch 1/15


ValueError: ignored

Conclusion - RNN runs, and gives pretty decent improvement over a naive "It's Al!" model. To *really* improve the model, more playing with parameters, and just getting more data (particularly Austen tweets), would help. Also - RNN may well not be the best approach here, but it is at least a valid one.

## Part 2- CNNs

Time to play "find the frog!" Use Keras and ResNet50 to detect which of the following images contain frogs:

In [60]:
!pip install google_images_download

Collecting google_images_download
  Downloading https://files.pythonhosted.org/packages/43/51/49ebfd3a02945974b1d93e34bb96a1f9530a0dde9c2bc022b30fd658edd6/google_images_download-2.5.0.tar.gz
Collecting selenium (from google_images_download)
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K    100% |████████████████████████████████| 911kB 21.3MB/s 
Building wheels for collected packages: google-images-download
  Building wheel for google-images-download (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/d2/23/84/3cec6d566b88bef64ad727a7e805f6544b8af4a8f121f9691c
Successfully built google-images-download
Installing collected packages: selenium, google-images-download
Successfully installed google-images-download-2.5.0 selenium-3.141.0


In [65]:
from google_images_download import google_images_download

response = google_images_download.googleimagesdownload()
arguments = {"keywords": "animal pond", "limit": 5, "print_urls": True}
absolute_image_paths = response.download(arguments)


Item no.: 1 --> Item name = animal pond
Evaluating...
Starting Download...
Image URL: https://www.enchantedlearning.com/pgifs/Pondanimals.GIF
Completed Image ====> 1. pondanimals.gif
Image URL: https://i.ytimg.com/vi/NCbu0TND9vE/hqdefault.jpg
Completed Image ====> 2. hqdefault.jpg
Image URL: https://pklifescience.com/staticfiles/articles/images/PKLS4116_inline.png
Completed Image ====> 3. pkls4116_inline.png
Image URL: https://pixnio.com/free-images/fauna-animals/reptiles-and-amphibians/alligators-and-crocodiles-pictures/alligator-animal-on-pond.jpg
Completed Image ====> 4. alligator-animal-on-pond.jpg
Image URL: https://www.nwf.org/-/media/NEW-WEBSITE/Programs/Garden-for-Wildlife/amphibian_bronze-frog_Julia-Bartosh_400x267.ashx
Completed Image ====> 5. amphibian_bronze-frog_julia-bartosh_400x267.ash

Errors: 0



At time of writing at least a few do, but since the Internet changes - it is possible your 5 won't. You can easily verify yourself, and (once you have working code) increase the number of images you pull to be more sure of getting a frog. Your goal is to validly run ResNet50 on the input images - don't worry about tuning or improving the model.

*Hint* - ResNet 50 doesn't just return "frog". The three labels it has for frogs are: `bullfrog, tree frog, tailed frog`

*Stretch goal* - also check for fish.

In [68]:
images_list = absolute_image_paths["animal pond"]
images_list

['/content/downloads/animal pond/1. pondanimals.gif',
 '/content/downloads/animal pond/2. hqdefault.jpg',
 '/content/downloads/animal pond/3. pkls4116_inline.png',
 '/content/downloads/animal pond/4. alligator-animal-on-pond.jpg',
 '/content/downloads/animal pond/5. amphibian_bronze-frog_julia-bartosh_400x267.ash']

In [0]:
# TODO - your code!
import numpy as np

from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.applications.resnet50 import preprocess_input, decode_predictions

def process_img_path(img_path):
  return image.load_img(img_path, target_size=(224, 224))

def img_contains_frog(img):
  x = image.img_to_array(img)
  x = np.expand_dims(x, axis=0)
  x = preprocess_input(x)
  model = ResNet50(weights='imagenet')
  features = model.predict(x)
  results = decode_predictions(features, top=3)[0]
  print(results)
  for entry in results:
    if entry[1] == 'frog':
      return entry[2]
  return 0.0

In [83]:
# NOT ABLE TO OPEN IMAGE BELOW ENDING WITH ".ash"
faulty_pic = '/content/downloads/animal pond/5. amphibian_bronze-frog_julia-bartosh_400x267.ash'
for pic in images_list:
    if pic != faulty_pic:
        print(img_contains_frog(process_img_path(pic)))
        Image(filename=pic, width=600)

[('n03598930', 'jigsaw_puzzle', 0.8680313), ('n06359193', 'web_site', 0.06410024), ('n02834397', 'bib', 0.021264324)]
0.0
[('n01443537', 'goldfish', 0.8495859), ('n01631663', 'eft', 0.06760218), ('n02536864', 'coho', 0.035163548)]
0.0
[('n04243546', 'slot', 0.8712449), ('n04476259', 'tray', 0.04993588), ('n03908618', 'pencil_box', 0.023072386)]
0.0
[('n01698640', 'American_alligator', 0.96394104), ('n01697457', 'African_crocodile', 0.026759902), ('n01737021', 'water_snake', 0.005964664)]
0.0


## Part 3 - AutoML

Use [TPOT](https://github.com/EpistasisLab/tpot) to fit a predictive model for the King County housing data, with `price` as the target output variable.

In [85]:
!pip install tpot

Collecting tpot
[?25l  Downloading https://files.pythonhosted.org/packages/36/6f/9a400b0a7d32d13b1b9a565de481d10163c8b39d1bdf63ae0219922a24fb/TPOT-0.10.0-py3-none-any.whl (73kB)
[K    100% |████████████████████████████████| 81kB 3.2MB/s 
[?25hCollecting update-checker>=0.16 (from tpot)
  Downloading https://files.pythonhosted.org/packages/17/c9/ab11855af164d03be0ff4fddd4c46a5bd44799a9ecc1770e01a669c21168/update_checker-0.16-py2.py3-none-any.whl
Collecting stopit>=1.1.1 (from tpot)
  Downloading https://files.pythonhosted.org/packages/35/58/e8bb0b0fb05baf07bbac1450c447d753da65f9701f551dca79823ce15d50/stopit-1.1.2.tar.gz
Collecting deap>=1.0 (from tpot)
[?25l  Downloading https://files.pythonhosted.org/packages/af/29/e7f2ecbe02997b16a768baed076f5fc4781d7057cd5d9adf7c94027845ba/deap-1.2.2.tar.gz (936kB)
[K    100% |████████████████████████████████| 942kB 22.6MB/s 
Building wheels for collected packages: stopit, deap
  Building wheel for stopit (setup.py) ... [?25ldone
[?25h  Stored

In [86]:
!wget https://raw.githubusercontent.com/ryanleeallred/datasets/master/kc_house_data.csv

--2019-04-12 17:03:20--  https://raw.githubusercontent.com/ryanleeallred/datasets/master/kc_house_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2515206 (2.4M) [text/plain]
Saving to: ‘kc_house_data.csv’


2019-04-12 17:03:21 (26.8 MB/s) - ‘kc_house_data.csv’ saved [2515206/2515206]



In [87]:
!head kc_house_data.csv

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
"7129300520","20141013T000000",221900,3,1,1180,5650,"1",0,0,3,7,1180,0,1955,0,"98178",47.5112,-122.257,1340,5650
"6414100192","20141209T000000",538000,3,2.25,2570,7242,"2",0,0,3,7,2170,400,1951,1991,"98125",47.721,-122.319,1690,7639
"5631500400","20150225T000000",180000,2,1,770,10000,"1",0,0,3,6,770,0,1933,0,"98028",47.7379,-122.233,2720,8062
"2487200875","20141209T000000",604000,4,3,1960,5000,"1",0,0,5,7,1050,910,1965,0,"98136",47.5208,-122.393,1360,5000
"1954400510","20150218T000000",510000,3,2,1680,8080,"1",0,0,3,8,1680,0,1987,0,"98074",47.6168,-122.045,1800,7503
"7237550310","20140512T000000",1.225e+006,4,4.5,5420,101930,"1",0,0,3,11,3890,1530,2001,0,"98053",47.6561,-122.005,4760,101930
"1321400060","20140627T000000",257500,3,2.25,1715,6819,"2",0,0,3,7,1715,0,1995,0,"98003",47.3097,-122.327,2238,6819
"2

As with previous questions, your goal is to run TPOT and successfully run and report error at the end.  Also, in the interest of time, feel free to choose small `generation=1` and `population_size=10` parameters so your pipeline runs efficiently and you are able to iterate and test.

*Hint* - you'll have to drop and/or type coerce at least a few variables to get things working. It's fine to err on the side of dropping to get things running, as long as you still get a valid model with reasonable predictive power.

In [90]:
# TODO - your code!
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

df = pd.read_csv('kc_house_data.csv')
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [91]:
df = df.drop(['id', 'date', 'lat', 'long'], axis=1)
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,sqft_living15,sqft_lot15
0,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,1340,5650
1,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,1690,7639
2,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,2720,8062
3,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,1360,5000
4,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,1800,7503


In [92]:
X = df.drop('price', axis=1)
y = df.price

X.shape, y.shape

((21613, 16), (21613,))

In [0]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
x = scaler.fit_transform(X)

In [95]:
from tpot import TPOTRegressor
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.75, 
                                                    test_size=0.25)

tpot = TPOTRegressor(generations=1, population_size=10, verbosity=2)
tpot.fit(X_train, y_train)

print(tpot.score(X_test, y_test))

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=20, style=ProgressStyle(descripti…

Generation 1 - Current best internal CV score: -22754579273.6778

Best pipeline: DecisionTreeRegressor(GradientBoostingRegressor(input_matrix, alpha=0.9, learning_rate=0.1, loss=huber, max_depth=7, max_features=0.6000000000000001, min_samples_leaf=5, min_samples_split=11, n_estimators=100, subsample=0.6500000000000001), max_depth=10, min_samples_leaf=10, min_samples_split=2)
-18333795348.39358


In [96]:
tpot.predict(X_test)

array([354453.67261905, 202185.97810219, 460894.73684211, ...,
       313434.8705036 , 295435.81730769, 560993.41353383])

>Using `tpot` produces a fairly satisfactory baseline regression while comparing the prediction results above with the true dependent variable below. Feature engineering should greatly increase accuracy.

In [97]:
y_test

10137     340000.0
2340      230000.0
6886      509500.0
7529      487000.0
4944      625000.0
18509     705640.0
6343      260000.0
9043      455000.0
8443     2140000.0
11700     455000.0
458       578000.0
1912      339950.0
269      2900000.0
17350     520000.0
11045    1120000.0
11350     651500.0
20799     290000.0
10368     590000.0
14556    2888000.0
10841     265000.0
2075     1200000.0
11171     319000.0
3379      415000.0
2881      149900.0
18679     300000.0
16678     305000.0
6278      241000.0
630       218000.0
8483      799990.0
12161     600000.0
20969     193000.0
2519      545000.0
1788      256000.0
6803      135000.0
7468      415000.0
17624     329900.0
13123    1180000.0
6783     2250000.0
14154     450000.0
15701     619000.0
15023     250000.0
18466     285000.0
5497      230000.0
13814     760000.0
17595     406500.0
17845     249500.0
8192      180000.0
18475     571500.0
5309      535000.0
6118      325000.0
20235     735000.0
5122      210000.0
16149     53

## Part 4 - More...

Answer the following questions, with a target audience of a fellow Data Scientist:

- What do you consider your strongest area, as a Data Scientist?
> I consider myself as a strong data wrangler and feature engineer in order to improve model prediction and accuracy when it comes to seeking to produce classifications or run regressions. This is specially usefull for businesses seeking to get insight from their data or predictions that will help them achieve their goals.
- What area of Data Science would you most like to learn more about, and why?
> I would like to increase my learning in how to implement NLP and Deep Leaning models in order to help businesses overcome hurdles and thrive.
- Where do you think Data Science will be in 5 years?
> I see myself running my own consultancy using machine and deep leaning models in order to help small businesses.

A few sentences per answer is fine - only elaborate if time allows.

Thank you for your hard work, and congratulations! You've learned a lot, and should proudly call yourself a Data Scientist.