# Deep Learning Sprint Challenge
### RNNs, CNNs, GANS, and AutoML

In this Sprint Challenge, you'll explore some of the cutting edge of Data Science. *Caution* - these approaches can be pretty heavy computationally. All problems are designed to completed with 5-10 minutes of run time on most machines. If you approach takes longer, please double check your work. 

## Part 1 - RNNs

Use an RNN to fit a classification model on tweets to distinguish from tweets from any two accounts. The following code sample illustrates how to access data from an account (no API auth needed), uses [twitterscraper](https://github.com/taspinar/twitterscraper): 

Your Tasks:
* Select two twitter accounts to gather data from
* Use twitterscraper to get ~1,000 tweets from each account
* Encode the characters to a sequence of integers for the model
* Get the data into the appropriate shape/format, including labels and a train/test split
* Use Keras to fit a predictive model, classying tweets as being from one acount or the other
* Report your overall score and accuracy

For reference, the [Keras IMDB classification example](https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py) will be useful, as well as the RNN code we used in class.

Note - focus on getting a running model, not on making accuracy with extreme data size or epoch numbers. Fit a baseline model based on tweet text. Only revisit and push accuracy or incorporate additional features if you get everything else done!

In [None]:
# !pip install twitterscraper

In [1]:
from twitterscraper import query_tweets

tweets_1 = query_tweets('from:streetsblogchi', 1000)
print(len(tweets_1))
tweets_2 = query_tweets('from:elonmusk', 1000)
print(len(tweets_2))

INFO: queries: ['from:streetsblogchi since:2006-03-21 until:2006-11-16', 'from:streetsblogchi since:2006-11-16 until:2007-07-14', 'from:streetsblogchi since:2007-07-14 until:2008-03-10', 'from:streetsblogchi since:2008-03-10 until:2008-11-06', 'from:streetsblogchi since:2008-11-06 until:2009-07-04', 'from:streetsblogchi since:2009-07-04 until:2010-03-01', 'from:streetsblogchi since:2010-03-01 until:2010-10-27', 'from:streetsblogchi since:2010-10-27 until:2011-06-25', 'from:streetsblogchi since:2011-06-25 until:2012-02-20', 'from:streetsblogchi since:2012-02-20 until:2012-10-17', 'from:streetsblogchi since:2012-10-17 until:2013-06-14', 'from:streetsblogchi since:2013-06-14 until:2014-02-10', 'from:streetsblogchi since:2014-02-10 until:2014-10-08', 'from:streetsblogchi since:2014-10-08 until:2015-06-05', 'from:streetsblogchi since:2015-06-05 until:2016-01-31', 'from:streetsblogchi since:2016-01-31 until:2016-09-28', 'from:streetsblogchi since:2016-09-28 until:2017-05-26', 'from:streetsbl

746


INFO: queries: ['from:elonmusk since:2006-03-21 until:2006-11-16', 'from:elonmusk since:2006-11-16 until:2007-07-14', 'from:elonmusk since:2007-07-14 until:2008-03-10', 'from:elonmusk since:2008-03-10 until:2008-11-06', 'from:elonmusk since:2008-11-06 until:2009-07-04', 'from:elonmusk since:2009-07-04 until:2010-03-01', 'from:elonmusk since:2010-03-01 until:2010-10-27', 'from:elonmusk since:2010-10-27 until:2011-06-25', 'from:elonmusk since:2011-06-25 until:2012-02-20', 'from:elonmusk since:2012-02-20 until:2012-10-17', 'from:elonmusk since:2012-10-17 until:2013-06-14', 'from:elonmusk since:2013-06-14 until:2014-02-10', 'from:elonmusk since:2014-02-10 until:2014-10-08', 'from:elonmusk since:2014-10-08 until:2015-06-05', 'from:elonmusk since:2015-06-05 until:2016-01-31', 'from:elonmusk since:2016-01-31 until:2016-09-28', 'from:elonmusk since:2016-09-28 until:2017-05-26', 'from:elonmusk since:2017-05-26 until:2018-01-21', 'from:elonmusk since:2018-01-21 until:2018-09-18', 'from:elonmusk 

6592


In [27]:
# encode
import numpy as np

tweet_chr_set = set()
for user_tweets in [tweets_1, tweets_2]:
    for tweet in user_tweets:
        tweet_chr_set = tweet_chr_set.union(tweet.text)

chr_to_int = dict((c, i) for i, c in enumerate(list(tweet_chr_set)))

tweets = []
for user_tweets in [tweets_1, tweets_2]:
    for tweet in user_tweets:
        tweets.append([chr_to_int[c] for c in tweet.text])
        
X = np.array(tweets)

In [28]:
# 1: streetsblogchi, 0: elonmusk
y = np.array([1] * len(tweets_1) + [0] * len(tweets_2))
y = y.reshape(len(y), 1)

In [29]:
print(X[0:3], '-' * 10, y[0:3], sep='\n')

array([list([122, 76, 7, 50, 10, 41, 138, 10, 76, 10, 110, 56, 30, 149, 76, 130, 10, 74, 30, 10, 50, 130, 128, 30, 41, 132, 76, 149, 50, 10, 83, 99, 128, 120, 128, 56, 99, 130, 149, 13, 10, 95, 99, 130, 10, 74, 145, 99, 110, 10, 145, 76, 74, 129, 10, 145, 74, 74, 138, 37, 84, 84, 139, 129, 114, 138, 84, 139, 128, 149, 70, 50, 62, 10, 38, 2, 138, 50, 130, 10, 74, 30, 10, 72, 3, 10, 110, 74, 76, 74, 50, 110, 10, 27, 10, 81, 69, 134]),
       list([51, 112, 10, 112, 50, 101, 50, 132, 76, 56, 10, 69, 122, 115, 77, 10, 114, 30, 130, 50, 120, 10, 101, 30, 50, 110, 130, 80, 74, 10, 138, 76, 120, 10, 74, 30, 10, 114, 76, 99, 130, 74, 76, 99, 130, 10, 83, 99, 7, 50, 10, 56, 76, 130, 50, 110, 13, 10, 95, 145, 30, 10, 101, 30, 50, 110, 75, 10, 145, 74, 74, 138, 37, 84, 84, 139, 129, 114, 138, 84, 114, 112, 26, 74, 85, 43, 10, 63, 83, 99, 128, 120, 128, 56, 99, 130, 149]),
       list([115, 74, 10, 88, 41, 114, 76, 80, 110, 10, 112, 30, 132, 10, 74, 145, 50, 10, 112, 99, 132, 110, 74, 10, 74, 99, 

In [37]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [38]:
maxlen = 100
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test  = sequence.pad_sequences(X_test,  maxlen=maxlen)
print('x_train shape:', X_train.shape)
print('x_test shape: ', X_test.shape)

x_train shape: (4916, 100)
x_test shape:  (2422, 100)


In [39]:
batch_size = 32
max_features = 20000  # or max(chr_to_int.values()) ?

model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(X_test, y_test))
score, acc = model.evaluate(X_test, y_test,
                            batch_size=batch_size)

Train on 4916 samples, validate on 2422 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [41]:
print(score, acc)

0.15867118511585832 0.9508670522692182


## Part 2 - CNNs
Time to play "find the frog!" Use Keras and ResNet50 to detect which of the following images contain frogs. You may need to adjust the number of images to query to ensure one picture contains a frog. Your goal is validly run ResNet50 on the input images - don't worry about tuning or improving the model. 

*Hint:* ResNet 50 doesn't just return "frog". The three labels it has for frogs are bullfrog, tree frog, and tailed frog.

Stretch goal - also check for fish.

In [40]:
# !pip install google_images_download

In [48]:
from google_images_download import google_images_download

response = google_images_download.googleimagesdownload()
arguments = {'keywords': "animal pond",
             "limit": 8, 
             "print_urls": False,
             "output_directory":'img/ignore'}
absolute_image_paths = response.download(arguments)


Item no.: 1 --> Item name = animal pond
Evaluating...
Starting Download...
Completed Image ====> 1.Pondanimals.GIF
Completed Image ====> 2.hqdefault.jpg
Completed Image ====> 3.water-animal-pond-wildlife-mammal-fish-eat-fauna-whiskers-vertebrate-otter-mink-marmot-sea-otter-mustelidae-1383482.jpg
Completed Image ====> 4.PKLS4116_inline.png
Completed Image ====> 5.alligator_animal_on_pond.jpg
Completed Image ====> 6.frog-2243543_960_720.jpg
Completed Image ====> 7.maxresdefault.jpg
Completed Image ====> 8.birds-in-a-pond-5986310798966784.jpg

Errors: 0



In [49]:
from pathlib import Path

from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.applications.resnet50 import preprocess_input, decode_predictions

PTModel = ResNet50(weights='imagenet')

for path in Path('img/ignore/animal pond').iterdir():
    # match the shape of the image to the shape of the pretrained model
    Img = image.load_img(path, target_size=(224, 224))
    # preprocess the data
    InputIMG = image.img_to_array(Img)
    InputIMG = np.expand_dims(InputIMG, axis=0)
    InputIMG = preprocess_input(InputIMG)
    # use pre trained model to classify image
    PredData = PTModel.predict(InputIMG)
    # use decode_predictions to eval perfomance
    print('Predicted:', decode_predictions(PredData, top=2)[0])

Predicted: [('n03598930', 'jigsaw_puzzle', 0.86803204), ('n06359193', 'web_site', 0.06409999)]
Predicted: [('n01443537', 'goldfish', 0.8495913), ('n01631663', 'eft', 0.067602284)]
Predicted: [('n02442845', 'mink', 0.30976605), ('n02363005', 'beaver', 0.23398966)]
Predicted: [('n04243546', 'slot', 0.8712447), ('n04476259', 'tray', 0.04993611)]
Predicted: [('n01698640', 'American_alligator', 0.963947), ('n01697457', 'African_crocodile', 0.026759991)]
Predicted: [('n01641577', 'bullfrog', 0.9223341), ('n01644900', 'tailed_frog', 0.07364755)]
Predicted: [('n02013706', 'limpkin', 0.3572372), ('n01806567', 'quail', 0.1810789)]
Predicted: [('n02009912', 'American_egret', 0.7822417), ('n02012849', 'crane', 0.1433928)]


Correctly identified frog in image 6!

## Part 3 - AutoML

Use [TPOT](https://epistasislab.github.io/tpot/) to fit a predictive model for the King County housing data, with `price` as the target output variable.

As with previous questions, your goal is to run TPOT and successfully run and report error at the end. Also, in the interest of time, feel free to choose small `generation=1`and `population_size=10` parameters, so your pipeline runs efficiently. You will want to be able to iterate and test. 

*Hint:* You will have to drop and/or type coerce at least a few variables to get things working. It's fine to err on the side of dropping to get things running - as long as you still get a valid model with reasonable predictive power. 

In [51]:
# !pip install tpot

In [114]:
import pandas as pd

url = ("https://raw.githubusercontent.com/ryanleeallred/"
       "datasets/master/kc_house_data.csv")
df = pd.read_csv(url)

df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [115]:
pd.options.display.max_columns = None

In [116]:
df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,4580302000.0,540088.1,3.370842,2.114757,2079.899736,15106.97,1.494309,0.007542,0.234303,3.40943,7.656873,1788.390691,291.509045,1971.005136,84.402258,98077.939805,47.560053,-122.213896,1986.552492,12768.455652
std,2876566000.0,367127.2,0.930062,0.770163,918.440897,41420.51,0.539989,0.086517,0.766318,0.650743,1.175459,828.090978,442.575043,29.373411,401.67924,53.505026,0.138564,0.140828,685.391304,27304.179631
min,1000102.0,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,321950.0,3.0,1.75,1427.0,5040.0,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1951.0,0.0,98033.0,47.471,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,3.0,7.0,1560.0,0.0,1975.0,0.0,98065.0,47.5718,-122.23,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10688.0,2.0,0.0,0.0,4.0,8.0,2210.0,560.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


In [117]:
df.dtypes

id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

In [118]:
(df == 0).sum()

id                   0
date                 0
price                0
bedrooms            13
bathrooms           10
sqft_living          0
sqft_lot             0
floors               0
waterfront       21450
view             19489
condition            0
grade                0
sqft_above           0
sqft_basement    13126
yr_built             0
yr_renovated     20699
zipcode              0
lat                  0
long                 0
sqft_living15        0
sqft_lot15           0
dtype: int64

In [119]:
df = df.drop(columns=['id', 'date', 'zipcode', 'yr_renovated', 'lat', 'long'])

In [121]:
from tpot import TPOTRegressor
from sklearn.model_selection import train_test_split

target = 'price'
X = df.drop(columns=target).values
y = df[target].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

pipeline_optimizer = TPOTRegressor(
    generations=1, population_size=20, cv=3, n_jobs=-1,
    verbosity=1
)

In [122]:
pipeline_optimizer.fit(X_train, y_train)

Best pipeline: ExtraTreesRegressor(input_matrix, bootstrap=False, max_features=0.55, min_samples_leaf=1, min_samples_split=6, n_estimators=100)


TPOTRegressor(config_dict=None, crossover_rate=0.1, cv=3,
       disable_update_check=False, early_stop=None, generations=1,
       max_eval_time_mins=5, max_time_mins=None, memory=None,
       mutation_rate=0.9, n_jobs=-1, offspring_size=None,
       periodic_checkpoint_folder=None, population_size=20,
       random_state=None, scoring=None, subsample=1.0,
       template='RandomTree', use_dask=False, verbosity=1,
       warm_start=False)

In [128]:
print('mse: ', pipeline_optimizer.score(X_test, y_test))
print('rmse:', np.sqrt(pipeline_optimizer.score(X_test, y_test) * -1))

mse: -40550329466.80189
rmse: 201371.1237163906


## Part 4 - More... 

Answer the following questions, with a target audience of a fellow Data Scientist. A few sentences per answer is fine. Only elaborate if time allows. Use markdown to format your answers.

**What do you consider your strongest area as a Data Scientist?**
  * Since ETL demands the greatest human effort, I image that will always be my greatest strength. However I also have a great eye for design and presentation. I scaffold my projects well to efficiently apply effort to a given task.

**What area of Data Science would you most like to learn more about and why?**
  * Computer vision and geospatial algorithms. I love solutions that are tangible, and these applications seem to exist in the real world. They augment aspects of humanity (vision and physical engagement with our world) in ways that I find satisfying and mystifying at the same time.

**Where do you think Data Science will be in 5 years?**
  * I think ETL tasks will still require the most work. Automated modeling will be easier. Pre-trained models will be more powerful and generalizable, and new frameworks will arise. Python will explode in usership. The barrier to entry will continue to decline.