<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Major Neural Network Architectures Challenge
## *Data Science Unit 4 Sprint 3 Challenge*

In this sprint challenge, you'll explore some of the cutting edge of Data Science. This week we studied several famous neural network architectures: 
recurrent neural networks (RNNs), long short-term memory (LSTMs), convolutional neural networks (CNNs), and Autoencoders. In this sprint challenge, you will revisit these models. Remember, we are testing your knowledge of these architectures not your ability to fit a model with high accuracy. 

__*Caution:*__  these approaches can be pretty heavy computationally. All problems were designed so that you should be able to achieve results within at most 5-10 minutes of runtime on SageMaker, Colab or a comparable environment. If something is running longer, doublecheck your approach!

## Challenge Objectives
*You should be able to:*
* <a href="#p1">Part 1</a>: Train a LSTM classification model
* <a href="#p2">Part 2</a>: Utilize a pre-trained CNN for objective detection
* <a href="#p3">Part 3</a>: Describe the components of an autoencoder
* <a href="#p4">Part 4</a>: Describe yourself as a Data Science and elucidate your vision of AI

<a id="p1"></a>
## Part 1 - RNNs

Use an RNN/LSTM to fit a multi-class classification model on reuters news articles to distinguish topics of articles. The data is already encoded properly for use in an RNN model. 

Your Tasks: 
- Use Keras to fit a predictive model, classifying news articles into topics. 
- Report your overall score and accuracy

For reference, the [Keras IMDB sentiment classification example](https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py) will be useful, as well the RNN code we used in class.

__*Note:*__  Focus on getting a running model, not on maxing accuracy with extreme data size or epoch numbers. Only revisit and push accuracy if you get everything else done!

In [0]:
# Import libraries.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats

from google_images_download import google_images_download
from sklearn.metrics import accuracy_score

from tensorflow.keras.datasets import reuters
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.applications.resnet50 import decode_predictions
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.preprocessing import image, sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM

In [0]:
pip install google-images-download



In [0]:
from tensorflow.keras.datasets import reuters

(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words=None,
                                                         skip_top=0,
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=723812,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

In [0]:
# Demo of encoding

word_index = reuters.get_word_index(path="reuters_word_index.json")

print(f"Iran is encoded as {word_index['iran']} in the data")
print(f"London is encoded as {word_index['london']} in the data")
print("Words are encoded as numbers in our dataset.")

Iran is encoded as 779 in the data
London is encoded as 544 in the data
Words are encoded as numbers in our dataset.


In [0]:
topic_indexes = pd.DataFrame.from_dict(data={'copper': 6, 
                                             'livestock': 28, 
                                             'gold': 25, 
                                             'money-fx': 19, 
                                             'ipi': 30, 
                                             'trade': 11, 
                                             'cocoa': 0, 
                                             'iron-steel': 31, 
                                             'reserves': 12, 
                                             'tin': 26, 
                                             'zinc': 37, 
                                             'jobs': 34, 
                                             'ship': 13, 
                                             'cotton': 14, 
                                             'alum': 23, 
                                             'strategic-metal': 27, 
                                             'lead': 45, 
                                             'housing': 7, 
                                             'meal-feed': 22, 
                                             'gnp': 21, 
                                             'sugar': 10, 
                                             'rubber': 32, 
                                             'dlr': 40,
                                             'veg-oil': 2,  
                                             'interest': 20, 
                                             'crude': 16, 
                                             'coffee': 9, 
                                             'wheat': 5, 
                                             'carcass': 15, 
                                             'lei': 35, 
                                             'gas': 41, 
                                             'nat-gas': 17, 
                                             'oilseed': 24, 
                                             'orange': 38, 
                                             'heat': 33, 
                                             'wpi': 43, 
                                             'silver': 42, 
                                             'cpi': 18, 
                                             'earn': 3, 
                                             'bop': 36, 
                                             'money-supply': 8, 
                                             'hog': 44, 
                                             'acq': 4, 
                                             'pet-chem': 39, 
                                             'grain': 1, 
                                             'retail': 29}, 
                                       orient='index',
                                       columns=['topic_id'])
                            
topic_indexes = topic_indexes.rename_axis('topic_name').reset_index()
topic_indexes = topic_indexes.set_index('topic_id').sort_index()
topic_indexes

Unnamed: 0_level_0,topic_name
topic_id,Unnamed: 1_level_1
0,cocoa
1,grain
2,veg-oil
3,earn
4,acq
5,wheat
6,copper
7,housing
8,money-supply
9,coffee


In [0]:
# Identify articles from training dataset with topic "orange".
np.where(y_train == 38)

(array([ 321, 1395, 1660, 2107, 2334, 2774, 2903, 3611, 3629, 3937, 4102,
        4506, 5213, 5894, 6591, 6718, 7757, 8059, 8975]),)

In [0]:
# Create reverse lookup dictionary for article reconstruction.
wordDict = {y:x for x, y in reuters.get_word_index().items()} 

# Reconstruct sample "orange" article.
print(' '.join([(wordDict.get(index-3)) for index in 
                filter(lambda i: (i >= 3), 
                       X_train[321])]))

there is no confirmation that brazil's major processors of frozen concentrated orange juice fcoj will raise export prices of the product to 1 375 dlrs per tonne from april 1 a spokesman for the brazilian association of citrus juice industries abrassuco said asked to comment on a report from new york that cutrale and citrosuco had sent telexes to customers informing of the price raise jose carlos goncalves said abrassuco was not aware of it all we know is that cacex has increased the dollar amount to translate fob price to ex dock new york price to 1 050 dlrs from 770 dlrs goncalves said citrosuco and cutrale officials were not available for comment reuter 3


In [0]:
# Look at topic distribution in training dataset.
pd.merge(pd.DataFrame(y_train, columns=['topic_id']), 
         topic_indexes, 
         on='topic_id')['topic_name'].value_counts(normalize=True)

earn               0.350590
acq                0.214986
money-fx           0.061456
grain              0.049878
crude              0.047651
trade              0.041527
interest           0.030728
ship               0.018815
money-supply       0.016477
sugar              0.014139
gnp                0.012469
coffee             0.011356
gold               0.011022
veg-oil            0.008461
oilseed            0.007793
cpi                0.007348
cocoa              0.006123
bop                0.005455
ipi                0.005344
copper             0.005233
reserves           0.005233
jobs               0.005010
alum               0.004676
livestock          0.004676
iron-steel         0.004565
nat-gas            0.004231
dlr                0.003785
rubber             0.003451
gas                0.003340
tin                0.002672
wpi                0.002561
pet-chem           0.002561
cotton             0.002561
carcass            0.002449
retail             0.002227
meal-feed          0

In [0]:
# Check that all classes are represented in both datasets.
set(y_train) == set(y_test)

True

In [0]:
# Check total number of words.
len(word_index.values())

30979

In [0]:
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
batch_size = 46
max_features = len(word_index.values()) + 1
maxlen = 200

print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(len(set(y_train)), activation='softmax'))


8982 train sequences
2246 test sequences
Pad sequences (samples x time)
X_train shape: (8982, 200)
X_test shape: (2246, 200)
Build model...
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [0]:
# You should only run this cell once your model has been
# properly configured.
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(X_train, y_train,
          batch_size=batch_size,
          epochs=1,
          validation_data=(X_test, y_test))

score, acc = model.evaluate(X_train, y_train,
                            batch_size=batch_size,
                            verbose=False)
print('Train score:', score)
print('Train accuracy:', acc)

score, acc = model.evaluate(X_test, y_test,
                            batch_size=batch_size,
                            verbose=False)
print('Test score:', score)
print('Test accuracy:', acc)

Train...
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 8982 samples, validate on 2246 samples
Train score: 1.9176377393014976
Train accuracy: 0.49220663
Test score: 1.8947205191091672
Test accuracy: 0.49821904


In [0]:
# Define majority class from training dataset.
majority_class = scipy.stats.mode(y_train)[0][0]
majority_class

3

In [0]:
# Calculate accuracy of majority classifier for comparison.
acc = accuracy_score(y_train, np.full_like(y_train, majority_class))
print('Majority classifier train accuracy:', acc)

acc = accuracy_score(y_test, np.full_like(y_test, majority_class))
print('Majority classifier test accuracy:', acc)

Majority classifier train accuracy: 0.3505900690269428
Majority classifier test accuracy: 0.3664292074799644


## Sequence Data Question
#### *Describe the `pad_sequences` method used on the training dataset. What does it do? Why do you need it?*

The `pad_sequences` method (API documentation) used on the training and test datasets truncates inputs longer than the maximum length specified and prepends zeros to any inputs shorter than that length. This is necessary because the architecture of a neural network is designed to work with inputs of a set length, but our raw data is presumably of variable length.

## RNNs versus LSTMs
#### *What are the primary motivations behind using Long-ShortTerm Memory Cell unit over traditional Recurrent Neural Networks?*

Traditional RNNs are prone to vanishing and exploding gradients, which can cause slow convergence or failure to converge. They also heavily weight the most recent inputs, which means that they don't model long-term dependencies well. LSTMs are designed to address both of these issues.


## RNN / LSTM Use Cases
#### *Name and Describe 3 Use Cases of LSTMs or RNNs and why they are suited to that use case*

LSTMs - and RNNs generally - are useful for prediction and generation tasks where inputs occur in a meaningful and/or relevant sequential context, because of their ability to retain and use information about that context, i.e. past inputs. Examples include machine translation, text generation, music composition, handwriting and speech recognition, robot control, and many kinds of time series prediction.

<a id="p2"></a>
## Part 2- CNNs

### Find the Frog

Time to play "find the frog!" Use Keras and ResNet50 (pre-trained) to detect which of the following images contain frogs:

<img align="left" src="https://d3i6fh83elv35t.cloudfront.net/newshour/app/uploads/2017/03/GettyImages-654745934-1024x687.jpg" width=400>


In [87]:
# Download images.
from google_images_download import google_images_download 
response = google_images_download.googleimagesdownload()
search_queries = ['lily frog pond']
def downloadimages(query): 
    # keywords is the search query 
    # format is the image file format 
    # limit is the number of images to be downloaded 
    # print urs is to print the image file url 
    # size is the image size which can 
    # be specified manually ("large, medium, icon") 
    # aspect ratio denotes the height width ratio 
    # of images to download. ("tall, square, wide, panoramic") 
    arguments = {"keywords": query, 
                 "format": "jpg", 
                 "limit":4, 
                 "print_urls":True, 
                 "size": "medium"} 
    try: 
        response.download(arguments) 
      
    # Handling File NotFound Error     
    except FileNotFoundError:  
        arguments = {"keywords": query, 
                     "format": "jpg", 
                     "limit":4, 
                     "print_urls":True,  
                     "size": "medium"} 
                       
        # Providing arguments for the searched query 
        try: 
            # Downloading the photos based 
            # on the given arguments 
            response.download(arguments)  
        except: 
            pass
  
# Driver Code 
for query in search_queries: 
    downloadimages(query)  
    print()  

#arguments = {"keywords": "lily frog pond", "limit": 5, "print_urls": True}
#absolute_image_paths = response.download(arguments)



Item no.: 1 --> Item name = lily frog pond
Evaluating...
Starting Download...


Unfortunately all 4 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!

Errors: 0




In [0]:
!git clone https://github.com/hardikvasa/google-images-download.git

Cloning into 'google-images-download'...
remote: Enumerating objects: 604, done.[K
remote: Total 604 (delta 0), reused 0 (delta 0), pack-reused 604[K
Receiving objects: 100% (604/604), 255.16 KiB | 3.45 MiB/s, done.
Resolving deltas: 100% (349/349), done.


In [0]:
!cd google-images-download && sudo python setup.py install

running install
running bdist_egg
running egg_info
creating google_images_download.egg-info
writing google_images_download.egg-info/PKG-INFO
writing dependency_links to google_images_download.egg-info/dependency_links.txt
writing entry points to google_images_download.egg-info/entry_points.txt
writing requirements to google_images_download.egg-info/requires.txt
writing top-level names to google_images_download.egg-info/top_level.txt
writing manifest file 'google_images_download.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'google_images_download.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/google_images_download
copying google_images_download/__init__.py -> build/lib/google_images_download
copying google_images_download/__main__.py -> build/lib/google_images_download
copying google_images_download/google_images_download.py -> 

At time of writing at least a few do, but since the Internet changes - it is possible your 5 won't. You can easily verify yourself, and (once you have working code) increase the number of images you pull to be more sure of getting a frog. Your goal is to validly run ResNet50 on the input images - don't worry about tuning or improving the model.

*Hint* - ResNet 50 doesn't just return "frog". The three labels it has for frogs are: `bullfrog, tree frog, tailed frog`

*Stretch goals* 
- Check for fish or other labels
- Create a matplotlib visualizations of the images and your prediction as the visualization label

In [0]:
from google.colab import files
absolute_image_paths = files.upload()

Saving Unknown-2.jpeg to Unknown-2 (1).jpeg
Saving Unknown-3.jpeg to Unknown-3 (1).jpeg
Saving Unknown-4.jpeg to Unknown-4 (1).jpeg
Saving Unknown-5.jpeg to Unknown-5 (1).jpeg
Saving Unknown.jpeg to Unknown (2).jpeg


In [0]:
absolute_image_paths

{'Unknown-2.jpeg': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00\x84\x00\t\x06\x07\x13\x13\x12\x15\x13\x12\x13\x16\x15\x15\x17\x17\x16\x15\x15\x17\x18\x17\x1d\x17\x18\x18\x15\x15\x15\x16\x16\x17\x18\x17\x18\x18\x1d( \x18\x1a%\x1d\x15\x15!1!%)+...\x17\x1f383-7(-.+\x01\n\n\n\x0e\r\x0e\x1b\x10\x10\x1b.&\x1f%-------------------+-./---------------------------\xff\xc0\x00\x11\x08\x00\xad\x01#\x03\x01"\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1c\x00\x00\x00\x07\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\x07\x00\x08\xff\xc4\x00G\x10\x00\x02\x01\x02\x03\x04\x07\x04\x07\x06\x05\x02\x06\x03\x00\x00\x01\x02\x03\x00\x11\x04\x12!\x05\x061A\x13Qaq\x81\x91\xa1"2B\xb1\x07\x14Rb\x92\xc1\xd1#CSr\x82\xe1\x153\xa2\xb2\xf0\xc2\xf1\x16DTc\x83\x84\x08%4\xff\xc4\x00\x1a\x01\x00\x03\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\xff\xc4\x004\x11\x00\x02\x01\x02\x03\x04\x08\x06\x02\x02\x03\x00\x00\x00\x00\x

In [0]:
import os
import glob

#source_path = "/Users/ericchiyembekeza/Desktop/Lambda School/frog_pics/*.jpeg"
os.path.isfile("/Users/ericchiyembekeza/Desktop/Lambda School/frog_pics/Unknown.jpeg")

#absolute_image_paths = [source_path + '/' + f for f in glob.glob('*.jpeg')]

#for filepath in absolute_image_paths:
 #   callimage = Image.open(filepath).load()

False

In [0]:
# You've got something to do in this cell. ;)

import numpy as np

from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions

def process_img_path(img_path):
    """
    Pre-process image for prediction with ResNet50.
    """
    img = image.load_img(img_path, target_size=(224, 224))
    image.img_to_array(img)
    img = np.expand_dims(img, axis=0)
    return preprocess_input(img)

def predict_top_n(image, n):
    """
    Returns ResNet50's top n predictions for the given image.
    """
    model = ResNet50(weights='imagenet')
    features = model.predict(image)
    return decode_predictions(features, top=n)[0]

def img_contains_prob(img, name, top=10):
    """ Scans image for named object.
    
    Returns the estimated probability that the named object appears
    in the given image.
    
    Inputs:
    ---------
    img:  Precrossed image ready for prediction. The `process_img_path`
    function should already be applied to the image. 
    
    Returns: 
    ---------
    (float): [0, 1] - The probability that the named object appears.
    
    """
    model = ResNet50(weights='imagenet')
    features = model.predict(img)
    preds = decode_predictions(features, top=top)[0]
    
    probability = 0
    
    for pred in preds:
        if name in pred[1]:
            probability += pred[2]
            
    return probability

def img_contains(img, name, top=10):
    """ Scans image for named object.
    
    Should return a boolean (True/False) if the named object 
    is in the image with a predicted probability >= 0.5.
    
    Inputs:
    ---------
    img:  Precrossed image ready for prediction. The `process_img_path`
    function should already be applied to the image. 
    
    Returns: 
    ---------
    (boolean):  TRUE or FALSE - The named object appears in the image.    
    """
           
    probability = img_contains_prob(img, name)
    return probability >= 0.5
            
    return None

In [89]:
# Check downloaded images for frogs.
for path in absolute_image_paths[0]:
    img = process_img_path(path)
    print(img_contains(img, 'frog'))

KeyError: ignored

In [0]:
absolute_image_paths

{'Unknown-2.jpeg': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00\x84\x00\t\x06\x07\x13\x13\x12\x15\x13\x12\x13\x16\x15\x15\x17\x17\x16\x15\x15\x17\x18\x17\x1d\x17\x18\x18\x15\x15\x15\x16\x16\x17\x18\x17\x18\x18\x1d( \x18\x1a%\x1d\x15\x15!1!%)+...\x17\x1f383-7(-.+\x01\n\n\n\x0e\r\x0e\x1b\x10\x10\x1b.&\x1f%-------------------+-./---------------------------\xff\xc0\x00\x11\x08\x00\xad\x01#\x03\x01"\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1c\x00\x00\x00\x07\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\x07\x00\x08\xff\xc4\x00G\x10\x00\x02\x01\x02\x03\x04\x07\x04\x07\x06\x05\x02\x06\x03\x00\x00\x01\x02\x03\x00\x11\x04\x12!\x05\x061A\x13Qaq\x81\x91\xa1"2B\xb1\x07\x14Rb\x92\xc1\xd1#CSr\x82\xe1\x153\xa2\xb2\xf0\xc2\xf1\x16DTc\x83\x84\x08%4\xff\xc4\x00\x1a\x01\x00\x03\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\xff\xc4\x004\x11\x00\x02\x01\x02\x03\x04\x08\x06\x02\x02\x03\x00\x00\x00\x00\x

In [0]:
frog_path = (uploaded)
test_frog = process_img_path(frog_path)

AttributeError: ignored

#### Stretch Goal: Displaying Predictions

In [0]:
import matplotlib.pyplot as plt



<a id="p3"></a>
## Part 3 - Autoencoders

Describe a use case for an autoencoder given that an autoencoder tries to predict its own input. 

__*Your Answer:* Autoencoders can be used for anomaly detection, taking a high reconstruction error relative to the model's typical performance as an indicator of anomalous input. They can also be used for dimensionality reduction (useful for visualization of high-dimensional data), in which application they work a bit like a non-linear form of principal component analysis (PCA). Image denoising is a third use case for autoencoders.

<a id="p4"></a>
## Part 4 - More...

Answer the following questions, with a target audience of a fellow Data Scientist:

- What do you consider your strongest area, as a Data Scientist?
My strongest area is probably creating ML models. I found that as I find situations that are more relatable, especially as it pertains to health care, I am able to really grasp the concept and create pretty accurate models.
- What area of Data Science would you most like to learn more about, and why?
I would definitely like to delve a bit deeper in neural networks. I would also like to learn about blockchain as I believe that there are a lot of industries that will benefit from the technology.
- Where do you think Data Science will be in 5 years?
It is really hard to say where DS will be in 5 years. But I am certain that there will be DS in virtually every industry, and will likely be leading the decision making in most companies. I think there will be a push to incorporate this in middle and high school curriculums.
- What are the threats posed by AI to our society?
The more that humans learn, the more dangerous we become to our environment. Therefore, with AI the learning and the destruction could virtually happen quicker.
- How do you think we can counteract those threats? 
Before this technology is really implemented and fully integrated into our society, there needs to be deep conversation and (basically) universal agreement about the implications. I think that there is a rudimentary understanding of the uses of AI. This is dangerous.
- Do you think achieving General Artifical Intelligence is ever possible?
It is kind of hard to say. I believe that General Artificial Intelligence is conceptually possible, but given that AI implies that learning happens at an accelerated pace, there is already an advantage that AI has over humans. Additionally, humans have emotions and other limiting factors that machines do not necessarily have to account for.

A few sentences per answer is fine - only elaborate if time allows.

## Congratulations! 

Thank you for your hard work, and congratulations! You've learned a lot, and you should proudly call yourself a Data Scientist.


In [0]:
from IPython.display import HTML

HTML("""<iframe src="https://giphy.com/embed/26xivLqkv86uJzqWk" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/mumm-champagne-saber-26xivLqkv86uJzqWk">via GIPHY</a></p>""")