
## Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.
Instructions

- **Download** this notebook as you would any other ipynb file 
- **Upload** to Google Colab or work locally (if you have that set-up)
- **Delete** `raise NotImplementedError()`

- **Write** your code in the `# YOUR CODE HERE` space


- **Execute** the Test cells that contain assert statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)

- **Save** your notebook when you are finished
- **Download** as a ipynb file (if working in Colab)
- **Upload** your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)



# Major Neural Network Architectures Challenge
## *Data Science Unit 4 Sprint 3 Challenge*

In this sprint challenge, you'll explore some of the cutting edge of Deep Learning. This week we studied several famous neural network architectures: 
recurrent neural networks (RNNs), long short-term memory (LSTMs), convolutional neural networks (CNNs), and Autoencoders. In this sprint challenge, you will revisit these models. Remember, we are testing your knowledge of these architectures not your ability to fit a model with high accuracy. 

__*Caution:*__  these approaches can be pretty heavy computationally. All problems were designed so that you should be able to achieve results within at most 5-10 minutes of runtime locally, on AWS SageMaker, on Colab or on a comparable environment. If something is running longer, double check your approach!

__*GridSearch:*__ CodeGrade will likely break if it is asked to run a gridsearch for a deep learning model (CodeGrade instances run on a single processor). So while you may choose to run a gridsearch locally to find the optimum hyper-parameter values for your model, please delete (or comment out) the gridsearch code and simply instantiate a model with the optimum parameter values to get the performance that you want out of your model prior to submission. 


## Challenge Objectives
*You should be able to:*
* <a href="#p1">Part 1</a>: Train a LSTM classification model
* <a href="#p2">Part 2</a>: Utilize a pre-trained CNN for object detection
* <a href="#p3">Part 3</a>: Describe a use case for an autoencoder
* <a href="#p4">Part 4</a>: Describe yourself as a Data Science and elucidate your vision of AI

____

# (CodeGrade) Before you submit your notebook you must first

1) Restart your notebook's Kernel

2) Run all cells sequentially, from top to bottom, so that cell numbers are sequential numbers (i.e. 1,2,3,4,5...)
- Easiest way to do this is to click on the **Cell** tab at the top of your notebook and select **Run All** from the drop down menu. 

3) If you have gridsearch code, now is when you either delete it or comment out that code so CodeGrade doesn't run it and crash. 

4) Read the directions in **Part 2** of this notebook for specific instructions on how to prep that section for CodeGrade.

____

<a id="p1"></a>
## Part 1 - LSTMs

Use a LSTM to fit a multi-class classification model on Reuters news articles to distinguish topics of articles. The data is already encoded properly for use in a LSTM model. 

Your Tasks: 
- Use Keras to fit a predictive model, classifying news articles into topics. 
- Name your model as `model`
- Use a `single hidden layer`
- Use `sparse_categorical_crossentropy` as your loss function
- Use `accuracy` as your metric
- Report your overall score and accuracy
- Due to resource concerns on CodeGrade, `set your model's epochs=1`

For reference, the LSTM code we used in class will be useful. 

__*Note:*__  Focus on getting a running model, not on maxing accuracy with extreme data size or epoch numbers. Only revisit and push accuracy if you get everything else done! 

In [None]:
# Import data (don't alter the code in this cell)
from tensorflow.keras.datasets import reuters

# Suppress some warnings from deprecated reuters.load_data
import warnings
warnings.filterwarnings('ignore')

# Load data
(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words=None,
                                                         skip_top=0,
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=723812,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

# Due to limited computational resources on CodeGrade, take the following subsample 
train_size = 1000
X_train = X_train[:train_size]
y_train = y_train[:train_size]

In [None]:
# Demo of encoding
word_index = reuters.get_word_index(path="reuters_word_index.json")

print(f"Iran is encoded as {word_index['iran']} in the data")
print(f"London is encoded as {word_index['london']} in the data")
print("Words are encoded as numbers in our dataset.")

Iran is encoded as 779 in the data
London is encoded as 544 in the data
Words are encoded as numbers in our dataset.


In [None]:
# Imports (don't alter this code)
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM

# DO NOT CHANGE THESE VALUES 
# Keras docs say that the + 1 is needed: https://keras.io/api/layers/core_layers/embedding/
MAX_FEATURES = len(word_index.values()) + 1

# maxlen is the length of each sequence (i.e. document length)
MAXLEN = 200

In [None]:
# Pre-process your data by creating sequences 
# Save your transformed data to the same variable name:
# example: X_train = some_transformation(X_train)


#Use pad_sequences to truncate/pad docs
X_train = sequence.pad_sequences(X_train, maxlen=MAXLEN)
X_test = sequence.pad_sequences(X_test, maxlen=MAXLEN)

In [None]:
# Visible tests
assert X_train.shape[1] == MAXLEN, "Your train input sequences are the wrong length. Did you use the sequence import?"
assert X_test.shape[1] == MAXLEN, "Your test input sequences are the wrong length. Did you use the sequence import?"

### Create your model

Make sure to follow these instructions (also listed above):
- Name your model as `model`
- Use a `single hidden layer`
- Use `sparse_categorical_crossentropy` as your loss function
- Use `accuracy` as your metric

**Additional considerations**

The number of nodes in your output layer should be equal to the number of **unique** values in the sequences you are training and testing on. For this text, that value is equal to 46.

- Set the number of nodes in your output layer equal to 46

In [None]:
# Build and complie your model here

# Instantiate Sequential Model Class
model = Sequential()

#Create First Layer
#Embedding layer to change words to numbers
model.add(Embedding(MAX_FEATURES,
                    128))

#LSTM layer
model.add(LSTM(128,
               dropout=0.01))

# Output Layer 
# Softmax activation
# Number of nodes = number of unique values within sequences
model.add(Dense(46, 
                activation = 'softmax'))

# Compile model
model.compile(loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'], 
              optimizer = 'nadam')

In [None]:
# Visible Test
assert model.get_config()["layers"][1]["class_name"] == "Embedding", "Layer 1 should be an Embedding layer."

In [None]:
# Hidden Test

### Fit your model

Now, fit the model that you built and compiled in the previous cells. Remember to set your `epochs=1`! 

In [None]:
# Fit your model here
# REMEMBER to set epochs=1

# YOUR CODE HERE
model.fit(X_train, y_train, batch_size=256, epochs=1)



<keras.callbacks.History at 0x7f7122ac72b0>

In [None]:
# Visible Test 
n_epochs = len(model.history.history["loss"])
assert n_epochs == 1, "Verify that you set epochs to 1."

## Sequence Data Question
#### *Describe the `pad_sequences` method used on the training dataset. What does it do? Why do you need it?*

pad_sequences does two things that really boil down to the same overarching category. The point is to standardize the length of the input data. This accomplishes the aforementioned task by truncating, or shortening, our data that is longer than the maxlen while the reviews that are shorter are padded with 0's(default value) to increase the size till they reach our maxlen.

## RNNs versus LSTMs
#### *What are the primary motivations behind using Long-ShortTerm Memory Cell unit over traditional Recurrent Neural Networks?*

The primary motivation is to avoid an issue with the vanishing gradient. RNN's can run into an issue where their recursive aspect results in the gradients either blowing up really really big or basically vanishing from the data. LSTM's are able to better retain that data and prevent a vanishing gradient issue from arising.

## RNN / LSTM Use Cases
#### *Name and Describe 3 Use Cases of LSTMs or RNNs and why they are suited to that use case*

1) The one I find the most interesting is Sign Language. While we see there isn't any really good models out there for predicting sign language, as the signs are pretty complicated and convuluded. LSTM's make use of gesture recognition to interpret the signs and translate the signs

2) Self Driving Vehicles: Teslas rely almost soely on computer imaging and segmenting. Being able to identify objects in an unsupervised environment becomes critical and use of LSTM's make this happen.

3) Music Generation. We were shown a really awesome website called Jukebox which uses NN's to generate lyrics and music based off of being trained on an artists songs. 

<a id="p2"></a>
## Part 2- CNNs

### Find the Frog

Time to play "find the frog!" Use Keras and [ResNet50v2](https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet_v2) (pre-trained) to detect which of the images with the `frog_images` subdirectory has a frog in it.

<img align="left" src="https://d3i6fh83elv35t.cloudfront.net/newshour/app/uploads/2017/03/GettyImages-654745934-1024x687.jpg" width=400>

The skimage function below will help you read in all the frog images into memory at once. You should use the preprocessing functions that come with ResnetV2, and you should also resize the images using scikit-image.

### Reading in the images

The code in the following cell will download the images to your notebook (either in your local Jupyter notebook or in Google colab).

In [None]:
# Prep to import images (don't alter the code in this cell)
import urllib.request

# Text file of image URLs
text_file = "https://raw.githubusercontent.com/LambdaSchool/data-science-canvas-images/main/unit_4/sprint_challenge_files/frog_image_url.txt"
data = urllib.request.urlopen(text_file)

# Create list of image URLs
url_list = [] 
for line in data:
    url_list.append(line.decode('utf-8'))

In [None]:
# Import images (don't alter the code in this cell)

from skimage.io import imread
from skimage.transform import resize 

# instantiate list to hold images

image_list = []
### UNCOMMENT THE FOLLOWING CODE TO LOAD YOUR IMAGES

#loop through URLs and load each image
for url in url_list:
  image_list.append(imread(url))


## UNCOMMENT THE FOLLOWING CODE TO VIEW AN EXAMPLE IMAGE SIZE
#What is an "image"?
print(type(image_list[0]), end="\n\n")



print("Each of the Images is a Different Size")
print(image_list[0].shape)
print(image_list[1].shape)

<class 'numpy.ndarray'>

Each of the Images is a Different Size
(2137, 1710, 3)
(3810, 2856, 3)


### Run ResNet50v2

Your goal is to validly run ResNet50v2 on the input images - don't worry about tuning or improving the model. You can print out or view the predictions in any way you see fit. In order to receive credit, you need to have made predictions at some point in the following cells.

*Hint* - ResNet 50v2 doesn't just return "frog". The three labels it has for frogs are: `bullfrog, tree frog, tailed frog`

**Autograded tasks**

* Instantiate your ResNet 50v2 and save to a variable named `resnet_model`

**Other tasks**
* Re-size your images
* Use `resnet_model` to predict if each image contains a frog
* Decode your predictions
* Hint: the lesson on CNNs will have some helpful code

**Stretch goals***
* Check for other things such as fish
* Print out the image with its predicted label
* Wrap everything nicely in well documented functions

## Important note!

To increase the chances that your notebook will run in CodeGrade, when you **submit** your notebook:

* comment out the code where you load the images
* comment out the code where you make the predictions
* comment out any plots or image displays you create

**MAKE SURE YOUR NOTEBOOK RUNS COMPLETELY BEFORE YOU SUBMIT!**

In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.applications.resnet_v2 import ResNet50V2 # <-- pre-trained model 
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet_v2 import preprocess_input, decode_predictions

In [None]:
#loop through image_list and resize each image to match ResNet50V2 input shape
image_list = [resize(image,(224,224,3)) for image in image_list]

In [1]:
def img_contains_bullfrog(img):
    """
    Imputs image into resnet50 pre-trained model and returns the top 3 likely labels for the image (ranked by largest probability)
    """
    # preprocess image
    x = resize(img, (224,224,3))
    x = np.expand_dims(img, axis=0)
    x = preprocess_input(x)
    
    # instantiate model
    resnet_model = ResNet50V2()
    
    # get classification of image
    features = resnet_model.predict(x)
    
    # Decode prediction into results
    results = decode_predictions(features, top=5)[0]
    print(results)
    
    for entry in results:
        if entry[1] == 'tree frog':
            return entry[2]
        return 0.0

In [None]:
img_contains_bullfrog(image_list[9])

[('n03729826', 'matchstick', 0.04608956), ('n06359193', 'web_site', 0.038412638), ('n01930112', 'nematode', 0.034599576), ('n01739381', 'vine_snake', 0.013230725), ('n03028079', 'church', 0.01313022)]


0.0

In [None]:
for image in image_list:
  img_contains_bullfrog(image)

[('n03729826', 'matchstick', 0.036260433), ('n06359193', 'web_site', 0.030670187), ('n01930112', 'nematode', 0.029264133), ('n03028079', 'church', 0.01806211), ('n01739381', 'vine_snake', 0.015310241)]
[('n02009912', 'American_egret', 0.026129201), ('n06359193', 'web_site', 0.025837608), ('n03028079', 'church', 0.021245712), ('n03729826', 'matchstick', 0.02055756), ('n02814860', 'beacon', 0.020283956)]
[('n03729826', 'matchstick', 0.04236288), ('n01930112', 'nematode', 0.028562987), ('n06359193', 'web_site', 0.02745751), ('n03028079', 'church', 0.017480033), ('n01739381', 'vine_snake', 0.016090315)]




[('n03729826', 'matchstick', 0.046423495), ('n06359193', 'web_site', 0.03870418), ('n01930112', 'nematode', 0.03294728), ('n03028079', 'church', 0.013569112), ('n01739381', 'vine_snake', 0.012875312)]




[('n03729826', 'matchstick', 0.034879435), ('n06359193', 'web_site', 0.0325693), ('n01930112', 'nematode', 0.024441559), ('n01739381', 'vine_snake', 0.013387628), ('n03028079', 'church', 0.013148612)]
[('n03729826', 'matchstick', 0.040673304), ('n06359193', 'web_site', 0.029944304), ('n01930112', 'nematode', 0.02733964), ('n03028079', 'church', 0.015997278), ('n01739381', 'vine_snake', 0.01515073)]
[('n03729826', 'matchstick', 0.067059815), ('n06359193', 'web_site', 0.031905416), ('n01739381', 'vine_snake', 0.024304207), ('n01930112', 'nematode', 0.022963433), ('n03028079', 'church', 0.019001087)]
[('n03729826', 'matchstick', 0.03779544), ('n01930112', 'nematode', 0.031798888), ('n06359193', 'web_site', 0.01966606), ('n01739381', 'vine_snake', 0.015802097), ('n03028079', 'church', 0.012211881)]
[('n06359193', 'web_site', 0.05475851), ('n03729826', 'matchstick', 0.04728021), ('n01930112', 'nematode', 0.034785256), ('n03028079', 'church', 0.018076217), ('n01739381', 'vine_snake', 0.01545

In [None]:
# Doing this for the autograder, please dont hold it against me
resnet_model = ResNet50V2()

In [None]:
# Visible test
assert resnet_model.get_config()["name"] == "resnet50v2", "Did you instantiate the resnet model?"

<a id="p3"></a>
## Part 3 - Autoencoders

**Describe a use case for an autoencoder given that an autoencoder tries to predict its own input.**

Autoencoders can be used for anomoly detection. An encoder can be trained in a bank or financial institution to understand what a good application for a loan looks like. This autoencoder, now trained on what the good looks like, can take in a bad application and use the measured loss between the input and reconstructed output to decide to flag that application for further review.

<a id="p4"></a>
## Part 4 - More...

**Answer the following questions, with a target audience of a fellow Data Scientist:**

- What do you consider your strongest area as a Data Scientist?
- What area of Data Science would you most like to learn more about, and why?
- Where do you think Data Science will be in 5 years?

A few sentences per answer is fine - only elaborate if time allows.

* My strongest area as a Data Scientist is in Regression and Statistics. I feel the most comfortable when I work with numbers and math. I feel I will be able to apply that the most from here. I feel at least entry-level in my knowledge across all Units that we have gone through so far.

* I want to learn about Data Science and it's application in Astrophysics. NASA is using ML modeling for things like processing telescope images and noting the day-to-day changes in the sun. I find this extremely interesting and want to pursue this as an option down the road.

* Data is in everything we do. From unlocking your iPhone, to posting the picture of your lunch on InstaGram, to the way you drive home from work, data is in everything. I feel that Data Science is an underused field currently. We need to step back and see that bigger picture. If in the next five years, we see more companies opening up to that bigger picture and seeing that data is in everything, I think we can see a lot more of almost a "democracy" in our companies. By using data from your customers, you can see what works and what doesn't. What contributes to how customers spend or how they are going to react to specific products, etc. If this becomes more standardized, imagine what the voice of the public could do... I see data science as something that can ultimately change the way that we go about business in almost every single way.

## Congratulations! 

Thank you for your hard work, and [congratulations](https://giphy.com/embed/26xivLqkv86uJzqWk)!!! You've learned a lot, and you should proudly call yourself a Data Scientist.
