<a href="https://colab.research.google.com/github/ShreyasJothish/DS-Unit-4-Sprint-4-Deep-Learning/blob/master/LS_DS_Unit_4_Sprint_Challenge_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science Unit 4 Sprint Challenge 4

## RNNs, CNNs, AutoML, and more...

In this sprint challenge, you'll explore some of the cutting edge of Data Science.

*Caution* - these approaches can be pretty heavy computationally. All problems were designed so that you should be able to achieve results within at most 5-10 minutes of runtime on Colab or a comparable environment. If something is running longer, doublecheck your approach!

## Part 1 - RNNs

Use an RNN to fit a simple classification model on tweets to distinguish from tweets from Austen Allred and tweets from Weird Al Yankovic.

Following is code to scrape the needed data (no API auth needed, uses [twitterscraper](https://github.com/taspinar/twitterscraper)):

In [0]:
!pip install twitterscraper

In [0]:
from twitterscraper import query_tweets

# Stretch Goal - Data for Austen is very less so fetching more tweets
austen_tweets = query_tweets('from:austen', 3000)
len(austen_tweets)

In [0]:
austen_tweets[0].text

In [0]:
al_tweets = query_tweets('from:AlYankovic', 1000)
len(al_tweets)

In [0]:
al_tweets[0].text

In [0]:
len(austen_tweets + al_tweets)

Your tasks:

- Encode the characters to a sequence of integers for the model
- Get the data into the appropriate shape/format, including labels and a train/test split
- Use Keras to fit a predictive model, classifying tweets as being from Austen versus Weird Al
- Report your overall score and accuracy

For reference, the [Keras IMDB sentiment classification example](https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py) will be useful, as well the RNN code we used in class.

*Note* - focus on getting a running model, not on maxing accuracy with extreme data size or epoch numbers. Only revisit and push accuracy if you get everything else done!

In [0]:
# Generic imports
import pandas as pd
import numpy as np

In [0]:
# TODO - your code!

tweets_len = []

for i in range(len(austen_tweets)):
  tweet_len = len(austen_tweets[i].text)
  tweets_len.append(tweet_len)

print('Austen tweet lengths:')
print(f'max: {max(tweets_len)}, min: {min(tweets_len)}')

tweets_len = []

for i in range(len(al_tweets)):
  tweet_len = len(al_tweets[i].text)    
  tweets_len.append(tweet_len)
  
print('AlYankovic tweet lengths:')
print(f'max: {max(tweets_len)}, min: {min(tweets_len)}')

In [0]:
import re

# Encode the characters to a sequence of integers for the model
def convert_to_ascii(text):
  ascii_list = [ord(char) for char in text]
  return ascii_list

def remove_url_from_text(text):
  return re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
  
# output = convert_to_ascii('hello')
# print(output)

austen_tweets_updated = []

for i in range(len(austen_tweets)):
  text_without_url = remove_url_from_text(austen_tweets[i].text)
  updated_text = convert_to_ascii(text_without_url)
  austen_tweets_updated.append(updated_text)
  
al_tweets_updated = []

for i in range(len(al_tweets)):
  text_without_url = remove_url_from_text(al_tweets[i].text)
  updated_text = convert_to_ascii(text_without_url)
  al_tweets_updated.append(updated_text)
  

print('Initial tweet counts')
print(len(austen_tweets))
print(len(al_tweets))

print('Updated tweet counts')
print(len(austen_tweets_updated))
print(len(al_tweets_updated))

In [0]:
# Get the data into the appropriate shape/format, including labels 
# and a train/test split

combined_updated_tweets = austen_tweets_updated + al_tweets_updated
print('Combined tweet counts')
print(len(combined_updated_tweets))
print(combined_updated_tweets)

X = np.array(combined_updated_tweets)
# print(X)

austen_tweets_label = np.ones((len(austen_tweets_updated),), dtype=np.int)
al_tweets_lable = np.zeros((len(al_tweets_updated),), dtype=np.int)

y = np.concatenate((austen_tweets_label,al_tweets_lable), axis=0)

In [0]:
# Use Keras to fit a predictive model, classifying tweets as being 
# from Austen versus Weird Al

from sklearn.model_selection import train_test_split
from __future__ import print_function

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

max_features = 20000
# cut texts after this number of words (among top max_features most common words)
maxlen = 370
batch_size = 32

print('Loading data...')
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# print(x_train)
# print(y_train)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)

In [0]:
# Report your overall score and accuracy
print('Test score:', score)
print('Test accuracy:', acc)

Conclusion - RNN runs, and gives pretty decent improvement over a naive "It's Al!" model. To *really* improve the model, more playing with parameters, and just getting more data (particularly Austen tweets), would help. Also - RNN may well not be the best approach here, but it is at least a valid one.

## Part 2- CNNs

Time to play "find the frog!" Use Keras and ResNet50 to detect which of the following images contain frogs:

In [0]:
!pip install google_images_download

In [0]:
from google_images_download import google_images_download

response = google_images_download.googleimagesdownload()
arguments = {"keywords": "animalpond", "limit": 20, "print_urls": True}
absolute_image_paths = response.download(arguments)

At time of writing at least a few do, but since the Internet changes - it is possible your 5 won't. You can easily verify yourself, and (once you have working code) increase the number of images you pull to be more sure of getting a frog. Your goal is to validly run ResNet50 on the input images - don't worry about tuning or improving the model.

*Hint* - ResNet 50 doesn't just return "frog". The three labels it has for frogs are: `bullfrog, tree frog, tailed frog`

*Stretch goal* - also check for fish.

In [0]:
# TODO - your code!

import numpy as np

from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.applications.resnet50 import preprocess_input, decode_predictions

def process_img_path(img_path):
  return image.load_img(img_path, target_size=(224, 224))

def img_contains(img, findstr='frog'):
  x = image.img_to_array(img)
  x = np.expand_dims(x, axis=0)
  x = preprocess_input(x)
  model = ResNet50(weights='imagenet')
  features = model.predict(x)
  results = decode_predictions(features, top=3)[0]
  #for entry in results:
  #  print(f'Found {entry[1]} with prediction score {entry[2]}')
  
  for entry in results:
    entry_key = entry[1]
    if entry_key.find(findstr) != -1:
      return entry[2]
  
  return 0.0

In [0]:
# Check the download path
absolute_image_paths

In [0]:
# For each image check the presense of frog or fish

image_path_list = absolute_image_paths['animalpond']
for i, image_path in enumerate(image_path_list):
  print(image_path)
  processed_image = process_img_path(image_path)
  results = img_contains(processed_image, 'frog')
  print(f'Prediction for frog in the picture is {results}\n')
  results = img_contains(processed_image, 'fish')
  print(f'Prediction for fish in the picture is {results}\n')

## Part 3 - AutoML

Use [TPOT](https://github.com/EpistasisLab/tpot) to fit a predictive model for the King County housing data, with `price` as the target output variable.

In [0]:
!pip install tpot

In [0]:
!wget https://raw.githubusercontent.com/ryanleeallred/datasets/master/kc_house_data.csv

In [0]:
!head kc_house_data.csv

As with previous questions, your goal is to run TPOT and successfully run and report error at the end.  Also, in the interest of time, feel free to choose small `generation=1` and `population_size=10` parameters so your pipeline runs efficiently and you are able to iterate and test.

*Hint* - you'll have to drop and/or type coerce at least a few variables to get things working. It's fine to err on the side of dropping to get things running, as long as you still get a valid model with reasonable predictive power.

In [0]:
# TODO - your code!
import pandas as pd
from tpot import TPOTRegressor

In [0]:
df = pd.read_csv('kc_house_data.csv')
df.head()

In [0]:
# Feature Engineering
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_month'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.weekday

In [0]:
df.dtypes

In [0]:
X = df.drop(columns=['price','date'])
y = df['price']

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, test_size=0.25)

In [0]:
%%time

tpot = TPOTRegressor(generations=1, population_size=10, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

In [0]:
y_test_predict = tpot.predict(X_test)

In [0]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

MSE = mean_squared_error(y_test, y_test_predict)
RMSE = (np.sqrt(MSE))

print('MSE is {}'.format(MSE))
print('RMSE is {}'.format(RMSE))

R2 = r2_score(y_test, y_test_predict)

print('R^2 is {}'.format(R2))

## Part 4 - More...

Answer the following questions, with a target audience of a fellow Data Scientist:

- What do you consider your strongest area, as a Data Scientist?

**Answer** 

I like experimenting and iterating to improve the overall output of a Data Science problem. This is a key to any Data Scientist to try new approaches and interpret the results. This is needed because not all problems within Data Science is the same and for each problem we need a different approach of finding the solutions.

Technically I have good experience in programming based on my previous experience in the Software field and currently I love programming in python.

Also another key requirement for Data Scientist is to working with cross functional teams. I have worked and even managed teams as part of my previos work experience.

Finally, my I like up skilling myself regularly on new technologies within the domain. This is again a key to be successful Data Scientist.


- What area of Data Science would you most like to learn more about, and why?

**Answer**

Given an oppurtunity I want to democratise Data Science and learn how to make it reach the masses. Personally I would love to learn more about use of Data Science in the field of automation. Hard problem which I like to learn and work is related to improving the efficiency of tasks and reducing the loss within any project. Example, something like AutoML which can reduce the effort needed for getting the quick estimate.

I want to work on Data Science projects which have positive impact on people's lives and give them more free time. For example, something like Google Maps which has drastically improved the efficiency of travel to people.

- Where do you think Data Science will be in 5 years?

**Answer**

I see the future of Data Science both exiting and challenging for the next few years. This is due to the fact that incubent companies with huge cash reserves in fields other than software are yet to catch up and incorporate Machine Learning or AI into their existing products.

Also now we have many ecosystems of Data Science service providers or platforms evolving in parallel for example, Google, Amazon, IBM, Windows...

5 years from now the companies would have gone through several cycles of product evalution based on ML/AI and the delivery would be stable with improved results. Similarly the platforms providing ML/AI solutions would be stablized and market would have few leaders.

A few sentences per answer is fine - only elaborate if time allows.

Thank you for your hard work, and congratulations! You've learned a lot, and should proudly call yourself a Data Scientist.