# Lambda School Data Science Unit 4 Sprint Challenge 4

## RNNs, CNNs, AutoML, and more...

In this sprint challenge, you'll explore some of the cutting edge of Data Science.

*Caution* - these approaches can be pretty heavy computationally. All problems were designed so that you should be able to achieve results within at most 5-10 minutes of runtime on Colab or a comparable environment. If something is running longer, doublecheck your approach!

## Part 1 - RNNs

Use an RNN to fit a simple classification model on tweets to distinguish from tweets from Austen Allred and tweets from Weird Al Yankovic.

Following is code to scrape the needed data (no API auth needed, uses [twitterscraper](https://github.com/taspinar/twitterscraper)):

Conclusion - RNN runs, and gives pretty decent improvement over a naive "It's Al!" model. To *really* improve the model, more playing with parameters, and just getting more data (particularly Austen tweets), would help. Also - RNN may well not be the best approach here, but it is at least a valid one.

In [0]:
!pip install twitterscraper

In [0]:
from twitterscraper import query_tweets

austen_tweets = query_tweets('from:austen', 1000)
len(austen_tweets)

In [4]:
austen_tweets[0].text

'I love love love working with great people.pic.twitter.com/fCKOm6Vl'

In [0]:
al_tweets = query_tweets('from:AlYankovic', 1000)
len(al_tweets)

In [6]:
al_tweets[0].text

'RT @GeoffTheRobot: Hey Al, you played zydeco on my ribs at the RED premiere and it airs tonight on Late Late with @CraigyFerg!'

In [7]:
len(austen_tweets + al_tweets)

1141

Your tasks:

- Encode the characters to a sequence of integers for the model
- Get the data into the appropriate shape/format, including labels and a train/test split
- Use Keras to fit a predictive model, classifying tweets as being from Austen versus Weird Al
- Report your overall score and accuracy

For reference, the [Keras IMDB sentiment classification example](https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py) will be useful, as well the RNN code we used in class.

*Note* - focus on getting a running model, not on maxing accuracy with extreme data size or epoch numbers. Only revisit and push accuracy if you get everything else done!

In [0]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.layers import Flatten
from sklearn.model_selection import train_test_split
import numpy as np

In [0]:
# Integer-encode all the characters in the corpus, and transform each
# tweet into a list of integers (one per character).  Combine all tweets into a
# numpy array that pads each tweet to the same length.

def tweet_preprocessing(tweets):
  '''
  Pre-processes a corpus of tweets
  '''
  fulltext = ''
  for twt in tweets:
    fulltext += twt.text
    
  chars = list(set(fulltext)) # split and remove duplicate characters. convert to list.
  num_chars = len(chars) # the number of unique characters
  txt_data_size = len(fulltext)
  
  # Integer-encoded dictionaries
  char_to_int = dict((c, i) for i, c in enumerate(chars))
  int_to_char = dict((i, c) for i, c in enumerate(chars))
  
  # Turn each tweet into a list of integers representing the characters
  encoded_tweets = []
  for twt in tweets:
    enc_tweet = [char_to_int[ch] for ch in twt.text]
    encoded_tweets.append(enc_tweet)
  
  # Pad each tweet to 280 characters
  padded_tweets = sequence.pad_sequences(encoded_tweets, maxlen=280)
  
#   # Informative printouts
  print("Unique characters : ", num_chars)
  print("Size of tweet library (char) : ", txt_data_size)
  print('All characters: \n', [x for x in chars])
  print('Character dictionary: \n', char_to_int)
  print('Example processed tweet: \n', padded_tweets[0])
  print('Shape of tweet library: ', padded_tweets.shape)
  
  return padded_tweets, chars

In [72]:
 austen_tweets2, austen_chars = tweet_preprocessing(austen_tweets)

Unique characters :  90
Size of tweet library (char) :  16177
All characters: 
 ['X', 'k', 'T', ';', 'D', ')', 'f', 'h', '-', 'J', '…', 'W', 'F', 'a', '5', '\n', '8', '1', 'I', 'K', 's', 'b', 'r', 'y', 'Q', 'A', '/', 'C', '’', 'c', 'Y', '?', 'e', ':', '(', 'u', '$', 'S', 'l', "'", '0', 'x', 'd', 'j', '“', '"', 'M', '7', 'g', 'E', ' ', '3', 'p', 'H', 'm', '*', '2', 'O', '\xa0', 'o', ',', 'i', 't', '@', '6', '4', 'ï', 'v', 'N', 'G', '#', 'U', '%', '!', 'n', 'q', 'z', 'B', 'w', '+', 'P', 'Z', 'R', '_', '.', 'V', '”', '&', '9', 'L']
Character dictionary: 
 {'X': 0, 'k': 1, 'T': 2, ';': 3, 'D': 4, ')': 5, 'f': 6, 'h': 7, '-': 8, 'J': 9, '…': 10, 'W': 11, 'F': 12, 'a': 13, '5': 14, '\n': 15, '8': 16, '1': 17, 'I': 18, 'K': 19, 's': 20, 'b': 21, 'r': 22, 'y': 23, 'Q': 24, 'A': 25, '/': 26, 'C': 27, '’': 28, 'c': 29, 'Y': 30, '?': 31, 'e': 32, ':': 33, '(': 34, 'u': 35, '$': 36, 'S': 37, 'l': 38, "'": 39, '0': 40, 'x': 41, 'd': 42, 'j': 43, '“': 44, '"': 45, 'M': 46, '7': 47, 'g': 48, 'E': 49,

In [73]:
al_tweets2, al_chars = tweet_preprocessing(al_tweets)

Unique characters :  104
Size of tweet library (char) :  94043
All characters: 
 ['т', 'й', 'X', 'T', 'k', ';', 'D', ')', 'f', 'h', '-', 'J', '…', 'W', 'F', 'a', '5', '–', '\n', '8', '1', 'í', 'I', 'K', 's', 'b', 'д', 'y', 'r', 'Q', 'A', '/', '‘', 'C', '’', 'р', 'c', 'Y', '?', 'в', 'e', ':', '(', 'u', '$', 'S', 'l', "'", '0', '—', 'x', 'd', 'j', '“', '"', 'M', '7', 'g', 'с', 'а', 'E', ' ', '3', 'p', 'H', 'm', 'З', '™', '*', '2', 'é', 'O', '\xa0', 'o', ',', 'i', 't', 'у', '@', '6', '4', 'v', 'N', 'G', '#', 'U', '%', '!', 'n', 'q', 'z', 'B', 'w', 'P', 'Z', 'R', '_', '.', 'V', '”', '&', 'е', '9', 'L']
Character dictionary: 
 {'т': 0, 'й': 1, 'X': 2, 'T': 3, 'k': 4, ';': 5, 'D': 6, ')': 7, 'f': 8, 'h': 9, '-': 10, 'J': 11, '…': 12, 'W': 13, 'F': 14, 'a': 15, '5': 16, '–': 17, '\n': 18, '8': 19, '1': 20, 'í': 21, 'I': 22, 'K': 23, 's': 24, 'b': 25, 'д': 26, 'y': 27, 'r': 28, 'Q': 29, 'A': 30, '/': 31, '‘': 32, 'C': 33, '’': 34, 'р': 35, 'c': 36, 'Y': 37, '?': 38, 'в': 39, 'e': 40, ':': 41, 

In [77]:
# Create a single vocabulary for both tweet libraries
vocabulary = set(austen_chars + al_chars)
len(vocabulary)

106

In [94]:
# Combine both user's processed tweets into a single dataset
X = np.concatenate((austen_tweets2, al_tweets2), axis=0)
y = np.concatenate((np.zeros((austen_tweets2.shape[0],1)),
                    np.ones((al_tweets2.shape[0],1))), axis=0)

X.shape, y.shape

((1141, 280), (1141, 1))

In [0]:
# Divide into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                  test_size=0.25, random_state=42)

In [96]:
# Create the Keras LSTM RNN

model = Sequential()
model.add(Embedding(input_dim=len(vocabulary), output_dim=128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, None, 128)         13568     
_________________________________________________________________
lstm_6 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 129       
Total params: 145,281
Trainable params: 145,281
Non-trainable params: 0
_________________________________________________________________
None


In [97]:
batch_size = 32

model.fit(X_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(X_test, y_test))

Instructions for updating:
Use tf.cast instead.
Train on 855 samples, validate on 286 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f94f7e36208>

In [100]:
score, acc = model.evaluate(X_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print(f'Test accuracy: {acc*100:.2f}%')

Test score: 0.008031357738752751
Test accuracy: 99.65%


Whoa!  The model was 99.65% accurate in classifying the tweets!