### Analyze the data

In the blog post I'm following it outputs the following information: 

```
1823250 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1961295 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"
```

I want to add how many phrases I hae but to generally create this output here

In [1]:
import pandas as pd 
from tools import *
from collections import Counter
import numpy as np
from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional, Dropout, LSTM
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

Using TensorFlow backend.


These next two cells are because I had to manipulate the data a little from pre-processing. I have resaved it and now the load_data() line should work

In [2]:
data, id2eng, id2fren, X_eng, y_fren = load_data()

In [3]:
data.head()

Unnamed: 0,english,french,french_tokens,english_tokens,english_bow,french_bow,english_padded,french_padded
0,"new jersey is sometimes quiet during autumn , ...",new jersey est parfois calme pendant l' automn...,"[new, jersey, est, parfois, calme, pendant, l,...","[new, jersey, is, sometimes, quiet, during, au...","[8, 7, 5, 11, 9, 3, 2, 0, 6, 5, 10, 4, 1]","[10, 7, 4, 11, 2, 12, 8, 0, 5, 6, 4, 9, 3, 1]","[8, 7, 5, 11, 9, 3, 2, 0, 6, 5, 10, 4, 1, -1, ...","[10, 7, 4, 11, 2, 12, 8, 0, 5, 6, 4, 9, 3, 1, ..."
1,the united states is usually chilly during jul...,les états-unis est généralement froid en juill...,"[les, états, unis, est, généralement, froid, e...","[the, united, states, is, usually, chilly, dur...","[17, 18, 16, 5, 19, 12, 3, 14, 0, 6, 5, 19, 13...","[18, 21, 20, 4, 15, 13, 3, 17, 5, 6, 14, 16, 3...","[17, 18, 16, 5, 19, 12, 3, 14, 0, 6, 5, 19, 13...","[18, 21, 20, 4, 15, 13, 3, 17, 5, 6, 14, 16, 3..."
2,"california is usually quiet during march , and...","california est généralement calme en mars , et...","[california, est, généralement, calme, en, mar...","[california, is, usually, quiet, during, march...","[20, 5, 19, 9, 3, 23, 0, 6, 5, 19, 21, 4, 22]","[22, 4, 15, 2, 3, 25, 5, 6, 4, 15, 23, 3, 24]","[20, 5, 19, 9, 3, 23, 0, 6, 5, 19, 21, 4, 22, ...","[22, 4, 15, 2, 3, 25, 5, 6, 4, 15, 23, 3, 24, ..."
3,the united states is sometimes mild during jun...,"les états-unis est parfois légère en juin , et...","[les, états, unis, est, parfois, légère, en, j...","[the, united, states, is, sometimes, mild, dur...","[17, 18, 16, 5, 11, 25, 3, 22, 0, 6, 5, 24, 4,...","[18, 21, 20, 4, 11, 27, 3, 24, 5, 6, 26, 13, 3...","[17, 18, 16, 5, 11, 25, 3, 22, 0, 6, 5, 24, 4,...","[18, 21, 20, 4, 11, 27, 3, 24, 5, 6, 26, 13, 3..."
4,"your least liked fruit is the grape , but my l...","votre moins aimé fruit est le raisin , mais mo...","[votre, moins, aimé, fruit, est, le, raisin, m...","[your, least, liked, fruit, is, the, grape, bu...","[34, 31, 32, 29, 5, 17, 30, 28, 33, 31, 32, 5,...","[38, 34, 29, 30, 4, 32, 37, 33, 35, 34, 29, 4,...","[34, 31, 32, 29, 5, 17, 30, 28, 33, 31, 32, 5,...","[38, 34, 29, 30, 4, 32, 37, 33, 35, 34, 29, 4,..."


In [4]:
#I need a flattened list of all the words to pass into counter
english_word_list = [l for sublist in data['english_tokens'] for l in sublist]
french_word_list = [l for sublist in data['french_tokens'] for l in sublist]

In [5]:
eng_counter = Counter(english_word_list)
fren_counter = Counter(french_word_list)

In [6]:
print(f'{len(english_word_list)} English words.')
print(f'{len(eng_counter)} unique English Words.')
print(f'10 most common english words in the data:')
print('"' + '" "'.join(list(zip(*eng_counter.most_common(10)))[0]) + '"')
print()
print(f'{len(french_word_list)} French words.')
print(f'{len(fren_counter)} unique french words.')
print('10 most common french words:')
print('"' + '" "'.join(list(zip(*fren_counter.most_common(10)))[0]) + '"')

2511216 English words.
13482 unique English Words.
10 most common english words in the data:
"is" "the" "it" "in" "during" "but" "i" "and" "you" "never"

2766892 French words.
23242 unique french words.
10 most common french words:
"est" "en" "il" "la" "mais" "l" "et" "les" "le" "de"


### Ok so I have slightly more complicated data
I'm mostly going to still follow the guide and 'create my own' embedding, because that will be *slightly* easier for the first round (just following along) however like the guide says I want to go back and impliment someone else's embedding to better capture this vocabulary and make it exportable. 

In the next section the guide has a lot of work done around tokenization. I'm going to assume my tokenization worked properly, for this version I'm not going to go back and do any additional cleaning and not worry about all the punctuation.

Additional cleaning is an area that I could improve this project in the future. I'm acknowledging that I'm feeding my neural network garbage and as such will get garbage back out. 

Since I'm just copying the project from the internet there is an argument to be made that I should just use the same data but I want to have some personal flare 

It looks like my pre-processing does most of the same thing as theirs does. Theres just an issue of needing to be reshaped. I'm going to see what I can do to get my data into the right shape and be left with an X and y variable that I can pass into a train_test_split. 

These edits need to be made in my tools function that loads the data. Return the full DF and an X and y value

In [5]:
X_train_eng, X_test_eng, y_train_fren, y_test_fren = train_test_split(X_eng, y_fren)

In [14]:
decode_french(data['french_padded'][0], id2fren)

['new',
 'jersey',
 'est',
 'parfois',
 'calme',
 'pendant',
 'l',
 'automne',
 'et',
 'il',
 'est',
 'neigeux',
 'en',
 'avril']

### Model 1

This is a simple RNN that that is just here to be a baseline. 

In [3]:
def simple_model(input_shape, output_length, eng_vocab_size, french_vocab_size): 
    #hyperparameters
    learning_rate = .005
        
    # TODO: Build the layers
    model = Sequential()
    model.add(GRU(256, input_shape=input_shape[1:], return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax'))) 

    # Compile model
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model

In [6]:
simple_rnn = simple_model(X_train_eng.shape, 60, len(id2eng), len(id2fren))

In [17]:
len(id2eng), len(id2fren)

(13482, 23242)

In [7]:
simple_rnn.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru_1 (GRU)                  (None, 60, 256)           198144    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 60, 1024)          263168    
_________________________________________________________________
dropout_1 (Dropout)          (None, 60, 1024)          0         
_________________________________________________________________
time_distributed_2 (TimeDist (None, 60, 23242)         23823050  
Total params: 24,284,362
Trainable params: 24,284,362
Non-trainable params: 0
_________________________________________________________________


In [None]:
simple_rnn.fit(X_train_eng, y_train_fren, batch_size=1024, epochs=10, validation_split=.2)

Train on 170696 samples, validate on 42675 samples
Epoch 1/10


### First learning experience 
well it looks like my input and output have to be of the same length. I have gone back and updated my process data function accordingly and will be re-producing the data set. 

(213371, 50)