# Text Classification and word Embeddings 



I. Classification on text data using NNs


II. Unsupervised learning of word embeddings


III. Joint learning of embeddings with a predictive model 


IV. Learning Convolutional models with learned embeddings


# I. Classifying text data with NNs


1. Getting and preparing Reuters Data

2. Using a standard MLP

3. Using a Convolutional model. Non ?

## I.1. Getting and preparing Reuters Data


### Introduction 

Dataset of 11,228 newswires from Reuters, labeled over 46 topics. As with the IMDB dataset, each wire is encoded as a sequence of word indexes (same conventions).

The documents in the Reuters-21578 collection appeared on the Reuters newswire in 1987. The documents were assembled and indexed with categories by personnel from Reuters Ltd. (Sam Dobbins, Mike Topliss, Steve Weinstein) and Carnegie Group, Inc. (Peggy Andersen, Monica Cellio, Phil Hayes, Laura Knecht, Irene Nirenburg) in 1987. 

In 1990, the documents were made available by Reuters and CGI for research purposes to the Information Retrieval Laboratory (W. Bruce Croft, Director) of the Computer and Information Science Department at the University of Massachusetts at Amherst. Formatting of the documents and production of associated data files was done in 1990 by David D. Lewis and Stephen Harding at the Information Retrieval Laboratory. 

* Links
    * [UCI link](https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection)
    * [Keras utility](https://keras.io/datasets/)

In [44]:
from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Convolution1D, GlobalMaxPooling1D
from keras.datasets import reuters
from keras.utils import np_utils
from keras.preprocessing.text import Tokenizer

# set parameters:
max_words = 500
maxlen = 400
batch_size = 32
embedding_dims = 50
nb_filter = 250
filter_length = 3
hidden_dims = 250
nb_epoch = 2

print('Loading data...')

(X_train_o, y_train_o), (X_test_o, y_test_o) = reuters.load_data(path="reuters.npz",
                                                         num_words=max_words,
                                                         skip_top=0,
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=113,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

print('Loading data...')
#(X_train, y_train), (X_test, y_test) = reuters.load_data(nb_words=max_words, test_split=0.2)
print(len(X_train_o), 'train sequences')
print(len(X_test_o), 'test sequences')


nb_classes = np.max(y_train_o) + 1
print(nb_classes, 'classes')

print('Vectorizing sequence data...')
tokenizer_R = Tokenizer(nb_words=max_words)
X_train_R = tokenizer_R.sequences_to_matrix(X_train_o, mode='binary')
X_test_R = tokenizer_R.sequences_to_matrix(X_test_o, mode='binary')
print('X_train shape:', X_train_R.shape)
print('X_test shape:', X_test_R.shape)

print('Convert class vector to binary class matrix (for use with categorical_crossentropy)')
Y_train_R = np_utils.to_categorical(y_train_o, nb_classes)
Y_test_R = np_utils.to_categorical(y_test_o, nb_classes)
print('Y_train shape:', Y_train_R.shape)
print('Y_test shape:', Y_test_R.shape)



Loading data...
Loading data...
8982 train sequences
2246 test sequences
46 classes
Vectorizing sequence data...
X_train shape: (8982, 500)
X_test shape: (2246, 500)
Convert class vector to binary class matrix (for use with categorical_crossentropy)
Y_train shape: (8982, 46)
Y_test shape: (2246, 46)


In [2]:
print(X_train_R[0])
print(Y_train_R[0])

[ 0.  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  0.  1.  1.  1.
  0.  1.  0.  0.  1.  0.  0.  1.  1.  0.  0.  1.  1.  0.  1.  0.  0.  0.
  0.  0.  0.  1.  0.  0.  0.  1.  1.  0.  0.  0.  1.  1.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  1.
  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.
  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0

### To Do 

* Why are data samples of dimension 500 ? What do these 500 values stand for ? 
* What is the content of word_index ? 
* Change the code above so that data structures and associated variables are suffixed with R for Reuters  

#### Answers

- 500 stands for the number of words in our vocabulary. For each element of X_train_R (for instance) we have a matrix of presence/absence of each word of the dictionnary in our sentence.
- word_index describes the position of each word in our vocabulary

## I.2. Classification with a standard MLP

### To do 
 
Build a MLP for dealing with the Reuters data and evaluate its performance. 

When building your MLP, explain the number of parameters per layer.

In [3]:
model = Sequential()
model.add(Dense(64, input_dim=500, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(46, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

model.fit(X_train_R, Y_train_R, nb_epoch=8, batch_size=16, validation_split = 0.3)
score = model.evaluate(X_train_R, Y_train_R, batch_size=16)
print('Test score:', score[0])
print('Test accuracy:', score[1])



Train on 6287 samples, validate on 2695 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
Test accuracy: 0.791694500125


- Input dim = 500 car il y a 500 mots, donc paramètres, dans notre vocabulaire.
- Output dim = 46 car il y a 46 thèmes entre lesquels on souhaite classer nos phrases.
- Pour le hidden layer, on prend 64 assez arbitrairement, on suit l'heuristique suivante : le nombre de neurones diminue de couche en couche.

## II.  Unsupervised learning word embeddings

0. Loading Newsgroup data 

1. Learning embeddings with the Skipgram model 

2. Exploring Glove Embeddings

 i. Visualization

 ii. Closest words query

 iii. Compositional queries

 iv. Evaluating embeddings
 

### II.1. Chargement des données des Newsgroups

Le code suivant vous permet les données classiques 20_Newsgroup (http://qwone.com/~jason/20Newsgroups/) et de les mettre en forme pour l'apprentissage de représentations des mots à l'aide d'un modèle type skipgram. 

Changez le code ci-dessous pour suffixer les structures de données et les variables associées par N pour Newsgroup 


In [2]:
import numpy as np
np.random.seed(13)

from keras.models import Sequential, Model
from keras.layers import Embedding, Reshape, Activation, Input
from keras.layers.merge import Dot
from keras.utils import np_utils
from keras.utils.data_utils import get_file
from keras.preprocessing.text import Tokenizer

import os
import sys


In [5]:
BASE_DIR = '/users/usrlocal/artieres/data'
TEXT_DATA_DIR = BASE_DIR + '/20_newsgroup/'
MAX_SEQUENCE_LENGTH = 40
MAX_NB_WORDS = 500
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2


# prepare text samples and their labels
print('Processing text dataset')

texts = []  # list of text samples
for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                if sys.version_info < (3,):
                    f = open(fpath)
                else:
                    f = open(fpath, encoding='latin-1')
                texts.append(f.read())
                f.close()

print('Found %s texts.' % len(texts))

texts = texts[0:100]

# finally, vectorize the text samples into a 2D integer tensor
tokenizer_N = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer_N.fit_on_texts(texts)
sequences_N = tokenizer_N.texts_to_sequences(texts)

word_index_N = tokenizer_N.word_index
print('Found %s unique tokens.' % len(word_index_N))

#data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)


V_N = len(tokenizer_N.word_index) + 1


Processing text dataset
Found 19997 texts.
Found 5722 unique tokens.


####  You may alternative use following data


### II.2. To do: Learning embeddings with the CBOW model 

* Design and learn a CBOW model able to learn from above loaded data.

* Once trained
 * get the embedding from  the embedding layer 
 * define a function that finds the word with the closest embedding to a given word. try it. 


You may use code below to prepare the data and learn the model

In [6]:
def generate_data(corpus, window_size, V):
    maxlen = window_size*2
    #print (len(corpus))
    for words in corpus:
        L = len(words)
        #print (L)
        for index, word in enumerate(words):
            contexts = []
            labels   = []            
            s = index - window_size
            e = index + window_size + 1
            
            contexts.append([words[i] for i in range(s, e) if 0 <= i < L and i != index])
            labels.append(word)
            #print (contexts)
            x = sequence.pad_sequences(contexts, maxlen=maxlen)
            y = np_utils.to_categorical(labels, V)
            yield (x, y)

In [7]:
nb_samples = sum(len(s) for s in sequences_N)
dim = 30
window_size = 2
print (nb_samples, V_N)

30892 5723


#### Design your model here and call it cbow

In [3]:
from keras.layers import Lambda
import keras.backend as K
import theano
import numpy as np
from numpy import linalg as LA



In [9]:
cbow = Sequential()
cbow.add(Embedding(input_dim = V_N, output_dim = EMBEDDING_DIM, input_length = window_size*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(EMBEDDING_DIM,)))
cbow.add(Dense(V_N, activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='adadelta')

In [10]:
for ite in range(10):
    loss = 0.
    for x, y in generate_data(sequences_N, window_size, V_N):
        loss += cbow.train_on_batch(x, y)

    print(ite, loss)

0 229305.919884
1 201409.043353
2 198520.808911
3 196539.605292
4 194274.746633
5 192079.066923
6 190110.671973
7 188305.981887
8 186577.289324
9 185127.484783


In [11]:
vectors = cbow.get_weights()[0]

embeddings_index = {}
for word, i in word_index_N.items():
    embeddings_index[word] = np.asarray(vectors[i,:], dtype='float32')

In [12]:
embeddings_index['oceans']

array([ -1.09864585e-02,   1.86755918e-02,   1.23374462e-02,
        -3.92579064e-02,  -2.02844869e-02,   2.61343457e-02,
        -4.45602648e-02,  -4.20271158e-02,   4.43270914e-02,
        -3.31120566e-03,  -6.40191883e-03,   5.13488054e-03,
        -1.70851722e-02,  -4.76726294e-02,   3.76207568e-02,
         2.08579376e-03,  -3.27934846e-02,  -3.26588303e-02,
        -4.23085093e-02,  -2.75608301e-02,  -3.64290848e-02,
        -3.58486995e-02,  -1.99965723e-02,   1.87807158e-03,
        -3.24599743e-02,   3.75561155e-02,   3.47062089e-02,
         3.60765718e-02,  -1.19230524e-02,  -4.90283370e-02,
         2.26186030e-02,  -2.67401338e-05,   4.88229729e-02,
        -1.23007409e-02,  -2.66130455e-02,   2.96876170e-02,
        -2.43330244e-02,   4.08882014e-02,  -3.66538651e-02,
         4.50231880e-03,   3.45428847e-02,   7.54807144e-03,
        -1.13337748e-02,   8.48339871e-03,  -2.18051672e-02,
         4.62532043e-05,   4.90461625e-02,   1.70347579e-02,
         4.48462851e-02,

In [13]:
for word, i in word_index_N.items():
    print(word)

writings
'created
four
looking
superficially
regional
screaming
prize
solid
oooh
commented
fritter
darice
charter
tired
miller
second
273
276
errors
usenet
26390
fossil
increasing
evolutionism
hero
herb
here
atoms
china
natured
cyclical
kids
k
reports
replicators
153552
i'd
military
i'm
criticism
golden
divide
explained
summons
psychological
unix
univ
dna
42
cannibal
music
therefore
until
benedikt
holy
successful
brings
pelletier
hurt
90
93
hold
95
circumstances
pursue
accomplishment
16b9510654
1p8v1ainn9e9
concepts
centralization
example
mindedness
wand
unjust
currency
want
absolute
damage
how
hot
preferable
cadence
oceans
funny
n4hy
outlawed
wrong
types
512
effective
keeps
195807
8018
65882
wc1r
fit
fiu
survivors
conner
hidden
easier
reasonability
lamb's
effects
schools
blink
represents
acme
arrow
interfering
definitional
financial
series
allah
allan
parasites
cathedral
whit
golen
netsys
mayan
rz
re
encourage
ra
dsi
dsg
millions
rk
foundation
snake2
sensory
attributions
1pqifj
4627
e

In [14]:
embeddings_index

{'writings': array([ 0.0075023 ,  0.00913991, -0.01408539,  0.03871191, -0.00173422,
        -0.00801544, -0.0300933 ,  0.0371162 ,  0.00426503, -0.00012974,
        -0.04542939, -0.00139225,  0.0450013 ,  0.02773816, -0.02893012,
         0.01398372, -0.00156714,  0.02386716, -0.03369286,  0.03158228,
        -0.01283883,  0.049577  , -0.03492011, -0.04114062, -0.04289835,
         0.03361317,  0.01210109, -0.04116954, -0.0228798 , -0.00721561,
         0.01114804,  0.02684258,  0.04504743, -0.0200974 ,  0.03787366,
         0.00179503,  0.0180347 ,  0.01486217,  0.02898672, -0.02671448,
        -0.03111023, -0.04891801,  0.04599779,  0.0141643 , -0.00888517,
        -0.01957221, -0.01257845, -0.01307008,  0.03625992, -0.00448072,
         0.01401353, -0.01347561, -0.03057692,  0.04440514, -0.00381606,
        -0.01871189,  0.02941508, -0.03473438, -0.00940241, -0.00823282,
         0.00966956,  0.02851179, -0.00498676, -0.04243663,  0.00233134,
        -0.00872574, -0.02929087,  0.00

In [15]:
def dot(u, v):
    s = sum(u[:,]*v[:,])
    return s

In [16]:
def most_similar(word, corpus):
    sim = []
    for words, i in word_index_N.items():
        if words != word:
            similarity = dot(embeddings_index[word], embeddings_index[words])
            product = LA.norm(embeddings_index[word]) * LA.norm(embeddings_index[words])
            similarity = similarity / product
            sim.append(similarity)
    similar = np.argmax(sim)
    return word_index_N.items()[similar]

In [17]:
most_similar('oceans', sequences_N)

('inflicted', 4218)

###  II.3. Exploring Glove Embeddings 


#### II.3.i. Geting the embeddings

You may download the 880Mo file [here](http://nlp.stanford.edu/data/glove.6B.zip)

=> But these are actually downloaded there : /users/usrlocal/artieres/...

In [4]:
### Importing GLOVE Embeddings

BASE_DIR = '/users/usrlocal/artieres/data'
GLOVE_DIR = BASE_DIR + '/glove.6B/'
# BASE_DIR = os.getcwd()
# GLOVE_DIR = BASE_DIR + '/glove/'
MAX_SEQUENCE_LENGTH = 400
MAX_NB_WORDS = 500
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

# first, build index mapping words in the embeddings set
# to their embedding vector

print('Indexing word vectors.')

embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))


Indexing word vectors.
Found 400000 word vectors.


In [13]:
# Quelques exemples
print (embeddings_index["last"])
print (embeddings_index["last"].shape)

[ 0.23745     0.1241      0.62519997 -0.46639001  0.12238    -0.03638
  0.63611001  0.59246999 -0.51344001  0.06342     0.13813999  0.15193
  0.094313   -0.20547999  0.023568    0.091028   -0.089488   -0.29624
 -0.52465999 -0.028628    0.87168002 -0.16306999 -0.06663     1.02199996
  0.44789001 -0.48008001 -0.17092    -0.30576    -0.061999   -0.23902
  0.25187999 -0.018878    0.084057   -0.045508   -0.38034001  0.45910999
 -0.4122      0.39721999 -0.33249     0.031747   -0.54667997 -0.093553
  0.67786002  0.006941    0.20717999  0.075996    0.49682    -0.92707002
  0.3506     -1.09860003 -0.074723   -0.62458003  0.31856999  1.2155
 -0.47095999 -2.94219995 -0.76529002 -0.11556     1.90059996  1.01520002
 -0.49471     0.61458999 -0.25465    -0.0065357   0.17702     0.14239
  0.18148001  0.69493002 -0.028739   -0.083564    0.009709   -0.18764
 -0.72948998 -0.26976001 -0.47262001  0.23738    -0.38301    -0.017983
 -1.45959997  0.34242001  0.83222997 -0.1178     -0.28637001  0.35349
 -1.215

####  II.3.ii. To do: Visualizing embeddings of Reuters vocabulary

* The goal is to visualize the Glove embeddings of Reuters words 

* You must first build an embedding matrix called embedding_matrix with all Glove embeddings of the Reuters vocabulary.
Beginning of the code is provided below.

Une fois celle-ci créée vous pourrez utiliser le code suivant pour déterminer des projections en 2D des donnéd spour les visualiser.
      
 * from sklearn.manifold import TSNE

 * model_sne = TSNE(n_components=2, random_state=0)

 * model_sne.fit_transform(embedding_matrix) 

In [45]:
word_index = reuters.get_word_index(path="reuters_word_index.json")

print (word_index["last"], len(word_index))

NB_Words = len(word_index)
nb_words = min(NB_Words, len(word_index))
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
label_Embeddings = []

for word, i in word_index.items():
    if i >= nb_words:
        continue
    embedding_vector = embeddings_index.get(word)

    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        label_Embeddings.append(word)


51 30979


In [None]:
from sklearn.manifold import TSNE
model_sne = TSNE(n_components=2, random_state=0)
model_sne.fit_transform(embedding_matrix)

####  II.3.iii.  Finding closest words in the Embedding's space

#### To do 

* Define a function that returns the nbest clostest words to a given word w in the embedding space. 

 * The signature of this function is:
 
  def closest(Embeddings, word, nbest):
  ...
 
 
 
 * The closeness between two words is measured in the embedding space as the cosinus similarity or the dot product beteen the embedding vectors of the two words.  

In [29]:
def closest(label_Embeddings, word, nbest):
    # nbest means the 'k' of 'k nearest neighbors'
    sim = []
    
    for words in label_Embeddings:
        if words != word:
            similarity = dot(embeddings_index[word], embeddings_index[words])
            product = LA.norm(embeddings_index[word]) * LA.norm(embeddings_index[words])
            similarity = similarity / product
            sim.append(similarity)
        
    similar = np.argmax(sim)
    
    for j in range(nbest):
        print (label_Embeddings[similar])
        del sim[similar]
        similar = np.argmax(sim)

In [30]:
closest(label_Embeddings, 'the', 3)

retreated
pari
one


#### II.2.3.iii. Evaluating the Embedding space (Just to know that it exist)

To evaluate the quality of a set of word embeddings one can use the WS-3531 dataset ([available here](https://github.com/k-kawakami/embedding-evaluation)). The file EN-WS-353-REL.txt contains on each line a pair of words followed by the average of a similarity rating by human judges. The
higher the rating, the more similar the judges though the words in the pair are. One way to evaluate the quality of word embeddings is to compute the correlation between the cosine similarity of the pairs and human ratings. Since
cosine similarities and judgements are on a different scale, these need to be normalized.



#### II.2.3.iv. Exploring compositional issues

##### To do: Define a funcion enabling that takes 3 wrds, w1, w2, w3 and that finds the closest word embedding to w1 + (w2-w3). Test this function on "king", "woman", "man"


In [31]:
def comp_closest(label_Embeddings, w1, w2, w3):
    sim = []
    embeddings_index_w = embeddings_index[w1] + (embeddings_index[w2] - embeddings_index[w3])
    for words in label_Embeddings:
        if words != w1 and words != w2 and words != w3:
            similarity = dot(embeddings_index_w, embeddings_index[words])
            product = LA.norm(embeddings_index_w) * LA.norm(embeddings_index[words])
            similarity = similarity / product
            sim.append(similarity)
    similar = np.argmax(sim)
    return label_Embeddings[similar]

In [32]:
comp_closest(label_Embeddings, 'king', 'woman', 'man')

u'pope'

In [33]:
comp_closest(label_Embeddings, 'man', 'queen', 'woman')

u'king'

### Deeper understanding 

* [If you want to know more on Glove embeddings](http://nlp.stanford.edu/pubs/glove.pdf)

* A very popular model for learning embedding is the [Word2vec model](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).

## III. Using a CNN with an Embedding layer

0. Preparing data

1. Using a CNN with an embedding layer on Reuters data

2. Using the same modeling on NewsGroup data

3. Comparing embeddings

 i. Learned embedding on Reuters's data ?

### III.1. Preparing the data in a suitable way

To use data in a convolutional model one uses an embedding layer as input which transfoms input sentences in a series of vectors representing the input sequence of words. Unless recurrent models are used (seen later in the course) a neural network model cannot deal with inputs of varying size (e.g. sentences of different lengthes). To do so we have to tranforme the data so that each input has the same size. It is done with pad_sequences.

The code below does this job, fixing the maximum length of sequences, cutting long ones intio multiple pieces and  filling too short ones. 

In [46]:
from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Convolution1D, GlobalMaxPooling1D
from keras.datasets import reuters
from keras.utils import np_utils
from keras.preprocessing.text import Tokenizer

# set parameters:
max_words = 500
maxlen = 400
batch_size = 32
embedding_dims = 50
nb_filter = 250
filter_length = 3
hidden_dims = 250
nb_epoch = 2

print('Loading data...')

(X_train_o, y_train_o), (X_test_o, y_test_o) = reuters.load_data(path="reuters.npz",
                                                         num_words=max_words,
                                                         skip_top=0,
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=113,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

print('Loading data...')
#(X_train, y_train), (X_test, y_test) = reuters.load_data(nb_words=max_words, test_split=0.2)
print(len(X_train_o), 'train sequences')
print(len(X_test_o), 'test sequences')


nb_classes = np.max(y_train_o) + 1
print(nb_classes, 'classes')

print('Vectorizing sequence data...')
tokenizer = Tokenizer(nb_words=max_words)
X_train = tokenizer.sequences_to_matrix(X_train_o, mode='binary')
X_test = tokenizer.sequences_to_matrix(X_test_o, mode='binary')
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Convert class vector to binary class matrix (for use with categorical_crossentropy)')
Y_train = np_utils.to_categorical(y_train_o, nb_classes)
Y_test = np_utils.to_categorical(y_test_o, nb_classes)
print('Y_train shape:', Y_train.shape)
print('Y_test shape:', Y_test.shape)



Loading data...
Loading data...
8982 train sequences
2246 test sequences
46 classes
Vectorizing sequence data...
X_train shape: (8982, 500)
X_test shape: (2246, 500)
Convert class vector to binary class matrix (for use with categorical_crossentropy)
Y_train shape: (8982, 46)
Y_test shape: (2246, 46)


In [47]:
# set parameters:
max_words = 500
maxlen = 400


print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train_o, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test_o, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Maximum dans X_train', np.max(X_train))

print('Y_train shape:', Y_train.shape)
print('Y_test shape:', Y_test.shape)

Pad sequences (samples x time)
X_train shape: (8982, 400)
X_test shape: (2246, 400)
Maximum dans X_train 499
Y_train shape: (8982, 46)
Y_test shape: (2246, 46)


### III.2. To do: Design and learn a convolutional model

The model should take as input a sequence of word indices and process it through an embedding layer, a convolutional layer (convolution1D since it is only temporal convolution), a maxpooling layer, a dense hidden layer and a final classification layer for classifying documents into classes of documents (categories of the newsgroup data).  

 * Design the model, explain the number of parameters of its layers, learn it on data and report accuracy.


In [22]:
EMBEDDING_DIM = 100
nb_filters = 64
kernel_size = (3,3)

In [39]:
model = Sequential()
model.add(Embedding(max_words,
                    embedding_dims,
                    input_length=maxlen))
model.add(Convolution1D(nb_filter,
                 filter_length,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_dims))
model.add(Activation('relu'))
model.add(Dense(nb_classes))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

model.fit(X_train, Y_train,
          batch_size=batch_size,
          epochs=nb_epoch)

score = model.evaluate(X_test, Y_test, verbose=2, batch_size=batch_size)

print (score)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 400, 50)           25000     
_________________________________________________________________
conv1d_10 (Conv1D)           (None, 398, 250)          37750     
_________________________________________________________________
global_max_pooling1d_8 (Glob (None, 250)               0         
_________________________________________________________________
dense_15 (Dense)             (None, 250)               62750     
_________________________________________________________________
activation_15 (Activation)   (None, 250)               0         
_________________________________________________________________
dense_16 (Dense)             (None, 46)                11546     
Total params: 137,046
Trainable params: 137,046
Non-trainable params: 0
_________________________________________________________________
Epoc

Accuracy et loss sont plutôt bons !

## III. Building a CNN with an Embedding layer initialized with Glove embeddings


0. Reloading Reuters data

1. Initilizing embedding layer with Glove embeddings

2. Comparison of CNN with embeddings initiliazed from scratch / from Glove embeddings


### III.1. Reloading Reuters Data


In [48]:
# set parameters:
max_words = 500
maxlen = 400
batch_size = 32
embedding_dims = 50
nb_filter = 250
filter_length = 3
hidden_dims = 250
nb_epoch = 2

print('Loading data...')
(X_train_o, y_train_o), (X_test_o, y_test_o) = reuters.load_data(path="reuters.pkl",
                                                         nb_words=max_words,
                                                         skip_top=0,
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=113,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

print('Loading data...')
#(X_train, y_train), (X_test, y_test) = reuters.load_data(nb_words=max_words, test_split=0.2)
print(len(X_train_o), 'train sequences')
print(len(X_test_o), 'test sequences')


nb_classes = np.max(y_train_o) + 1
print(nb_classes, 'classes')

# set parameters:
max_words = 500
maxlen = 400



print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train_o, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test_o, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Convert class vector to binary class matrix (for use with categorical_crossentropy)')
Y_train = np_utils.to_categorical(y_train_o, nb_classes)
Y_test = np_utils.to_categorical(y_test_o, nb_classes)
print('Y_train shape:', Y_train.shape)
print('Y_test shape:', Y_test.shape)

print('Maximum dans X_train', np.max(X_train))

print('Y_train shape:', Y_train.shape)
print('Y_test shape:', Y_test.shape)



Loading data...
Loading data...
8982 train sequences
2246 test sequences
46 classes
Pad sequences (samples x time)
X_train shape: (8982, 400)
X_test shape: (2246, 400)
Convert class vector to binary class matrix (for use with categorical_crossentropy)
Y_train shape: (8982, 46)
Y_test shape: (2246, 46)
Maximum dans X_train 499
Y_train shape: (8982, 46)
Y_test shape: (2246, 46)


### III.2. Designing and Learning the model 

* Define a convolutional model with a first embedding layer and initialize its weights using the Glove Embeddings of Reters words 

Note that you can initialize weights of a hidden layer by filling the parameter weights when creating the layer

* Compare the accuracy achieved when relearning the embeddings or when freezing these (this can be done by setting the parameter trainable=False when creating the layer).  

https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html


First of all : weights initialized

In [49]:
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [51]:
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [52]:
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

In [64]:
from keras.layers import Input
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(nb_classes, activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_12 (InputLayer)        (None, 400)               0         
_________________________________________________________________
embedding_11 (Embedding)     (None, 400, 100)          3098000   
_________________________________________________________________
conv1d_25 (Conv1D)           (None, 396, 128)          64128     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 79, 128)           0         
_________________________________________________________________
conv1d_26 (Conv1D)           (None, 75, 128)           82048     
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 15, 128)           0         
_________________________________________________________________
conv1d_27 (Conv1D)           (None, 11, 128)           82048     
__________

In [65]:
model.fit(X_train, Y_train,
          batch_size=batch_size,
          epochs=nb_epoch)

score = model.evaluate(X_test, Y_test, verbose = 0, batch_size=batch_size)

print('Test score:', score[0])
print('Test accuracy:', score[1])

Epoch 1/2
Epoch 2/2
Test score: 1.621852722
Test accuracy: 0.594390026714


Second way to do : weights learned through training

In [66]:
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

In [69]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(nb_classes, activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

model.fit(X_train, Y_train,
          batch_size=batch_size,
          epochs=nb_epoch)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x134731810>

In [70]:
score = model.evaluate(X_test, Y_test, verbose = 0, batch_size=batch_size)
print('Test score:', score[0])
print('Test accuracy:', score[1])

Test score: 1.40940414309
Test accuracy: 0.634016028548


Conclusion on results :

Le ré-entrainement donne de bons résultats, bien que ceux-ci ne soient pas excellents.