## Convolutional NN to classify govuk content to level2 taxons

Based on:
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

### Load requirements and data

In [20]:
import pandas as pd
import numpy as np
import os
from datetime import datetime
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import (Embedding, Input, Dense, 
                          Activation, Conv1D, MaxPooling1D, Flatten)
from keras.models import Model, Sequential
from keras.optimizers import rmsprop
from keras.callbacks import TensorBoard
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MultiLabelBinarizer


In [2]:
!which python3

/Users/ellieking/Documents/tag_tax/govuk-taxonomy-supervised-learning/tax_SL/bin/python3


In [3]:
labelled_level2 = pd.read_csv('../../../data/labelled_level2.csv', dtype=object)

## Hyperparameters

In [4]:
#MAX_NB_WORDS
MAX_SEQUENCE_LENGTH =1000
#EMBEDDING_DIM

### Create target/Y 

Note: when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros expect for a 1 at the index corresponding to the class of the sample). In order to convert integer targets into categorical targets, you can use the Keras utility to_categorical

In [26]:
labelled_level2['level2taxon'] = labelled_level2['level2taxon'].astype('category')

labels = labelled_level2['level2taxon'].cat.codes 


In multilabel learning, the joint set of binary classification tasks is expressed with label binary indicator array: each sample is one row of a 2d array of shape (n_samples, n_classes) with binary values:  
the one, i.e. the non zero elements, corresponds to the subset of labels.  
An array such as np.array([[1, 0, 0], [0, 1, 1], [0, 0, 0]]) represents label 0 in the first sample, labels 1 and 2 in the second sample, and no labels in the third sample.  
Producing multilabel data as a list of sets of labels may be more intuitive. The MultiLabelBinarizer transformer can be used to convert between a collection of collections of labels and the indicator format.

####  first reshape wide to get columns for each level2taxon and row number = number unique urls

In [153]:
#get a smaller copy of data for pivoting ease (think you can work from full data actually and other cols get droopedauto)

level2_reduced = labelled_level2[['content_id', 'level2taxon', 'combined_text']].copy()

#how many level2taxons are there?
print(level2_reduced.level2taxon.nunique())

#count the number of taxons per content item into new column
level2_reduced['num_taxon_per_content'] = level2_reduced.groupby(["content_id"])['content_id'].transform("count")

#Add 1 because of zero-indexing to get 1-number of level2taxons as numerical targets
level2_reduced['level2taxon_code'] = level2_reduced.level2taxon.astype('category').cat.codes + 1

440


In [155]:
#reshape to wide per taxon and keep the combined text so indexing is consistent when splitting X from Y

multilabel = (level2_reduced.pivot_table(index=['content_id', 'combined_text'], 
                  columns='level2taxon_code', 
                  values='num_taxon_per_content'))
print(level2_reduced.shape)
print(multilabel.shape)

list(multilabel.columns)

(173560, 5)
(114048, 440)


Unnamed: 0_level_0,level2taxon_code,1,2,3,4,5,6,7,8,9,10,...,431,432,433,434,435,436,437,438,439,440
content_id,combined_text,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
00029fa4-9b60-4285-898c-85ae8a6367f5,emma jones - small business crown representative as small business crown representative emma is keen to help uk smes win government business. emma was appointed as small business crown representative in july 2016. she was selected for the role because of her wealth of experience in working with smes. she is the founder of small business support group enterprise nation and the co-founder of startup britain. emma’s work in her role as small business crown representative includes: working with government and the small business panel to identify the remaining barriers to smes doing business with the public sector supporting the launch and delivery of the campaign to help show that government is “open for business” for smes and helping them bid for and win more contracts increasing awareness among smaller businesses of opportunities to deliver on behalf of larger private sector firms who have secured government contracts working with government to identify new opportunities to get best value from smes getting support emma is keen to hear what small business have to say and wants to engage with as many smes as possible. so if you’re thinking about becoming a government supplier take a look at the events and opportunities below for how to get involved and gain support. events the leeds cross government sme roadshow - 24 november 2017 - is a great opportunity for smes to hear directly about the opportunities to sell to the public sector. more information about the event and how to register can be found here. webinars register for free for emma’s half-hour webinars offering advice on how to become a government supplier. a list of webinars coming up is featured below. blogs read emma’s blogs to gain useful insight updates and tips for smes and government buyers. these smes did it and so can you! prompt payment makes for good business 2017 a big year for small businesses calling central government buyers: emma can help you meet your target small business saturday dec 2016: top tips for selling to government selling to the public sector guide in partner with the crown commercial service emma has developed a guide for small businesses with tips on selling to government. read here . government is open for business ‘open for business’ is the government’s campaign to reach more smes as potential suppliers: to help and support them to become suppliers and to listen to how government can improve the process. for more information visit www.gov.uk/openforbusiness register with contracts finder to keep updated on new and upcoming contracts worth over £10 000. for inspiration on how other small business have grown and benefitted from being a supplier government read our case studies . if you would like to help in getting the message out that government is open for business then visit the resources page for ways in which you can support.,,,,,,,,,,,...,,,,,,,,,,
00037b70-5b08-44c2-bf0a-fa8eb636a60b,land remediation: bringing brownfield sites back to use brochure showing uk expertise in land remediation outlining technologies systems and ideas used in the regeneration of industrial land. the uk was the first industrialised country in the world. the legacy of the industrial revolution is over 400 000 hectares of contaminated land. uk expertise in land remediation has been borne out of necessity. the department for international trade’s ( dit ) brochure provides an overview of the expertise gained from over 5 decades of experience in land remediation. the brochure includes information on: sector specialists urban regeneration spill response monitoring and validation corporate liability management innovation industry bodies how dit can help this was published originally by uk trade and investment which has since moved to the department for international trade ( dit ).,,,,,,,,,,,...,,,,,,,,,,
00037ee5-7b5e-452d-a233-af2c134f5bce,steps 2 success:ni statistics from october 2014 to september 2016 details on the number of referrals and starts on the steps 2 success programme and the number of moves into employment up to 30 sept 2016 statistics presented include details on the number of referrals and starts to the steps 2 success programme up to 30 september 2016.,,,,,,,,,,,...,,,,,,,,,,
0004c63d-ae16-432a-bb35-c0f949b1e27c,student support applications for higher education: september 2016 data includes the number of applications received and grants awarded. these monthly statistics present information on applications for student support and tuition fee loans and tuition fee grants which include data for welsh domiciled students (wherever they study) and eu domiciled students studying in wales.,,,,,,,,,,,...,,,,,,,,,,
0005ac76-50fe-42f1-8168-8b6fc046e40f,advice for building owners: large-scale wall system test 2 advice for building owners on the large-scale wall system test with acm with a polyethylene filler cladding with stone wool insulation. the government is undertaking large scale testing of cladding systems to understand better how 3 different types of aluminium composite material ( acm ) panels behave in combination with 2 different types of insulation in a fire. this note sets out advice to building owners following the results of the large scale test for a wall system including: acm with unmodified polyethylene filler (category 3 in screening tests) stone wool insulation. this should be read alongside the government’s explanatory note on the large scale wall systems testing .,,,,,,,,,,,...,,,,,,,,,,
0006811c-ad80-4cd0-a732-04cc983ec8c2,on the mark: m&s works with four security consultants to upgrade cctv retailer marks and spencer has upgraded to an effective and efficient cctv system that adheres to the surveillance camera code of practice. marks and spencer needed to upgrade their cctv system at their major distribution centre in bradford. they wanted a scheme that was effective efficient and in line with the 12 guiding principles in the surveillance camera code of practice . security is always a key concern for us but we also want schemes that are of high compliance standards in terms of legislation and processes. this means we end up with a cost effective and efficient cctv system that is value for money said nick russell the marks and spencer build project manager. get the requirements right it’s important that before a new system is installed that every stage of development is thought through and processed into an effective cctv operational requirement report otherwise it will present difficulties later on in the installation. marks and spencer appointed four security consultants to help guide them through the stages from start to finish and draw up the operational requirement report for the scheme. brendan mcgarrity a director at four security consultants explains their approach: we bring together the key people at an early stage of the design. this means we design a system that supports marks and spencer’s operational delivery and data compliance requirements bringing all of the parties together provides a great opportunity to raise everyone’s awareness of the standards and legislation required to operate a cctv system that is technically compliant and meets cctv competency standards. many organisations aren’t aware of their legal responsibilities around the use of cctv. our approach is to raise awareness of the surveillance camera code of practice and the 12 guiding principles. we can help give advice to ensure suitable rules policies and procedures are established to govern and oversee the operation of the cctv system. our experience is that a compliant system is also an effective and efficient system. read more information on the surveillance camera code of practice .,,,,,,,,,,,...,,,,,,,,,,
0006be3e-1180-4288-9d2d-2a2c69e9f5f2,new government veterinary services blog launched nigel gibbens uk chief veterinary officer has launched a blog for government vets. the government veterinary services (gvs) blog is a place for government vets to share stories about the interesting and varied roles they undertake. the blog’s aim is to raise awareness of the range of career options available to vets in government to inspire veterinary graduates and non-government vets to consider a government veterinary career. vets from all government departments in the united kingdom (uk) can get involved and contribute a blog. vets have worked in government for more than 150 years protecting the animals and people of the uk against disease threats and helping to improve animal welfare. there are more than 600 vets employed in a range of government departments across the uk with the majority (approximately 350) being employed by the animal and plant health agency (apha). other departments include: the department for environment food and rural affairs welsh government scottish government the food standards agency the food standards agency scotland the ministry of defence the centre for environment fisheries and aquaculture science the department for agriculture environment and rural affairs (northern ireland) the home office the defence science and technology laboratory public health england the veterinary medicines directorate nigel gibbens uk chief veterinary officer said: government veterinary services want to inspire veterinary graduates and vets in private practice to consider a career as a government veterinarian – there is no veterinary role that is wider in scope impact and influence than that of a government vet. as the government’s chief vet i’m thrilled that the government veterinary services has given us the opportunity to create a platform to demonstrate the important role we play in protecting the livelihoods of farmers the health and welfare of their animals and public health across the country. the blog can be viewed at https://vets.blog.gov.uk/ and can also be accessed from the government veterinary services gov.uk page a portal for veterinarians to find out more about the work of government vets. gvs supports public sector veterinary professionals and promotes veterinary policy to other vets and the public. it works to attract and retain talented individuals to the profession and improve the skills and capability of veterinary professionals across government. you can subscribe to the blog to receive email updates when new blogs are posted.,,,,,,,,,,,...,,,,,,,,,,
0008f82f-9713-4074-8793-0d266d53930c,focused inspection of wakefield city academy trust ofsted today publishes the outcome letter of the focused inspection of wakefield city academies trust (wcat). the inspections are part of a concerted programme of action by ofsted to establish the effectiveness of a multi-academy trust in supporting and challenging academy schools within individual chains. five academies were inspected as part of the focused inspection in may. four of these were full inspections and one was a monitoring inspection. the academies were all due for an inspection by the end of this academic year. along with the inspections telephone discussions were held with leaders of five other academies within the trust and inspectors undertook a follow-up visit to wcat’s national office. as part of this visit discussions were held with the chief executive officer senior and operational staff from the trust headteachers the chair of the board and partners. inspectors also scrutinised a range of relevant documentation. the letter sent to wcat has been published on the ofsted website. notes to editors the focused inspection of wakefield city academy trust (wcat) is online . the secretary of state for education wrote to ofsted on 22 january 2015 clarifying the arrangements for the focused inspection of academies. in the first week ofsted inspects a number of the trust’s academies. inspectors also hold telephone conversations with other academy principals. in the following week inspectors now visit multi-academy trust central offices and hold discussions with staff from the trust. they consider a range of other evidence alongside the results from the focused inspections. wcat became a multi-academy trust in august 2013. the trust has grown significantly in the past two years bringing failing schools into the trust at the request of the department for education (dfe) or local authorities. it now has academies in four local authorities: wakefield doncaster sheffield and the east riding of yorkshire. the inclusion of an academy in rotherham is imminent. plans are well-advanced for two secondary and four primary academies to become part of wcat by september 2015. this includes carr lane a new purpose-built primary academy in doncaster. two schools judged good at the time of their inspection have asked to join wcat. the trust’s education business administrative and finance teams have expanded accordingly all within an agreed dfe growth plan. the office for standards in education children’s services and skills (ofsted) regulates and inspects to achieve excellence in the care of children and young people and in education and skills for learners of all ages. it regulates and inspects childcare and children’s social care and inspects the children and family court advisory and support service (cafcass) schools colleges initial teacher training work-based learning and skills training adult and community learning and education and training in prisons and other secure establishments. it assesses council children’s services and inspects services for looked after children safeguarding and child protection. media can contact the ofsted press office through 03000 130415 or via ofsted’s enquiry line 0300 1231231 between 8.30am – 6.00pm monday – friday. out of these hours during evenings and weekends the duty press officer can be reached on 07919 057359.,,,,,,,,,,,...,,,,,,,,,,
000aa34d-c3c0-4176-ad8a-50e801056df1,oxford flood alleviation scheme online consultation opens up the environment agency is about to launch an online public consultation into the detailed design of the oxford flood alleviation scheme. the environment agency is calling on communities to have their say on benefits and features that will make up the £120 million project to reduce flood risk to all homes and businesses in oxford. the consultation will be open from 22 june to 20 july 2017 and will allow the public to input into design features; ranging from the 7 bridges along the route of the scheme to options for benches and cycle racks on footpaths. the scheme is a major project which will involve lowering parts of oxford’s floodplain to increase capacity for floodwater as well as widening and deepening some of the rivers and streams that run through it. speaking ahead of the launch of the consultation emma howard boyd chair of the environment agency said: the oxford flood alleviation scheme will be a major feat of engineering and is one of the biggest projects we are working on across the country. i am very proud of our partnership approach which is so important to building the scheme and keeping this iconic city moving during times of flood for businesses commuters and communities of oxford. cllr yvonne constance oxfordshire county council’s cabinet member for environment added: this is a really important issue for people in oxford and beyond. the plans for the flood alleviation scheme are now very advanced and we want to hear what our residents think. as the lead local flood authority oxfordshire county council strongly supports the oxford flood alleviation scheme and we encourage local communities residents and businesses to take this opportunity to get involved in the consultation. the project team spoke to over 200 members of the public at drop-in events throughout oxford in may where they shared information about the progress of the scheme and the many benefits it will bring. you can go online from 22 june to 20 july to view this information see how they’ve been progressing with the scheme and to complete the consultation. the project team will be available to help members of the public who don’t have access to the internet to complete the online consultation at the following libraries: kennington library ox1 5pg: 2pm to 7pm on friday 30 june oxford central library ox1 1ay: 12pm to 6pm on thursday 6 july botley library ox2 9lp: 9:30am to 2pm on tuesday 11 july the environment agency is working with local partners: oxfordshire county council oxford city council vale of white horse district council thames water the oxford flood alliance oxfordshire local enterprise partnership thames regional flood and coastal committee and the university of oxford on the scheme to reduce flood risk to all homes and businesses in oxford as well as to major transport routes into the city. further information and contact details for further information please visit: oxford fas web page facebook twitter or contact the project team at oxfordscheme@environment-agency.gov.uk .,,,,,,,,,,,...,,,,,,,,,,
000b6a38-c69a-4ac9-918b-717a79cbdad2,provisional uk greenhouse gas emissions national statistics 2012 provisional estimates of uk greenhouse gas emissions 2012 and final emissions by fuel type and end-user 1990 to 2011. this publication provides the latest provisional estimates of uk greenhouse gas emissions based on provisional inland energy consumption statistics which are published in decc’s quarterly energy trends publication . this publication also includes an update to final statistics published in february to include estimates by end-user and fuel type. updated data tables for these statistics can be found in uk greenhouse gas emissions final statistics . for the purposes of reporting greenhouse gas emissions are allocated into a small number of broad high level sectors as follows: energy supply business transport public residential agriculture industrial processes land use land use change and forestry (lulucf) and waste management. additionally provisional emissions for carbon dioxide (co2) only are allocated into broad fuel classifications as follows: gas oil coal other solid fuels and non-fuel. these provisional emissions estimates are subject to revision when the final estimates are published; however they provide an early indication of emissions in the most recent full calendar year. this publication also includes an update to final statistics published in february to include estimates by end-user and fuel type. updated data tables for these statistics can be found in uk greenhouse gas emissions final statistics . this is a national statistics publication and complies with the code of practice for official statistics. please check our frequently asked questions or email climatechange.statistics@decc.gsi.gov.uk if you have any questions or comments about the information on this page.,,,,,,,,,,,...,,,,,,,,,,


In [143]:
#convert the number_of_taxons_per_content values to 1, meaning there was an entry for this taxon and this content_id, 0 otherwise
binary_multilabel = multilabel.notnull().astype('int')

In [144]:
#will convert columns to an array of shape
binary_multilabel[list(binary_multilabel.columns)].values.shape

(114048, 440)

In [145]:
#convert columns to an array. Each row represents a content item, each column an individual taxon
binary_multilabel = binary_multilabel[list(binary_multilabel.columns)].values
binary_multilabel[2]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0,

In [149]:
type(binary_multilabel)

numpy.ndarray

In [146]:
# mlb = MultiLabelBinarizer()
# y = mlb.fit_transform(binary_multilabel)
# y.shape

In [5]:
#Use this for singlelabel problems
labels = to_categorical(np.asarray(labels))

print('Shape of label tensor:', labels.shape)

labels

Shape of label tensor: (173560, 440)


array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

### Create language data/X

format our text samples and labels into tensors that can be fed into a neural network. To do this, we will rely on Keras utilities keras.preprocessing.text.Tokenizer and keras.preprocessing.sequence.pad_sequences.

In [161]:
multilabel.index.names
texts = multilabel.index.get_level_values('combined_text')
texts.shape

(114048,)

In [162]:
tokenizer = Tokenizer(num_words=20000)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts) #yield one sequence per input text

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen= MAX_SEQUENCE_LENGTH) #MAX_SEQUENCE_LENGTH

Found 213132 unique tokens.


In [163]:
print('Shape of label tensor:', binary_multilabel.shape)
print('Shape of data tensor:', data.shape)

Shape of label tensor: (114048, 440)
Shape of data tensor: (114048, 1000)


### Data split

In [164]:
# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = binary_multilabel[indices]
nb_validation_samples = int(0.2 * data.shape[0]) #validation split
print('nb_validationsamples:', nb_validation_samples)

x_train = data[:-nb_validation_samples]
print('Shape of x_train:', x_train.shape)
y_train = labels[:-nb_validation_samples]
print('Shape of y_train:', y_train.shape)
x_val = data[-nb_validation_samples:]
print('Shape of x_val:', x_val.shape)
y_val = labels[-nb_validation_samples:]
print('Shape of y_val:', y_val.shape)

nb_validationsamples: 22809
Shape of x_train: (91239, 1000)
Shape of y_train: (91239, 440)
Shape of x_val: (22809, 1000)
Shape of y_val: (22809, 440)


### preparing the Embedding layer
compute an index mapping words ot known embeddings by parsing the data dump of pre-trained embeddings
NB stopwords haven't been removed yet...

In [165]:
embeddings_index = {}
f = open(os.path.join('../../../data/', 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


compute embedding matrix using embedding_index dict and word_index

In [166]:
embedding_matrix = np.zeros((len(word_index) + 1, 100))# used 6B.100d.txt
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

load this embedding matrix into an Embedding layer. Note that we set trainable=False to prevent the weights from being updated during training.

In [167]:
embedding_layer = Embedding(len(word_index) + 1,
                            100, # used 6B.100d.txt
                            weights=[embedding_matrix],
                            input_length= MAX_SEQUENCE_LENGTH, #MAX_SEQUENCE LENGTH
                            trainable=False)

An Embedding layer should be fed sequences of integers, i.e. a 2D input of shape (samples, indices). These input sequences should be padded so that they all have the same length in a batch of input data (although an Embedding layer is capable of processing sequence of heterogenous length, if you don't pass an explicit input_length argument to the layer).

All that the Embedding layer does is to map the integer inputs to the vectors found at the corresponding index in the embedding matrix, i.e. the sequence [1, 2] would be converted to [embeddings[1], embeddings[2]]. This means that the output of the Embedding layer will be a 3D tensor of shape (samples, sequence_length, embedding_dim).

## Training a 1D convnet

In [168]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') #MAX_SEQUENCE_LENGTH
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(x_train), activation='sigmoid')(x)

model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

Metric values are recorded at the end of each epoch on the training dataset. If a validation dataset is also provided, then the metric recorded is also calculated for the validation dataset.

All metrics are reported in verbose output and in the history object returned from calling the fit() function. In both cases, the name of the metric function is used as the key for the metric values. In the case of metrics for the validation dataset, the “val_” prefix is added to the key.

In [169]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 1000)              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 1000, 100)         21313300  
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 996, 128)          64128     
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 199, 128)          0         
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 195, 128)          82048     
_________________________________________________________________
max_pooling1d_5 (MaxPooling1 (None, 39, 128)           0         
_________________________________________________________________
conv1d_6 (Conv1D)            (None, 35, 128)           82048     
__________

In [170]:
logname = 'tf_logs/allgovuk_' + str(datetime.now())

tb = TensorBoard(
    log_dir=logname, histogram_freq=1, write_graph=True, write_images=True)

#tbCallBack = TensorBoard(log_dir='./Graph', histogram_freq=1, write_graph=True, write_images=True)

In [171]:
model.fit(
    x_train, y_train, 
    validation_data=(x_val, y_val), 
    epochs=4, batch_size=128, 
    callbacks=[tb]
)

ValueError: Error when checking target: expected dense_4 to have shape (None, 91239) but got array with shape (91239, 440)

Improve accuracy by training longer with some regularization mechanism (such as dropout) or by fine-tuning the Embedding layer.

We can also test how well we would have performed by not using pre-trained word embeddings, but instead initializing our Embedding layer from scratch and learning its weights during training. We just need to replace our Embedding layer with the following:

embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)
                            
After 2 epochs, this approach only gets us to 90% validation accuracy, less than what the previous model could reach in just one epoch. Our pre-trained embeddings were definitely buying us something. In general, using pre-trained embeddings is relevant for natural processing tasks were little training data is available (functionally the embeddings act as an injection of outside information which might prove useful for your model).

