## Convolutional NN to classify govuk content to level2 taxons

Based on:
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

## To do:
- ~~Consider grouping very small classes (especially if too small for evaluation)~~
- ~~Split data into training, validation and test to avoid overfitting validation data during hyperparamter searches & model architecture changes~~
- ~~Try learning embeddings~~--
- ~~Try changing pos_ratio~~
- Try implementing class_weights during model fit (does this do the same as the weighted binary corss entropy?)
- Work on tensorboard callbacks
- ~~Create dictionary of class indices to taxon names for viewing results~~
- ~~Check model architecture~~
- ~~consider relationship of training error to validation error - overfitting/bias?~~
- ~~train longer~~
- Try differnet max_sequence_length
- Check batch size is appropriate
- Also think about:
  - ~~regularization (e.g. dropout)~~ 
  - fine-tuning the Embedding layer

### Load requirements and data

TODO: edit requirement.txt to include only these packages and do not include tensorflow because this conflicts with tf on AWS when using on GPU.

In [1]:
import pandas as pd
import numpy as np
import os
from datetime import datetime
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.utils import to_categorical, layer_utils, plot_model

from keras.layers import (Embedding, Input, Dense, Dropout, 
                          Activation, Conv1D, MaxPooling1D, Flatten, concatenate, Reshape)
from keras.models import Model, Sequential
from keras.optimizers import rmsprop
from keras.callbacks import TensorBoard, Callback, ModelCheckpoint
import keras.backend as K
from keras.losses import binary_crossentropy

from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score 
from sklearn.metrics import precision_recall_fscore_support, classification_report
from sklearn.utils import class_weight

import tensorflow as tf

import matplotlib.pyplot as plt
%matplotlib inline

import functools

import h5py

import time

  return f(*args, **kwds)
Using TensorFlow backend.


### Environmental vars

In [2]:
DATADIR=os.getenv('DATADIR')
#DATADIR='/data' #this was put in for AWS run but doesn't work locally...

## Hyperparameters

Intuition for POS_RATIO is that it penalises the prediction of zero for everything, which is attractive to the model because the multilabel y matrix is super sparse. 

Increasing POS_RATIO should penalise predicting zeros more.

In [3]:
#MAX_NB_WORDS
MAX_SEQUENCE_LENGTH =1000
EMBEDDING_DIM = 100 # keras embedding layer output_dim = Dimension of the dense embedding
P_THRESHOLD = 0.5 #Threshold for probability of being assigned to class
POS_RATIO = 0.5 #ratio of positive to negative for each class in weighted binary cross entropy loss function
NUM_WORDS=20000 #keras tokenizer num_words: None or int. Maximum number of words to work with 
#(if set, tokenization will be restricted to the top num_words most common words in the dataset).

### Read in data
Content items tagged to level 2 taxons or lower in the topic taxonomy

In [4]:
labelled_level2 = pd.read_csv(os.path.join(DATADIR, 'labelled_level2.csv.gz'), dtype=object, compression='gzip')

In [5]:
labelled_level2.shape

(172916, 21)

In [6]:
labelled_level2['content_id'].nunique()

113481

#### clean up any World taxons leftover despite dropping relevant doctypes

In [7]:
#COLLAPSE World level2taxons
labelled_level2.loc[labelled_level2['level1taxon'] == 'World', 'level2taxon'] = 'world_level1'

#creating categorical variable for level2taxons from values
labelled_level2['level2taxon'] = labelled_level2['level2taxon'].astype('category')

In [8]:
#count the number of content items per taxon into new column
labelled_level2['num_content_per_taxon'] = labelled_level2.groupby(["level2taxon"])['level2taxon'].transform("count")

In [9]:
labelled_level2['num_content_per_taxon'].describe()

count    172916.000000
mean       4575.345752
std        3691.673143
min           1.000000
25%        1500.000000
50%        3780.000000
75%        6156.000000
max       11717.000000
Name: num_content_per_taxon, dtype: float64

In [10]:
#number of rows in biggest level2 taxon -this is the target size for all other level2 taxons in resampling
max_content_freq = max(labelled_level2['num_content_per_taxon'])
max_content_freq

11717

### drop news

In [11]:
labelled_level2.shape

(172916, 22)

In [12]:
labelled_level2[(labelled_level2['document_type'] == 'world_news_story')].shape

(3927, 22)

In [13]:
labelled_level2[(labelled_level2['document_type'] == 'news_story')].shape

(33214, 22)

In [14]:
nonews = labelled_level2[(labelled_level2['document_type'] != 'news_story') &
                         (labelled_level2['document_type'] != 'world_news_story')]

In [15]:
nonews.shape

(135775, 22)

### Create dictionary mapping taxon codes to string labels

In [16]:
#Get the category numeric values (codes) and avoid zero-indexing
labels = nonews['level2taxon'].cat.codes + 1

#create dictionary of taxon category code to string label for use in model evaluation
labels_index = dict(zip((labels), nonews['level2taxon']))

In [17]:
#labels_index

In [18]:
print(len(labels_index))

210


### Create target/Y 

Note: when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros expect for a 1 at the index corresponding to the class of the sample).

In multilabel learning, the joint set of binary classification tasks is expressed with label binary indicator array: each sample is one row of a 2d array of shape (n_samples, n_classes) with binary values:  
the one, i.e. the non zero elements, corresponds to the subset of labels.  
An array such as np.array([[1, 0, 0], [0, 1, 1], [0, 0, 0]]) represents label 0 in the first sample, labels 1 and 2 in the second sample, and no labels in the third sample.  
Producing multilabel data as a list of sets of labels may be more intuitive.

####  First reshape wide to get columns for each level2taxon and row number = number unique urls

In [19]:
#get a smaller copy of data for pivoting ease (think you can work from full data actually and other cols get droopedauto)

level2_reduced = nonews[['content_id', 
                         'level2taxon', 
                         'combined_text', 
                         'title', 
                         'description',
                         'document_type', 
                            'first_published_at', 
                            'publishing_app', 
                            'primary_publishing_organisation']].copy()

#how many level2taxons are there?
print('Number of unique level2taxons: {}'.format(level2_reduced.level2taxon.nunique()))

#count the number of taxons per content item into new column
level2_reduced['num_taxon_per_content'] = level2_reduced.groupby(["content_id"])['content_id'].transform("count")

#Add 1 because of zero-indexing to get 1-number of level2taxons as numerical targets
level2_reduced['level2taxon_code'] = level2_reduced.level2taxon.astype('category').cat.codes + 1

Number of unique level2taxons: 210


In [20]:
#how many level2taxons are there?
print('Number of unique level2taxons: {}'.format(labelled_level2.level2taxon.nunique()))

#count the number of taxons per content item into new column
labelled_level2['num_taxon_per_content'] = labelled_level2.groupby(["content_id"])['content_id'].transform("count")

#Add 1 because of zero-indexing to get 1-number of level2taxons as numerical targets
labelled_level2['level2taxon_code'] = labelled_level2.level2taxon.astype('category').cat.codes + 1

Number of unique level2taxons: 210


In [21]:
#reshape to wide per taxon and keep the combined text so indexing is consistent when splitting X from Y

multilabel = (level2_reduced.pivot_table(index=['content_id', 
                                                'combined_text', 
                                                'title', 
                                                'description' 
                                                ] , columns='level2taxon_code', values='num_taxon_per_content'))
print('level2reduced shape: {}'.format(level2_reduced.shape))
print('pivot table shape (no duplicates): {} '.format(multilabel.shape))


level2reduced shape: (135775, 11)
pivot table shape (no duplicates): (91771, 210) 


In [22]:
multilabel.columns

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            201, 202, 203, 204, 205, 206, 207, 208, 209, 210],
           dtype='int64', name='level2taxon_code', length=210)

In [23]:
multilabel.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,level2taxon_code,1,2,3,4,5,6,7,8,9,10,...,201,202,203,204,205,206,207,208,209,210
content_id,combined_text,title,description,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
00029fa4-9b60-4285-898c-85ae8a6367f5,emma jones - small business crown representative as small business crown representative emma is keen to help uk smes win government business. emma was appointed as small business crown representative in july 2016. she was selected for the role because of her wealth of experience in working with smes. she is the founder of small business support group enterprise nation and the co founder of startup britain. emma’s work in her role as small business crown representative includes: working with government and the small business panel to identify the remaining barriers to smes doing business with the public sector supporting the launch and delivery of the campaign to help show that government is “open for business” for smes and helping them bid for and win more contracts increasing awareness among smaller businesses of opportunities to deliver on behalf of larger private sector firms who have secured government contracts working with government to identify new opportunities to get best value from smes getting support emma is keen to hear what small business have to say and wants to engage with as many smes as possible. so if you’re thinking about becoming a government supplier take a look at the events and opportunities below for how to get involved and gain support. events the leeds cross government sme roadshow 24 november 2017 is a great opportunity for smes to hear directly about the opportunities to sell to the public sector. more information about the event and how to register can be found here. webinars register for free for emma’s half hour webinars offering advice on how to become a government supplier. a list of webinars coming up is featured below. blogs read emma’s blogs to gain useful insight updates and tips for smes and government buyers. these smes did it and so can you! prompt payment makes for good business 2017 a big year for small businesses calling central government buyers: emma can help you meet your target small business saturday dec 2016: top tips for selling to government selling to the public sector guide in partner with the crown commercial service emma has developed a guide for small businesses with tips on selling to government. read here . government is open for business ‘open for business’ is the government’s campaign to reach more smes as potential suppliers: to help and support them to become suppliers and to listen to how government can improve the process. for more information visit www.gov.uk/openforbusiness register with contracts finder to keep updated on new and upcoming contracts worth over £10 000. for inspiration on how other small business have grown and benefitted from being a supplier government read our case studies . if you would like to help in getting the message out that government is open for business then visit the resources page for ways in which you can support.,emma jones - small business crown representative,as small business crown representative emma is keen to help uk smes win government business.,,,,,,,,,,,...,,,,,,,,,,
00037b70-5b08-44c2-bf0a-fa8eb636a60b,land remediation: bringing brownfield sites back to use brochure showing uk expertise in land remediation outlining technologies systems and ideas used in the regeneration of industrial land. the uk was the first industrialised country in the world. the legacy of the industrial revolution is over 400 000 hectares of contaminated land. uk expertise in land remediation has been borne out of necessity. the department for international trade’s ( dit ) brochure provides an overview of the expertise gained from over 5 decades of experience in land remediation. the brochure includes information on: sector specialists urban regeneration spill response monitoring and validation corporate liability management innovation industry bodies how dit can help this was published originally by uk trade and investment which has since moved to the department for international trade ( dit ). land remediation: bringing brownfield sites back to use html land remediation: bringing brownfield sites back to use pdf 4.47mb 18 pages this file may not be suitable for users of assistive technology. request an accessible format. if you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format please email digital@ukti.gsi.gov.uk . please tell us what format you need. it will help us if you say what assistive technology you use.,land remediation: bringing brownfield sites back to use,brochure showing uk expertise in land remediation outlining technologies systems and ideas used in the regeneration of industrial land.,,,,,,,,,,,...,,,,,,,,,,
00037ee5-7b5e-452d-a233-af2c134f5bce,steps 2 success:ni statistics from october 2014 to september 2016 details on the number of referrals and starts on the steps 2 success programme and the number of moves into employment up to 30 sept 2016 statistics presented include details on the number of referrals and starts to the steps 2 success programme up to 30 september 2016. steps 2 success: ni statistics from october 2014 to september 2016 https://www.communities-ni.gov.uk/topics/statistics-and-research/employment-programme-statistics,steps 2 success:ni statistics from october 2014 to september 2016,details on the number of referrals and starts on the steps 2 success programme and the number of moves into employment up to 30 sept 2016,,,,,,,,,,,...,,,,,,,,,,
0004c63d-ae16-432a-bb35-c0f949b1e27c,student support applications for higher education: september 2016 data includes the number of applications received and grants awarded. these monthly statistics present information on applications for student support and tuition fee loans and tuition fee grants which include data for welsh domiciled students (wherever they study) and eu domiciled students studying in wales. student support applications for higher education: september 2016 http://gov.wales/statistics-and-research/student-support-applications-higher-education/?lang=en,student support applications for higher education: september 2016,data includes the number of applications received and grants awarded.,,,,,,,,,,,...,,,,,,,,,,
0005ac76-50fe-42f1-8168-8b6fc046e40f,advice for building owners: large-scale wall system test 2 advice for building owners on the large-scale wall system test with acm with a polyethylene filler cladding with stone wool insulation. the government is undertaking large scale testing of cladding systems to understand better how 3 different types of aluminium composite material ( acm ) panels behave in combination with 2 different types of insulation in a fire. this note sets out advice to building owners following the results of the large scale test for a wall system including: acm with unmodified polyethylene filler (category 3 in screening tests) stone wool insulation. this should be read alongside the government’s explanatory note on the large scale wall systems testing . advice for building owners: large-scale wall system test with acm with unmodified polyethylene filler with stone wool insulation pdf 214kb 3 pages this file may not be suitable for users of assistive technology. request an accessible format. if you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format please email alternativeformats@communities.gsi.gov.uk . please tell us what format you need. it will help us if you say what assistive technology you use.,advice for building owners: large-scale wall system test 2,advice for building owners on the large-scale wall system test with acm with a polyethylene filler cladding with stone wool insulation.,,,,,,,,,,,...,,,,,,,,,,


In [24]:
multilabel.columns.astype('str')

Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10',
       ...
       '201', '202', '203', '204', '205', '206', '207', '208', '209', '210'],
      dtype='object', name='level2taxon_code', length=210)

In [25]:
#THIS IS WHY INDEXING IS NOT ZERO-BASED
#convert the number_of_taxons_per_content values to 1, meaning there was an entry for this taxon and this content_id, 0 otherwise
binary_multilabel = multilabel.notnull().astype('int')

## Data Pre-Processing

In [26]:
total_size = binary_multilabel.shape[0]
total_size

91771

In [27]:
nb_test_samples = int(0.1 * total_size) #test split
print('nb_test samples:', nb_test_samples)

nb_dev_samples = int(0.2 * total_size) #dev split
print('nb_dev samples:', nb_dev_samples)

nb_training_samples = int(0.8 * total_size) #train split
print('nb_training samples:', nb_training_samples)

nb_test samples: 9177
nb_dev samples: 18354
nb_training samples: 73416


### Shuffle

In [28]:
for i in range(0,10):
    print(binary_multilabel.index[i][0])

00029fa4-9b60-4285-898c-85ae8a6367f5
00037b70-5b08-44c2-bf0a-fa8eb636a60b
00037ee5-7b5e-452d-a233-af2c134f5bce
0004c63d-ae16-432a-bb35-c0f949b1e27c
0005ac76-50fe-42f1-8168-8b6fc046e40f
0006811c-ad80-4cd0-a732-04cc983ec8c2
0008f82f-9713-4074-8793-0d266d53930c
000aa34d-c3c0-4176-ad8a-50e801056df1
000b6a38-c69a-4ac9-918b-717a79cbdad2
000b8c7e-4671-4586-9eff-97c0c374126b


00029fa4-9b60-4285-898c-85ae8a6367f5
00037b70-5b08-44c2-bf0a-fa8eb636a60b
00037ee5-7b5e-452d-a233-af2c134f5bce
0004c63d-ae16-432a-bb35-c0f949b1e27c
0005ac76-50fe-42f1-8168-8b6fc046e40f
0006811c-ad80-4cd0-a732-04cc983ec8c2
0008f82f-9713-4074-8793-0d266d53930c
000aa34d-c3c0-4176-ad8a-50e801056df1
000b6a38-c69a-4ac9-918b-717a79cbdad2
000b8c7e-4671-4586-9eff-97c0c374126b

In [29]:
from sklearn.utils import shuffle

In [30]:
binary_multilabel = shuffle(binary_multilabel,random_state=0)

In [31]:
for i in range(0,10):
    print(binary_multilabel.index[i][0])

df76ffdf-70d6-4a38-9d60-a1765c18914e
dca1f897-c8bd-4e35-a839-5953ee94d54e
3bec5cd0-76bd-48b1-924a-567bd3361ec0
5eb7cd3c-7631-11e4-a3cb-005056011aef
a67385c3-8562-4dc1-96ba-d96ff215943b
5e35118a-7631-11e4-a3cb-005056011aef
5feb658b-7631-11e4-a3cb-005056011aef
144a86f9-6902-444c-87bc-b389a6f3b275
5e139390-7631-11e4-a3cb-005056011aef
e5741923-bc21-46bd-8832-886706f59e81


df76ffdf-70d6-4a38-9d60-a1765c18914e
dca1f897-c8bd-4e35-a839-5953ee94d54e
3bec5cd0-76bd-48b1-924a-567bd3361ec0
5eb7cd3c-7631-11e4-a3cb-005056011aef
a67385c3-8562-4dc1-96ba-d96ff215943b
5e35118a-7631-11e4-a3cb-005056011aef
5feb658b-7631-11e4-a3cb-005056011aef
144a86f9-6902-444c-87bc-b389a6f3b275
5e139390-7631-11e4-a3cb-005056011aef
e5741923-bc21-46bd-8832-886706f59e81

### Upsample minority classes to address imbalance leading to ~2, 465, 570 rows of data!

Access taxon columns with indexing. 

In [32]:
print("[ENCODING] Taxon min indx:",binary_multilabel.columns[0],"Taxon max indx:",
      binary_multilabel.columns[len(binary_multilabel.columns)-1])

[ENCODING] Taxon min indx: 1 Taxon max indx: 210


In [33]:
binary_multilabel[1].shape

(91771,)

In [34]:
type(binary_multilabel.columns[0])

numpy.int64

In [35]:
### Array with indices to upsample

In [36]:
index = [binary_multilabel.index[i][0] for i in range(0,nb_training_samples)]
print(len(index))

73416


In [37]:
binary_multilabel[binary_multilabel[1]==1].loc[index].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,level2taxon_code,1,2,3,4,5,6,7,8,9,10,...,201,202,203,204,205,206,207,208,209,210
content_id,combined_text,title,description,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
56929879-ef37-4d62-ae36-8985fd738369,administrative justice and tribunals: final progress report this is the final performance report against the commitments in the administrative justice and tribunals strategic work programme. the administrative justice and tribunals strategic work programme 2013 to 2016 sets out the government’s overarching objectives for reforming and improving the administrative justice and tribunal system to ensure it meets the principles of fairness efficiency and accessibility for users. the first progress report was published in june 2014 and we have continued to work across all these areas. this is our final update on how we have delivered the objectives specified in the strategic work programme. related links administrative justice and tribunals: a strategic work programme 2013 16 administrative justice and tribunals annual performance report: 2013 to 2014 administrative justice and tribunals: final progress report ref: isbn 9781474141321 cm 9319 pdf 442kb 45 pages this file may not be suitable for users of assistive technology. request an accessible format. if you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format please email web.comments@justice.gsi.gov.uk . please tell us what format you need. it will help us if you say what assistive technology you use. administrative justice and tribunals: final progress report (print version) ref: isbn 9781474141314 cm 9319 pdf 1010kb 48 pages this file may not be suitable for users of assistive technology. request an accessible format. if you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format please email web.comments@justice.gsi.gov.uk . please tell us what format you need. it will help us if you say what assistive technology you use.,administrative justice and tribunals: final progress report,this is the final performance report against the commitments in the administrative justice and tribunals strategic work programme.,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5fa8e157-7631-11e4-a3cb-005056011aef,proposals to reform judicial review the government response to the joint committee on human rights’ (jchr) 13th report of the 2013 to 2014 session. the government has set out its view on the committee’s recommendations in respect of its proposed reforms to judicial review many of which are being taken forward through the criminal justice and courts bill. the government’s view is that the reforms being taken forward are a proportionate response to the concerns raised in the consultation judicial review – proposals for further reform. the government is clear that judicial review is and must remain a crucial check on the power of the state and should continue to be readily available where it’s necessary in the interests of justice. the reforms the government is pursuing are aimed at speeding up the process for people who have arguable grounds and a genuine case to put. further information government response to the consultation judicial review: proposals for further reform government’s proposals to reform judicial review pdf 253kb 28 pages,proposals to reform judicial review,the government response to the joint committee on human rights’ (jchr) 13th report of the 2013 to 2014 session.,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8ea3bdc9-05ff-4416-8874-248a13bc7610,merger of north sussex and west sussex local justice areas seeks views on merger of 2 local justice areas (ljas) of sussex northern and sussex western into 1 lja to be known as the west sussex lja. the aim of this consultation is to find views on proposals to merge the local justice areas of sussex (northern) and sussex (western) into one new west sussex local justice area. this will give greater flexibility in managing the caseload across west sussex whilst increasing the opportunities for magistrates to sit on a broader range of cases on a regular basis and maintain experience. it aims to reduce delays and provide a more consistent service to court users. there will also be no reduction in access to justice for court users who have to attend hearings. this will also enable more effective management of the business of the bench reducing the number of meetings that magistrates and support staff must attend.,merger of north sussex and west sussex local justice areas,seeks views on merger of 2 local justice areas (ljas) of sussex northern and sussex western into 1 lja to be known as the west sussex lja.,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5f617c08-7631-11e4-a3cb-005056011aef,gwent magistrates' courts: proposals for the future this is a consultation on a proposal to close abergavenny magistrates' court and caerphilly magistrates' court. both abergavenny magistrates’ court and caerphilly magistrates’ court are in need of restoration and hm courts and tribunals service would incur considerable costs in making necessary repairs. it is proposed that both courts close and the workload be absorbed by the other 2 magistrates’ courts in gwent newport and cwmbran. this proposal aims to ensure our court estate is used more efficiently and the closure of the courts would offer hm courts and tribunals service savings of around £80 000 a year. this consultation seeks the views of local users judiciary magistracy staff criminal justice agency practitioners and elected representatives to better understand the impact that this proposal would have on the gwent community.,gwent magistrates' courts: proposals for the future,this is a consultation on a proposal to close abergavenny magistrates' court and caerphilly magistrates' court.,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
714b7c4c-269a-40fd-b3d8-41eda3d5517a,merger of local justice areas in greater manchester seeks views on merging 8 local justice areas (ljas) into a single lja to be known as the greater manchester lja. there are 3 key reasons for considering a merger of the current 8 ljas: to improve the effectiveness of the delivery of justice by improving flexibility in dealing with cases to make better use of reduced resources to increase the opportunities for magistrates to retain experience and thus competence the judicial business group (jbg) must address the question of magistrates’ sittings against the background of falling court sittings in criminal jurisdiction. the jbg must also consider the resources available to hmcts and criminal justice agencies to ensure that justice can be delivered as effectively as possible with reduced resources. staffing within hmcts and other organisations is determined by the workload and has therefore reduced over recent years.,merger of local justice areas in greater manchester,seeks views on merging 8 local justice areas (ljas) into a single lja to be known as the greater manchester lja.,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
# Why are we deleting this?
del binary_multilabel.columns.name

In [39]:
#TAKES FOREVER TO RUN!
from sklearn.utils import resample

In [40]:
upsampled_training = pd.DataFrame()
upper = len(binary_multilabel.columns)+1

for taxon in range(1, upper):
    num_samples = binary_multilabel[binary_multilabel[taxon]==1].shape[0] 
    if num_samples<500:
        print("Taxon code:",taxon,"Taxon name:",labels_index[taxon])
        print("SMALL SUPPORT:",num_samples)
        df_minority = binary_multilabel[binary_multilabel[taxon]==1].loc[index]
        if not df_minority.empty:
        # Upsample minority class
            print(df_minority.shape)
            df_minority_upsampled = resample(df_minority, 
                                                 replace=True,     # sample with replacement
                                                 n_samples=(500),    # to match majority class, switch to max_content_freq if works
                                                 random_state=123) # reproducible results
            
            print("FIRST 5 IDs:",[df_minority_upsampled.index[i][0] for i in range(0,5)])

            # Combine majority class with upsampled minority class
            upsampled_training = pd.concat([upsampled_training, df_minority_upsampled])

            # Display new shape
            print("UPSAMPLING:",upsampled_training.shape)

upsampled_training = shuffle(upsampled_training,random_state=0)

Taxon code: 1 Taxon name: Administrative justice reform
SMALL SUPPORT: 11
(9, 210)
FIRST 5 IDs: ['8ea3bdc9-05ff-4416-8874-248a13bc7610', '8ea3bdc9-05ff-4416-8874-248a13bc7610', 'fd0b66df-bab6-4e8a-bd7b-bb12a8ca63ca', '5fa8e157-7631-11e4-a3cb-005056011aef', '5f617c08-7631-11e4-a3cb-005056011aef']
UPSAMPLING: (500, 210)
Taxon code: 2 Taxon name: Adoption, fostering and surrogacy
SMALL SUPPORT: 69
(55, 210)
FIRST 5 IDs: ['5e134ab5-7631-11e4-a3cb-005056011aef', '7583313c-c51f-4da2-bf77-a332e4c89678', '5f510507-7631-11e4-a3cb-005056011aef', '5e385e06-7631-11e4-a3cb-005056011aef', '5dc7f416-7631-11e4-a3cb-005056011aef']
UPSAMPLING: (1000, 210)
Taxon code: 3 Taxon name: Afghanistan
SMALL SUPPORT: 81
(69, 210)
FIRST 5 IDs: ['5f4c54a9-7631-11e4-a3cb-005056011aef', '5e973f59-7631-11e4-a3cb-005056011aef', '601f2bb3-7631-11e4-a3cb-005056011aef', '5c85d62e-7631-11e4-a3cb-005056011aef', '5fa6f327-7631-11e4-a3cb-005056011aef']
UPSAMPLING: (1500, 210)
Taxon code: 4 Taxon name: Armed Forces Covenant
SM

(39, 210)
FIRST 5 IDs: ['cb57ff4b-3167-4d44-ade7-4e8c23064f1e', '8ee55b42-4f25-45fd-bba6-adb0c67ff539', '5dcabd0c-7631-11e4-a3cb-005056011aef', '5fee9cf3-7631-11e4-a3cb-005056011aef', '5d645268-7631-11e4-a3cb-005056011aef']
UPSAMPLING: (13500, 210)
Taxon code: 33 Taxon name: Civil justice reform
SMALL SUPPORT: 3
(2, 210)
FIRST 5 IDs: ['c2734a07-31c8-4306-bb1c-d8038dba326e', 'ce9ceea5-e8b5-497d-ae73-e8290dcb5a5d', 'c2734a07-31c8-4306-bb1c-d8038dba326e', 'c2734a07-31c8-4306-bb1c-d8038dba326e', 'c2734a07-31c8-4306-bb1c-d8038dba326e']
UPSAMPLING: (14000, 210)
Taxon code: 34 Taxon name: Civil service reform
SMALL SUPPORT: 358
(286, 210)
FIRST 5 IDs: ['5d5e5165-7631-11e4-a3cb-005056011aef', '5f5e5842-7631-11e4-a3cb-005056011aef', '5f43519a-7631-11e4-a3cb-005056011aef', '5d5e7359-7631-11e4-a3cb-005056011aef', '5f4351e5-7631-11e4-a3cb-005056011aef']
UPSAMPLING: (14500, 210)
Taxon code: 36 Taxon name: Commercial fishing and fisheries
SMALL SUPPORT: 263
(220, 210)
FIRST 5 IDs: ['5f5cd007-7631-11

(9, 210)
FIRST 5 IDs: ['5fea5890-a27f-4130-9e43-4ae0cfb83415', '5fea5890-a27f-4130-9e43-4ae0cfb83415', '5f46e8d5-7631-11e4-a3cb-005056011aef', '578df83d-ea40-49df-a035-b923f03c092f', '8c1633f1-b255-4bbf-87e2-1313d7f5ca52']
UPSAMPLING: (26500, 210)
Taxon code: 67 Taxon name: European Union laws and regulation
SMALL SUPPORT: 11
(9, 210)
FIRST 5 IDs: ['5c7166fe-7631-11e4-a3cb-005056011aef', '5c7166fe-7631-11e4-a3cb-005056011aef', '5f10966c-7631-11e4-a3cb-005056011aef', '5d0517d1-7631-11e4-a3cb-005056011aef', '5d63bf56-7631-11e4-a3cb-005056011aef']
UPSAMPLING: (27000, 210)
Taxon code: 68 Taxon name: European funds
SMALL SUPPORT: 82
(70, 210)
FIRST 5 IDs: ['5e8d6411-7631-11e4-a3cb-005056011aef', '5e37fecd-7631-11e4-a3cb-005056011aef', '5f9d5f42-7631-11e4-a3cb-005056011aef', '5fee766d-7631-11e4-a3cb-005056011aef', '5e120d7d-7631-11e4-a3cb-005056011aef']
UPSAMPLING: (27500, 210)
Taxon code: 69 Taxon name: European single market
SMALL SUPPORT: 142
(123, 210)
FIRST 5 IDs: ['5e9c262b-7631-11e4-a

UPSAMPLING: (39000, 210)
Taxon code: 103 Taxon name: Land management
SMALL SUPPORT: 57
(44, 210)
FIRST 5 IDs: ['8b5d3305-44cb-422f-ad18-4e1bd21d098c', '5f46e8d5-7631-11e4-a3cb-005056011aef', 'b881f77f-1fe5-49a2-9570-51a7add80566', '5c71ea9e-7631-11e4-a3cb-005056011aef', '2bb3eb0c-90f9-497e-874c-e356b20f0781']
UPSAMPLING: (39500, 210)
Taxon code: 104 Taxon name: Land registration
SMALL SUPPORT: 140
(113, 210)
FIRST 5 IDs: ['5f66e3e6-7631-11e4-a3cb-005056011aef', '72c74777-3a57-4170-80a8-9c1e5ce6f78d', '94737fd3-49c4-47ac-96f6-ae9d0d18ca0a', '5f4aa54c-7631-11e4-a3cb-005056011aef', '5f53e5c2-7631-11e4-a3cb-005056011aef']
UPSAMPLING: (40000, 210)
Taxon code: 105 Taxon name: Lasting power of attorney, being in care and your financial affairs
SMALL SUPPORT: 21
(16, 210)
FIRST 5 IDs: ['aa2b6cbb-aaaa-437f-ba03-5e6133c7036a', '0ed58e79-d9f6-4bed-bbe4-c6b5bad1a543', 'aa2b6cbb-aaaa-437f-ba03-5e6133c7036a', '3c62a127-1c02-41ce-86cd-eccfdb9e0ec3', '3b73d6f9-da30-4166-a3ba-ef5316b00803']
UPSAMPLING:

UPSAMPLING: (52000, 210)
Taxon code: 136 Taxon name: Passports and travel documents for foreign nationals
SMALL SUPPORT: 38
(28, 210)
FIRST 5 IDs: ['5ee64097-7631-11e4-a3cb-005056011aef', '5ee57e77-7631-11e4-a3cb-005056011aef', '5ee57e77-7631-11e4-a3cb-005056011aef', 'dabfc6b3-d88c-458f-a9fb-f286b987509b', '5e16d212-7631-11e4-a3cb-005056011aef']
UPSAMPLING: (52500, 210)
Taxon code: 137 Taxon name: Payroll
SMALL SUPPORT: 40
(32, 210)
FIRST 5 IDs: ['6602dce6-38c1-4592-882b-e177fd72fe9f', '88309f75-af24-42e9-b30d-e36157cf918b', '6602dce6-38c1-4592-882b-e177fd72fe9f', '6de2a5b5-5d33-42a4-97bf-35f56a1e1295', 'acfe2320-09c7-4fad-9afe-3a0b9b8bb0dd']
UPSAMPLING: (53000, 210)
Taxon code: 140 Taxon name: Permanent stay in the UK
SMALL SUPPORT: 33
(28, 210)
FIRST 5 IDs: ['5ec24967-7631-11e4-a3cb-005056011aef', 'f4c6ac13-1769-4e0a-aba5-46eedaab2bcf', 'f4c6ac13-1769-4e0a-aba5-46eedaab2bcf', '5ef20eb2-7631-11e4-a3cb-005056011aef', '851ee732-7aad-4749-b2a2-54bfd32b0f6b']
UPSAMPLING: (53500, 210)
Taxo

(57, 210)
FIRST 5 IDs: ['6027472e-7631-11e4-a3cb-005056011aef', 'f33cbf6f-f27c-4335-ab6f-b5e396f02d56', '60272e36-7631-11e4-a3cb-005056011aef', '60272a1f-7631-11e4-a3cb-005056011aef', '5fa70803-7631-11e4-a3cb-005056011aef']
UPSAMPLING: (65000, 210)
Taxon code: 177 Taxon name: Tax evasion and avoidance
SMALL SUPPORT: 122
(100, 210)
FIRST 5 IDs: ['377a2bba-eaf1-4090-bf8e-dd7ea6939e84', '412c52e7-3a8c-4100-bb0b-1311543a1430', 'f1f8bc2e-0e64-46f1-b56d-fe71d79b2ae1', '5e124bdb-7631-11e4-a3cb-005056011aef', '5e5b1a34-7631-11e4-a3cb-005056011aef']
UPSAMPLING: (65500, 210)
Taxon code: 179 Taxon name: The Commonwealth
SMALL SUPPORT: 50
(39, 210)
FIRST 5 IDs: ['da89bcba-5afb-41ca-8e1b-0c7261e9f54e', '5d376793-7631-11e4-a3cb-005056011aef', 'fe72798e-40be-46c4-856e-e72919993c22', '940e5b95-2bed-415f-9fe7-9cd41941548b', '0a597401-7432-4fd9-8bf2-680706979448']
UPSAMPLING: (66000, 210)
Taxon code: 180 Taxon name: Tourism
SMALL SUPPORT: 118
(83, 210)
FIRST 5 IDs: ['0aa36874-d6ac-4b57-b6de-dcb81d8a0d21

### Doublecheck dataframe contents before merging.

In [41]:
binary_multilabel.shape

(91771, 210)

In [42]:
binary_multilabel.index[91770][0] # final sample before merging.

'76f5d9df-2d2b-486b-97ba-1a0098a72068'

In [43]:
binary_multilabel = pd.concat([binary_multilabel, upsampled_training])

In [44]:
binary_multilabel.index[total_size][0] # first sample of duplicated training data

'5c85d2b1-7631-11e4-a3cb-005056011aef'

Do not remove index because the text data lives there.
**TODO** Consider reworking how datasets are set up at some point

In [45]:
binary_multilabel.to_csv(os.path.join(DATADIR, 'balanced_level2_training_set_sampled.csv.gz'), compression='gzip')

### LOAD OVERSAMPLED DATASET

In [46]:
balanced_df = pd.read_csv(os.path.join(DATADIR, 'balanced_level2_training_set_sampled.csv.gz'), dtype=object, compression='gzip')

In [47]:
balanced_df.shape

(169271, 214)

In [48]:
#will convert columns to an array of shape
print('Shape of Y multilabel array before train/val/test split:{}'.format(balanced_df[list(balanced_df.columns)].values.shape))

Shape of Y multilabel array before train/val/test split:(169271, 214)


In [49]:
#dont' overwirte blanced_df as it take sages to read in
balanced_df_taxons = balanced_df.iloc[:,4:215]

In [50]:
balanced_df_taxons.columns = balanced_df_taxons.columns.astype(int)

In [51]:
balanced_df_taxons = balanced_df_taxons.astype(int)

In [52]:
#convert columns to an array. Each row represents a content item, each column an individual taxon
binary_multilabel = balanced_df_taxons[list(balanced_df_taxons.columns)].values
print('Example row of multilabel array {}'.format(binary_multilabel[2]))

Example row of multilabel array [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [53]:
balanced_df.head()

Unnamed: 0,content_id,combined_text,title,description,1,2,3,4,5,6,...,201,202,203,204,205,206,207,208,209,210
0,df76ffdf-70d6-4a38-9d60-a1765c18914e,dft: spending over £25 000 february 2015 repor...,dft: spending over £25 000 february 2015,reports on departmental spending over £500.,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,dca1f897-c8bd-4e35-a839-5953ee94d54e,regulatory notice: the abbeyfield dorcas socie...,regulatory notice: the abbeyfield dorcas socie...,the homes and communities agency's view of how...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3bec5cd0-76bd-48b1-924a-567bd3361ec0,tariff notice 5 (2015): blue polymethine dye (...,tariff notice 5 (2015): blue polymethine dye (...,tariff classification of blue polymethine dye ...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,5eb7cd3c-7631-11e4-a3cb-005056011aef,dfe: senior civil servant expenses and hospita...,dfe: senior civil servant expenses and hospita...,this collection brings together all documents ...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,a67385c3-8562-4dc1-96ba-d96ff215943b,quarterly fines report 29 - quarterly fines re...,quarterly fines report 29 - quarterly fines re...,this publication reports on fines collection b...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Format metadata/X

In [54]:
#extract content_id index to df
meta1 = pd.DataFrame(balanced_df['content_id'])

In [55]:
print(meta1.shape)
meta1.head()

(169271, 1)


Unnamed: 0,content_id
0,df76ffdf-70d6-4a38-9d60-a1765c18914e
1,dca1f897-c8bd-4e35-a839-5953ee94d54e
2,3bec5cd0-76bd-48b1-924a-567bd3361ec0
3,5eb7cd3c-7631-11e4-a3cb-005056011aef
4,a67385c3-8562-4dc1-96ba-d96ff215943b


In [56]:
metas = ['document_type','first_published_at','publishing_app','primary_publishing_organisation']

In [57]:
def build_index(x):
    index_dict = {}
    index_dict['index'] = 0
    for i,elem in enumerate(x):
        index_dict[elem] = i+1
    return index_dict

In [58]:
import time

In [59]:
#IF THIS FUNCTION TURNS OUT FASTER KEEP
#apply meta data to content
print("STARTED:",time.strftime("%H:%M:%S"))
for meta in metas:
    print("WORKON:",meta)
    meta1[meta] = meta1['content_id'].map(dict(zip(labelled_level2['content_id'], labelled_level2[meta])))
print("FINISHED:",time.strftime("%H:%M:%S"))

STARTED: 11:45:41
WORKON: document_type
WORKON: first_published_at
WORKON: publishing_app
WORKON: primary_publishing_organisation
FINISHED: 11:45:42


In [60]:
meta1 = meta1.replace(np.nan, '', regex=True) #conver nans to empty strings for labelencoder types
meta1.head()

Unnamed: 0,content_id,document_type,first_published_at,publishing_app,primary_publishing_organisation
0,df76ffdf-70d6-4a38-9d60-a1765c18914e,transparency,2015-06-05T09:07:35.000+00:00,whitehall,{'title': 'Department for Transport'}
1,dca1f897-c8bd-4e35-a839-5953ee94d54e,decision,2015-08-12T08:00:00.000+00:00,whitehall,{'title': 'Homes and Communities Agency'}
2,3bec5cd0-76bd-48b1-924a-567bd3361ec0,notice,2015-02-12T09:24:00.000+00:00,whitehall,{'title': 'HM Revenue & Customs'}
3,5eb7cd3c-7631-11e4-a3cb-005056011aef,document_collection,2013-07-12T15:00:00.000+00:00,whitehall,{'title': 'Department for Education'}
4,a67385c3-8562-4dc1-96ba-d96ff215943b,official_statistics,2016-08-25T08:30:17.000+00:00,whitehall,{'title': 'The Scottish Government'}


In [61]:
def to_cat_to_hot(column):
    doctype_encoder = LabelEncoder()
    new_col = column+"_cat"
    meta1[new_col] = doctype_encoder.fit_transform(meta1[column])
    return to_categorical(meta1[new_col])

dict_of_encodings = {}
for meta in metas:
    if meta != "first_published_at":
        print(meta)
        dict_of_encodings[meta] = to_cat_to_hot(meta)   

document_type
publishing_app
primary_publishing_organisation


In [62]:
meta1.head()

Unnamed: 0,content_id,document_type,first_published_at,publishing_app,primary_publishing_organisation,document_type_cat,publishing_app_cat,primary_publishing_organisation_cat
0,df76ffdf-70d6-4a38-9d60-a1765c18914e,transparency,2015-06-05T09:07:35.000+00:00,whitehall,{'title': 'Department for Transport'},51,8,101
1,dca1f897-c8bd-4e35-a839-5953ee94d54e,decision,2015-08-12T08:00:00.000+00:00,whitehall,{'title': 'Homes and Communities Agency'},9,8,177
2,3bec5cd0-76bd-48b1-924a-567bd3361ec0,notice,2015-02-12T09:24:00.000+00:00,whitehall,{'title': 'HM Revenue & Customs'},29,8,165
3,5eb7cd3c-7631-11e4-a3cb-005056011aef,document_collection,2013-07-12T15:00:00.000+00:00,whitehall,{'title': 'Department for Education'},11,8,90
4,a67385c3-8562-4dc1-96ba-d96ff215943b,official_statistics,2016-08-25T08:30:17.000+00:00,whitehall,{'title': 'The Scottish Government'},31,8,332


In [63]:
meta1['first_published_at'] = pd.to_datetime(meta1['first_published_at'])
print(meta1['first_published_at'].shape)

(169271,)


In [64]:
first_published = np.array(meta1['first_published_at']).reshape(meta1['first_published_at'].shape[0], 1)

In [65]:
print(first_published.dtype,first_published.shape,type(first_published))

datetime64[ns] (169271, 1) <class 'numpy.ndarray'>


In [66]:
dict_of_encodings.keys()

dict_keys(['publishing_app', 'document_type', 'primary_publishing_organisation'])

In [67]:
meta = np.concatenate((dict_of_encodings['document_type'], 
                               dict_of_encodings['primary_publishing_organisation'], 
                               dict_of_encodings['publishing_app']), 
                              axis=1)

In [68]:
nb_metavars = meta.shape[1]
print(nb_metavars)
print(meta.shape)

429
(169271, 429)


### Tokenize text fields

Tokenizer = Class for vectorizing texts, or/and turning texts into sequences (=list of word indexes, where the word of rank i in the dataset (starting at 1) has index i)

In [69]:
def tokenize(local_tokenizer,input_data,option):
# apply tokenizer to our text data
    data = []
    local_tokenizer.fit_on_texts(input_data)
# list of word indexes, where the word of rank i in the dataset (starting at 1) has index i
    sequences = local_tokenizer.texts_to_sequences(input_data)
    word_index = local_tokenizer.word_index  
    print('Found %s unique tokens.' % len(word_index))
    if option:
        data = pad_sequences(sequences, maxlen= MAX_SEQUENCE_LENGTH)
    else:
        data = tokenizer.sequences_to_matrix(sequences)
    return data

In [70]:
# True for sequences to matrix, False otherwise.
texts = balanced_df['combined_text']
tokenizer = Tokenizer(num_words=NUM_WORDS)
data = tokenize(tokenizer,texts,False)

titles = balanced_df['title']
tokenizer_tit = Tokenizer(num_words=10000)
onehot_tit = tokenize(tokenizer_tit,titles,True)

descs = balanced_df['description']
tokenizer_desc = Tokenizer(num_words=10000)
onehot_desc = tokenize(tokenizer_desc,descs,True)

Found 193851 unique tokens.
Found 31442 unique tokens.
Found 36795 unique tokens.


In [72]:
print('Shape of label tensor:', binary_multilabel.shape)
print('Shape of data tensor:', data.shape)

Shape of label tensor: (169271, 210)
Shape of data tensor: (169271, 20000)


### Data split
- Training data = 80%
- Development data = 10%
- Test data = 10%

#### Original sizes, keep for reference.
    nb_test samples: 9177
    nb_dev samples: 18354
    nb_training samples: 73416

In [109]:
print(nb_training_samples,nb_dev_samples,nb_test_samples)

73416 18354 9177


In [102]:
def split(data,splits):
    l = []
    for (start,end) in splits:
        l.append(data[start:end])
    return tuple([x for x in l])

In [103]:
diff = len(data)-total_size+1
diff

77501

In [104]:
splits = [(0,-(nb_dev_samples+diff)),(-(nb_dev_samples+diff),-(nb_test_samples+diff)),(-(nb_test_samples+diff),total_size)]
re_split = [(total_size,len(data))]

In [105]:
x_train, x_dev, x_test = split(data,splits)
x_resampled = split(data,re_split)[0]

In [107]:
print(x_train.shape,x_resampled.shape)
print(x_dev.shape,x_test.shape)

(73416, 20000) (77500, 20000)
(9177, 20000) (9178, 20000)


In [98]:
x_train = np.concatenate([x_train,x_resampled],axis=0)

In [85]:
x_train.shape

(150917, 20000)

In [96]:
meta_train, meta_dev, meta_test = split(meta,splits)
meta_resampled = split(meta,re_split)[0]
meta_train = np.concatenate([meta_train,meta_resampled],axis=0)
                                                                  
title_train, title_dev, title_test = split(onehot_tit,splits)
title_resampled = split(onehot_tit,re_split)[0]   
title_train = np.concatenate([title_train,title_resampled],axis=0)
                                                                  
desc_train, desc_dev, desc_test = split(onehot_desc,splits)
desc_resampled = split(onehot_desc,re_split)[0] 
desc_train = np.concatenate([desc_train,desc_resampled],axis=0)
                                                                  
y_train, y_dev, y_test = split(binary_multilabel,splits)
y_resampled = split(binary_multilabel,re_split)[0]
y_train = np.concatenate([y_train,y_resampled],axis=0)                                                             

In [99]:
print('Shape of x_train:', x_train.shape)
print('Shape of metax_train:', meta_train.shape)
print('Shape of titlex_train:', title_train.shape)
print('Shape of descx_train:', desc_train.shape)
print('Shape of y_train:', y_train.shape)

Shape of x_train: (150916, 20000)
Shape of metax_train: (150916, 429)
Shape of titlex_train: (150916, 1000)
Shape of descx_train: (150916, 1000)
Shape of y_train: (150916, 210)


In [100]:
print('Shape of x_dev:', x_dev.shape)
print('Shape of meta_dev:', meta_dev.shape)
print('Shape of titlex_dev:', title_dev.shape)
print('Shape of descx_dev:', desc_dev.shape)
print('Shape of y_dev:', y_dev.shape)

Shape of x_dev: (9177, 20000)
Shape of meta_dev: (9177, 429)
Shape of titlex_dev: (9177, 1000)
Shape of descx_dev: (9177, 1000)
Shape of y_dev: (9177, 210)


In [101]:
print('Shape of x_test:', x_test.shape)
print('Shape of metax_test:', meta_test.shape)
print('Shape of titlex_test:', title_test.shape)
print('Shape of descx_test:', desc_test.shape)
print('Shape of y_test:', y_test.shape)

Shape of x_test: (9178, 20000)
Shape of metax_test: (9178, 429)
Shape of titlex_test: (9178, 1000)
Shape of descx_test: (9178, 1000)
Shape of y_test: (9178, 210)


### preparing the Embedding layer

NB stopwords haven't been removed yet...

In [None]:
embedding_layer = Embedding(len(word_index) + 1, 
                            EMBEDDING_DIM, 
                            input_length=MAX_SEQUENCE_LENGTH)

An Embedding layer should be fed sequences of integers, i.e. a 2D input of shape (samples, indices). These input sequences should be padded so that they all have the same length in a batch of input data (although an Embedding layer is capable of processing sequence of heterogenous length, if you don't pass an explicit input_length argument to the layer).

All that the Embedding layer does is to map the integer inputs to the vectors found at the corresponding index in the embedding matrix, i.e. the sequence [1, 2] would be converted to [embeddings[1], embeddings[2]]. This means that the output of the Embedding layer will be a 3D tensor of shape (samples, sequence_length, embedding_dim).

### Estimate class weights for unbalanced datasets.
paramter to model.fit = __class_weight__: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.

Implement class_weight from sklearn:

- Import the module 

`from sklearn.utils import class_weight`
- calculate the class weight, If ‘balanced’, class weights will be given by n_samples / (n_classes * np.bincount(y)):

`class_weight = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)`

- change it to a dict in order to work with Keras.

`class_weight_dict = dict(enumerate(class_weight))`

- Add to model fitting

`model.fit(X_train, y_train, class_weight=class_weight)`

In [None]:
# class_weight = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
# class_weight_dict = dict(enumerate(class_weight))

### Custom loss function

In [None]:
class WeightedBinaryCrossEntropy(object):

    def __init__(self, pos_ratio):
        neg_ratio = 1. - pos_ratio
        #self.pos_ratio = tf.constant(pos_ratio, tf.float32)
        self.pos_ratio = pos_ratio
        #self.weights = tf.constant(neg_ratio / pos_ratio, tf.float32)
        self.weights = neg_ratio / pos_ratio
        self.__name__ = "weighted_binary_crossentropy({0})".format(pos_ratio)

    def __call__(self, y_true, y_pred):
        return self.weighted_binary_crossentropy(y_true, y_pred)

    def weighted_binary_crossentropy(self, y_true, y_pred):
            # Transform to logits
            epsilon = tf.convert_to_tensor(K.common._EPSILON, y_pred.dtype.base_dtype)
            y_pred = tf.clip_by_value(y_pred, epsilon, 1 - epsilon)
            y_pred = tf.log(y_pred / (1 - y_pred))

            cost = tf.nn.weighted_cross_entropy_with_logits(y_true, y_pred, self.weights)
            return K.mean(cost * self.pos_ratio, axis=-1)
    
y_true_arr = np.array([0,1,0,1], dtype="float32")
y_pred_arr = np.array([0,0,1,1], dtype="float32")
y_true = tf.constant(y_true_arr)
y_pred = tf.constant(y_pred_arr)

with tf.Session().as_default(): 
    print(WeightedBinaryCrossEntropy(0.5)(y_true, y_pred).eval())
    print(binary_crossentropy(y_true, y_pred).eval())


### difficulty getting global precision/recall metrics . CAUTION interpreting monitoring metrics
fcholltet: "Basically these are all global metrics that were approximated
batch-wise, which is more misleading than helpful. This was mentioned in
the docs but it's much cleaner to remove them altogether. It was a mistake
to merge them in the first place."

In [None]:
def f1(y_true, y_pred):
    """Use Recall  and precision metrics to calculate harmonic mean (F1 score).

        Only computes a batch-wise average of recall.

        Computes the recall, a metric for multi-label classification of
        how many relevant items are selected.
        """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1 = 2*((precision*recall)/(precision+recall))
    
    return f1

## Training a 1D convnet

### 1. Create model

In [None]:
NB_CLASSES = y_train.shape[1]
NB_METAVARS = metax_train.shape[1]



sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='wordindex') #MAX_SEQUENCE_LENGTH
embedded_sequences = embedding_layer(sequence_input)
x = Dropout(0.2, name = 'dropout_embedded')(embedded_sequences)

x = Conv1D(128, 5, activation='relu', name = 'conv0')(x)

x = MaxPooling1D(5, name = 'max_pool0')(x)

x = Dropout(0.5, name = 'dropout0')(x)

x = Conv1D(128, 5, activation='relu', name = 'conv1')(x)

x = MaxPooling1D(5 , name = 'max_pool1')(x)

x = Conv1D(128, 5, activation='relu', name = 'conv2')(x)

x = MaxPooling1D(35, name = 'global_max_pool')(x)  # global max pooling

x = Flatten()(x) #reduce dimensions from 3 to 2; convert to vector + FULLYCONNECTED

meta_input = Input(shape=(NB_METAVARS,), name='meta')
meta_hidden = Dense(128, activation='relu', name = 'hidden_meta')(meta_input)
meta_hidden = Dropout(0.2, name = 'dropout_meta')(meta_hidden)


title_input = Input(shape=(titlex_train.shape[1],), name='titles')
title_hidden = Dense(128, activation='relu', name = 'hidden_title')(title_input)
title_hidden = Dropout(0.2, name = 'dropout_title')(title_hidden)

desc_input = Input(shape=(descx_train.shape[1],), name='descs')
desc_hidden = Dense(128, activation='relu', name = 'hidden_desc')(desc_input)
desc_hidden = Dropout(0.2, name = 'dropout_desc')(desc_hidden)

concatenated = concatenate([meta_hidden, title_hidden, desc_hidden, x])

x = Dense(400, activation='relu', name = 'fully_connected0')(concatenated)

x = Dropout(0.2, name = 'dropout1')(x)

x = Dense(NB_CLASSES, activation='sigmoid', name = 'fully_connected1')(x)

# # The Model class turns an input tensor and output tensor into a model
# This creates Keras model instance, will use this instance to train/test the model.
model = Model(inputs=[meta_input, title_input, desc_input, sequence_input], outputs=x)

### 2. Compile model

In [None]:
# model.compile(loss=WeightedBinaryCrossEntropy(POS_RATIO),
#               optimizer='rmsprop',
#               metrics=['binary_accuracy', f1])

Metric values are recorded at the end of each epoch on the training dataset. If a validation dataset is also provided, then the metric recorded is also calculated for the validation dataset.

All metrics are reported in verbose output and in the history object returned from calling the fit() function. In both cases, the name of the metric function is used as the key for the metric values. In the case of metrics for the validation dataset, the “val_” prefix is added to the key.

You have now built a function to describe your model. To train and test this model, there are four steps in Keras:
1. Create the model by calling the function above
2. Compile the model by calling `model.compile(optimizer = "...", loss = "...", metrics = ["accuracy"])`
3. Train the model on train data by calling `model.fit(x = ..., y = ..., epochs = ..., batch_size = ...)`
4. Test the model on test data by calling `model.evaluate(x = ..., y = ...)`

If you want to know more about `model.compile()`, `model.fit()`, `model.evaluate()` and their arguments, refer to the official [Keras documentation](https://keras.io/models/model/).


In [None]:
model.summary()

### Tensorboard callbacks /metrics /monitor training

<span style="color:red"> **Size of these files is killing storage during training. Is it histograms?**</span>

In [None]:
tb = TensorBoard(log_dir='./learn_embedding_logs', histogram_freq=1, write_graph=True, write_images=False)

In [None]:
CHECKPOINT_PATH = os.path.join(DATADIR, 'model_checkpoint.hdf5')

cp = ModelCheckpoint(
                     filepath = CHECKPOINT_PATH, 
                     monitor='val_loss', 
                     verbose=0, 
                     save_best_only=False, 
                     save_weights_only=False, 
                     mode='auto', 
                     period=1
                    )

In [None]:
# class Metrics(Callback):
#     def on_train_begin(self, logs={}):
#         self.val_f1s = []
#         self.val_recalls = []
#         self.val_precisions = []
 
#     def on_epoch_end(self, epoch, logs={}):
#         val_predict = (np.asarray(self.model.predict(self.model.validation_data[0]))).round()
#         val_targ = self.model.validation_data[1]
        
#         self.val_f1s.append(f1_score(val_targ, val_predict, average='micro'))
#         self.val_recalls.append(recall_score(val_targ, val_predict))
#         self.val_precisions.append(precision_score(val_targ, val_predict))
#         print("- val_f1: %f — val_precision: %f — val_recall %f" 
#                 %(f1_score(val_targ, val_predict, average='micro'), 
#                   precision_score(val_targ, val_predict),
#                    recall_score(val_targ, val_predict)))
#         return
 
# metrics = Metrics()

In [None]:
from keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience=2)
#model.fit(x, y, validation_split=0.2, callbacks=[early_stopping])

### 3. Train model

In [None]:
# metrics callback causes: CCCCCCR55555555511155
# So disable for now
from keras.utils import multi_gpu_model

# Replicates `model` on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss=WeightedBinaryCrossEntropy(POS_RATIO),
              optimizer='rmsprop',
              metrics=['binary_accuracy', f1])

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
history = parallel_model.fit(
    {'meta': metax_train, 'titles': titlex_train, 'descs': descx_train, 'wordindex': x_train},
    y_train, 
    validation_data=([metax_dev, titlex_dev, descx_dev, x_dev], y_dev), 
    epochs=10, batch_size=128, callbacks=[early_stopping]
)


# history = model.fit(
#     {'meta': metax_train, 'titles': titlex_train, 'descs': descx_train, 'wordindex': x_train},
#     y_train, 
#     validation_data=([metax_dev, titlex_dev, descx_dev, x_dev], y_dev), 
#     epochs=10, batch_size=128, callbacks=[early_stopping]
# )

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1, 10)

plt.plot(epochs, loss_values, 'bo', label='Training loss')           
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')      
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
plt.clf()    

f1_values = history_dict['f1']
val_f1_values = history_dict['val_f1']

plt.plot(epochs, f1_values, 'bo', label='Training f1')
plt.plot(epochs, val_f1_values, 'b', label='Validation f1')
plt.title('Training and validation batch-level f1-micro')
plt.xlabel('Epochs')
plt.ylabel('F1-micro')
plt.legend()

plt.show()

### Evaluate model

#### Training metrics

In [None]:
y_prob = parallel_model.predict([metax_train, titlex_train, descx_train, x_train])

In [None]:
y_prob.shape

In [None]:
y_pred = y_prob.copy()
y_pred[y_pred>=P_THRESHOLD] = 1
y_pred[y_pred<P_THRESHOLD] = 0

In [None]:
f1_score(y_train, y_pred, average='micro')

In [None]:
#average= None, the scores for each class are returned.
#precision_recall_fscore_support(y_train, y_pred, average=None, sample_weight=None)

In [None]:
#a = precision_recall_fscore_support(y_train, y_pred, average=None, sample_weight=None)
# pd.DataFrame(list(a))
# f1_byclass = pd.DataFrame((a)[2], columns=['f1'])

# support_byclass = pd.DataFrame((a)[3], columns=['support'])

# f1_byclass = pd.merge(
#     left=f1_byclass, 
#     right=support_byclass, 
#     left_index=True,
#     right_index=True,
#     how='outer', 
#     validate='one_to_one'
# )

# f1_byclass['index_col'] = f1_byclass.index

# f1_byclass['level2taxon'] = f1_byclass['index_col'].map(labels_index).copy()

# print("At p_threshold of {}, there were {} out of {} ({})% taxons with auto-tagged content in the training data"
#       .format(P_THRESHOLD, 
#               f1_byclass.loc[f1_byclass['f1'] > 0].shape[0], 
#               y_pred.shape[1], 
#               (f1_byclass.loc[f1_byclass['f1'] > 0].shape[0]/y_pred.shape[1])*100 ))

In [None]:
# no_auto_content = f1_byclass.loc[f1_byclass['f1'] == 0]
# no_auto_content = no_auto_content.set_index('level2taxon')

In [None]:
# no_auto_content['support'].sort_values().plot( kind = 'barh', figsize=(20, 20))

In [None]:
# classes_predictedto = f1_byclass.loc[f1_byclass['f1'] > 0]
# classes_predictedto = classes_predictedto.set_index('level2taxon') 

In [None]:
# classes_predictedto.plot.scatter(x='support', y='f1', figsize=(20, 10), xticks=np.arange(0, 9700, 100))

In [None]:
# classes_predictedto['f1'].sort_values().plot( kind = 'barh', figsize=(20, 20))

In [None]:
#Calculate globally by counting the total true positives, false negatives and false positives.
precision_recall_fscore_support(y_train, y_pred, average='micro', sample_weight=None) 

In [None]:
#Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account
precision_recall_fscore_support(y_train, y_pred, average='macro', sample_weight=None)

In [None]:
#Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account
precision_recall_fscore_support(y_train, y_pred, average='weighted', sample_weight=None)

#### Development set metrics

In [None]:
y_pred_dev = parallel_model.predict([metax_dev, titlex_dev, descx_dev, x_dev])

In [None]:
y_pred_dev[y_pred_dev>=P_THRESHOLD] = 1
y_pred_dev[y_pred_dev<P_THRESHOLD] = 0

In [None]:
#average= None, the scores for each class are returned.
precision_recall_fscore_support(y_dev, y_pred_dev, average=None, sample_weight=None)

In [None]:
#Calculate globally by counting the total true positives, false negatives and false positives.
precision_recall_fscore_support(y_dev, y_pred_dev, average='micro', sample_weight=None) 

In [None]:
#Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account
precision_recall_fscore_support(y_dev, y_pred_dev, average='macro', sample_weight=None)

In [None]:
#Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account
precision_recall_fscore_support(y_dev, y_pred_dev, average='weighted', sample_weight=None)