## SpaCy
We're going to try to create a text classifier using Spacy to tag each PQ with a topic keyword. We want a mutli-classifier, i.e. PQs can have more than one tag. 

## Exploring model training to recognise EV-related PQs.

In [1]:
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy
from spacy.util import minibatch, compounding
from tqdm import tqdm
tqdm.pandas()
import pandas as pd

nlp = spacy.blank('en')
matcher = Matcher(nlp.vocab)



In [2]:
wpqs = pd.read_csv('cleaned.csv')
wpqs['cleanedQuestion'].fillna('', inplace=True)
wpqs['topic'].fillna('', inplace=True)

In [3]:
# Select some text data

data = wpqs[['cleanedQuestion', 'topic']]

In [4]:
data.head()

Unnamed: 0,cleanedQuestion,topic
0,how much has been paid through working tax cre...,working tax credit
1,what the value was of the average claim for ta...,welfare tax credits
2,what the total value was of tax credits paid t...,welfare tax credits
3,how many self-employed people claimed (a) chil...,welfare tax credits
4,how many families in (a) york central constitu...,welfare tax credits


In [5]:
# data[(data.label == None) & (data.cleanedQuestion.str.contains('on-street residential'))].cleanedQuestion.tolist()

In [6]:
# lst = ['coronavirus', 'universal credit', 'nhs', 'children', 'schools', 'social security benefits', 'asylum', 'immigraiton', 'metnal health servicies', 'armed forces', 'prisons']

In [7]:
# data['label'] = data.progress_apply(lambda row: 0 if row.topic in lst else row['label'], axis=1)

In [8]:
data.topic.value_counts().head(30)

coronavirus                      7881
universal credit                 4082
nhs                              3654
children                         3637
schools                          3473
social security benefits         3231
asylum                           3137
railways                         3133
immigration                      2815
mental health services           2717
housing                          2446
armed forces                     2342
prisons                          1965
personal independence payment    1927
developing countries             1910
health services                  1779
refugees                         1682
energy                           1669
israel                           1655
police                           1651
apprentices                      1640
members                          1638
                                 1603
syria                            1596
ministry of defence              1545
social services                  1512
cancer      

In [9]:
# data_sel = data.copy()
# data_sel['topic'] = data_sel.label.astype('float')

In [10]:
# Manually label EVs
data['ELECTRICV'] = data.progress_apply(lambda row: 1 if 'electric vehicle' in row.cleanedQuestion else 0, axis=1)
data['ELECTRICV'] = data.progress_apply(lambda row: 1 if 'electric car' in row.cleanedQuestion else row['ELECTRICV'], axis=1)
data['ELECTRICV'] = data.progress_apply(lambda row: 1 if 'public charging' in row.cleanedQuestion else row['ELECTRICV'], axis=1)
data['ELECTRICV'] = data.progress_apply(lambda row: 1 if 'gigafactory' in row.cleanedQuestion else row['ELECTRICV'], axis=1)
data['ELECTRICV'] = data.progress_apply(lambda row: 1 if 'electric battery' in row.cleanedQuestion else row['ELECTRICV'], axis=1)
data['ELECTRICV'] = data.progress_apply(lambda row: 1 if 'vehicle charging' in row.cleanedQuestion else row['ELECTRICV'], axis=1)
data['ELECTRICV'] = data.progress_apply(lambda row: 1 if 'car grant' in row.cleanedQuestion else row['ELECTRICV'], axis=1)
data['ELECTRICV'] = data.progress_apply(lambda row: 1 if 'homecharge' in row.cleanedQuestion else row['ELECTRICV'], axis=1)
data['ELECTRICV'] = data.progress_apply(lambda row: 1 if 'workplace charging' in row.cleanedQuestion else row['ELECTRICV'], axis=1)
data['ELECTRICV'] = data.progress_apply(lambda row: 1 if 'on-street residential' in row.cleanedQuestion else row['ELECTRICV'], axis=1)

100%|████████████████████████████████████| 385417/385417 [00:03<00:00, 127642.41it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['ELECTRICV'] = data.progress_apply(lambda row: 1 if 'electric vehicle' in row.cleanedQuestion else 0, axis=1)
100%|████████████████████████████████████| 385417/385417 [00:03<00:00, 101914.99it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['ELECTRICV'] = data.progress_apply(lambda row: 1 if 'electric car' in row.cleanedQuestion else row['ELECTRICV'], axis=1)
100%|████████████████████████████████████| 385417/385417 

In [11]:
data['CORONAVIRUS'] = data.progress_apply(lambda row: 1 if 'coronavirus' in row.topic else 0, axis=1)

100%|████████████████████████████████████| 385417/385417 [00:02<00:00, 133161.02it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['CORONAVIRUS'] = data.progress_apply(lambda row: 1 if 'coronavirus' in row.topic else 0, axis=1)


In [12]:
data['UCREDIT'] = data.progress_apply(lambda row: 1 if 'universal credit' in row.topic else 0, axis=1)

100%|████████████████████████████████████| 385417/385417 [00:02<00:00, 133193.14it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['UCREDIT'] = data.progress_apply(lambda row: 1 if 'universal credit' in row.topic else 0, axis=1)


In [13]:
data['NHS'] = data.progress_apply(lambda row: 1 if 'nhs' in row.topic else 0, axis=1)

100%|████████████████████████████████████| 385417/385417 [00:03<00:00, 125496.69it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['NHS'] = data.progress_apply(lambda row: 1 if 'nhs' in row.topic else 0, axis=1)


In [14]:
data['CHILDREN'] = data.progress_apply(lambda row: 1 if 'children' in row.topic else 0, axis=1)

100%|████████████████████████████████████| 385417/385417 [00:03<00:00, 125829.08it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['CHILDREN'] = data.progress_apply(lambda row: 1 if 'children' in row.topic else 0, axis=1)


In [15]:
data['EDUCATION'] = data.progress_apply(lambda row: 1 if 'schools' in row.topic else 0, axis=1)

100%|████████████████████████████████████| 385417/385417 [00:02<00:00, 129288.09it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['EDUCATION'] = data.progress_apply(lambda row: 1 if 'schools' in row.topic else 0, axis=1)


In [16]:
data.head()

Unnamed: 0,cleanedQuestion,topic,ELECTRICV,CORONAVIRUS,UCREDIT,NHS,CHILDREN,EDUCATION
0,how much has been paid through working tax cre...,working tax credit,0,0,0,0,0,0
1,what the value was of the average claim for ta...,welfare tax credits,0,0,0,0,0,0
2,what the total value was of tax credits paid t...,welfare tax credits,0,0,0,0,0,0
3,how many self-employed people claimed (a) chil...,welfare tax credits,0,0,0,0,0,0
4,how many families in (a) york central constitu...,welfare tax credits,0,0,0,0,0,0


In [17]:
cats_dict = {
    'CORONAVIRUS': 0,
    'UCREDIT': 0,
    'NHS': 0,
    'CHILDREN': 0,
    'EDUCATION': 0,
    'ELECTRICV': 0
}
cats = {'cats': cats_dict}

In [18]:
ln = data.shape[0]

In [19]:
ln

385417

In [20]:
data.iloc[0]

cleanedQuestion    how much has been paid through working tax cre...
topic                                             working tax credit
ELECTRICV                                                          0
CORONAVIRUS                                                        0
UCREDIT                                                            0
NHS                                                                0
CHILDREN                                                           0
EDUCATION                                                          0
Name: 0, dtype: object

In [21]:
# data.iloc[0][['ELECTRICV', 'CORONAVIRUS', 'UCREDIT', 'NHS', 'CHILDREN', 'EDUCATION']].to_dict()

In [27]:
data[(data.ELECTRICV > 0) | (data.CORONAVIRUS > 1)]

Unnamed: 0,cleanedQuestion,topic,ELECTRICV,CORONAVIRUS,UCREDIT,NHS,CHILDREN,EDUCATION
11815,what assessment he has made of the comprehensi...,electric vehicles,1,0,0,0,0,0
27572,how many ultra low emission vehicles of what m...,motor vehicles,1,0,0,0,0,0
40813,how many publicly-funded charging points for e...,electric vehicles,1,0,0,0,0,0
40837,what his department's budget was for new charg...,electric vehicles,1,0,0,0,0,0
41542,what assessment he has made of the possible da...,electric vehicles,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...
383221,what steps his department is taking to ensure ...,electric vehicles,1,0,0,0,0,0
383358,what recent assessment he has made of the adeq...,electric vehicles,1,0,0,0,0,0
383481,how many electric car charging points (a) are ...,electric vehicles,1,0,0,0,0,0
384101,what support his department is providing to lo...,,1,0,0,0,0,0


In [29]:
data_sel = data[(data.ELECTRICV > 0) | (data.CORONAVIRUS > 0) | (data.UCREDIT > 0) | (data.NHS > 0) | (data.CHILDREN > 0) | (data.EDUCATION > 0)]

In [30]:
data_sel['tuples'] = data_sel.progress_apply(lambda row: (row['cleanedQuestion'], {'cats': row[['ELECTRICV', 'CORONAVIRUS', 'UCREDIT', 'NHS', 'CHILDREN', 'EDUCATION']].to_dict()}), axis=1)

100%|████████████████████████████████████████| 28524/28524 [00:05<00:00, 4868.07it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_sel['tuples'] = data_sel.progress_apply(lambda row: (row['cleanedQuestion'], {'cats': row[['ELECTRICV', 'CORONAVIRUS', 'UCREDIT', 'NHS', 'CHILDREN', 'EDUCATION']].to_dict()}), axis=1)


In [31]:
# data_sel is a labelled dataset, we can now preprocess
# data_sel['tuples'] = data_sel.progress_apply(lambda row: (row['cleanedQuestion'],cats), axis=1)
data_sel = data_sel['tuples'].tolist()
data_sel[:1]

[('what estimate he has made of the number of partners in small businesses who will apply for universal credit; and what steps such people need to take to establish their monthly income in order to do so.',
  {'cats': {'ELECTRICV': 0,
    'CORONAVIRUS': 0,
    'UCREDIT': 1,
    'NHS': 0,
    'CHILDREN': 0,
    'EDUCATION': 0}})]

In [54]:
textcat = get_textcat_pipe(nlp)

NameError: name 'get_textcat_pipe' is not defined

In [38]:
len(data_sel)

28524

In [35]:
def load_data(train_data, limit=0, split=0.8):
#     train_data = data
    np.random.shuffle(train_data)
    train_data = train_data[-limit:]
    texts, labels = zip(*train_data)
    cats = [{'ELECTRICV': bool(y)} for y in labels]
    split = int(len(train_data) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])

In [36]:
def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 1e-8  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 1e-8  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f_score = 2 * (precision * recall) / (precision + recall)
    return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}

In [39]:
n_texts = 20000
n_iter = 10

In [46]:
if 'textcat_multilabel' not in nlp.pipe_names:
    textcat = nlp.add_pipe('textcat_multilabel')
else:
    textcat = nlp.get_pipe

In [47]:
data_sel[0]

('what estimate he has made of the number of partners in small businesses who will apply for universal credit; and what steps such people need to take to establish their monthly income in order to do so.',
 {'cats': {'ELECTRICV': 0,
   'CORONAVIRUS': 0,
   'UCREDIT': 1,
   'NHS': 0,
   'CHILDREN': 0,
   'EDUCATION': 0}})

In [51]:
nlp.add_label('ELECTRICV')
# textcat.add_label('CORONAVIRUS')
# textcat.add_label('UCREDIT')
# textcat.add_label('NHS')
# textcat.add_label('CHILDREN')
# textcat.add_label('EDUCATION')

AttributeError: 'English' object has no attribute 'add_label'

In [53]:
textcat.labels

AttributeError: 'function' object has no attribute 'labels'

In [None]:
textcat.labels

In [None]:
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(train_data = data_sel, limit=n_texts)
print("Using {} examples ({} training, {} evaluation)".format(n_texts, len(train_texts), len(dev_texts)))

In [None]:
train_data = list(zip(train_texts, [{'cats': cats} for cats in train_cats]))

In [None]:
train_data

In [None]:
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat_multilabel']


with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.begin_training()
    print("Training the model...")
    print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
    for i in range(n_iter):
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
                       losses=losses)
        with textcat.model.use_params(optimizer.averages):
            # evaluate on the dev data split off in load_data()
            scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
        print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}'  # print a simple table
              .format(losses['textcat'], scores['textcat_p'],
                      scores['textcat_r'], scores['textcat_f']))