# Education theme - all audits - all data excluding PDF contents (including all possible bigrams)

This experiment used 8697 pages from GOV.UK related to the education theme. We extracted the following content from those pages:

- Title
- Description
- Indexable content (i.e. the body of the document stored in Search)
- Existing topic names
- Exiting organisation names

In order to do so, we used a combination of data from the search index and the content store.

This is the same input as `20_topics_without_pdf_data` except that in that experiment, we extracted bigrams (word pairs) from each document that matched a curated list. In this experiment, we considered all possible bigrams, regardless of whether they were on the list.

We then run Latent Dirichlet allocation (LDA) with the following parameters:

- we asked for 20 topics
- we let LDA run with 50 iterations

The outcome of this experiment can be seen below. In order to run the script again, use this:

```shell
python train_lda.py --output-topics output/all_bigrams_topics.csv --output-tags output/all_bigrams_tags.csv --vis-filename experiments/all_bigrams.html expanded_audits/all_audits_for_education_words_nopdf.csv
```

With commit `ab8f93a2be7080a8cb73720badad48499c907da2`

```
--- a/gensim_engine.py
+++ b/gensim_engine.py
@@ -90,13 +90,8 @@ class GensimEngine:
         self.topics = []
         self.ldamodel = None
         self.bigrams = []
-        self.top_bigrams = []
         self.include_bigrams = include_bigrams

-        with open('input/bigrams.csv', 'r') as f:
-            reader = csv.reader(f)
-            self.top_bigrams = [bigram[0] for bigram in list(reader)]
-
         if log:
             logging.basicConfig(
                 format='%(asctime)s : %(levelname)s : %(message)s',
@@ -178,10 +173,9 @@ class GensimEngine:

         # Extract the most common bigrams in the document
         document_bigrams = self.fetch_document_bigrams(document_lemmas)
-        known_bigrams = [bigram for bigram in document_bigrams if bigram in self.top_bigrams]

         # Calculate the bag of words
-        document_bow = self.dictionary.doc2bow(document_lemmas + known_bigrams)
+        document_bow = self.dictionary.doc2bow(document_lemmas + document_bigrams)

         # Tag the document
         all_tags = self.ldamodel[document_bow]
@@ -203,8 +197,7 @@ class GensimEngine:
             raw_text = document['text'].lower()
             all_lemmas = lemmatize(raw_text, allowed_tags=re.compile('(NN|JJ)'), stopwords=STOPWORDS)
             document_bigrams = self.fetch_document_bigrams(all_lemmas)
-            known_bigrams = [bigram for bigram in document_bigrams if bigram in self.top_bigrams]
-            lemmas.append(all_lemmas + known_bigrams)
+            lemmas.append(all_lemmas + document_bigrams)

         if dictionary_path:
             print("Load pre-existing dictionary from file")
```

## Dictionary

LDA constructs a dictionary of words it collects from the documents. This dictionary has information on word frequencies.

In [5]:
# This code is required so we can display the visualisation
import pyLDAvis
import pandas as pd
from IPython.core.display import display, HTML

# Changing the cell widths
display(HTML("<style>.container { width:100% !important; }</style>"))

# Setting the max number of rows
pd.options.display.max_rows = 30
# Setting the max number of columns
pd.options.display.max_columns = 50                                         

pyLDAvis.enable_notebook()

## Interactive topic model visualisation

The page below displays the topics generated by the algorithm and allows us to interact with them in order to discover what words make up each topic.

In [6]:
from IPython.display import HTML
HTML(filename='all_bigrams.html')

## Sample of tagged documents

Below we list a sample of the education links and the correspondent topics the algorithm chose to tag it with. This is useful in order to see if the algorithm is tagging those documents with meaningful topics.

For a complete list, please see [here](all_bigrams_tags.csv).

## https://www.gov.uk/government/news/plans-to-manage-school-and-further-education-inspections-in-house
- Topic 12 (41%)
- Topic 19 (18%)
- Topic 14 (09%)

## https://www.gov.uk/government/organisations/ofsted/about/personal-information-charter
- Topic 17 (23%)
- Topic 14 (22%)
- Topic 2 (20%)

## https://www.gov.uk/government/speeches/written-ministerial-statement-by-michael-gove-on-building-schools-for-the-future
- Topic 9 (27%)
- Topic 14 (25%)
- Topic 11 (18%)

## https://www.gov.uk/government/consultations/high-needs-funding-reform
- Topic 9 (44%)
- Topic 14 (29%)
- Topic 6 (11%)

## https://www.gov.uk/government/publications/apprenticeship-standard-maritime-mechanical-fitter-standard
- Topic 4 (79%)
- Topic 1 (14%)
- Topic 6 (03%)

## https://www.gov.uk/government/news/phonics-screening-check-week-begins-today
- Topic 3 (70%)
- Topic 6 (13%)
- Topic 2 (10%)

## https://www.gov.uk/government/speeches/school-funding
- Topic 16 (34%)
- Topic 14 (25%)
- Topic 9 (20%)

## https://www.gov.uk/government/publications/national-occupational-standards-for-supporting-teaching-learning
- Topic 17 (34%)
- Topic 8 (13%)
- Topic 9 (13%)

## https://www.gov.uk/government/publications/point-in-time-survey-for-pupils-aged-3-to-19-at-non-association-independent-schools
- Topic 0 (68%)
- Topic 9 (13%)
- Topic 15 (08%)

## https://www.gov.uk/government/publications/sfa-update-issue-303-6-april-2016
- Topic 17 (75%)
- Topic 2 (06%)
- Topic 19 (05%)

## https://www.gov.uk/government/publications/sfa-update-issue-279-7-october-2015
- Topic 17 (74%)
- Topic 2 (09%)
- Topic 18 (05%)

## https://www.gov.uk/guidance/16-to-18-education-free-meals-for-academic-year-2016-to-2017
- Topic 13 (57%)
- Topic 9 (31%)
- Topic 2 (06%)

## https://www.gov.uk/government/publications/pre-warning-notice-to-the-malcolm-arnold-academy
- Topic 7 (93%)
- Topic 9 (05%)
- Topic 13 (61%)

## https://www.gov.uk/government/publications/apprenticeship-standard-aviation-ground-operative
- Topic 4 (77%)
- Topic 1 (14%)
- Topic 6 (04%)

## https://www.gov.uk/government/publications/welfare-and-duty-of-care-in-armed-forces-initial-training-2015
- Topic 19 (53%)
- Topic 12 (33%)
- Topic 14 (07%)

## https://www.gov.uk/government/publications/pre-warning-notice-to-sir-thomas-wharton-community-college
- Topic 7 (92%)
- Topic 19 (06%)
- Topic 14 (58%)

## https://www.gov.uk/government/publications/gcse-9-to-1-subject-level-guidance-for-computer-science
- Topic 8 (71%)
- Topic 0 (24%)
- Topic 14 (02%)

## https://www.gov.uk/government/publications/apprenticeship-standard-junior-content-producer
- Topic 4 (75%)
- Topic 1 (13%)
- Topic 6 (07%)