This notebook contains code that will let you examine the curriculum through a very elementary text mining approach. All mineable teaching documents have been downloaded and the readable content extracted. These include word (.docx), powerpoint (.pptx), text, web pages, and PDFs. There are some technical wrinkles, especially with PDFs.
Keywords and keyword pairs (bigrams) have been extracted from this text and are used as the basis for similarity matching.
There is the option to use *stemming* in processing the keywords, where plurals and similar derivatives are reduced to just the stem of the word. 

Each document is matched against each other document using cosine similarity across the tf-idf metric ( [See the wikipedia entry here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) ) calculated for every keyword. The score calculated is scaled up by 1 million times larger to give a more readable number. That means about 500,000 document similarity measures. We can use various tools to access the text - finding where particular keywords are (concordance), finding which documents match a document of interest and so on.

The similarity network can be downloaded and run in R to visualise it. If you filter the similarities by score (approximately 2 -3 edges per node is a good rule of thumb) then a clearer representation can be seen. [An R script that does this is available (depends on the igraph package which can be installed from CPAN).](quickview.R) 

[A blog post covering text mining and more in R](https://www.r-bloggers.com/an-overview-of-text-mining-visualisations-possibilities-with-r-on-the-ceta-trade-agreement/)

## Loading the corpus
A corpus is a collection of documents. The extracted text is in the *keydata* directory.

Click on the cell below and press CTRL-Enter to run. It could take some time to load (a few minutes). In the example below we use a stemmer from the Natural Language Tool Kit and choose to include pairs of words as keywords.

In [3]:
import nltk
from tfidf import *
corpus = TeachingCorpus('keydata5', stemmer=nltk.PorterStemmer(), bigraph=True) 

Error reading corpus from keydata5: None


Traceback (most recent call last):
  File "C:\Users\marti\Documents\LifeSciteaching\Curriculum\curriculumTools\curriculum map 2016\tfidf.py", line 522, in load_corpus
    corpus = nltk.corpus.PlaintextCorpusReader(self._datadir,'.*\.txt')
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\plaintext.py", line 62, in __init__
    CorpusReader.__init__(self, root, fileids, encoding)
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\api.py", line 80, in __init__
    root = FileSystemPathPointer(root)
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py", line 312, in __init__
    raise OSError("No such file or directory: %r" % _path)
OSError: No such file or directory: 'C:\\Users\\marti\\Documents\\LifeSciteaching\\Curriculum\\curriculumTools\\curriculum map 2016\\keydata5'


In [2]:
for f in corpus.corpus.fileids():
    try:
        corpus.indices[f] = IndexedText( corpus._stemmer, corpus.corpus.words(f))
    except: 
        pass

That has loaded the text and built indexed keyword lists. It is now possible to find keywords throughout the curriculum with the *corpus.show_matches(keyword)* command. To see specific matches, add the option lines=True to the command

In [10]:
corpus.show_matches('kinase', details=False)
# change the keyword to anything you want and try it out - remember that text must be between quotes.

351 documents returned matches
BS42019 (30 files match)
BS32006 (27 files match)
BS42013 (24 files match)
BS42014 (21 files match)
BS42015 (18 files match)
BS32009 (12 files match)
BS31004 (11 files match)
BS42021 (11 files match)
BS32003 (11 files match)
BS42020 (10 files match)
BS42018 (10 files match)
BS31005 (10 files match)
BS42007 (9 files match)
BS42008 (9 files match)
BS42005 (8 files match)
BS42009 (8 files match)
BS42006 (8 files match)
BS31006 (8 files match)
BS32028 (8 files match)
BS42022 (7 files match)
BS31019 (7 files match)
BS42023 (7 files match)
BS22002 (7 files match)
BS32004 (6 files match)
BS42004 (6 files match)
BS32008 (6 files match)
BS42024 (6 files match)
BS31013 (5 files match)
BS22001 (5 files match)
BS42010 (4 files match)
BS21002 (4 files match)
BS42012 (4 files match)
BS32024 (3 files match)
BS42025 (3 files match)
BS32022 (2 files match)
advhigher (2 files match)
BS32011 (2 files match)
BS31016 (2 files match)
BS12002 (2 files match)
BS42017 (1 files ma

In [None]:
corpus.show_matches('handbook', details=True) 
# this will show the specific matches in context. Note that the text shown in the 
#matches is the stemmed keywords without stopwords so may appear a little strange

## Inexact matching of documents and text

Individual keywords are great for some purposes, but can be very noisy. Often we'd like to take a lecture, or a paper and find the most related documents (and hence perhaps the most related module). We can do that by using a measure of how well matched the terms used are weighted by how specific those terms are to a particular document.


In [None]:
# This is a student essay in a fourth year module.
corpus.match_file('test_document1.docx')
# only the top matches are returned - control the number with the matches option


We can compare all the files from one module against all the other files not in that module to see how modules relate. This can take some time. 

In [3]:
corpus.match_module('BS31005', matches=10)



Your file Tutorial 3 – Andy Flavell.docx from assessments matched the following modules/documents:
55533	BS32008	assessments	BS32008_3.txt	Tutorial 1 Andy Flavell.docx
18059	BS32008	week 20	BS32008_7.txt	Lecture 4 - 2016(2).pptx
3654	BS32008	assessments	BS32008_1.txt	BS32008 Final 2015.docx
3120	BS11001	WEEK 11	BS11001_5.txt	BS11001_LTEY_DZB_molecular_evolution(4).pptx
2273	BS22002	week 23	BS22002_25.txt	BS22002Intro(1).pptx
2230	BS32022	week 19	BS32022_14.txt	BS32022-L14.pptx
1814	BS32008	week 20	BS32008_6.txt	Lecture 3 - 2016.pptx
1782	BS12001	week 16	BS12001_5.txt	BS12001 Chromosome structure(1).pptx
1706	BS42014	lectures	BS42014_21.txt	L4 2016 Obesity and Metabolic Disease Lectures 5 & 6 (Ashford).pptx
1621	BS42008	assessments	BS42008_2.txt	Paper analysis test-Formative.docx

Your file linkage_Tutorial_ notes2015.pptx from assessments matched the following modules/documents:
67035	BS12001	week 16	BS12001_6.txt	BS12001 Meiosis and Mendel(2).pptx
18956	BS42024	workshops	BS42024_14.tx

In [3]:
corpus.cal_network('ahall4network.csv',minsize=700)


In [None]:
sorted([k.lower() for k in corpus.fileinfo.keys()])[:20]


In [None]:
corpus.load_fileinfo('keydata5')
