## Project 04: Extending Wikipedia Text Explorer

- **Due**: Tuesday, 30 October 2018; 12:00pm
- **Total Points**: 180
    - code runs and does not produce too many lint/style warnings, 40 points
    - the output shows the changes correctly on your dataset, 100 points
    - the code works on other datasets as well, 40 points
    
In this project, each of your will be tasked with making some modification
to the `wikitext.py` module. You need to make the change and re-run the analysis
that you had from Project 03 (if getting the topic and cluster names again is too
difficult, that is not a big deal). If you have any questions about a specific task,
please let me know!

**For submission, make sure to upload all of the html files. You should also
upload this project file.**

### Tasks

For this assignment, you'll pick one of the follow sets of tasks. My intent
was that all five are equally time-consuming, though (roughly) as you move
down the list the difficulty increases but the number of tasks you have to
perform descreases. 

1. **Update user interface**: The user interface is fairly basic at 
the moment. Add, with appropriate documentation, the following elements:
    1. Allow users to select the number of top documents shown in the
    topics page and the most similar documents shown on the documents 
    page.
    2. Print out status messages for initalizing the WikiCorpus object
    and the function `wiki_text_explorer`. If you are feeling ambitious,
    use the logging module for this.
    3. Modify `json_meta_template` to construct better default topic and
    cluster names. Perhaps the top 3-5 terms in each, seperated by a hyphen?
2. **Linguistic Information**: I have stored a list of other languages 
that pages have been translated into in the `WikiCorpus` metadata. Display
a subset of these (say, a dozen or so popular languages) on the document
page by showing a country flag when a page has been translated into a given
language.
3. **Network Analysis**: Compute a network structure using the internal links
and display the eigenvalue centrality on the documents page. 
4. **Images**: Write a function that finds the first image found on a
Wikipedia page and store the `src` attribute in the `WikiCorpus` metadata.
Then, add a column to the document page that includes the image. 
5. **Integrate with spacy**: Load the *spacy* module and use its much
more sophisticated tokenization algorithms to parse the texts. Save
the lemmas, perform part of speech filtering (nouns and verbs, perhaps)
and use these in the `WikiCorpus` class. Finally, use the spacy document
similarity measurement to construct the similarity scores.

### General approach

**In this block, describe in just a couple of sentences the task you choose and
approach you took to solving the problem.**

### Application

There is not much more code that you need to write in this file. Basically, just copy
the code you had from Project 3 over to this page so that it will reproduce your
Text Explorer page, but with your new `wikitext` module.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import wiki
import iplot
import wikitext

In [None]:
assert wiki.__version__ >= 6
assert wikitext.__version__ >= 3
assert iplot.__version__ >= 3

In [None]:
links = wikitext.get_internal_links('List_of_important_publications_in_philosophy') 
wcorp = wikitext.WikiCorpus(links['ilinks'], num_clusters=40, num_topics=15)

In [None]:
wikitext.wiki_text_explorer(wcorp, output_dir="philosophy-pub")

You should also check that your code is not producing too many warnings with
`pycodestyle` and `pylint` (Note: you do not need to be perfect, just don't have
too many simple errors).

In [None]:
import pycodestyle
pycodestyle.Checker(filename='wikitext.py').check_all()

In [None]:
from pylint.epylint import lint
lint("wikitext.py")

In [None]:
wcorp.meta.eigen