## Tutorial 18: Creating the wikitext module

Recall that I created the module `wiki.py` in order to wrap-up and make
easily usable all of the functions for interacting with the MediaWiki API.
I then also created the module `iplot.py` for working with interactive
data visualizations. We need a similar module for working with textual
data from a corpus of Wikipedia pages. However, this time you are going
to try to create this module yourself.

### Create `wikitext.py`

Start by constructing an empty module named `wikitext.py`. We will turn on the
autoreload function and import the empty module.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import wikitext

Now, add this line into your module to indicate that this is version one of
the code:

In [None]:
__version__ = 1

Save the file, and make sure that everything is working correctly (and autoreloading) by
checking the version string:

In [None]:
wikitext.__version__

### Cleaning text

To start our module, let's create a function that gets rid of the newline characters
and numeric references (numbers in square brackets). For example, here is a short
snippet of text from the 'Plato' page:

In [None]:
text = """Western religion and spirituality.[6]\n"""
text

Write a function named `clean_text` in the `wikitext.py` module that
takes one string argument and returns a cleaned string with the newlines
and references removed. You can informally test using this code:

In [None]:
wikitext.clean_text(text)

When you think you have the correct code, test your function by running the
following code lines:

In [None]:
assert wikitext.clean_text(text) == 'Western religion and spirituality.'
assert wikitext.clean_text('Some[120] more.') == 'Some more.'
assert wikitext.clean_text('And\n again.') == 'And again.'

If your code is working as expected, the block of code will not produce anything.
Only if there is an error will something appear.

### List of paragraphs

Next, we'll create a function `link_to_p` that takes the name of a Wikipedia
page as an input and returns a list of the paragraphs in the text. That is,
each element of the list is a string containing the text of a paragraph. You
can test the function with this code (it shows the first three paragraph of
the 'Plato' page):

In [None]:
wikitext.link_to_p('Plato')[:3]

Make sure that your function does these two things:

- calls the `clean_text` function on each block of text
- does not return paragraphs that are empty after cleaning

Once you have that worked out, test the code with the test below.

In [None]:
paragraphs = wikitext.link_to_p('Plato')[:3]

assert paragraphs[1][:22] == 'Along with his teacher'
assert paragraphs[1][-13:] == 'spirituality.'

Again, the code works if the above does not produce any output.

### Entire document

While it is often useful to have the text within each paragraph 
seperated, more often we will want to extract the entire text as
a whole. Write a new function `link_to_doc` that returns the entire
paragraph text as a single string. Hint: The easiest way to do this
is to call the function `link_to_p` and then collapse the results 
using the `join` function.

First, try your code with this:

In [None]:
wikitext.link_to_doc('Plato')[:1000]

Then run these tests once you think you are finished with the code.

In [None]:
doc = wikitext.link_to_doc('Plato')

assert type(doc) == str
assert len(doc) == 64299
assert doc[:5] == 'Plato'

### docstrings

Go back to the module and make sure that you have full docstrings on all of
the functions in the module. These should gives a sentence describing what the
function does, followed by the input argument, then what the results are.

### Checking your code

Once you have the three functions written, check your code with the `pycodestyle`
and `pylint` modules. 

In [None]:
import pycodestyle
pycodestyle.Checker(filename='wikitext.py').check_all()

In [None]:
from pylint.epylint import lint
lint("wikitext.py")

Try to fix all of the issues given by these modules. I suggest fixing the code style
issues first, followed by the `pylint` warnings. If a warning does not make sense to
you, just ask!

### More practice

Let's write another function! We will write a function named `link_to_plinks` 
that takes a Wikipedia link and returns a list of all the internal links on the
page that are given somewhere inside of a paragraph tag. This will avoid, for
example, extraneous links at the bottom and sides of the page. However, we want
to ensure a few things about the results:

- only return the page name (i.e., not the '/wiki/' part)
- only return internal links and 'real' pages; use the list of internal links for this
- return a list with no duplicates
- sort the list in the output, and make sure the result is a 'list'

You'll probably want to work on this function in stages. That is, returning all of
the links at first and then building `if` statements to filter out exactly what we
want. Note that you'll need to replace spaces with underscores in the list of links
provided from the Wikipedia JSON file.

You can test your code with:

In [None]:
wikitext.link_to_plinks('Plato')

You should find that the following code will run if you have correctly
defined the `link_to_plinks` function.

In [None]:
ilinks = wikitext.link_to_plinks('Plato')

assert len(ilinks) == 288
assert ilinks[0] == 'Abstraction'
assert type(ilinks) == list

Finally, ensure that you have a docstring for the function and the `pycodestyle`
and `pylint` modules produce no errors.

In [None]:
import pycodestyle
pycodestyle.Checker(filename='wikitext.py').check_all()

In [None]:
from pylint.epylint import lint
lint("wikitext.py")

### Even more practice

The above steps should take some time to get done correctly. If you would like even
more practice with building and testing functions for working with XML and textual
data, here is one more task. Build a function `link_to_geo` that takes a name of a
Wikipedia page and returns either the latitude and longitude associated with the
page or, if there is no geographic information, returns the object `None`.

You can test your code with the 'London' page:

In [None]:
wikitext.link_to_geo('London')

At first, try to just spit out the coordinates as a string as given by Wikipedia.
Then, make sure you that you correctly return `None` when given a page like Plato:

In [None]:
type(wikitext.link_to_geo('Plato'))

Finally, when there is coordinate information, split the string into latitude
and longitude and return the result as a tuple (just use `return lat, lon` in 
the code). To test, check that we have:

In [None]:
lat, lon = wikitext.link_to_geo('London')

assert type(lat) == float
assert type(lon) == float
assert abs(lat - 51.50722) < 0.4
assert abs(lon - -0.12750) < 0.4
assert wikitext.link_to_geo('Plato') == None

And, as usual, make sure that you have full docstrings and the code produces
no warnings when running `pylint`.