# Extract coordinates and keywords from Wikipedia

Wikipedia allows you to download the entirety of its contents. For English only (and without talk pages) that comes out to about 16 GB of data in the zipped format. We're only interested in articles with coordinates specified for the article itself (rather than something the article refers to). So we need to 1) be able to read in a page, given either a page ID or the page's title, and 2) query all pages with coordinates with a certain display type.

Wikipedia provides an index file to go with the database dump, that lets you read articles in batches of 100 using byte locations. But I'd rather have the exact index to all the pages; that will save time when we're iterating over a subset of the pages later. And compared to other things I'm doing, building my own index isn't terribly computationally expensive.

First, let's tell it where the dump file is. I made a 10 MB sample file that we can use.

In [1]:
from pathlib import Path
xml_filename = "wiki_sample.xml"

scratch_folder = Path("scratch-pipeline")

## What we want to create

For each page, we're going to have the start and end indices, the page number, title, and the coord tag. We'll ignore any pages that don't have the string `Coord` or `coord`.

Here are the databases we're creating:
 - `index`: maps page number to the title, start and end indices, and raw coord tag
 - `coords`: maps page number to lots of attributes in a coord tag
 - `title`: maps page title to page number
 
Here's an example of a coord tag from a Wikipedia article:
```
coord|42.440556|-98.148083|type:landmark|name=Ashfall Fossil Beds
```

Let's get started.

In [2]:
from wikiparse import geo_indexer
import time
pipeline_start = time.time()

In [3]:
indexer = geo_indexer.Indexer(xml_filename,
            scratch_folder=scratch_folder)
indexer.load()

opening scratch-pipeline\index.db
Ready. Metadata: [('size', 999)]
iterating 100.0% of pages took 0.01 minutes
100.0 % contained coordinates tag
Creating title dictionary
        0.0%	5.98ms / page  

How many pages do we have? This is the number that have coord tags.

In [4]:
len(indexer.get_page_numbers())

999

Let's check out a page.

In [7]:
page = indexer.get_page_by_num(indexer.get_page_numbers()[5])
print(f'"{page.title}" \n\n{page.text[:1000]}')

"Chermignon" 



Chermignon is a former Municipalities of Switzerland municipality in the district of Sierre district Sierre in the Cantons of Switzerland canton of Valais in Switzerland. On 1 January 2017 the former municipalities of Chermignon, Mollens, Valais Mollens, Montana, Switzerland Montana and Randogne merged into the new municipality of Crans-Montana.

  History  
Chermignon is first mentioned in 1228 as Chermenon and Chirminon.  It became an independent municipality in 1905 when it separated from Lens, Valais Lens.

  Geography  
Chermignon had an area, , of .  Of this area,  or 38.8% is used for agricultural purposes, while  or 30.5% is forested.   Of the rest of the land,  or 28.6% is settled buildings or roads,  or 0.4% is either rivers or lakes and  or 1.3% is unproductive land.  

Of the built up area, industrial buildings made up 1.3% of the total area while housing and buildings made up 17.3% and transportation infrastructure made up 6.1%. while parks, green belts an

We're losing _some_ of the actual text of the Wikipedia article, but generally the filter's doing a good job of keeping out the metadata and keeping in the text.

In [6]:
took = time.time() - pipeline_start
if took < 60:
    print("pipeline took", round(took, 2), "seconds")
elif took < 3600:
    print("pipeline took", round(took/60, 2), "minutes")
else:
    print("pipeline took", round(took/60/60, 2), "hours")

pipeline took 4.75 seconds
