In this notebook we describe the process for downloading all of Wikipedia's data from their publically acessable<br>
data dumps. The data is dumped in SQL and XML files. <br><br> In this case, we will scrape the dump site for the XML downloads.

 The linked blog post goes into detail about the technologies used here and was an important resource. <br>
 https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c <br>
 <br><br>
 You can find the dump site below: <br>
 https://dumps.wikimedia.org/ <br>https://dumps.wikimedia.org/enwiki/<br>
 

In [39]:
import requests

# Parsing HTML
from bs4 import BeautifulSoup

# File system management
import os

Using the requests library and the BeautifulSoup library, we can easily grab the index page and find <br>
the most recent data dumps.<br>
#### We find the latest wiki dumps. We go with 20191220.

In [40]:
base_url = 'https://dumps.wikimedia.org/enwiki/'
index = requests.get(base_url).text
soup_index = BeautifulSoup(index, 'html.parser')

# Find the links that are dates of dumps
dumps = [a['href'] for a in soup_index.find_all('a') if 
         a.has_attr('href')]
dumps

['../',
 '20191220/',
 '20200101/',
 '20200120/',
 '20200201/',
 '20200220/',
 '20200301/',
 '20200401/',
 'latest/']

Using requests.get() we can download the webpage to examine the HTML.

In [41]:
dump_url = base_url + '20191220/'

# Retrieve the html
dump_html = requests.get(dump_url).text
dump_html[:100]

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n        "http://www.w3.org/TR/xhtml1/DTD/xh'

Exploring the dump url ourselves, we see that there are many different XML files. There are the full XML <br>
texts as well as Multistream downloads that organize the data differently. The full XML document of all wikipedia data<br>
is available, although it is 16Gs and will be a hastle to work with. Instead we will aim to download the data individual<br>
partitions that are generated before merging them for the full document.<br><br>
For this example we work with the raw XML files <b>not</b> the Multistream (although we download both as Multistream should be looked into).

In [42]:
# Convert to a soup
soup_dump = BeautifulSoup(dump_html, 'html.parser')

# Find li elements with the class file
soup_dump.find_all('li', {'class': 'file'}, limit = 10)[:4]

[<li class="file"><a href="/enwiki/20191220/enwiki-20191220-pages-articles-multistream.xml.bz2">enwiki-20191220-pages-articles-multistream.xml.bz2</a> 16.5 GB</li>,
 <li class="file"><a href="/enwiki/20191220/enwiki-20191220-pages-articles-multistream-index.txt.bz2">enwiki-20191220-pages-articles-multistream-index.txt.bz2</a> 207.5 MB</li>,
 <li class="file"><a href="/enwiki/20191220/enwiki-20191220-pages-articles-multistream1.xml-p10p30302.bz2">enwiki-20191220-pages-articles-multistream1.xml-p10p30302.bz2</a> 174.8 MB</li>,
 <li class="file"><a href="/enwiki/20191220/enwiki-20191220-pages-articles-multistream-index1.txt-p10p30302.bz2">enwiki-20191220-pages-articles-multistream-index1.txt-p10p30302.bz2</a> 163 KB</li>]

BeautifulSoup allows us to iterate through the HTML tags! Using this we can pull out the direct references to files.

In [43]:

files = []

# Search through all files
for file in soup_dump.find_all('li', {'class': 'file'}):
    text = file.text
    # Select the relevant files
    if 'pages-articles' in text:
        files.append((text.split()[0], text.split()[1:]))
        
files[:5]

[('enwiki-20191220-pages-articles-multistream.xml.bz2', ['16.5', 'GB']),
 ('enwiki-20191220-pages-articles-multistream-index.txt.bz2', ['207.5', 'MB']),
 ('enwiki-20191220-pages-articles-multistream1.xml-p10p30302.bz2',
  ['174.8', 'MB']),
 ('enwiki-20191220-pages-articles-multistream-index1.txt-p10p30302.bz2',
  ['163', 'KB']),
 ('enwiki-20191220-pages-articles-multistream2.xml-p30304p88444.bz2',
  ['208.0', 'MB'])]

The files we are looking for all have .xml-p, so we can sort them out.

In [44]:
files_to_download = [file[0] for file in files if '.xml-p' in file[0]]
files_to_download[-5:]

['enwiki-20191220-pages-articles27.xml-p56163464p57663464.bz2',
 'enwiki-20191220-pages-articles27.xml-p57663464p59163464.bz2',
 'enwiki-20191220-pages-articles27.xml-p59163464p60663464.bz2',
 'enwiki-20191220-pages-articles27.xml-p60663464p62163464.bz2',
 'enwiki-20191220-pages-articles27.xml-p62163464p62632462.bz2']

## We outline two methods for downloading.

### Option 1
#### An interesting point to remember is that we would rather have many small files than one big file.

In this section we show a less elegent method than the one above to scrape the HTML for downloading partitians<br>
However, we show off some of the ways that BeautifulSoup can be used to grab specific files. <br>
 <br>
<b> You may wish to skip to Option 2 for the download.

In [35]:
dump_url = base_url + '20200201/'# Retrieve the html
dump_html = requests.get(dump_url).text# Convert to a soup
soup_dump = BeautifulSoup(dump_html, 'html.parser')# Find list elements with the class file
soup_dump.find_all('li', {'class': 'file'})[:3]

[<li class="file"><a href="/enwiki/20200201/enwiki-20200201-pages-articles-multistream.xml.bz2">enwiki-20200201-pages-articles-multistream.xml.bz2</a> 16.6 GB</li>,
 <li class="file"><a href="/enwiki/20200201/enwiki-20200201-pages-articles-multistream-index.txt.bz2">enwiki-20200201-pages-articles-multistream-index.txt.bz2</a> 208.6 MB</li>,
 <li class="file"><a href="/enwiki/20200201/enwiki-20200201-pages-articles-multistream1.xml-p10p30302.bz2">enwiki-20200201-pages-articles-multistream1.xml-p10p30302.bz2</a> 175.7 MB</li>]

We are using the wiki dump from 2020-02-01. https://dumps.wikimedia.org/wikidatawiki/20200201/

In [20]:
page = requests.get("https://dumps.wikimedia.org/wikidatawiki/20200201/")
page

<Response [200]>

In [21]:
page.content

b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n<head>\n        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>\n        <title>wikidatawiki dump progress on 20200201</title>\n        <link rel="stylesheet" type="text/css" href="/dumps.css" />\n        <style type="text/css">\n                .siteinfo {\n                        text-align: center;\n                }\n                li {\n                        list-style-type: none;\n                        padding: 0.5em 1.5em 0.5em 1.5em;\n                        background: #fff;\n                        margin-bottom: 1em;\n                }\n                li li {\n                        background-color: white;\n                        box-shadow: none;\n                        border-top: none;\n                        padding: 0px;\n                        mar

In [22]:
#Many files are included in the the dump web-page, however we only want the pages-articles-multistream
#which is approximatly 83GB.
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   wikidatawiki dump progress on 20200201
  </title>
  <link href="/dumps.css" rel="stylesheet" type="text/css"/>
  <style type="text/css">
   .siteinfo {
                        text-align: center;
                }
                li {
                        list-style-type: none;
                        padding: 0.5em 1.5em 0.5em 1.5em;
                        background: #fff;
                        margin-bottom: 1em;
                }
                li li {
                        background-color: white;
                        box-shadow: none;
                        border-top: none;
                        padding: 0px;
                        margin-bottom: 0em;
                }
                li ul

In [23]:
list(soup.children)

['html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"',
 '\n',
 <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <title>wikidatawiki dump progress on 20200201</title>
 <link href="/dumps.css" rel="stylesheet" type="text/css"/>
 <style type="text/css">
                 .siteinfo {
                         text-align: center;
                 }
                 li {
                         list-style-type: none;
                         padding: 0.5em 1.5em 0.5em 1.5em;
                         background: #fff;
                         margin-bottom: 1em;
                 }
                 li li {
                         background-color: white;
                         box-shadow: none;
                         border-top: none;
                         padding: 0px;
                         margin-bottom: 0em;
                 }
 

In [24]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString]

In [25]:
soup = BeautifulSoup(page.content, 'html.parser')

In [26]:
datalist = soup.select("ul ul")[1]
datalist

<ul><li class="file"><a href="/wikidatawiki/20200201/wikidatawiki-20200201-pages-articles-multistream1.xml-p1p235321.bz2">wikidatawiki-20200201-pages-articles-multistream1.xml-p1p235321.bz2</a> 768.9 MB</li>
<li class="file"><a href="/wikidatawiki/20200201/wikidatawiki-20200201-pages-articles-multistream-index1.txt-p1p235321.bz2">wikidatawiki-20200201-pages-articles-multistream-index1.txt-p1p235321.bz2</a> 759 KB</li>
<li class="file"><a href="/wikidatawiki/20200201/wikidatawiki-20200201-pages-articles-multistream2.xml-p235322p585543.bz2">wikidatawiki-20200201-pages-articles-multistream2.xml-p235322p585543.bz2</a> 843.8 MB</li>
<li class="file"><a href="/wikidatawiki/20200201/wikidatawiki-20200201-pages-articles-multistream-index2.txt-p235322p585543.bz2">wikidatawiki-20200201-pages-articles-multistream-index2.txt-p235322p585543.bz2</a> 1.1 MB</li>
<li class="file"><a href="/wikidatawiki/20200201/wikidatawiki-20200201-pages-articles-multistream3.xml-p585544p1015944.bz2">wikidatawiki-202

In [27]:
currentdata = datalist.select("li")[0]
currentdata

<li class="file"><a href="/wikidatawiki/20200201/wikidatawiki-20200201-pages-articles-multistream1.xml-p1p235321.bz2">wikidatawiki-20200201-pages-articles-multistream1.xml-p1p235321.bz2</a> 768.9 MB</li>

In [28]:
currentdata.find("a").get_text()

'wikidatawiki-20200201-pages-articles-multistream1.xml-p1p235321.bz2'

In [29]:
#Below we download all wiki articles. 136 files total. 
leng=135

Please note this will take a few hours. 

In [None]:
import sys
from keras.utils import get_file
import tensorflow as tf
keras_home = 'C:/Users/Austin/.keras/datasets/'

for y in range(leng+1):
    currentdata = datalist.select("li")[y]
    ref = currentdata.find("a").get_text()
    print(ref)
    tf.keras.utils.get_file(
    fname= ref, 
    origin="https://dumps.wikimedia.org/wikidatawiki/20200201"+"/"+ref
    )

## Option 2

In [16]:
import sys
from keras.utils import get_file
import tensorflow as tf
keras_home = 'C:/Users/Austin/.keras/datasets/'

In [45]:
files_to_download

['enwiki-20191220-pages-articles-multistream1.xml-p10p30302.bz2',
 'enwiki-20191220-pages-articles-multistream2.xml-p30304p88444.bz2',
 'enwiki-20191220-pages-articles-multistream3.xml-p88445p200507.bz2',
 'enwiki-20191220-pages-articles-multistream4.xml-p200511p352689.bz2',
 'enwiki-20191220-pages-articles-multistream5.xml-p352690p565312.bz2',
 'enwiki-20191220-pages-articles-multistream6.xml-p565314p892912.bz2',
 'enwiki-20191220-pages-articles-multistream7.xml-p892914p1268691.bz2',
 'enwiki-20191220-pages-articles-multistream8.xml-p1268693p1791079.bz2',
 'enwiki-20191220-pages-articles-multistream9.xml-p1791081p2336422.bz2',
 'enwiki-20191220-pages-articles-multistream10.xml-p2336425p3046511.bz2',
 'enwiki-20191220-pages-articles-multistream11.xml-p3046517p3926861.bz2',
 'enwiki-20191220-pages-articles-multistream12.xml-p3926864p5040435.bz2',
 'enwiki-20191220-pages-articles-multistream13.xml-p5040438p6197593.bz2',
 'enwiki-20191220-pages-articles-multistream14.xml-p6197599p7697599.

## Note, wikidump now includes articles and multistream articles.

#### More work would be needed to separate further.

The following cell will download the data straight to the Keras home folder. <br>
We add an exception to skip downloads that can be found in the folder already to prevent duplicates <br>
if the download fails before completion.

In [None]:
data_paths = []
file_info = []

# Iterate through each file
for file in files_to_download:
    path = keras_home  +file
    
    # Check to see if the path exists (if the file is already downloaded)
    if not os.path.exists(keras_home + file):
        print('Downloading')
        # If not, download the file
        data_paths.append(get_file(file, dump_url + file))
        # Find the file size in MB
        file_size = os.stat(path).st_size / 1e6
        
        # Find the number of articles
        file_articles = int(file.split('p')[-1].split('.')[-2]) - int(file.split('p')[-2])
        file_info.append((file, file_size, file_articles))
        
    # If the file is already downloaded find some information
    else:
        data_paths.append(path)
        # Find the file size in MB
        file_size = os.stat(path).st_size / 1e6
        
        # Find the number of articles
        file_number = int(file.split('p')[-1].split('.')[-2]) - int(file.split('p')[-2])
        file_info.append((file.split('-')[-1], file_size, file_number))

In [73]:
files_to_download

['enwiki-20191220-pages-articles-multistream1.xml-p10p30302.bz2',
 'enwiki-20191220-pages-articles-multistream2.xml-p30304p88444.bz2',
 'enwiki-20191220-pages-articles-multistream3.xml-p88445p200507.bz2',
 'enwiki-20191220-pages-articles-multistream4.xml-p200511p352689.bz2',
 'enwiki-20191220-pages-articles-multistream5.xml-p352690p565312.bz2',
 'enwiki-20191220-pages-articles-multistream6.xml-p565314p892912.bz2',
 'enwiki-20191220-pages-articles-multistream7.xml-p892914p1268691.bz2',
 'enwiki-20191220-pages-articles-multistream8.xml-p1268693p1791079.bz2',
 'enwiki-20191220-pages-articles-multistream9.xml-p1791081p2336422.bz2',
 'enwiki-20191220-pages-articles-multistream10.xml-p2336425p3046511.bz2',
 'enwiki-20191220-pages-articles-multistream11.xml-p3046517p3926861.bz2',
 'enwiki-20191220-pages-articles-multistream12.xml-p3926864p5040435.bz2',
 'enwiki-20191220-pages-articles-multistream13.xml-p5040438p6197593.bz2',
 'enwiki-20191220-pages-articles-multistream14.xml-p6197599p7697599.

In [76]:
dump_url

'https://dumps.wikimedia.org/enwiki/20191220/'

### In Option 2, we kept file info about what we downloaded. 
### Let's take a look!

In [27]:
#Largest File. Files are stored in MB.
sorted(file_info, key = lambda x: x[1], reverse = True)[:5]

[]

In [28]:
print(f'There are {len(file_info)} partitions.')


There are 0 partitions.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
file_df = pd.DataFrame(file_info, columns = ['file', 'size (MB)', 'articles']).set_index('file')
file_df['size (MB)'].plot.bar(color = 'red', figsize = (22, 6));

In [None]:
print(f"The total size of files on disk is {file_df['size (MB)'].sum() / 1e3} GB")
