Chapter 3: Regular Expression
The file enwiki-country.json.gz stores Wikipedia articles in the format:

Each line stores a Wikipedia article in JSON format
Each JSON document has key-value pairs:
Title of the article as the value for the title key
Body of the article as the value for the text key
The entire file is compressed by gzip
Write codes that perform the following jobs.

20. Read JSON documents
Read the JSON documents and output the body of the article about the United Kingdom. Reuse the output in problems 21-29.

In [7]:
import json

with open('/Users/wdy940211/Desktop/enwiki-country.json', 'r') as f_source, open('/Users/wdy940211/Desktop/uk.txt', 'w') as f_target:
    for line in f_source:
        data = json.loads(line)
        if data['title'] == 'United Kingdom':
            uk = data['text']
            f_target.write(uk)
f_source.close()
f_target.close()

21. Lines with category names
Extract lines that define the categories of the article.

In [11]:
with open('/Users/wdy940211/Desktop/uk.txt', 'r') as f_source, open('/Users/wdy940211/Desktop/categories.txt', 'w') as f_target:
    for line in f_source:
        if '[[Category' in line:
            f_target.write('%s\n' % line)

f_source.close()
f_target.close()

In [32]:
import pandas as pd
import re

pattern = re.compile('Category')
article = pd.read_json('/Users/wdy940211/Desktop/enwiki-country.json', lines=True)
uk = article[article['title']=='United Kingdom'].text.values
lines = uk[0].split('\n')

with open('/Users/wdy940211/Desktop/categories1.txt', 'w') as f_target:
    for line in lines:
        if re.search(pattern, line):
            f.write('%s\n' % line)

f_target.close()

22. Category names
Extract the category names of the article.

In [13]:
import re

with open('/Users/wdy940211/Desktop/categories.txt', 'r') as f_source, open('/Users/wdy940211/Desktop/category_names.txt', 'w') as f_target:
    for line in f_source:
        line = line.replace('[[', '').replace('Category:', '').replace(']]', '').replace('|', '')
        f_target.write('%s' % line)

f_source.close()
f_target.close()

In [37]:
import re

with open('/Users/wdy940211/Desktop/categories.txt', 'r') as f_source, open('/Users/wdy940211/Desktop/category_names1.txt', 'w') as f_target:
    for line in f_source:
        line = re.sub(r'^.*Category:(.*?)(\|.*)?\]\].*$', '\\1', line)
        f_target.write('%s\n' % line)

f_source.close()
f_target.close()

23. Section structure
Extract section names in the article with their levels. For example, the level of the section is 1 for the MediaWiki markup "== Section name ==".

In [92]:
import re

with open('/Users/wdy940211/Desktop/uk.txt', 'r') as f_source, open('/Users/wdy940211/Desktop/section.txt', 'w') as f_target:
    for line in f_source:
        if re.search('^=+.*=+$', line):
            title = str(line.replace('=', ''))
            level = (len(line) - len(title)) // 2 - 1
            f_target.write(title.strip()+' '+str(level)+'\n')

f_source.close()
f_target.close()

24. Media references
Extract references to media files linked from the article.

In [94]:
import re

with open('/Users/wdy940211/Desktop/uk.txt', 'r') as f_source, open('/Users/wdy940211/Desktop/media.txt', 'w') as f_target:
    for line in f_source:
        if '[[File' in line:
            f_target.write(re.sub(r'.*\[\[File:(.*\.\w{3})[\|\]].*', '\\1', line))
            
f_source.close()
f_target.close()

25. Infobox
Extract field names and their values in the Infobox “country”, and store them in a dictionary object.

In [39]:
import re
import json

d = {}
pattern = re.compile(r'\|(.+?)\s=\s*(.+)')
with open('/Users/wdy940211/Desktop/uk.txt', 'r') as f_source, open('/Users/wdy940211/Desktop/infobox.txt', 'w') as f_target:
    for line in f_source:
        match = re.search(pattern, line)
        if match:
            d[match[1].strip()]=match[2].strip()
    for key, value in d.items():
        f_target.write(key+': '+value+',\n')

f_source.close()
f_target.close()

26. Remove emphasis markups
In addition to the process of the problem 25, remove emphasis MediaWiki markups from the values. See Help:Cheatsheet.

In [28]:
import re
import json

d = {}
pattern = re.compile(r'\|(.+?)\s=\s*(.+)')
emphasis = re.compile(r"(.*?)'{2,}(.+?)'{2,}(.*)")
with open('/Users/wdy940211/Desktop/uk.txt', 'r') as f_source, open('/Users/wdy940211/Desktop/remove_emphasis.txt', 'w') as f_target:
    for line in f_source:
        match = re.search(pattern, line)
        if match:
            d[match[1].strip()]=match[2].strip()
    for key in d:
        match1 = re.match(emphasis, d[key])
        if match1:
            d[key] = ''.join(match1.group(1,2,3))
    for key, value in d.items():
        f_target.write(key+': '+value+',\n')

f_source.close()
f_target.close()

27. Remove internal links
In addition to the process of the problem 26, remove internal links from the values. See Help:Cheatsheet.

In [20]:
import re
import json

d = {}
pattern = re.compile(r'\|(.+?)\s=\s*(.+)')
emphasis = re.compile(r"(.*?)'{2,}(.+?)'{2,}(.*)")
links = re.compile(r'\[\[(.*?)\]\]')
with open('/Users/wdy940211/Desktop/uk.txt', 'r') as f_source, open('/Users/wdy940211/Desktop/remove_links.txt', 'w') as f_target:
    for line in f_source:
        line = re.sub(emphasis, '\\1', line)
        line = re.sub(links, '\\1', line)
        match = re.search(pattern, line)
        if match:
            d[match[1].strip()] = match[2].strip()

    for key, value in d.items():
        f_target.write(key+': '+value+',\n')

f_source.close()
f_target.close()

28. Remove MediaWiki markups
In addition to the process of the problem 27, remove MediaWiki markups from the values as much as you can, and obtain the basic information of the country in plain text format.

In [19]:
import re
import json

d = {}
pattern = re.compile(r'\|(.+?)\s=\s*(.+)')
emphasis = re.compile(r"(.*?)'{2,}(.+?)'{2,}(.*)")
links = re.compile(r'\[\[(.*?)\]\]')
media = re.compile(r'<[br|ref][^>]*?>.+?<\/[br|ref][^>]*?>')

with open('/Users/wdy940211/Desktop/uk.txt', 'r') as f_source, open('/Users/wdy940211/Desktop/remove_media.txt', 'w') as f_target:
    for line in f_source:  
        line = re.sub(emphasis, '\\1', line)
        line = re.sub(links, '\\1', line)
        line = re.sub(media, '', line)
        match = re.search(pattern, line)
        if match:
            d[match[1].strip()] = match[2].strip()

    for key, value in d.items():
        f_target.write(key+': '+value+',\n')

f_source.close()
f_target.close()

29. Country flag
Obtain the URL of the country flag by using the analysis result of Infobox. (Hint: convert a file reference to a URL by calling imageinfo in MediaWiki API)

In [24]:
# Learning requests
import requests

r = requests.get('https://en.wikipedia.org/w/api.php')
r

<Response [200]>

In [25]:
# Learning requests
import requests

r = requests.get('https://en.wikipedia.org/w/api.php')
dir(r)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

In [26]:
# Learning requests
import requests

r = requests.get('https://en.wikipedia.org/w/api.php')
help (r)

Help on Response in module requests.models object:

class Response(builtins.object)
 |  The :class:`Response <Response>` object, which contains a
 |  server's response to an HTTP request.
 |  
 |  Methods defined here:
 |  
 |  __bool__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if the status code of the response is between
 |      400 and 600 to see if there was a client error or a server error. If
 |      the status code, is between 200 and 400, this will return True. This
 |      is **not** a check to see if the response code is ``200 OK``.
 |  
 |  __enter__(self)
 |  
 |  __exit__(self, *args)
 |  
 |  __getstate__(self)
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |      Allows you to use a response as an iterator.
 |  
 |  __nonzero__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if

In [27]:
# Learning requests
import requests

r = requests.get('https://en.wikipedia.org/w/api.php')
r.text



In [28]:
# Learning requests
import requests

r = requests.get('https://en.wikipedia.org/w/api.php')
r.status_code

200

In [31]:
# Learning requests
import requests

r = requests.get('https://en.wikipedia.org/w/api.php')
r.content



In [30]:
# Learning requests
import requests

r = requests.get('https://en.wikipedia.org/w/api.php')
r.headers

{'Date': 'Mon, 19 Oct 2020 16:23:58 GMT', 'Server': 'mw2356.codfw.wmnet', 'X-Content-Type-Options': 'nosniff', 'P3p': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'Content-Language': 'en', 'Vary': 'Accept-Encoding,Cookie,Authorization', 'Expires': 'Mon, 19 Oct 2020 16:23:58 GMT', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Disposition': 'inline; filename=api-help.html', 'Cache-Control': 'private, must-revalidate, max-age=0', 'X-Request-Id': '430618a4-5dac-460d-bac1-0679ec03686c', 'Content-Type': 'text/html; charset=utf-8', 'Content-Encoding': 'gzip', 'Age': '0', 'X-Cache': 'cp5011 miss, cp5007 pass', 'X-Cache-Status': 'pass', 'Server-Timing': 'cache;desc="pass"', 'Strict-Transport-Security': 'max-age=106384710; includeSubDomains; preload', 'Report-To': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0"

In [79]:
# test
import requests

s = requests.Session()
url = 'https://en.wikipedia.org/w/api.php'
params = {
    'action': 'query',
    'format': 'json',
    'prop': 'imageinfo',
    'titles': 'File:'+d['image_flag'],
    'iiprop': 'url'
}

r = s.get(url, params=params)
data = r.json()
page = data['query']['pages']

print(r)
print(data)

<Response [200]>
{'continue': {'iistart': '2011-10-03T04:05:02Z', 'continue': '||'}, 'query': {'pages': {'23473560': {'pageid': 23473560, 'ns': 6, 'title': 'File:Flag of the United Kingdom.svg', 'imagerepository': 'local', 'imageinfo': [{'url': 'https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg', 'descriptionurl': 'https://en.wikipedia.org/wiki/File:Flag_of_the_United_Kingdom.svg', 'descriptionshorturl': 'https://en.wikipedia.org/w/index.php?curid=23473560'}]}}}}


In [67]:
# 29
import requests

s = requests.Session()
url = 'https://en.wikipedia.org/w/api.php'
params = {
    'action': 'query',
    'format': 'json',
    'prop': 'imageinfo',
    'titles': 'File:'+d['image_flag'],
    'iiprop': 'url'
}

r = s.get(url, params=params)
data = r.json()
page = data['query']['pages']
for key, value in page.items():
    print(value['imageinfo'][0]['url'])

https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg
