## Chapter 3: Regular Expression

NLP 100  

Steven Coyne
  
The file enwiki-country.json.gz stores Wikipedia articles in the format:

Each line stores a Wikipedia article in JSON format  
Each JSON document has key-value pairs:  
Title of the article as the value for the title key  
Body of the article as the value for the text key  
The entire file is compressed by gzip  

Write codes that perform the following jobs.

### 20. Read JSON documents

Read the JSON documents and output the body of the article about the United Kingdom. Reuse the output in problems 21-29.

In [2]:
import json
import gzip

with gzip.open('data/enwiki-country.json.gz', 'rt') as in_file:
    #encoding='utf-8'
    jsonList = [json.loads(row) for row in in_file]

    UK_article = list(filter(lambda x: x['title']=='United Kingdom', jsonList))[0]['text']

with open('work/UK_text.txt', 'w') as out_file:
    out_file.write(UK_article)

### 21. Lines with category names

Extract lines that define the categories of the article.

In [3]:
with open('work/UK_text.txt', 'r') as UK_text:
    cat_line_list = []
    for line in UK_text:
            if 'Category' in line:
                line = line.rstrip()
                cat_line_list.append(line)
                print(line)

{{Sister project links|n=Category:United Kingdom|voy=United Kingdom|d=Q145}}
[[Category:United Kingdom| ]]
[[Category:British Islands]]
[[Category:Countries in Europe]]
[[Category:English-speaking countries and territories]]
[[Category:G7 nations]]
[[Category:Group of Eight nations]]
[[Category:G20 nations]]
[[Category:Island countries]]
[[Category:Northern European countries]]
[[Category:Former member states of the European Union]]
[[Category:Member states of NATO]]
[[Category:Member states of the Commonwealth of Nations]]
[[Category:Member states of the Council of Europe]]
[[Category:Member states of the Union for the Mediterranean]]
[[Category:Member states of the United Nations]]
[[Category:Priority articles for attention after Brexit]]
[[Category:Western European countries]]


### 22. Category names

Extract the category names of the article.

In [4]:
import re

category_list = []
category_pattern = re.compile(r'\[\[Category:(.*?)[|\]]')

for line in cat_line_list:
    match = re.search(category_pattern, line)
    if match:
        category = match.group(1)
        category_list.append(category)
        print(category)

United Kingdom
British Islands
Countries in Europe
English-speaking countries and territories
G7 nations
Group of Eight nations
G20 nations
Island countries
Northern European countries
Former member states of the European Union
Member states of NATO
Member states of the Commonwealth of Nations
Member states of the Council of Europe
Member states of the Union for the Mediterranean
Member states of the United Nations
Priority articles for attention after Brexit
Western European countries


### 23. Section structure

Extract section names in the article with their levels. For example, the level of the section is 1 for the MediaWiki markup "== Section name ==".

In [5]:
section_pattern = re.compile(r'(==+)(.*?)(==+)')

with open('work/UK_text.txt', 'r') as UK_text:
    for line in UK_text:
        match = re.search(section_pattern,line)
        if match:
            print(f'{match.group(2)} ({len(match.group(1)) - 1})')

Etymology and terminology (1)
History (1)
Background (2)
Treaty of Union (2)
From the union with Ireland to the end of the First World War (2)
Between the World Wars (2)
Since the Second World War (2)
Geography (1)
Climate (2)
Administrative divisions (2)
Dependencies (1)
Politics (1)
Government (2)
Devolved administrations (2)
Law and criminal justice (2)
Foreign relations (2)
Military (2)
Economy (1)
Overview (2)
Science and technology (2)
Transport (2)
Energy (2)
Water supply and sanitation (2)
Demographics (1)
Ethnic groups (2)
Languages (2)
Religion (2)
Migration (2)
Education (2)
Health (2)
Culture (1)
Literature (2)
Music (2)
Visual art (2)
Cinema (2)
 Cuisine  (2)
Media (2)
Philosophy (2)
Sport (2)
Symbols (2)
Stereotypes (2)
Historiography (1)
See also (1)
Notes (1)
References (1)
External links (1)


### 24. Media references

Extract references to media files linked from the article.

In [6]:
media_pattern = re.compile(r'(?:File):(.+?)\|')

with open('work/UK_text.txt', 'r') as UK_text:
    for line in UK_text:
        match = re.search(media_pattern, line)
        if match:
            print(f'{match.group(1)}')

Royal Coat of Arms of the United Kingdom.svg
Europe-UK (orthographic projection).svg
United Kingdom (+overseas territories and crown dependencies) in the World (+Antarctica claims).svg
Stonehenge, Condado de Wiltshire, Inglaterra, 2014-08-12, DD 18.JPG
Bayeux Tapestry WillelmDux.jpg
State House- 1620 - St Geo - Bermuda.jpg
Treaty of Union.jpg
Royal Irish Rifles ration party Somme July 1916.jpg
The British Empire.png
Tratado de Lisboa 13 12 2007 (081).jpg
Uk topo en.jpg
UK_K%C3%B6ppen.svg 
Inside the Reef Cayman.jpg
Cap Juluca - Anguilla.jpg
Bermuda-Harbour and Town of St George.jpg
Rothera from reptile.jpg
Roadtown, Tortola.jpg
Upland.jpg
Catalan Bay from The Rock.JPG
Soufriere Hills.jpg
Bounty_bay.jpg
St-Helena-Jamestown-from-above.jpg
Grytviken_church.jpg
Cockburn Town.jpg
Mont Orgueil and Gorey harbour, Jersey.jpg
St Peter Port Guernsey.jpg
The_View_From_Douglas_Head,_Isle_Of_Man..jpg
London Parliament 2007-1.jpg
UK Political System.png
Scottish Parliament, Main Debating Chamber - g

### 25. Infobox

Extract field names and their values in the Infobox “country”, and store them in a dictionary object.

In [7]:
with open('work/UK_text.txt', 'r') as UK_text:
    infobox_list = []
    flag = False
    for line in UK_text:
        if line.startswith('{{Infobox country'):
            flag = True
        elif line.startswith('}}'):
                break
        elif flag:
            infobox_list.append(line)

    UK_dict = {}
    infobox_pattern = re.compile(r'\|(.+?)\s*=\s*(.*)')

    for i in infobox_list:
        match = re.match(infobox_pattern, i)
        if match:
            UK_dict[match.group(1)] = match.group(2)
    for key, value in UK_dict.items():
        print(f'{key}: {value}')

 common_name: United Kingdom
 linking_name: the United Kingdom<!--Note: "the" required here as this entry used to create wikilinks-->
 conventional_long_name: United Kingdom of Great Britain and Northern Ireland
 image_flag: Flag of the United Kingdom.svg
 alt_flag: A flag featuring both cross and saltire in red, white and blue
 other_symbol: [[File:Royal Coat of Arms of the United Kingdom.svg|x100px]][[File:Royal Coat of Arms of the United Kingdom (Scotland).svg|x100px]]
 other_symbol_type: [[Royal coat of arms of the United Kingdom|Royal coats of arms]]:{{#tag:ref |The coat of arms on the left is used in England, Northern Ireland, and Wales; the version on the right is used in Scotland|group=note}}
 national_anthem: "[[God Save the Queen]]"{{#tag:ref |There is no authorised version of the national anthem as the words are a matter of tradition; only the first verse is usually sung.<ref>{{cite web |title=National Anthem |url=https://www.royal.uk/national-anthem |website=Official web si

### 26. Remove emphasis markups

In addition to the process of the problem 25, remove emphasis MediaWiki markups from the values. See [Help:Cheatsheet.](https://en.wikipedia.org/wiki/Help:Cheatsheet)

In [8]:
#I took this to mean removing ''italics'', '''bold''', and '''''both''''', which all use the character ' several times

with open('work/UK_text.txt', 'r') as UK_text:
    infobox_list = []
    flag = False
    for line in UK_text:
        if line.startswith('{{Infobox country'):
            flag = True
        elif line.startswith('}}'):
                break
        elif flag:
            infobox_list.append(line)

    UK_dict = {}
    emphasis_pattern = re.compile(r'\'{2,5}')

    for i in infobox_list:
        match = re.match(infobox_pattern, i)
        if match:
            value = re.sub(emphasis_pattern, '', match.group(2))
            UK_dict[match.group(1)] = value

    for key, value in UK_dict.items():
        print(f'{key}: {value}')

 common_name: United Kingdom
 linking_name: the United Kingdom<!--Note: "the" required here as this entry used to create wikilinks-->
 conventional_long_name: United Kingdom of Great Britain and Northern Ireland
 image_flag: Flag of the United Kingdom.svg
 alt_flag: A flag featuring both cross and saltire in red, white and blue
 other_symbol: [[File:Royal Coat of Arms of the United Kingdom.svg|x100px]][[File:Royal Coat of Arms of the United Kingdom (Scotland).svg|x100px]]
 other_symbol_type: [[Royal coat of arms of the United Kingdom|Royal coats of arms]]:{{#tag:ref |The coat of arms on the left is used in England, Northern Ireland, and Wales; the version on the right is used in Scotland|group=note}}
 national_anthem: "[[God Save the Queen]]"{{#tag:ref |There is no authorised version of the national anthem as the words are a matter of tradition; only the first verse is usually sung.<ref>{{cite web |title=National Anthem |url=https://www.royal.uk/national-anthem |website=Official web si

### 27. Remove internal links

In addition to the process of the problem 26, remove internal links from the values. See [Help:Cheatsheet.](https://en.wikipedia.org/wiki/Help:Cheatsheet)

In [9]:
with open('work/UK_text.txt', 'r') as UK_text:
    infobox_list = []
    flag = False
    for line in UK_text:
        if line.startswith('{{Infobox country'):
            flag = True
        elif line.startswith('}}'):
                break
        elif flag:
            infobox_list.append(line)

    UK_dict = {}
    link_pattern = re.compile(r'\[\[(?:[^|]*?\|)*?([^|]*?)\]\]')
    #[[File:Royal Coat of Arms of the United Kingdom.svg|x100px]]
    #link_pattern = r'\[\[(?:File:)?(.+?)\]\]'

    for i in infobox_list:
        match = re.match(infobox_pattern, i)
        if match:
            #remove emphasis
            value = re.sub(emphasis_pattern, '', match.group(2))
            #remove internal links
            value = re.sub(link_pattern, r'\1', value)
            UK_dict[match.group(1)] = value

    for key, value in UK_dict.items():
        print(f'{key}: {value}')

 common_name: United Kingdom
 linking_name: the United Kingdom<!--Note: "the" required here as this entry used to create wikilinks-->
 conventional_long_name: United Kingdom of Great Britain and Northern Ireland
 image_flag: Flag of the United Kingdom.svg
 alt_flag: A flag featuring both cross and saltire in red, white and blue
 other_symbol: x100pxx100px
 other_symbol_type: Royal coats of arms:{{#tag:ref |The coat of arms on the left is used in England, Northern Ireland, and Wales; the version on the right is used in Scotland|group=note}}
 national_anthem: "God Save the Queen"{{#tag:ref |There is no authorised version of the national anthem as the words are a matter of tradition; only the first verse is usually sung.<ref>{{cite web |title=National Anthem |url=https://www.royal.uk/national-anthem |website=Official web site of the British Royal Family |accessdate=4 June 2016|date=15 January 2016 }}</ref> No law was passed making "God Save the Queen" the official anthem. In the English t

### 28. Remove MediaWiki markups

In addition to the process of the problem 27, remove MediaWiki markups from the values as much as you can, and obtain the basic information of the country in plain text format.

### 29. Country flag

Obtain the URL of the country flag by using the analysis result of Infobox. (Hint: convert a file reference to a URL by calling [imageinfo](https://www.mediawiki.org/wiki/API:Imageinfo) in [MediaWiki API](https://www.mediawiki.org/wiki/API:Main_page))