# Minicase -  Raw Data Cleanup from Wikipedia

We will be scraping content from Wikipedia, which makes all public edits in the history of the wiki available through a public API. This API is documented at ​https://www.mediawiki.org/wiki/API:Main_page​ but you will not have to work with it directly; in code below we have given some helper functions that you can use to make things easier.


## Initial Help

We have already included a few functions below, which will allow you to call `​page_text()` in your code. This will fetch either the raw HTML of any Wikipedia page, or the plan text in either single string or list-of-strings format, and cache the result locally. 

In [None]:

import hashlib, os, json, requests
from lxml import etree

# Input: Page name of a Wikipedia article.
# 
# Returns: Full HTML source of the named article if it exists,
#          or None if no such page exists.
def __api_GET_latest_page(title):
    parameters = {
        "action": "parse",
        "page": title,
        "format": "json"
    }
    response_json = __get("revisions", title, parameters)
    if("parse" in response_json.keys() 
        and "text" in response_json["parse"].keys() 
        and "*" in response_json["parse"]["text"].keys()):
        return response_json["parse"]["text"]["*"]
    return None    

# Internal function to hide a caching API request into a single private function.
# This function will save you a lot of headaches in writing your own HTTP requests
# and will save the Wikimedia foundation some bandwidth since you'll fetch a local
# copy if you have already retrieved an article text at least once.
def __get(function_key, key, parameters, check_cache=True, write_cache=True):
    target = "https://en.wikipedia.org/w/api.php"
    cache_path = "cached_api"
    params_unicode = str(parameters).encode('utf-8')
    md5 = hashlib.md5(params_unicode).hexdigest()
    return_json = None

    cache_file = os.path.join(cache_path, function_key, md5)
    cache_exists = os.path.isfile(cache_file)
    if cache_exists:
        try:
            json_in = open(cache_file, "r")
            json_str = json_in.read()
            return_json = json.loads(json_str)
            if "error" in return_json.keys() and "code" in return_json["error"].keys() and return_json["error"]["code"]=="maxlag":
                cache_exists = False
        except:
            cache_exists = False

    if not cache_exists:
        cache_dir = os.path.dirname(cache_file)
        if not os.path.isdir(cache_dir):
            os.makedirs(cache_dir)
        r = requests.get(target, params=parameters)
        request_json = r.json()
        json_out = open(cache_file, "w")
        print(json.dumps(request_json), file=json_out)
        return_json = request_json
    return return_json

# This function takes as input a parsed HTML tree and returns the same
# tree but with a set of tags removed, mostly the contents of tables and scripts.
# This makes parsing the actual contents of a page easier.
def __remove_tables_and_scripts(tree):
    tags_to_remove = ["tbody", "td", "script"]
    for tag in tags_to_remove:
        elements = tree.find(f".//{tag}")
        if elements is not None:
            for e in elements:
                e.getparent().remove(e)
    return tree

# This function takes two required and one optional parameters as input.
#
# Required:
# name: Name of a Wikipedia page to retrieve.
# format: Type of content that you want returned. Options include:
#         "html" : Full HTML content of the page you requested.
#         "text" : Full content of the page you requested as a single string,
#                  with all HTML tags removed.
#         "list" : Full content of the page you requested with all HTML removed,
#                  but each paragraph on the page is a separate string, and the
#                  page as a whole is returned to you as a list of paragraphs.
#
# Optional:
# include_tables: By default, all tables and scripts in the HTML text will be 
#                 removed from the text that gets sent back to you. If you want
#                 to include that content, you can pass in True instead.
#
# This function returns the content of the page in the format that you specified.
def page_text(name, format, include_tables = False):
    try:
        result = __api_GET_latest_page(name)
    except Exception as e:
        print("API request failed.")
        print(e)
    if result:
        e = etree.fromstring(result)
        if not include_tables:
            e = __remove_tables_and_scripts(e)
        if format == "html":
            return str(etree.tostring(e))
        elif format == "text":
            return ''.join(e.itertext())
        elif format == "list":
            return ''.join(e.itertext()).split('\n')
    else:
        print("Failed to retrieve a page.")
        return None


The `page_text()` function above is called with two parameters, first a page name and then a format. `text` will return the entire page's text as a single string, `html` will return the page including all HTML formatting, and `list` will return each line of the page as a separate string, all stored in a single list object.

In [None]:
page_text("Ursula K. Le Guin", "text")

'American fantasy and science fiction author (1929–2018)\n\n\n.mw-parser-output .infobox-subbox{padding:0;border:none;margin:-3px;width:auto;min-width:100%;font-size:100%;clear:none;float:none;background-color:transparent}.mw-parser-output .infobox-3cols-child{margin:auto}\nUrsula Kroeber Le Guin (/ˈkroʊbər lə ˈɡwɪn/;[1] October 21, 1929 – January 22, 2018) was an American author best known for her works of speculative fiction, including science fiction works set in her Hainish universe, and the Earthsea fantasy series. She was first published in 1959, and her literary career spanned nearly sixty years, producing more than twenty novels and over a hundred short stories, in addition to poetry, literary criticism, translations, and children\'s books. Frequently described as an author of science fiction, Le Guin has also been called a "major voice in American Letters".[2] Le Guin herself said she would prefer to be known as an "American novelist".[3]\nLe Guin was born in Berkeley, Califor

In [None]:
page_text("Ursula K. Le Guin", "html")

'b\'<div class="mw-parser-output"><div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">American fantasy and science fiction author (1929&#8211;2018)</div>\\n<p class="mw-empty-elt">\\n</p>\\n<style data-mw-deduplicate="TemplateStyles:r1048617464">.mw-parser-output .infobox-subbox{padding:0;border:none;margin:-3px;width:auto;min-width:100%;font-size:100%;clear:none;float:none;background-color:transparent}.mw-parser-output .infobox-3cols-child{margin:auto}</style><table class="infobox vcard"><tbody/></table>\\n<p><b>Ursula Kroeber Le Guin</b> (<span class="rt-commentedText nowrap"><span class="IPA nopopups noexcerpt"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="/&#712;/: primary stress follows">&#712;</span><span title="\\\'k\\\' in \\\'kind\\\'">k</span><span title="\\\'r\\\' in \\\'rye\\\'">r</span><span title="/o&#650;/: \\\'o\\\' in \\\'code\\\'">o&#650;</span><span title="\\\'b\\\' i

In [None]:
page_text("Ursula K. Le Guin", "list")

['American fantasy and science fiction author (1929–2018)',
 '',
 '',
 '.mw-parser-output .infobox-subbox{padding:0;border:none;margin:-3px;width:auto;min-width:100%;font-size:100%;clear:none;float:none;background-color:transparent}.mw-parser-output .infobox-3cols-child{margin:auto}',
 'Ursula Kroeber Le Guin (/ˈkroʊbər lə ˈɡwɪn/;[1] October 21, 1929 – January 22, 2018) was an American author best known for her works of speculative fiction, including science fiction works set in her Hainish universe, and the Earthsea fantasy series. She was first published in 1959, and her literary career spanned nearly sixty years, producing more than twenty novels and over a hundred short stories, in addition to poetry, literary criticism, translations, and children\'s books. Frequently described as an author of science fiction, Le Guin has also been called a "major voice in American Letters".[2] Le Guin herself said she would prefer to be known as an "American novelist".[3]',
 "Le Guin was born in B

## Featured Article Crawling 

The English-language Wikipedia has over 5 million articles, and about 0.1% of those have been reviewed by the community as ​Featured Articles​, which represent the best quality content on the site. The full list of all featured articles is at the following URL:

https://en.wikipedia.org/wiki/Wikipedia:Featured_articles

The biography page for author Ursula K. Le Guin, which we fetched above, is one example of a featured article.

We are going to fetch the contents of the page `Wikipedia:Featured_articles` and build a dataframe with all featured article names and their categories. For instance, as of October 2020, the first featured article listed on this page was `7 World Trade Center` and its category is `Architecture and archaeology`. The last featured article listed on this page was `Henry Wrigley` and its category is `Warfare biographies`.




### Part 1: Retrieval 
Fetch the contents of `Wikipedia: Featured_articles` in either string, HTML, or list format. Manually check the output to make sure you understand the contents and everything looks ready to go.

In [None]:
# fetch the contents in html format
data = page_text("Wikipedia: Featured_articles", "html")
data_line=data.split("\\n")
data_line

['b\'<div class="mw-parser-output"><p class="mw-empty-elt">',
 '</p>',
 '<div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">List of all featured articles in English Wikipedia</div>',
 '<p class="mw-empty-elt">',
 '</p>',
 '<table style="clear:both; background:none; color:black;">',
 '',
 '<tbody/></table>',
 '<div style="border:1px solid #A3BFB1; padding:1em 1em 1em 1em; background-color:#D8E4F2; margin: 1px 1px 2px 1px;">',
 '<div style="font-size:14pt;text-align:center">Contents</div>',
 '<div class="hlist" style="font-size:11pt;text-align:center;">',
 '<p>Biographies are in sub-topics according to the larger topics',
 '</p>',
 '<ul><li><a href="#Art,_architecture,_and_archaeology">Art, architecture, and archaeology</a> <small><a href="#Biographies_(art,_architecture,_and_archaeology)">(bios)</a></small></li>',
 '<li><a href="#Biology">Biology</a> <small><a href="#Biology_biographies">(bios)</a></small></li>',
 '<li><a href="#Business,_economics,_


### Part 2: Removing Extra Information

Using Python, find the start and end of the list of featured articles, and remove the content before and after that section of the page. You might use strategies similar to the Alice in Wonderland investigation from week 9, videos 4 and 5.


In [None]:
#remove everything before first 1st-level-title appears, here idenfied as the line with keyword "margin: 1px; vertical-align:top;"
data=data.split("margin: 1px; vertical-align:top;")[1]

#remove everything after last entry of featured articles, here identified as the line with keyword "infobox plainlinks"
data=data.split("infobox plainlinks")[0]

data_line1 = data.split("\\n")
data_line1

[' padding:1em 1em 1em 1em; border:1px solid #A3BFB1; background-color:#F1F6FB">',
 '<h2><span id="Art.2C_architecture.2C_and_archaeology"/><span class="mw-headline" id="Art,_architecture,_and_archaeology" data-mw-comment="{&quot;type&quot;:&quot;heading&quot;,&quot;level&quot;:0,&quot;id&quot;:&quot;h-Art,_architecture,_and_archaeology&quot;,&quot;replies&quot;:[&quot;h-Architecture_and_archaeology-Art,_architecture,_and_archaeology&quot;,&quot;h-Art-Art,_architecture,_and_archaeology&quot;,&quot;h-Biographies_(art,_architecture,_and_archaeology)-Art,_architecture,_and_archaeology&quot;],&quot;headingLevel&quot;:2,&quot;placeholderHeading&quot;:false}"><span data-mw-comment-start="" id="h-Art,_architecture,_and_archaeology"/>Art, architecture, and archaeology<span data-mw-comment-end="h-Art,_architecture,_and_archaeology"/></span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Wikipedia:Featured_articles&amp;action=edit&amp;section

### Part 3: Extracting Contents

Write code that can perform the following tasks:

   * Distinguish page names from category headers in the raw text of the featured article list.
   * Keep track of the current category at all times while looping through contents, probably in a variable.
   * Extract page titles and store them as you iterate through page contents, associated with the current category.

IMPORTANT: You **DO NOT** need to parse HTML in order to complete this task. Using external libraries like BeautifulSoup might help if you want to use it, and if you ask colleagues for help they may suggest it. This is probably a bad idea, and will make your work more complicated than it needs to be.

In [None]:
#data_line=data.split("\\n")
first_level_title=""
second_level_title=""
third_level_title=""
result_list=[]
for line in data_line1:

  if "<h2>" in line:
    first_level_title=line.split("span class")[1].split("id=")[1].split('"')[1]
    second_level_title=""
    third_level_title=""
    #print(first_level_title)
  if "<h3>" in line:
    second_level_title=line.split("span class")[1].split("id=")[1].split('"')[1]
    third_level_title=""
    #print("  ",first_level_title+"-"+second_level_title)
  if "<h4>" in line:
    index=line.find("span class")
    third_level_title=line[index:].split("id=")[1].split('"')[1]
    #print("    ",first_level_title+"-"+second_level_title+"-"+third_level_title) 
  if "featured_article_metadata has_been_on_main_page" in line:
    page_title=line.split("title=")[1].split('"')[1]
    result_list.append([first_level_title,second_level_title,third_level_title,page_title])

result_list 

[['Art,_architecture,_and_archaeology',
  'Architecture_and_archaeology',
  '',
  '7 World Trade Center'],
 ['Art,_architecture,_and_archaeology',
  'Architecture_and_archaeology',
  '',
  'Acra (fortress)'],
 ['Art,_architecture,_and_archaeology',
  'Architecture_and_archaeology',
  '',
  'Angkor Wat'],
 ['Art,_architecture,_and_archaeology',
  'Architecture_and_archaeology',
  '',
  'Belton House'],
 ['Art,_architecture,_and_archaeology',
  'Architecture_and_archaeology',
  '',
  'Benty Grange helmet'],
 ['Art,_architecture,_and_archaeology',
  'Architecture_and_archaeology',
  '',
  'Biblioteca Marciana'],
 ['Art,_architecture,_and_archaeology',
  'Architecture_and_archaeology',
  '',
  'Blakeney Chapel'],
 ['Art,_architecture,_and_archaeology',
  'Architecture_and_archaeology',
  '',
  'Bodiam Castle'],
 ['Art,_architecture,_and_archaeology',
  'Architecture_and_archaeology',
  '',
  'Borobudur'],
 ['Art,_architecture,_and_archaeology',
  'Architecture_and_archaeology',
  '',
  'Br

### Part 4: Building a dataframe

Store the filtered contents of the featured article list in a structured dataframe named `articles_df`. This should contain the list of featured articles on Wikipedia. Each row should represent one featured article and should contain two columns: 

   * `page_title`
   * `category`

These values should have been found by your extraction code in Part 3. 

Save the resulting dataframe to a CSV file and submit it along with your Python notebook to complete this homework.

In [None]:
articles_df_1 = pd.DataFrame(data=result_list,columns=["first_level_title","second_level_title","third_level_title","page_title"])
articles_df_1 

Unnamed: 0,first_level_title,second_level_title,third_level_title,page_title
0,"Art,_architecture,_and_archaeology",Architecture_and_archaeology,,7 World Trade Center
1,"Art,_architecture,_and_archaeology",Architecture_and_archaeology,,Acra (fortress)
2,"Art,_architecture,_and_archaeology",Architecture_and_archaeology,,Angkor Wat
3,"Art,_architecture,_and_archaeology",Architecture_and_archaeology,,Belton House
4,"Art,_architecture,_and_archaeology",Architecture_and_archaeology,,Benty Grange helmet
...,...,...,...,...
5271,Warfare,Warfare_biographies,,John Whittle
5272,Warfare,Warfare_biographies,,Maurice Wilder-Neligan
5273,Warfare,Warfare_biographies,,Richard Williams (RAAF officer)
5274,Warfare,Warfare_biographies,,James Park Woods


In [None]:
# output the result as the requirement
articles_df = articles_df_1[['page_title','second_level_title']].rename(columns = {'second_level_title':'category'})
articles_df

Unnamed: 0,page_title,category
0,7 World Trade Center,Architecture_and_archaeology
1,Acra (fortress),Architecture_and_archaeology
2,Angkor Wat,Architecture_and_archaeology
3,Belton House,Architecture_and_archaeology
4,Benty Grange helmet,Architecture_and_archaeology
...,...,...
5271,John Whittle,Warfare_biographies
5272,Maurice Wilder-Neligan,Warfare_biographies
5273,Richard Williams (RAAF officer),Warfare_biographies
5274,James Park Woods,Warfare_biographies


In [None]:
# save the result to a csv file and download
articles_df.to_csv('Yixuan Li-Article_df-hw4.csv')
files.download("Yixuan Li-Article_df-hw4.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Extra credit (Task 2)

**For up to 2 points of credit, you may answer the following question:**
   * What is the count and percentage distribution of categories in Wikipedia featured articles? Use `.groupby()` on the `category` column that you filled in while constructing your dataframe.



In [None]:
# the count of categories
categories_cnt = articles_df.groupby('category').size()
categories_cnt

category
                                1329
Albums                           102
Animals                          429
Architecture_and_archaeology     129
Art                               92
                                ... 
Video_game_systems                 7
Video_games_2                    168
Warfare_biographies              154
Wars,_battles_and_events         223
Wrestling                         11
Length: 68, dtype: int64

In [None]:
# percentage distribution 
categories_pct = (categories_cnt/len(articles_df))*100
categories_pct

category
                                25.189538
Albums                           1.933283
Animals                          8.131160
Architecture_and_archaeology     2.445034
Art                              1.743745
                                  ...    
Video_game_systems               0.132676
Video_games_2                    3.184230
Warfare_biographies              2.918878
Wars,_battles_and_events         4.226687
Wrestling                        0.208491
Length: 68, dtype: float64

**For up to 5 additional points of extra credit, you may complete the following task**: 
   * Write a loop that fetches the contents of every page in `articles_df`, and stores the fetched page in your dataframe in a new column named `page_contents`. Use `try / except` to skip pages that fail to load.

In [None]:
# as the dataframe is too large to load, I only fetch the first 100 records
page_content1=[]
for i in articles_df.page_title[:50]:
  try:
    tpage=page_text(i,'text')
    if "{margin:auto}" in tpage:
      tpage1=tpage.split("{margin:auto}")[1]
      tpage2=tpage1.split('\n')
      page_content=list(filter(None,tpage2))
    else:
      page_content=tpage 
  except:
    page_content="None"
  page_content1.append([i,page_content])
  

Failed to retrieve a page.


In [None]:
page_contents=pd.DataFrame(page_content1,columns=["page_title","page_contents"])
page_contents

Unnamed: 0,page_title,page_contents
0,7 World Trade Center,[7 World Trade Center (7 WTC or WTC-7) refers ...
1,Acra (fortress),"[The Acra, The Acra (also spelled Akra, from A..."
2,Angkor Wat,"[Angkor Wat, This article contains Khmer text...."
3,Belton House,"Country house in Belton near Grantham, Lincoln..."
4,Benty Grange helmet,[The Benty Grange helmet is a boar-crested Ang...
5,Biblioteca Marciana,"[Marciana Library, The Marciana Library or Li..."
6,Blakeney Chapel,[Blakeney Chapel is a ruined building on the N...
7,Bodiam Castle,[Bodiam Castle (/ˈboʊdiəm/) is a 14th-century ...
8,Borobudur,"[Borobudur, also transcribed Barabudur (Indone..."
9,Bramall Hall,[Bramall Hall is a largely Tudor manor house i...


In [None]:
# merge two dataframes into one
articles_df.merge(page_contents,how='left',on='page_title')

Unnamed: 0,page_title,category,page_contents
0,7 World Trade Center,Architecture_and_archaeology,[7 World Trade Center (7 WTC or WTC-7) refers ...
1,Acra (fortress),Architecture_and_archaeology,"[The Acra, The Acra (also spelled Akra, from A..."
2,Angkor Wat,Architecture_and_archaeology,"[Angkor Wat, This article contains Khmer text...."
3,Belton House,Architecture_and_archaeology,"Country house in Belton near Grantham, Lincoln..."
4,Benty Grange helmet,Architecture_and_archaeology,[The Benty Grange helmet is a boar-crested Ang...
...,...,...,...
5271,John Whittle,Warfare_biographies,
5272,Maurice Wilder-Neligan,Warfare_biographies,
5273,Richard Williams (RAAF officer),Warfare_biographies,
5274,James Park Woods,Warfare_biographies,
