**PySDS Week 02 Day 02 v.1 - Exercise - File Types and Text Processing I**

Today we will be doing some example regular expressions (yay), and some dataframe manipulation. Recall that we used the Canada wikipedia page as an example. Below is some code that you can use to pull in a Wikipedia page as data. Today, you will be asked to read in several pages, compare them on a number of features in a dataframe and report on what you found.  Below is the code that you can use to download a Wikipedia page. 

In [2]:
import urllib, urllib.request
import bs4 
import re
import pandas as pd

# You can set this Wikipage to be any string that has a wikipedia page.

def getWikiPage(page="United Kingdom"): 
    '''Returns the XML found using special export of a Wikipedia page.'''
    
    # Here we use urllib.parse.quote to turn spaces and special characters into
    # the characters needed for an html string. So for example spaces become %20

    URL = "http://en.wikipedia.org/wiki/Special:Export/%s" % urllib.parse.quote(page)

    print(URL,"\n")

    req = urllib.request.Request( URL, headers={'User-Agent': 'OII SDS class 2018.1/Hogan'})
    infile = urllib.request.urlopen(req)

    return infile.read()

# Testing
data = getWikiPage('Nigeria')
soup = bs4.BeautifulSoup(data.decode('utf8'), "lxml")
print(soup.mediawiki.page.revision.id)


http://en.wikipedia.org/wiki/Special:Export/Nigeria 

<id>863881727</id>


In [3]:
# Now, select 10 countries and place them in a list. 
# These will be rows in a dataframe. 
# For each of the ten countries, 
# find the following features from parsing their wikipedia page: 
# 1. The number of internal wikilinks. 
# 2. The number of external wikilinks. 
# 3. The length of the page (in characters)
# 4. The population of the country. 
#   - This last one will be very tricky. It's okay if you cannot get the 
#     regex working, or if you have to build multiple regexes. 
#     Please simply document this. 

# Print the following: 
# The rank order of each of the columns. 
# For example, for wikilinks you might print 
# (note numbrs below are not accurate)

# Table 1. Number of <Wikilinks>
# Canada        46
# Germany       45
# France        24
# Netherlands   12
# ...

# answer below here

countries = ['Nigeria', 'China', 'Zimbabwe', 'Kenya', 'South Africa', 'Ghana', 'Egypt', 'Tunisia', 'Togo', 'Senegal', 'Algeria']
# China was added later on by Patrick to demonstrate a point. But I didn't remove it cause I was worried I'd break something. 

countries_stats = []
for c in countries:
    data = getWikiPage(c)
    soup = bs4.BeautifulSoup(data.decode('utf8'), "lxml")
    text_to_parse = soup.mediawiki.page.text
    re_inner_links = re.compile(r'\[\[.*?\]\]')
    inner_links = re_inner_links.findall(text_to_parse)
    re_outer_links = re.compile(r'https?://[\w\./?&=%]*')
    outer_links = re_outer_links.findall(text_to_parse)
    page_length = soup.mediawiki.page.text
    countries_stats.append([c, len(inner_links), len(outer_links), len(page_length)])

countries_stats = pd.DataFrame(countries_stats, columns=['Country','Inner Links','Outer Links','Page Length'])
countries_final = countries_stats.set_index('Country')

# Reviewer's comments


http://en.wikipedia.org/wiki/Special:Export/Nigeria 

http://en.wikipedia.org/wiki/Special:Export/China 

http://en.wikipedia.org/wiki/Special:Export/Zimbabwe 

http://en.wikipedia.org/wiki/Special:Export/Kenya 

http://en.wikipedia.org/wiki/Special:Export/South%20Africa 

http://en.wikipedia.org/wiki/Special:Export/Ghana 

http://en.wikipedia.org/wiki/Special:Export/Egypt 

http://en.wikipedia.org/wiki/Special:Export/Tunisia 

http://en.wikipedia.org/wiki/Special:Export/Togo 

http://en.wikipedia.org/wiki/Special:Export/Senegal 

http://en.wikipedia.org/wiki/Special:Export/Algeria 



In [3]:
display(countries_final['Inner Links'].sort_values())

Country
Togo             339
Senegal          448
Tunisia          521
Kenya            734
Algeria          787
Zimbabwe         821
South Africa     906
Egypt           1084
Nigeria         1112
Ghana           1174
China           1433
Name: Inner Links, dtype: int64

In [4]:
display(countries_final['Outer Links'].sort_values())

Country
Togo             59
Senegal          82
Algeria         177
Kenya           192
South Africa    228
Tunisia         235
Nigeria         266
Egypt           289
Ghana           322
Zimbabwe        324
China           582
Name: Outer Links, dtype: int64

In [5]:
display(countries_final['Page Length'].sort_values())

Country
Togo             51304
Senegal          68247
Tunisia         123617
Algeria         125603
Kenya           141087
South Africa    158132
Nigeria         181013
Ghana           184682
Zimbabwe        185456
Egypt           192202
China           279232
Name: Page Length, dtype: int64

In [6]:
# Now I attempt to extract the population details from the Wiki pages. Note that I'll still be using the list 'countries'.
# As this list is zero-indexed, it means when I come to append/concatenate the pop figures, they will match my 'countries_final' dataframe.

pop_all = []
for c in countries:
    data = getWikiPage(c)
    soup = bs4.BeautifulSoup(data.decode('utf8'), "lxml")
    text_to_parse = soup.mediawiki.page.text
    re_pop_c = re.compile(r'population_census = [0-9,]+')
    re_pop_e = re.compile(r'population_estimate = [0-9,]+')
    pop_c = re_pop_c.findall(text_to_parse)
    pop_e = re_pop_e.findall(text_to_parse)
    pop_all.append([c, pop_c, pop_e])

# Note that after a couple of attempts I realised there were two pop firgures - estimate and census, so I've extracted them separately.
# Although, not all countries have both so you will see some missing values along the way.

pop_all_df = pd.DataFrame(pop_all)
display(pop_all_df)
pop_all_df[1] = pop_all_df[1].apply(lambda x: ', '.join(x))
pop_all_df[2] = pop_all_df[2].apply(lambda x: ', '.join(x))

display(pop_all_df)

http://en.wikipedia.org/wiki/Special:Export/Nigeria 

http://en.wikipedia.org/wiki/Special:Export/China 

http://en.wikipedia.org/wiki/Special:Export/Zimbabwe 

http://en.wikipedia.org/wiki/Special:Export/Kenya 

http://en.wikipedia.org/wiki/Special:Export/South%20Africa 

http://en.wikipedia.org/wiki/Special:Export/Ghana 

http://en.wikipedia.org/wiki/Special:Export/Egypt 

http://en.wikipedia.org/wiki/Special:Export/Tunisia 

http://en.wikipedia.org/wiki/Special:Export/Togo 

http://en.wikipedia.org/wiki/Special:Export/Senegal 

http://en.wikipedia.org/wiki/Special:Export/Algeria 



Unnamed: 0,0,1,2
0,Nigeria,"[population_census = 140,431,790]","[population_estimate = 190,886,311]"
1,China,"[population_census = 1,339,724,852]",[]
2,Zimbabwe,"[population_census = 12,973,808]",[]
3,Kenya,"[population_census = 38,610,097]","[population_estimate = 49,125,325]"
4,South Africa,"[population_census = 51,770,560]","[population_estimate = 57,725,600]"
5,Ghana,"[population_census = 24,200,000]","[population_estimate = 28,308,301]"
6,Egypt,"[population_census = 94,798,827]",[]
7,Tunisia,[],"[population_estimate = 11,304,482]"
8,Togo,"[population_census = 6,337,000]","[population_estimate = 7,965,055]"
9,Senegal,"[population_census = 14,668,522]",[]


Unnamed: 0,0,1,2
0,Nigeria,"population_census = 140,431,790","population_estimate = 190,886,311"
1,China,"population_census = 1,339,724,852",
2,Zimbabwe,"population_census = 12,973,808",
3,Kenya,"population_census = 38,610,097","population_estimate = 49,125,325"
4,South Africa,"population_census = 51,770,560","population_estimate = 57,725,600"
5,Ghana,"population_census = 24,200,000","population_estimate = 28,308,301"
6,Egypt,"population_census = 94,798,827",
7,Tunisia,,"population_estimate = 11,304,482"
8,Togo,"population_census = 6,337,000","population_estimate = 7,965,055"
9,Senegal,"population_census = 14,668,522",


In [7]:
# Cleaning my extract by removing all text and symbols and leaving only the numbers.

pop_all_df[1] = pop_all_df[1].map(lambda x: x.strip('=').strip('aAbBcC'))
pop_all_df[1].replace(regex=True,inplace=True,to_replace=r'[a-z_ =,]+',value=r'')
pop_all_df[2].replace(regex=True,inplace=True,to_replace=r'[a-z_ =,]+',value=r'')

# I set the column names so it makes some sense and index the dataframe to the name of the country.

pop_all_df.columns = ['Country', 'Population Census', 'Population Estimate']
pop_all_df = pop_all_df.set_index('Country')
display(pop_all_df)

Unnamed: 0_level_0,Population Census,Population Estimate
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Nigeria,140431790.0,190886311.0
China,1339724852.0,
Zimbabwe,12973808.0,
Kenya,38610097.0,49125325.0
South Africa,51770560.0,57725600.0
Ghana,24200000.0,28308301.0
Egypt,94798827.0,
Tunisia,,11304482.0
Togo,6337000.0,7965055.0
Senegal,14320055.0,


In [8]:
# Finally, I now add my pop figures to my dataframe 'countries_final' 

countries_final['Population Census'] = pop_all_df['Population Census']
countries_final['Population Estimate'] = pop_all_df['Population Estimate']

# And voila! Presto chango, we have a new dataframe that includes them all. 
# As you will see, there are some missing values but I guess this was part of the challenge.

display(countries_final)

# Below are the sort print statements if you want to test them. At this point I don't know if I care.

# display(countries_final['Population Census'].sort_values())
# display(countries_final['Population Estimate'].sort_values())


Unnamed: 0_level_0,Inner Links,Outer Links,Page Length,Population Census,Population Estimate
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Nigeria,1112,266,181013,140431790.0,190886311.0
China,1433,582,279232,1339724852.0,
Zimbabwe,821,324,185456,12973808.0,
Kenya,734,192,141087,38610097.0,49125325.0
South Africa,906,228,158132,51770560.0,57725600.0
Ghana,1174,322,184682,24200000.0,28308301.0
Egypt,1084,289,192202,94798827.0,
Tunisia,521,235,123617,,11304482.0
Togo,339,59,51304,6337000.0,7965055.0
Senegal,448,82,68247,14320055.0,
