# Data Acquisition: Web Scraping

Data acquisition is a crucial set for developing an information retrieval system. As the bulk of data, primarily textual, are available online, we should be familiar with extracting data from a site either using API or scraping. 

The practice in this notebook will ask you to extract data from a wiki page. 
The tasks are similar to what we saw in the lab notebook, but the only difference is you have to extract two different tables into two separate data frames and then merge them.  

**Activity 1:** Scrap the wiki page "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_sector_composition" to extract the content. Create a soup object using Beautiful soup library and save the soup in a variable called wiki_soup  

In [2]:
# Your code for activity 1 goes here..
#---------------------------------------

#import the library to query a website
import requests
# import Beautiful soup library to access 
# functions to parse the data returned from the website
from bs4 import BeautifulSoup

#import pandas to convert list to data frame
import pandas as pd
#imprt numpy
import numpy as np


# specify the url
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_sector_composition"
# Open website URL and return the html to the variable 'response'
response = requests.get(url)
# Parse the html in the 'response' variable, and store it in Beautiful Soup format
wiki_soup = BeautifulSoup(response.text, "html")



**Activity 2:** Extract the table "GDP from natural resources" from the soup and print it.

In [3]:
# Your code for activity 2 goes here..
#---------------------------------------

table_list=wiki_soup.find_all('table', class_='wikitable')
gdp_nat_resc = table_list[3]




In [4]:
print(gdp_nat_resc)

<table class="wikitable sortable">
<tbody><tr>
<th>Country/Economy</th>
<th>Total natural resources<br/> (% of GDP)</th>
<th>Oil<br/> (% of GDP)</th>
<th>Natural gas<br/> (% of GDP)</th>
<th>Coal<br/> (% of GDP)</th>
<th>Mineral<br/> (% of GDP)</th>
<th>Forest<br/> (% of GDP)
</th></tr>
<tr>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/23px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/35px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/45px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png 2x" width="23"/> </span><a href="/wiki/Afghanistan" title="Afghanistan">Afghanistan

**Activity 3:** Create a dataframe called "resources_df" from the extracted table. Use the column names same as the column headings in original wiki table but make them valid. For example, use "country" instead of "country/economy". 

In [5]:
# Your code for activity 3 goes here..
#---------------------------------------

country_c = []
tnr_c = []
oil_c = []
ng_c= []
coal_c= []
mineral_c= []
forest_c= []

# Skip Header
for row in gdp_nat_resc.findAll("tr")[1:]:
    
    # print (row)
    tds = row.findAll("td")
    if len(tds) > 2:
        
        #country_td = 
        #print(tds[0])
        country_c.append(tds[0].find("a").find(text=True))

        tnr_c.append(tds[1].find(text=True))
        oil_c.append(tds[2].find(text=True))
        ng_c.append(tds[3].find(text=True))
        coal_c.append(tds[4].find(text=True))
        mineral_c.append(tds[5].find(text=True))
        forest_c.append(tds[6].find(text=True).strip())

    #break
    
resources_df = df=pd.DataFrame({'Country':country_c, 
                                'TotalNaturalResources': tnr_c,
                                'Oil':oil_c,
                                'NaturalGas':ng_c,
                                'Coal': coal_c,
                                'Mineral':mineral_c,
                                'Forest':forest_c
                               })
resources_df.head()




Unnamed: 0,Country,TotalNaturalResources,Oil,NaturalGas,Coal,Mineral,Forest
0,Afghanistan,2.1,..,..,0,0.0,2.1
1,Albania,5.1,4.6,0,0,0.5,0.1
2,Algeria,26.3,19,7,0,0.3,0.1
3,Angola,46.6,46.3,0.1,..,0.0,0.2
4,Antigua and Barbuda,0.0,..,..,..,0.0,..


**Activity 4:** Extract the table "gdp per person employed(ppp) (2015) by sector" from wiki page and create a dataframe called "gdp_percent" out of it. Use the column names same as the column headings in original wiki table but make them valid. For example, use "country" instead of "country/economy". 

In [6]:
# Your code for activity 4 goes here..
#---------------------------------------
gdp_percent_TAB = table_list[5]



country_c = []
agg_c = []
ind_c = []
serv_c= []
aggT_c= []
indT_c= []
servT_c= []

# Skip Header
for row in gdp_percent_TAB.findAll("tr")[1:]:
    
    # print (row)
    tds = row.findAll("td")
    if len(tds) > 2:
        
        #country_td = 
        #print(tds[0])
        c_links = tds[0].findAll("a")
        if len(c_links) < 1:
            country_c.append('World')
        else:
            country_c.append(c_links[0].find(text=True))

        agg_c.append(tds[1].find(text=True))
        ind_c.append(tds[2].find(text=True))
        serv_c.append(tds[3].find(text=True))
        aggT_c.append(tds[4].find(text=True))
        indT_c.append(tds[5].find(text=True))
        servT_c.append(tds[6].find(text=True).strip())


gdp_percent = df=pd.DataFrame({
    'Country':country_c,
    'AggPercent':agg_c,
    'IndPercent':ind_c,
    'ServPercent':serv_c,
    'AggPercentOfTotal':aggT_c,
    'IndPercentOfTotal':indT_c,
    'ServPercentOfTotal':servT_c
                               })
gdp_percent.tail(10)





Unnamed: 0,Country,AggPercent,IndPercent,ServPercent,AggPercentOfTotal,IndPercentOfTotal,ServPercentOfTotal
150,United States,1.1 %,20 %,78.9 %,1.5 %,17.5 %,81 %
151,Uzbekistan,18.2 %,34.5 %,47.3 %,30.1 %,23.8 %,46.1 %
152,St. Vincent and the Grenadines,7.4 %,18.2 %,74.4 %,22.9 %,16.4 %,60.7 %
153,Vietnam,18.9 %,37 %,44.2 %,44 %,22.3 %,33.7 %
154,World,3.8 %,27.3 %,68.9 %,29.5 %,21.5 %,48.9 %
155,Samoa,9.5 %,24.3 %,66.2 %,5.3 %,14.5 %,80.2 %
156,Yemen,9.8 %,48.1 %,42.1 %,27.8 %,17 %,55.2 %
157,South Africa,2.3 %,29.2 %,68.5 %,6.2 %,26.4 %,67.4 %
158,Zambia,5.3 %,35.3 %,59.4 %,54.9 %,10.2 %,34.9 %
159,Zimbabwe,11.6 %,24.2 %,64.2 %,67.1 %,7.3 %,25.6 %


**Activity 5:** Combine the dataframes resources_df and gdp_percent. Name the resultant dataframe as combined_df

In [7]:
# Your code for activity 5 goes here..
#---------------------------------------

combined_df = resources_df.merge(
                    gdp_percent, 
                    how='inner', 
                    on='Country')

combined_df.head()




Unnamed: 0,Country,TotalNaturalResources,Oil,NaturalGas,Coal,Mineral,Forest,AggPercent,IndPercent,ServPercent,AggPercentOfTotal,IndPercentOfTotal,ServPercentOfTotal
0,Afghanistan,2.1,..,..,0,0.0,2.1,21.4 %,22.9 %,55.7 %,61.6 %,9.9 %,28.5 %
1,Albania,5.1,4.6,0,0,0.5,0.1,22.9 %,24.2 %,53 %,42.3 %,18.1 %,39.6 %
2,Algeria,26.3,19,7,0,0.3,0.1,12.6 %,38.8 %,48.6 %,11.4 %,35.1 %,53.5 %
3,Argentina,6.1,4.1,1.2,0,0.8,0.1,6 %,28.1 %,65.9 %,2.1 %,24.7 %,73.3 %
4,Armenia,2.7,..,..,..,2.7,0.0,19.3 %,28.8 %,52 %,35.3 %,15.9 %,48.8 %


**Activity 6:** Replace the the invalid values ".." with valid NAN in combined_df

In [9]:
help(combined_df.replace)

Help on method replace in module pandas.core.frame:

replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad') method of pandas.core.frame.DataFrame instance
    Replace values given in `to_replace` with `value`.
    
    Values of the DataFrame are replaced with other values dynamically.
    This differs from updating with ``.loc`` or ``.iloc``, which require
    you to specify a location to update with some value.
    
    Parameters
    ----------
    to_replace : str, regex, list, dict, Series, int, float, or None
        How to find the values that will be replaced.
    
        * numeric, str or regex:
    
            - numeric: numeric values equal to `to_replace` will be
              replaced with `value`
            - str: string exactly matching `to_replace` will be replaced
              with `value`
            - regex: regexs matching `to_replace` will be replaced with
              `value`
    
        * list of str, regex, or numeric:
 

In [10]:
# Your code for activity 6 goes here
#---------------------------------------

combined_df.replace('..', np.NaN, inplace=True)

combined_df.head()




Unnamed: 0,Country,TotalNaturalResources,Oil,NaturalGas,Coal,Mineral,Forest,AggPercent,IndPercent,ServPercent,AggPercentOfTotal,IndPercentOfTotal,ServPercentOfTotal
0,Afghanistan,2.1,,,0.0,0.0,2.1,21.4 %,22.9 %,55.7 %,61.6 %,9.9 %,28.5 %
1,Albania,5.1,4.6,0.0,0.0,0.5,0.1,22.9 %,24.2 %,53 %,42.3 %,18.1 %,39.6 %
2,Algeria,26.3,19.0,7.0,0.0,0.3,0.1,12.6 %,38.8 %,48.6 %,11.4 %,35.1 %,53.5 %
3,Argentina,6.1,4.1,1.2,0.0,0.8,0.1,6 %,28.1 %,65.9 %,2.1 %,24.7 %,73.3 %
4,Armenia,2.7,,,,2.7,0.0,19.3 %,28.8 %,52 %,35.3 %,15.9 %,48.8 %


**Activity 7:** What do you think about the NAN values in the dataset about how they should be handled. Should the rows with NAN values be deleted or imputed with some statistic like mean, median etc. Give us your thoughts.

# Save your notebook, then `File > Close and Halt`