# Denison DA210/CS181 Homework 3.f - Step 1

Before you turn this notebook in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells**.

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

In [1]:
import os
import io
import sys
import importlib
import pandas as pd
from lxml import etree

module_dir = os.path.join("..", "..", "modules")
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

import util
importlib.reload(util)

datadir = "publicdata"

---

## Part A: Web scraping a table

We can either work with locally saved HTML documents, or download them from the web.  For this homework, we'll just work with already-acquired HTML documents.

First, we'll consider the `indicators2016` dataset represented as a set of nested lists within a web page: [http://datasystems.denison.edu/ind2016.html](http://datasystems.denison.edu/ind2016.html).  This HTML file is stored in `datadir`.

**Q1:** First, we need to do some discovery.  Use `etree` to parse the root of the HTML tree from `ind2016.html` into the variable `ind2016_root`.

In [24]:
htmlparser = etree.HTMLParser()

file = os.path.join(datadir, "ind2016.html")
htmltree = etree.parse(file, htmlparser)
ind2016_root = htmltree.getroot()



# Display a snippet of the file (using a util module provided with the textbook)
util.print_xml(ind2016_root, depth=3, nchild=3)

<html>
  <head>
    <meta charset='utf-8'></meta>
    <meta name='viewport' content='width=device-width, init
    <meta http-equiv='X-UA-Compatible' content='IE=edge'></
     ...
  </head>
  <body>
    <div class='wrapper'>
      <<cyfunction Comment at 0x7ff641d76040>>Page Content<
      <div id='content-no-side'>
      </div>
    </div>
    <script src='js/jquery-3.4.1.min.js'></script>
    <script src='js/popper.min.js'></script>
     ...
  </body>
</html>


**Q2:** Find all `<table>` nodes in the `ind2016` HTML tree.  Store the resulting list in a variable `ind2016_table_nodes`.

In [26]:
ind2016_table_nodes = ind2016_root.xpath("//table")

# Display the resulting list
ind2016_table_nodes

[<Element table at 0x7ff641df4c80>]

In [27]:
# Testing cell
assert type(ind2016_table_nodes) is list
assert len(ind2016_table_nodes) == 1
assert ind2016_table_nodes[0].tag == "table"

**Q3:** The previous question should have resulted in a list of only one node.  From this node, use XPath or XML procedural operations to retrieve the column names in the table.  Store this list in a variable `ind2016_columns`.

In [29]:
ind2016_table_node = ind2016_table_nodes[0]
util.print_xml(ind2016_table_node, depth=3, nchild=3)

# YOUR CODE HERE
ind2016_columns = ind2016_table_node.xpath("//thead/*/th/text()")

# Display the resulting list
ind2016_columns

<table class='table table-dark' style='width: 600px'>
  <thead>
    <tr>
      <th title='Field #1'>code</th>
      <th title='Field #2'>country</th>
      <th title='Field #3' class='text-right'>pop</th>
       ...
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>CAN</td>
      <td>Canada</td>
      <td align='right'>36.26</td>
       ...
    </tr>
    <tr>
      <td>CHN</td>
      <td>China</td>
      <td align='right'>1378.66</td>
       ...
    </tr>
    <tr>
      <td>IND</td>
      <td>India</td>
      <td align='right'>1324.17</td>
       ...
    </tr>
     ...
  </tbody>
</table>


['code', 'country', 'pop', 'gdp', 'life', 'cell']

In [30]:
# Testing cell
assert type(ind2016_columns) is list
assert len(ind2016_columns) == 6
assert "code" in ind2016_columns
assert "life" in ind2016_columns

**Q4:** One way to process the data in the table is to read the text of all data cells, and then group them into a LoL assuming the same number of cells in each row.

Modify the following code to retrieve the text of all data cells in the table, stored in a variable `ind2016_td_text`.

In [31]:
# YOUR CODE HERE
ind2016_td_text = ind2016_table_node.xpath("//tbody/*/td/text()")

ind2016_LoL = []
colNum = 0
for text in ind2016_td_text:
    if colNum == 0:
        row = []

    # Add the text to the current row
    row.append(text)

    # If this was the last element in the row, add it to the LoL,
    # otherwise increment the column number
    if colNum == len(ind2016_columns)-1:
        ind2016_LoL.append(row)
        colNum = 0
    else:
        colNum += 1

# Print a subset of the resulting LoL
util.print_data(ind2016_LoL, nlines=20)

[
  [
    "CAN",
    "Canada",
    "36.26",
    "1535.77",
    "82.3",
    "30.75"
  ],
  [
    "CHN",
    "China",
    "1378.66",
    "11199.15",
    "76.25",
    "1364.93"
  ],
  [
    "IND",
    "India",


In [32]:
# Debugging cell - try to create a dataframe
pd.DataFrame(ind2016_LoL, columns=ind2016_columns)

Unnamed: 0,code,country,pop,gdp,life,cell
0,CAN,Canada,36.26,1535.77,82.3,30.75
1,CHN,China,1378.66,11199.15,76.25,1364.93
2,IND,India,1324.17,2263.79,68.56,1127.81
3,RUS,Russia,144.34,1283.16,71.59,229.13
4,USA,United States,323.13,18624.47,78.69,395.88
5,VNM,Vietnam,94.57,205.28,76.25,120.6


In [33]:
# Testing cell
assert type(ind2016_td_text) is list
assert len(ind2016_td_text) == 36
assert len(ind2016_LoL) == 6
assert ind2016_LoL[0][0] == "CAN"
assert ind2016_LoL[2][4] == "68.56"

**Q5:** Alternatively, we could use XPath to create a DoL representation of this data table.

Either procedurally or using XPath, get the text for each cell in the `pop` column (note the _position_ of this column in the table).  Store the values of this column _as `float`s_ in a variable `ind2016_pop_values`.

In [39]:
# YOUR CODE HERE
ind2016_pop_values = []
ind2016_pop_values1 = ind2016_table_node.xpath("//tbody/*/td[position()=3]/text()")
for i in ind2016_pop_values1:
    ind2016_pop_values.append(float(i))

# Display the list of population values
ind2016_pop_values

[36.26, 1378.66, 1324.17, 144.34, 323.13, 94.57]

In [40]:
# Testing cell
assert type(ind2016_pop_values) is list
assert len(ind2016_pop_values) == 6
assert type(ind2016_pop_values[0]) is float
assert ind2016_pop_values[0] == 36.26
assert ind2016_pop_values[-1] == 94.57

**Q6:** Complete the process started in the previous question.  For each of the columns in the table, extract the values for that column.  Store the data in a DoL named `ind2016_DoL`.  You can do this either procedurally or using XPath.

Note: For all columns containing numerical data, you must convert those values to `float`s.

In [41]:
# YOUR CODE HERE
ind2016_columns

ind2016_DoL = {} 

ind2016_code = ind2016_table_node.xpath("//tbody/*/td[position()=1]/text()")
ind2016_country = ind2016_table_node.xpath("//tbody/*/td[position()=2]/text()")

ind2016_gdp = []
ind2016_gdp1 = ind2016_table_node.xpath("//tbody/*/td[position()=4]/text()")
for i in ind2016_gdp1:
    ind2016_gdp.append(float(i))

ind2016_life = []
ind2016_life1 = ind2016_table_node.xpath("//tbody/*/td[position()=5]/text()")
for i in ind2016_life1:
    ind2016_life.append(float(i))

ind2016_cell = []
ind2016_cell1 = ind2016_table_node.xpath("//tbody/*/td[position()=6]/text()")
for i in ind2016_cell1:
    ind2016_cell.append(float(i))

ind2016_DoL[ind2016_columns[0]] = ind2016_code
ind2016_DoL[ind2016_columns[1]] = ind2016_country
ind2016_DoL[ind2016_columns[2]] = ind2016_pop_values
ind2016_DoL[ind2016_columns[3]] = ind2016_gdp 
ind2016_DoL[ind2016_columns[4]] = ind2016_life 
ind2016_DoL[ind2016_columns[5]] = ind2016_cell
# Display the DoL
ind2016_DoL

{'code': ['CAN', 'CHN', 'IND', 'RUS', 'USA', 'VNM'],
 'country': ['Canada', 'China', 'India', 'Russia', 'United States', 'Vietnam'],
 'pop': [36.26, 1378.66, 1324.17, 144.34, 323.13, 94.57],
 'gdp': [1535.77, 11199.15, 2263.79, 1283.16, 18624.47, 205.28],
 'life': [82.3, 76.25, 68.56, 71.59, 78.69, 76.25],
 'cell': [30.75, 1364.93, 1127.81, 229.13, 395.88, 120.6]}

In [42]:
# Debugging cell - try to create a dataframe
pd.DataFrame(ind2016_DoL)

Unnamed: 0,code,country,pop,gdp,life,cell
0,CAN,Canada,36.26,1535.77,82.3,30.75
1,CHN,China,1378.66,11199.15,76.25,1364.93
2,IND,India,1324.17,2263.79,68.56,1127.81
3,RUS,Russia,144.34,1283.16,71.59,229.13
4,USA,United States,323.13,18624.47,78.69,395.88
5,VNM,Vietnam,94.57,205.28,76.25,120.6


In [43]:
# Testing cell
assert len(ind2016_DoL) == 6
assert ind2016_DoL["code"][0] == "CAN"
assert ind2016_DoL["life"][2] == 68.56

---

## Part B: Web scraping nested lists

Next, we'll consider the `indicators0` dataset represented as a set of nested lists within a web page: [http://datasystems.denison.edu/ind0.html](http://datasystems.denison.edu/ind0.html).  This HTML file is stored in `datadir`.

**Q7:** Once again, discovery is our first step.  Use `etree` to parse the root of the HTML tree from `ind0.html` into the variable `ind0_root`.

In [45]:
# YOUR CODE HERE
htmlparser = etree.HTMLParser()

file = os.path.join(datadir, "ind0.html")
htmltree = etree.parse(file, htmlparser)
ind0_root = htmltree.getroot()

# Display a snippet of the file (using a util module provided with the textbook)
util.print_xml(ind0_root, depth=3, nchild=3)

<html>
  <head>
    <meta charset='utf-8'></meta>
    <meta name='viewport' content='width=device-width, init
    <meta http-equiv='X-UA-Compatible' content='IE=edge'></
     ...
  </head>
  <body>
    <div class='wrapper'>
      <<cyfunction Comment at 0x7ff641d76040>>Page Content<
      <div id='content-no-side'>
      </div>
    </div>
    <script src='js/jquery-3.4.1.min.js'></script>
    <script src='js/popper.min.js'></script>
     ...
  </body>
</html>


**Q8:** This webpage was created by a tool, so it has a lot going on (e.g., due to formatting) between the `<body>` node and the nested lists.  Use XPath or procedural operations to find the top-level HTML unordered-list element representing the indicators data, and store that node in the variable `ind0_list_node`.

In [53]:
# YOUR CODE HERE
ind0_list_node = ind0_root.xpath("//ul")[1]

# You should get a single node with tag "ul"
util.print_xml(ind0_list_node, depth=4, nchild=3, nlines=18)

<ul>
  <li>FRA
    <ul>
      <li>2007
        <ul>
        </ul>
      </li>
      <li>2017
        <ul>
        </ul>
      </li>
    </ul>
  </li>
  <li>GBR
    <ul>
      <li>
        <span ...>2007</span>
        <ul>


In [54]:
# Testing cell
assert type(ind0_list_node) is etree._Element
assert ind0_list_node.tag == "ul"
assert len(ind0_list_node) == 3

**Q9:** The subtree for `FRA` is fairly straightforward.  Use XPath or XML procedural operations to construct a row dictionary for `FRA` with columns (keys) `code`, `pop2007`, `gdp2007`, `pop2017`, and `gdp2017`.  Store your dictionary in a variable `FRA_rowD`.

Be sure to convert all numerical values to `float`s.

In [76]:
# YOUR CODE HERE
FRA_2007 = ind0_list_node.xpath("./li[text() ='FRA']//li[text()='2007']/*/*/text()")
fra_2007 = []
for i in FRA_2007: 
    ind = i.find(" ")
    val = float(i[ind:])
    fra_2007.append(val)

FRA_2017 = ind0_list_node.xpath("./li[text() ='FRA']//li[text()='2017']/*/*/text()")
fra_2017 = []
for i in FRA_2017: 
    ind = i.find(" ")
    val = float(i[ind:])
    fra_2017.append(val)

FRA_rowD = {"code": "FRA", "pop2007": fra_2007[0], "gdp2007" : fra_2007[1], "pop2017": fra_2017[0], "gdp2017" : fra_2017[1]}

# Display the resulting data row dictionary
#print(FRA_rowD)

In [77]:
# Testing cell
assert type(FRA_rowD) is dict
assert len(FRA_rowD) == 5
assert FRA_rowD["code"] == "FRA"
assert FRA_rowD["pop2007"] == 64.02
assert FRA_rowD["gdp2017"] == 2586.29

---

---

## Part C

**Q10:** How much time (in minutes/hours) did you spend on this homework assignment?

30 minutes

**Q11:** Who was your partner for this assignment?  If you worked alone, say so instead.

Alone