In [1]:
!jq --help


jq - commandline JSON processor [version 1.3]
Usage: jq [options] <jq filter> [file...]

For a description of the command line options and
how to write jq filters (and why you might want to)
see the jq manpage, or the online documentation at
http://stedolan.github.com/jq



In [2]:
!curl -sL 'http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio' > wiki.html

In [3]:
!ls wiki.html

wiki.html


#### Display first 79 chars each of the first 10 lines, 

In [4]:
!head -n 10 wiki.html | cut -c1-79

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of countries and territories by border/area ratio - Wikipedia, the 
<script>document.documentElement.className = document.documentElement.className
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonical
mw.user.tokens.set({"editToken":"+\\","patrolToken":"+\\","watchToken":"+\\","c

});mw.loader.load(["mediawiki.page.startup","mediawiki.legacy.wikibits","ext.ce


#### Use grep to examine table with class=wikitable. Use -A option to specify #lines after the match.

In [5]:
!< wiki.html grep wikitable -A 21 

<table class="wikitable sortable">
<tr>
<th>Rank</th>
<th>Country or territory</th>
<th>Total length of land borders (km)</th>
<th>Total surface area (km²)</th>
<th>Border/area ratio (km/km²)</th>
</tr>
<tr>
<td>1</td>
<td>Vatican City</td>
<td>3.2</td>
<td>0.44</td>
<td>7.2727273</td>
</tr>
<tr>
<td>2</td>
<td>Monaco</td>
<td>4.4</td>
<td>2</td>
<td>2.2000000</td>
</tr>


#### Scrape elements

#### To install scrape: git clone https://github.com/bjpcjp/data-science-at-the-command-line

-e argument = CSS selector syntax

In [6]:
! < wiki.html coderepo/tools/scrape -b -e\
'table.wikitable > tr:not(:first-child)'\
> table.html 

In [7]:
!head -n 21 table.html

<!DOCTYPE html>
<html>
<body>
<tr>
<td>1</td>
<td>Vatican City</td>
<td>3.2</td>
<td>0.44</td>
<td>7.2727273</td>
</tr>

<tr>
<td>2</td>
<td>Monaco</td>
<td>4.4</td>
<td>2</td>
<td>2.2000000</td>
</tr>

<tr>
<td>3</td>


Now we have a dataset where each row = a country/territory. Let's use xml2json for the next step. Install via git clone https://github.com/Inist-CNRS/node-xml2json-command.git


In [8]:
!< table.html xml2json > table.json

In [9]:
!< table.json jq '.' | head -n 10

{
  "html": {
    "body": {
      "tr": [
        {
          "td": [
            {
              "$t": "1"
            },
            {


In [10]:
!< table.json jq -c '.html.body.tr[] | {country: .td[1][],border:'\
'.td[2][], surface: .td[3][]}' > countries.json

In [11]:
! head -n 10 countries.json

{"surface":"0.44","border":"3.2","country":"Vatican City"}
{"surface":"2","border":"4.4","country":"Monaco"}
{"surface":"61","border":"39","country":"San Marino"}
{"surface":"160","border":"76","country":"Liechtenstein"}
{"surface":"34","border":"10.2","country":"Sint Maarten (Netherlands)"}
{"surface":"468","border":"120.3","country":"Andorra"}
{"surface":"6","border":"1.2","country":"Gibraltar (United Kingdom)"}
{"surface":"54","border":"10.2","country":"Saint Martin (France)"}
{"surface":"2586","border":"359","country":"Luxembourg"}
{"surface":"6220","border":"466","country":"Palestinian territories"}


#### install json2csv with 'go get https://github.com/jehiah/json2csv'; you'll need to set a GOPATH variable too.

In [12]:
!< countries.json json2csv -p -k border,surface > countries.csv

/bin/sh: 1: json2csv: not found
