What's up blockchain fam. Hopefully this helps - it's an example of scraping data with Python.

In [1]:
# Import libraries we need
import requests                 # allows us to dl a webpage
from bs4 import BeautifulSoup   # allows us to scrape html

In [2]:
# site to scrape
site = "https://www.walletexplorer.com/wallet/23c39bc811da0d39?from_address=1MMaU5nTrFdPZotfwdbv1wWnFjLCTFbpPY"
site = requests.get(site)
sitecontent = site.content

Now the html is stored inside of `content`. That's kind of helpful.

To actually do stuff with the data, though, we'll need a way to parse through it.
Enter beautifulsoup (ayyee):


In [4]:
# Use BeautifulSoup to make obtained website workable.
content = BeautifulSoup(sitecontent, 'html.parser')

# View original html hierarchy
print(content.prettify())

<!DOCTYPE html>
<html lang="en">
 <meta charset="utf-8"/>
 <title>
  Wallet 23c39bc811da0d39 [WalletExplorer.com]
 </title>
 <link href="/styles.css" rel="stylesheet">
  <meta content="always" name="referrer">
   <div id="topbar">
    <h1>
     <a href="/">
      WalletExplorer.com
     </a>
     : smart Bitcoin block explorer
    </h1>
    <form action="/">
     <input name="q" type="text" value=""/>
     <input type="submit" value="Search address/txid/wallet id/firstbits"/>
    </form>
   </div>
   <div id="main">
    <h2>
     Wallet
     <span class="walletcolor" style="background-color: #da0d39">
     </span>
     [23c39bc811]
     <span class="showother">
      (
      <a href="/wallet/23c39bc811da0d39/addresses">
       show wallet addresses
      </a>
      )
     </span>
    </h2>
    <!-- 23c39bc811da0d39 -->
    <p class="note">
     Displaying wallet
     <span class="walletcolor" style="background-color: #da0d39">
     </span>
     [23c39bc811], of which part is address 1M

## Now we can get to extracting elements. 

If you haven't used html, it's pretty simple. The tags are nested with in each other and stand for different things.

To find where a specific element is nested in the webpage, highlight it and on the webpage, right-click highlighted spot, and click *inspect*.


For the url you gave me, the table is stored inside of

`<table class="txs"> ... </table>`

Let's start grabbing stuff

## Find first element

To find the first type of an element, say the `<p>` tag, use the `.find()` method

Note: Output stored as Python list

In [12]:
# find first paragraph
content.find('p')

<p class="note">Displaying wallet <span class="walletcolor" style="background-color: #da0d39"></span>[23c39bc811], of which part is address 1MMaU5nTrFdPZotfwdbv1wWnFjLCTFbpPY. <a href="/address/1MMaU5nTrFdPZotfwdbv1wWnFjLCTFbpPY">Show only address 1MMaU5nTrFdPZotfwdbv1wWnFjLCTFbpPY</a></p>

In [16]:
# Find first div
content.find('div')

<div id="topbar">
<h1><a href="/">WalletExplorer.com</a>: smart Bitcoin block explorer</h1>
<form action="/">
<input name="q" type="text" value=""/>
<input type="submit" value="Search address/txid/wallet id/firstbits"/>
</form>
</div>

## Find all elements of a certain tag

Use `find_all()`

In [19]:
# Find all links
content.find_all('a')

[<a href="/">WalletExplorer.com</a>,
 <a href="/wallet/23c39bc811da0d39/addresses">show wallet addresses</a>,
 <a href="/address/1MMaU5nTrFdPZotfwdbv1wWnFjLCTFbpPY">Show only address 1MMaU5nTrFdPZotfwdbv1wWnFjLCTFbpPY</a>,
 <a href="/wallet/23c39bc811da0d39?format=csv">Download as CSV</a>,
 <a href="/wallet/b4a74320e7f069a2"><span class="walletcolor" style="background-color: #f069a2"></span>[b4a74320e7]</a>,
 <a href="/txid/2c44195149bbc0cf9f8aaeb5e434b899611e7e161e0446ab2bf9240287198df5">2c44195149bbc0cf9f8a…</a>,
 <a href="/wallet/2a7bb3480384eed5"><span class="walletcolor" style="background-color: #84eed5"></span>[2a7bb34803]</a>,
 <a href="/txid/eef5e0b1f24194402b89d0af40b1964632996414a40600cd73b82db13fe91d95">eef5e0b1f24194402b89…</a>,
 <a href="/wallet/ec38851dbb8af0a8"><span class="walletcolor" style="background-color: #8af0a8"></span>[ec38851dbb]</a>,
 <a href="/wallet/51743e65e4a27f63"><span class="walletcolor" style="background-color: #a27f63"></span>[51743e65e4]</a>,
 <a hre

## You can also use CSS tags to find nested stuff.
e.g. 
`content.select("body p a")` finds all `a` tags inside of a `p` tag inside of a `body` tag.

In [39]:
content.select('table th')

[<th>date</th>, <th>received/sent</th>, <th>balance</th>, <th>transaction</th>]

HTML tags can belong to certain 'classes'. You can use this information to find tags belonging to particular classes. 

For you guys (+ girl), the table you want is a `<table>` tag with `class = "txs"`

In [57]:
# find table we want
table = content.find('table', class_ = "txs")
print len(table)
#table = table[0] # extract code from list
print table.prettify()


# mmmmmm look at that sweet, sweet html code ;)

12
<table class="txs">
 <tr>
  <th>
   date
  </th>
  <th>
   received/sent
  </th>
  <th>
   balance
  </th>
  <th>
   transaction
  </th>
 </tr>
 <tr class="received">
  <td class="date">
   2017-02-06 01:02:47
  </td>
  <td class="inout">
   <table class="empty">
    <tr>
     <td class="walletid">
      <a href="/wallet/b4a74320e7f069a2">
       <span class="walletcolor" style="background-color: #f069a2">
       </span>
       [b4a74320e7]
      </a>
     </td>
     <td class="amount diff">
      +0.14915494
     </td>
     <td>
     </td>
    </tr>
   </table>
  </td>
  <td class="amount">
   0.14925494
  </td>
  <td class="txid">
   <a href="/txid/2c44195149bbc0cf9f8aaeb5e434b899611e7e161e0446ab2bf9240287198df5">
    2c44195149bbc0cf9f8a…
   </a>
  </td>
 </tr>
 <tr class="received">
  <td class="date">
   2016-11-24 17:36:42
  </td>
  <td class="inout">
   <table class="empty">
    <tr>
     <td class="walletid">
      <a href="/wallet/2a7bb3480384eed5">
       <span class="wall

## Grab stuff we want and coerce it to a workable format

We want the following columns:

- Dates
- Received/Sent
- Balance
- Transaction

We'll use BeautifulSoup to capture each column, and use the 'pandas' library in Python to make it useable.

Ok let's get it fam!

In [111]:
# Get Dates

# Looking at the html, we can see that each date is nested inside the following html
#   <td class="date">2017-02-06 01:02:47</td>
date_list = table.find_all("td", class_="date")
dates = [date.get_text() for date in date_list]
dates

5

In [142]:
# Get sent/received
# stored in <td class="walletid"></td>

# Recall, to find the above tag/class combos, we just highlighted & right-clicked 'inspect' on 
# those elements.

# get wallet
wallet_list = table.find_all("td", class_="walletid")
wallet = [wal.get_text() for wal in wallet_list]
wallet = [wallet[0], wallet[1], (wallet[3], wallet[5]), wallet[6], wallet[7]]

[u'[b4a74320e7]', u'[2a7bb34803]', u'', u'[ec38851dbb]', u'', u'[51743e65e4]', u'[b4ac413a29]', u'[924d32a14f]']
[u'[b4a74320e7]', u'[2a7bb34803]', (u'[ec38851dbb]', u'[51743e65e4]'), u'[b4ac413a29]', u'[924d32a14f]']


In [158]:
# Get balance
bal_list = table.find_all("td", class_="amount")
bal_dff = table.find_all("td", class_="amount diff")
balance_amount_only = set(bal_list) - set(bal_dff)
balances = [float(bal.get_text()) for bal in balance_amount_only]
balances

[0.0, 0.893056, 0.92898899, 0.14925494, 0.0001]

In [162]:
# Get transaction id
trans_list = table.find_all("td", class_="txid")
trans = [t.get_text() for t in trans_list]
trans

[u'2c44195149bbc0cf9f8a\u2026',
 u'eef5e0b1f24194402b89\u2026',
 u'1b1ac78d6f812c6bf80a\u2026',
 u'27e4f4629aca12f13097\u2026',
 u'2ae8a39f65fbbaf64988\u2026']

## Make into dataframe

In [163]:
import pandas as pd
wallet = pd.DataFrame({
        "date": dates,
        "received/sent": wallet,
        "balance": balances,
        "transaction": trans
    })
wallet

Unnamed: 0,balance,date,received/sent,transaction
0,0.0,2017-02-06 01:02:47,[b4a74320e7],2c44195149bbc0cf9f8a…
1,0.893056,2016-11-24 17:36:42,[2a7bb34803],eef5e0b1f24194402b89…
2,0.928989,2016-08-20 21:23:38,"([ec38851dbb], [51743e65e4])",1b1ac78d6f812c6bf80a…
3,0.149255,2016-07-07 22:37:48,[b4ac413a29],27e4f4629aca12f13097…
4,0.0001,2016-07-01 18:13:53,[924d32a14f],2ae8a39f65fbbaf64988…


Also - I just realised you can download each day's data as a csv file so idk how useful this is but here you go lol

Below is a link of what the dl looks like

In [166]:
pd.read_csv("walletexplorer-23c39bc811da0d39-1.csv")

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,"#Wallet 23c39bc811da0d39, page 1 from 1, transactions 1-5. Updated to block 455049 (2017-02-28 00:59:42 UTC). Source: WalletExplorer.com"
date,received from,received amount,sent amount,sent to,balance,transaction
2017-02-06 01:02:47,b4a74320e7f069a2,0.14915494,,,0.14925494,2c44195149bbc0cf9f8aaeb5e434b899611e7e161e0446...
2016-11-24 17:36:42,2a7bb3480384eed5,0.0001,,,0.0001,eef5e0b1f24194402b89d0af40b1964632996414a40600...
2016-08-20 21:23:38,,,0.5,ec38851dbb8af0a8,0,1b1ac78d6f812c6bf80ad5848fb982592af7e48e295396...
2016-08-20 21:23:38,,,0.428801,51743e65e4a27f63,0,1b1ac78d6f812c6bf80ad5848fb982592af7e48e295396...
2016-08-20 21:23:38,,,0.00018799,(fee),0,1b1ac78d6f812c6bf80ad5848fb982592af7e48e295396...
2016-07-07 22:37:48,b4ac413a2937ee5a,0.03593299,,,0.92898899,27e4f4629aca12f13097dcf581688a0d5a20157d8c190b...
2016-07-01 18:13:53,924d32a14ffafd41,0.893056,,,0.893056,2ae8a39f65fbbaf64988976993074569e34a896e040a93...
