The task is to scrape the English and Arabic data from the VDC website:

http://vdc-sy.net/en/

Index pages on the database are reached through:

http://www.vdc-sy.info/index.php/en/martyrs

The Arabic version is found with this address:

http://www.vdc-sy.info/index.php/ar/martyrs

Initial research suggests that the page structure is the same for both versions. 

Running a search on the entire database brings the first page of results with a URL like:

`http://www.vdc-sy.info/index.php/en/martyrs/1/c29ydGJ5PWEua2lsbGVkX2RhdGV8c29ydGRpcj1ERVNDfGFwcHJvdmVkPXZpc2libGV8ZXh0cmFkaXNwbGF5PTB8`

The `End` link from the page navigation there shows the last page number of results:

`http://www.vdc-sy.info/index.php/en/martyrs/1587/c29ydGJ5PWEua2lsbGVkX2RhdGV8c29ydGRpcj1ERVNDfGFwcHJvdmVkPXZpc2libGV8ZXh0cmFkaXNwbGF5PTB8`

In this case, the scraper should be able to iterate from 1 to 1521 to gather all of the index results. They have measures in place to prevent scraping of the site. If you visit one of the individual pages, it appends a ref to the end of the URL and returns a redirect. For example:

`http://www.vdc-sy.info/index.php/en/details/martyrs/189739`

becomes (with a dynamic parameter):

`http://www.vdc-sy.info/index.php/en/details/martyrs/189739#.WVpVW8aZMnU`

In [1]:
from time import sleep
from bs4 import BeautifulSoup
import requests
import dataset
import json

In [11]:
result = requests.get("http://www.vdc-sy.info/index.php/ar/details/martyrs/189739")

In [19]:
content = result.content
soup = BeautifulSoup(content.decode(),"lxml")

In [20]:
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="XWnKclwOOsL8qoot1wmrHAlAk8n97TsgxmV1zwQOXk8" name="google-site-verification"/>
<title>مركز توثيق الانتهاكات في سوريا - </title>
<meta content="توثيق,انتهاكات,سوريا,توثيق,الفتلى,معتقلين,مخطوفي,مفقودين,مدنيي, جيش النظام, جيش حر" name="keywords"/>
<meta content="مركز توثيق الانتهاكات في سوريا لتوثيق الفتلى والمعتقلين والمخطوفين والمفقودين من المدنيين وغير المدنيين" name="description"/>
<meta content="مركز توثيق الانتهاكات في سوريا " name="author"/>
<!-- Latest compiled and minified CSS -->
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" rel="stylesheet"/>
<!-- Latest compiled and minified JavaScr

In [26]:
result = requests.get("http://www.vdc-sy.info/index.php/ar/martyrs/1/c29ydGJ5PWEua2lsbGVkX2RhdGV8c29ydGRpcj1ERVNDfGFwcHJvdmVkPXZpc2libGV8ZXh0cmFkaXNwbGF5PTB8")

In [32]:
content = result.content
soup = BeautifulSoup(content,"lxml")
for link in soup.find_all("a"):
    print(link.get("href"))

http://vdc-sy.net/
http://vdc-sy.net
http://www.vdc-sy.info/index.php/ar/martyrs
http://www.vdc-sy.info/index.php/ar/detainees
http://www.vdc-sy.info/index.php/ar/missing
#
http://www.vdc-sy.info/index.php/ar/submit/martyrs
http://www.vdc-sy.info/index.php/ar/submit/detainees
http://www.vdc-sy.info/index.php/ar/submit/missing
http://www.vdc-sy.info/index.php/ar/submit/kidnapping
http://www.vdc-sy.info/index.php/ar/martyrs/2/c29ydGJ5PWEua2lsbGVkX2RhdGV8c29ydGRpcj1ERVNDfGFwcHJvdmVkPXZpc2libGV8ZXh0cmFkaXNwbGF5PTB8
http://www.vdc-sy.info/index.php/ar/martyrs/3/c29ydGJ5PWEua2lsbGVkX2RhdGV8c29ydGRpcj1ERVNDfGFwcHJvdmVkPXZpc2libGV8ZXh0cmFkaXNwbGF5PTB8
http://www.vdc-sy.info/index.php/ar/martyrs/4/c29ydGJ5PWEua2lsbGVkX2RhdGV8c29ydGRpcj1ERVNDfGFwcHJvdmVkPXZpc2libGV8ZXh0cmFkaXNwbGF5PTB8
http://www.vdc-sy.info/index.php/ar/martyrs/5/c29ydGJ5PWEua2lsbGVkX2RhdGV8c29ydGRpcj1ERVNDfGFwcHJvdmVkPXZpc2libGV8ZXh0cmFkaXNwbGF5PTB8
http://www.vdc-sy.info/index.php/ar/martyrs/6/c29ydGJ5PWEua2lsbGVkX2RhdGV8c29y

In [4]:
# http://www.vdc-sy.info/index.php/ar/details/martyrs/189729
link_list = []

def extract_links_from_results_page(idx,code):
    uri = "http://www.vdc-sy.info/index.php/ar/martyrs/" + str(idx) + "/" + code
    result = requests.get(uri)
    soup = BeautifulSoup(result.content,"lxml")
    for link in soup.find_all("a"):
        href = link.get("href")
        if "/details/martyrs/" in href:
            martyr_link = "http://www.vdc-sy.info" + href
            link_list.append(martyr_link)
            

In [5]:
link_list = []
for num in range(1,1735):
    extract_links_from_results_page(num,"c29ydGJ5PWEua2lsbGVkX2RhdGV8c29ydGRpcj1ERVNDfGFwcHJvdmVkPXZpc2libGV8ZXh0cmFkaXNwbGF5PTB8")
    sleep(1)
    if num % 100 == 0:
        print(num)

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700


In [6]:
len(link_list)

173368

In [3]:
db = dataset.connect("sqlite:///vdc.sqlite")
tab = db['links']

In [54]:
# for link in link_list:
#     en_link = link.replace("/ar/","/en/")
#     rec = {"ar_link":link,"en_link":en_link}
#     tab.insert(rec)

In [4]:
tab_ar = db['content_ar']
tab_en = db['content_en']

In [6]:
def get_content_table(link):
    result = requests.get(link)
    content = result.content
    soup = BeautifulSoup(content.decode(),"lxml")
    table = soup.find("table")
    if table is None:
        return table
    else:
        return str(table)

for rec in tab.find(_limit=2):
    # Arabic
    try:
        content_table = get_content_table(rec['ar_link'])
        ar = {"content":content_table,
              "url":rec['ar_link'], 
              "link_id":rec['id'], 
              "lang":"ar", 
              "success":1}
        tab_ar.insert(ar)
        sleep(1)
    except:
        ar = {"content":None,
              "url":rec['ar_link'], 
              "link_id":rec['id'], 
              "lang":"ar", 
              "success":0}
        tab_ar.insert(ar)
        sleep(1)
    
    # English
    try:
        content_table = get_content_table(rec['en_link'])
        en = {"content":content_table,
              "url":rec['en_link'], 
              "link_id":rec['id'], 
              "lang":"en", 
              "success":1}
        tab_en.insert(en)
        sleep(1)
    except:
        en = {"content":None,
              "url":rec['en_link'], 
              "link_id":rec['id'], 
              "lang":"en",
              "success":0}
        tab_en.insert(en)
        sleep(1)

In [46]:
tab.find_one()

OrderedDict([('id', 1),
             ('ar_link',
              'http://www.vdc-sy.info/index.php/ar/details/martyrs/190288'),
             ('en_link',
              'http://www.vdc-sy.info/index.php/en/details/martyrs/190288')])

In [48]:
t = get_content_table("http://www.vdc-sy.info/index.php/ar/details/martyrs/190288")

In [49]:
type(t)

bs4.element.Tag

In [52]:
str(t)

'<table class="peopleListing table-hover table-condensed table-striped">\n<tr>\n<td colspan="2">\n<!-- AddThis Button BEGIN -->\n<div class="addthis_toolbox addthis_default_style addthis_32x32_style" style="padding:10px 0 10px 20px; float:left;">\n<a class="addthis_button_preferred_1"></a>\n<a class="addthis_button_preferred_2"></a>\n<a class="addthis_button_preferred_3"></a>\n<a class="addthis_button_preferred_4"></a>\n<a class="addthis_button_compact"></a>\n<a class="addthis_counter addthis_bubble_style"></a>\n</div>\n<script type="text/javascript">var addthis_config = {"data_track_addressbar":true};</script>\n<script src="//s7.addthis.com/js/300/addthis_widget.js#pubid=ra-4e5936cd1bb03577" type="text/javascript"></script>\n<!-- AddThis Button END -->\n<a href="javascript:window.print()">Print This Page</a>\n</td>\n</tr>\n<tr>\n<td colspan="2">\n<div style="text-align:center; float:right;">\n<a href="/index.php/ar/martyrs/1/">Return to List</a>\n</div>\n<div style="clear:both;"></div

In [10]:
recs = db.query("""SELECT * FROM links 
                   WHERE id NOT IN (
                    SELECT id FROM content_ar
                );""")

In [11]:
for r in recs:
    print(r)
    break

OrderedDict([('id', 3), ('ar_link', 'http://www.vdc-sy.info/index.php/ar/details/martyrs/190290'), ('en_link', 'http://www.vdc-sy.info/index.php/en/details/martyrs/190290')])


In [2]:
a = [1,2,3]
b = []
b + a

[1, 2, 3]