<h1>Web-Scraper and Bulk-Downloader with Python using <br>BeautifulSoup and multiprocessing ThreadPool class.</h1>

<p>In this Notebook I will be presenting a web-scraper and bulk-downloader with Python, which enables us to download several files from a given url using Python's 3rd Party WebScraper Module `BeautifulSoup` and  the multiprocessing class `ThreadPool`. The goal is to demonstrate the benefit of the multithreading in comparison to popular loops by executing repeated tasks in a shorter time - in this case downloading of csv files.</p>

<p>Farshad Davoodifard<br>Berlin, November 2021<br><i>Level: beginner</i></p>

<h3>Content</h3>
<ol>
    <li><a href="#1">Introduction</a></li>
    <li><a href="#2">Data</a></li>
    <li><a href="#3">Exploratory Data Analysis and Preparation</a></li>
    <li><a href="#4">Extracting and Preparing Download Links</a></li>
    <li><a href="#5">Download the Target Files</a></li>
    <li><a href="#6">Checking the Results and some Notes</a></li>
    
</ol>

<div id="1">
    <h3>1. Introduction</h3>
    <p>Upon searching for some appropriate public datasets for machine learning exercises (as alternate for the usual built-in datasets in popular machine learning libraries like the well-known 'iris' dataset in scikit-learn), my attention was brought to a very interesting github repository having a collection of over 1300 datasets originally distributed in R packages, which could be interesting for ML exerciese with Python, too.</p>
    <p>Despite the fact that Pandas library is able to read any csv's content from a url, I was rather interested in 'Downloading' the whole data from this repository, before accessing any of them individually - just for fun.</p>
    <p>So I decided to develop a Web-Scraper to analyse the content of the above mentioned repository and then tried to download the files automatically using their given links.</p>
    <p>In the following lines, I will present all the steps I did to reach my goal.</p>
    <p>Have fun and feel free to use my code and share it.</p>

 </div>

<div id="2">
    <h3>2. Data</h3>
    <p>As mentioned in the Introduction, for the data source I am using the public repository of Vincent Arel-Bundock's Project Datasets, which are publically accessible <a href='https://vincentarelbundock.github.io/Rdatasets/datasets.html'>here</a>.
</div>

<div id="3">
    <h3>3. Exploratory Data Analysis and Preparation</h3>
    <p>First of all we need to import all libraries we need for this project. Please consider installing 3rd Party Modules like BeautifulSoup and ThreadPool before you start, as these are not part of Python's standard library.</p>
    <p>For more Information on these libraries see their respective documentations.</p>
</div>

In [1]:
import os
import pandas as pd
import requests
import httplib2
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
print('Libraries  imported Successfully!')

Libraries  imported Successfully!


Now we are going to have a look at the content of our data source. We just save it in a variable called `url`:

In [2]:
url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
print('URL saved!')

URL saved!


In the next step we read the page content and display it as a Pandas dataframe:

In [3]:
# create dataframe out of html table 
df = pd.read_html(url)[0][:-1]

In [4]:
# show the first 5 rows
df.head()

Unnamed: 0,Package,Item,Title,Rows,Cols,n_binary,n_character,n_factor,n_logical,n_numeric,CSV,Doc
0,AER,Affairs,Fair's Extramarital Affairs Data,601.0,9.0,2.0,0.0,2.0,0.0,7.0,CSV,DOC
1,AER,ArgentinaCPI,Consumer Price Index in Argentina,80.0,2.0,0.0,0.0,0.0,0.0,2.0,CSV,DOC
2,AER,BankWages,Bank Wages,474.0,4.0,2.0,0.0,3.0,0.0,1.0,CSV,DOC
3,AER,BenderlyZwick,"Benderly and Zwick Data: Inflation, Growth and...",31.0,5.0,0.0,0.0,0.0,0.0,5.0,CSV,DOC
4,AER,BondYield,Bond Yield Data,60.0,2.0,0.0,0.0,0.0,0.0,2.0,CSV,DOC


In [5]:
# show the last 5 rows
df.tail()

Unnamed: 0,Package,Item,Title,Rows,Cols,n_binary,n_character,n_factor,n_logical,n_numeric,CSV,Doc
1740,vcd,UKSoccer,UK Soccer Scores,25.0,3.0,0.0,0.0,2.0,0.0,1.0,CSV,DOC
1741,vcd,VisualAcuity,Visual Acuity in Left and Right Eyes,32.0,4.0,1.0,0.0,3.0,0.0,1.0,CSV,DOC
1742,vcd,VonBort,Von Bortkiewicz Horse Kicks Data,280.0,4.0,1.0,0.0,2.0,0.0,2.0,CSV,DOC
1743,vcd,WeldonDice,Weldon's Dice Data,11.0,2.0,0.0,0.0,1.0,0.0,1.0,CSV,DOC
1744,vcd,WomenQueue,Women in Queues,11.0,2.0,0.0,0.0,1.0,0.0,1.0,CSV,DOC


We may also have a look at the shape of our dataframe to findout the number of rows and columns:

In [6]:
df.shape

(1745, 12)

and to check more details:

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1745 entries, 0 to 1744
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Package      1745 non-null   object 
 1   Item         1745 non-null   object 
 2   Title        1745 non-null   object 
 3   Rows         1745 non-null   float64
 4   Cols         1745 non-null   float64
 5   n_binary     1745 non-null   float64
 6   n_character  1745 non-null   float64
 7   n_factor     1745 non-null   float64
 8   n_logical    1745 non-null   float64
 9   n_numeric    1745 non-null   float64
 10  CSV          1745 non-null   object 
 11  Doc          1745 non-null   object 
dtypes: float64(7), object(5)
memory usage: 163.7+ KB


From the very first look at the dataframe we can see that there are 1745 rows in the dataframe, which means we can theoretically download 1475 files as `csv` or `doc` files, if all the links work properly and each link refers to a unique dataset.

The dataframe does not have any column, which shwos the size of each file. Instead, we can see the number of rows and columns in each file. Let's see which file has the largest number of rows:

In [8]:
max_rows = df.Rows.max() # maximum number of Rows
max_rows

1414593.0

In [9]:
df[df['Rows'] == max_rows] # which sample has the maximum number of Rows?

Unnamed: 0,Package,Item,Title,Rows,Cols,n_binary,n_character,n_factor,n_logical,n_numeric,CSV,Doc
1180,openintro,military,US Military Demographics,1414593.0,6.0,2.0,0.0,4.0,1.0,1.0,CSV,DOC


As we see, the package `openintro` has 1414593 rows and therefore is the largest package among all 1745 files.

In a similar way, we may check the packages with the smallest number of rows:

In [10]:
min_rows = df.Rows.min() # minimum number of Rows
min_rows

2.0

In [11]:
df[df['Rows'] == min_rows] # which sample(s) have the minimum number of Rows?

Unnamed: 0,Package,Item,Title,Rows,Cols,n_binary,n_character,n_factor,n_logical,n_numeric,CSV,Doc
732,gap,jma.cojo,Internal functions for gap,2.0,16.0,13.0,4.0,0.0,0.0,12.0,CSV,DOC
1127,openintro,fish_oil_18,Findings on n-3 Fatty Acid Supplement Health B...,2.0,48.0,46.0,0.0,0.0,0.0,48.0,CSV,DOC
1332,reshape2,smiths,Demo data describing the Smiths.,2.0,5.0,2.0,0.0,1.0,0.0,4.0,CSV,DOC
1420,Stat2Data,ChemoTHC,THC for Antinausea Treatment in Chemotherapy,2.0,4.0,4.0,0.0,1.0,0.0,3.0,CSV,DOC
1516,Stat2Data,Migraines,Migraines and TMS,2.0,4.0,3.0,0.0,1.0,0.0,3.0,CSV,DOC
1581,Stat2Data,TMS,Migraines and TMS,2.0,4.0,3.0,0.0,1.0,0.0,3.0,CSV,DOC
1698,tidyr,smiths,Some data about the Smith family,2.0,5.0,2.0,1.0,0.0,0.0,4.0,CSV,DOC


We see many files with a small number of (e.g. 1 or 2 ) rows which may be less important and relevant for our machine learning exercises. We let them though be there and will not omit them.

Generating a dataframe out of the html table is nice, but not enough to read the links behind 'CSV' (or 'DOC') to get them download. In addition, we need at least the Titles from the dataframe to name the downloaded files later. To get the links behind the html, we need a web-scraper, which we can develop with BeautifulSoup.

<div id="4">
    <h3>4. Extracting and Preparing Download Links</h3>
    <p>To begin with, we create an empty list (links) that we will use to store the links that we will extract from the HTML content of the webpage.  
Then, we create a <strong>BeautifulSoup()</strong> object and pass the HTML content to it. What it does is it creates a nested representations of the HTML content.  
As the final step, what we need to do is actually discover the links from the entire HTML content of the webapage. To do it, we use the <strong>.find_all()</strong> method and let it know that we would like to discover only the tags that are actually links.</p>
    <p>
        An important note is that <strong>.request()</strong> method returns a tuple with two elements, the first being an instance of a Response class, and the second being the content of the body of the URL we are working with.</p>
</div>


In [12]:
# create an instance of a class that represents a client HTTP interface
http = httplib2.Http()
# run http-request
response, content = http.request(url)

In [13]:
# create an empty list (links) 
# that we will use to store the links that 
# we will extract from the HTML content of the webpage
links=[]

# find all 'a' tags (links) in the html-content
# and save them in the list of 'links'
for link in BeautifulSoup(content).find_all('a', href=True):
    links.append(link['href'])

Now let's check the list of links:

In [14]:
links[:5] # show the first five elements

['https://vincentarelbundock.github.io/Rdatasets/csv/AER/Affairs.csv',
 'https://vincentarelbundock.github.io/Rdatasets/doc/AER/Affairs.html',
 'https://vincentarelbundock.github.io/Rdatasets/csv/AER/ArgentinaCPI.csv',
 'https://vincentarelbundock.github.io/Rdatasets/doc/AER/ArgentinaCPI.html',
 'https://vincentarelbundock.github.io/Rdatasets/csv/AER/BankWages.csv']

As we see, there are two links for each file: a link to a csv and a html-link. The number of list elements can prove this:

In [15]:
len(links)

3490

Which is exactly the double of the number of rows in our dataframe (i.e 1745).  
For our machine learning exercises we will need the csv files only. So let deprecate the html links from our list by slicing over the list with a stepwide of 2:

In [16]:
# we need only csv files
csv_links = links[::2]
# show the first five elements
csv_links[:5]

['https://vincentarelbundock.github.io/Rdatasets/csv/AER/Affairs.csv',
 'https://vincentarelbundock.github.io/Rdatasets/csv/AER/ArgentinaCPI.csv',
 'https://vincentarelbundock.github.io/Rdatasets/csv/AER/BankWages.csv',
 'https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv',
 'https://vincentarelbundock.github.io/Rdatasets/csv/AER/BondYield.csv']

As we see, this list has exactly 1745 elements (the same as the dataframe)

In [17]:
len(df)

1745

In [18]:
len(csv_links)

1745

For the ease of work and a better view, we zip the links and their Items information from our dataframe:

In [19]:
urls = [(i, df.Item[i], csv_links[i]) for i in range(len(csv_links))]
urls[:5]  # show the first five elements

[(0,
  'Affairs',
  'https://vincentarelbundock.github.io/Rdatasets/csv/AER/Affairs.csv'),
 (1,
  'ArgentinaCPI',
  'https://vincentarelbundock.github.io/Rdatasets/csv/AER/ArgentinaCPI.csv'),
 (2,
  'BankWages',
  'https://vincentarelbundock.github.io/Rdatasets/csv/AER/BankWages.csv'),
 (3,
  'BenderlyZwick',
  'https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv'),
 (4,
  'BondYield',
  'https://vincentarelbundock.github.io/Rdatasets/csv/AER/BondYield.csv')]

_**important**_  
The latest list will help us to build unique names for the files which are supposed to get downloaded.  
We need the digits (say indices) to avoid eliminating/overwriting of files, which may have the same 'name' but different content during the download process.

<div id="5">
    <h3>5. Download the Target Files</h3>
 </div>
 
<p>The following function is supposed to download each file individually and save it in the subdirectory 'data'</p>

In [20]:
def dl_file(url):
    """Builds a file name and download and save it to disk."""
    id_, name, url = url
    name = str(id_) + '_'+ name + '.csv'
    path = './data/' + name
    r = requests.get(url, stream = True)
    with open(path, 'wb') as f:
         for ch in r:
            f.write(ch)

To download multiple files at a time, we use the `ThreadPool` method `imap_unordered()`.

In [21]:
# download using ThreadPool Object with 9 Threads
try:
    ThreadPool(9).imap_unordered(dl_file, urls)
       
except Exception as e:
    print(e)


<div id="6">
    <h3>6. Checking the Results and some Notes</h3>
 </div>
 
<p>To check the number of downloaded files we use the following snippet:</p>

In [22]:
len(os.listdir('./data'))

1745

As we see the number of downloaded files is equal to the number of rows in our original dataframe. So we can now reassure that we have downloaded all of the linked files under the `url`. Nevertheles, we cannot be sure that each file is unique regarding its content.  
To check this, you may change the structure of the list called `urls` so that every tuple saves without indices - just the Title and the url, and the check the results. Due to some duplicates, you will lose many files because some downloaded files with the same name will overwrite some old ones.  
Thank you for reading.