## Creating A Simple News Article Web Scraper


Large datasets for modern data applications must be generated and compiled before analysis
can be done. A growing popularity technique for gathering large sets of data from online
sources is web scraping. This refers to using software to gather data over time from the
websites a developer is interested in. Depending on the amount of data needed, and the type of
application it is needed for, web scraping might need to be done over a long period of time. In
the case of rate limiting, some scrapers must also be scheduled at specific times of the day and
only run for specific amounts of time. This tutorial will walk through a simple web scraper for
news articles, setting up a database to store article data, and setting up a scheduler to run the
scraper at specified times.

We will use Postgresql database to save data that the scraper finds.

Following are the python libraries used in the notebook

    Newspaper3k (or newspaper for python 2), is a library which makes scraping news data easy

    SQLAlchemy is designed to make it easy to interface with the postgres database.

    APScheduler is a package that helps schedule events to happen on regular intervals, such as scraping a website every day.

In [1]:
import newspaper
import sqlalchemy
import apscheduler

### Create Simple Web Scraper

All the necessary libraries are installed. In below code cell, 
the website URL which we want to scrape is specified. Newspaper package makes scraping easy.

First, we define a url that we wish to scrape.

Next, newspaper.build takes in that url and generates a newspaper object. A newspaper
object is generated starting at the url we gave, and creates a collection of articles.

The for loop prints out the title of all articles in paper.articles list. The reason for
the if statement that checks for the presence of a title is because scraping can
sometimes be messy, and not everything that is scraped has everything.

In [2]:
import newspaper
paper = newspaper.build("http://www.chicagotribune.com/")

for article in paper.articles:
    if article.title is not None:
        print(article.title+"\n************")

unable to cache TLDs in file /usr/lib/python3.4/site-packages/tldextract/.tld_set: [Errno 30] Read-only file system: '/usr/lib/python3.4/site-packages/tldextract/.tld_set'


'Ugly' fruits and vegetables home delivery service coming to Chicago
************
Watchdog
************
Wells St. Market food hall in Loop reveals opening chef line-up
************


3

### Make postgres table for data

AWS Redshift is a distributed version of postgresql. We will use a postgresql database to store the articles. Create a Redshift cluster for a postgresql database.

In [3]:
import boto3
import random
import time
import json
import psycopg2
from getpass import getpass
from pandas import read_sql
import datetime


# Create client objects for AWS redhift service. 
redshift_client = boto3.client('redshift')
pwd = getpass('password')

password········


Set the following variables, name of the cluster and a database.

In [4]:
############### Set the following variables ##################################

cluster_name="newsarticles"

database_name="articles"

Sec_group_name= "newsArticles_Sec_group"

Create an AWS EC2 client object to create a security group for the redshift cluster.

In [5]:
# Create client object for AWS EC2 service.
ec2_client = boto3.client(
    'ec2'
)

Create a security group. A redshift cluster is built with EC2 instances as its nodes. We need a security group while launching Redshift cluster. Get the security group Id in a variable.

In [13]:
sg = ec2_client.create_security_group(
    Description='security group for news articles redhift cluster',
    GroupName=Sec_group_name
)
Sec_group=sg["GroupId"]

Redshift cluster listens on the port 5439. Edit the security group inbound rules to allow all TCP/IP traffic on port number 5439.

In [14]:
try:
    sec_rule="ALL TCP"
    data = ec2_client.authorize_security_group_ingress(
        GroupId=Sec_group,
        IpPermissions=[
            {'IpProtocol': 'tcp',
             'FromPort': 5439,
             'ToPort': 5439,
             'IpRanges': [{'CidrIp': '0.0.0.0/0'}]},
        ],)
    print("Ingress "+sec_rule+" added")
except:
    print(sec_rule+" already added")

Ingress ALL TCP added


Below cell will deploy a redshift cluster. 
A default database named "articles" is created during the cluster deployment. 
The parameter "NumberOfNodes" will tell how many slave nodes the cluster should have. 
The network traffic is controlled based on the inbound rules in the security group newsArticles_Sec_group. 
At the end of the session we will delete the security group and cluster.

In [15]:
response = redshift_client.create_cluster(
    DBName=database_name,            # Optional. A default database named dev is created for the cluster. Optionally, 
                                     # specify a custom database name (e.g. mydb) to create an additional database.
    
    ClusterIdentifier=cluster_name,  # Unique key that identifies a cluster. It is stored as a lowercase string. 
    ClusterType='multi-node',        # single-node is other option
    NodeType='dc1.large',            # other options are dc1.8xlarge ds2.xlarge ds2.8xlarge ds1.xlarge ds1.8xlarge
    MasterUsername='skaf48',     
    MasterUserPassword=pwd,
    ClusterSubnetGroupName='default',
    VpcSecurityGroupIds=[
        Sec_group,
    ],
    ClusterParameterGroupName='default.redshift-1.0',  # Parameter group to associate with this cluster.  
    Port=5439,
    AllowVersionUpgrade=True,
    NumberOfNodes=2,   # Compute nodes store your data and execute your queries. In addition to your compute nodes, a leader 
                       # node will be added to your cluster, free of charge. The leader node is the access point for 
                       # ODBC/JDBC and generates the query plans executed on the compute nodes.
                       # The number of nodes should be a minimum of 
    
    PubliclyAccessible=True, # If true, cluster to be accessible from the public internet. If No, then its accessible only 
                             # from within the private VPC network
    EnhancedVpcRouting=False
)

Below poll function keeps checking the status of cluster. 
Once it is in ready state the poll function breaks out of the loop indicating the cluster is ready to use.

In [16]:
def poll_until_completed(client, cluster_id):
    delay = 2
    while True:
        # Get the cluster details
        cluster = client.describe_clusters(ClusterIdentifier=cluster_id)
        # Get the current status of cluster
        status = cluster['Clusters'][0]['ClusterStatus']
        # Get current system time 
        now = str(datetime.datetime.now().time())
        # Print the message with the sttaus of cluster at current time
        print("cluster %s is %s at %s" % (cluster_id, status, now))
        
        # Below Condition keeps checking if the cluster is in available state or in final-snapshot. If yes, then break the loop
        if status in ['available', 'final-snapshot']:
            break

        # If the cluster status is not in available or final-snapshot then wait for time and go through one more iteration.
        delay *= random.uniform(1.1, 2.0)
        time.sleep(delay)

In [None]:
# Wait until the cluster status changes to available. You can't use the cluster until it is available
poll_until_completed(redshift_client, cluster_id=cluster_name)

cluster newsarticles is creating at 13:28:17.906461
cluster newsarticles is creating at 13:28:20.225458
cluster newsarticles is creating at 13:28:23.451706
cluster newsarticles is creating at 13:28:29.602724
cluster newsarticles is creating at 13:28:40.009695
cluster newsarticles is creating at 13:28:57.695425
cluster newsarticles is creating at 13:29:20.190191
cluster newsarticles is creating at 13:30:00.456962
cluster newsarticles is creating at 13:31:14.647397
cluster newsarticles is creating at 13:33:12.678090
cluster newsarticles is available at 13:36:49.097852


To connect to the cluster we need its endpoint that is the DNS address. 
Below cell prints the end point, 
the default port(where the cluster is listening for input requests) and the database name.

In [7]:
cluster_end_point = ''
for cluster in redshift_client.describe_clusters()["Clusters"]:
    print("Cluster endpoint:",str(cluster["Endpoint"]["Address"])+"\nPort:",str(cluster["Endpoint"]["Port"])+"\nDatabase:",str(cluster["DBName"]))
    print('*'*40)
    cluster_end_point = str(cluster["Endpoint"]["Address"])

Cluster endpoint: climate.cub8zvu6uo1j.us-east-1.redshift.amazonaws.com
Port: 5439
Database: climatecitydata
****************************************
Cluster endpoint: climatesdcc8b.cub8zvu6uo1j.us-east-1.redshift.amazonaws.com
Port: 5439
Database: climatecitydata_sdcc8b
****************************************
Cluster endpoint: newsarticles.cub8zvu6uo1j.us-east-1.redshift.amazonaws.com
Port: 5439
Database: articles
****************************************


Below code cell prints the public and private addresses of the nodes in cluster.

All Amazon EC2 instances are assigned two IP addresses at launch. A private IP address and a public IP address that are directly mapped to each other through Network Address Translation (NAT). Private IP addresses are only reachable from within the Amazon EC2 network. Public addresses are reachable from the Internet.

Amazon EC2 also provides an internal DNS name and a public DNS name that map to the private and public IP addresses, respectively. The internal DNS name can only be resolved within Amazon EC2. The public DNS name resolves to the public IP address outside the Amazon EC2 network, and to the private IP address within the Amazon EC2 network.

In [8]:
for cluster in redshift_client.describe_clusters()["Clusters"]:
    for ClusterNode in cluster["ClusterNodes"]:
        if cluster_name in cluster["Endpoint"]["Address"]:
            print(ClusterNode)

{'NodeRole': 'COMPUTE-0', 'PrivateIPAddress': '172.31.81.100', 'PublicIPAddress': '34.226.106.73'}
{'NodeRole': 'LEADER', 'PrivateIPAddress': '172.31.95.159', 'PublicIPAddress': '34.238.68.222'}
{'NodeRole': 'COMPUTE-1', 'PrivateIPAddress': '172.31.93.55', 'PublicIPAddress': '34.237.248.118'}


Below connection string has all the credentials to connect to a AWS redshift cluster. 
It is used to connect to "articles" database in "newsarticles" cluster on port 5439.

In [9]:
# Define the connection string
conn_string = { 'dbname': database_name, 
           'user':'skaf48',
           'pwd':pwd,
           'host':cluster_end_point,
           'port':'5439'
         }

Below method is establishes a connection with cluster using connect method in psycopg2 librray.

In [10]:
def create_conn(config):
    try:
        # get a connection, if a connect cannot be made an exception will be raised here
        con=psycopg2.connect(dbname=config['dbname'], host=config['host'], 
                              port=config['port'], user=config['user'], 
                              password=config['pwd'])
        return con
    except Exception as err:
        print(err)

In [12]:
# Below function call will establish the connection to the redshift cluster "newsarticles" using psycopg library.
con = create_conn(config=conn_string)
print("Connected to DB!\n")

Connected!



In [13]:
con

<connection object at 0x7fa1a9b9e638; dsn: 'host=newsarticles.cub8zvu6uo1j.us-east-1.redshift.amazonaws.com dbname=articles port=5439 password=xxxxxxxxxxx user=skaf48', closed: 0>

The articles database is already created as default database so you can directly create a table in that database. 
Finally, the create table statement will create a table that has columns for title and body text.

In [14]:
# Create table query
statement = 'CREATE TABLE articles_table(title varchar, body varchar(max));'

### Cursors

Rather than executing a whole query at once, it is possible to set up a cursor that encapsulates the query, and then read the query result a few rows at a time. One reason for doing this is to avoid memory overrun when the result contains a large number of rows. postgreSQL users do not normally need to worry about that, since FOR loops automatically use a cursor internally to avoid memory problems.

In [15]:
# con.cursor will return a cursor object, you can use this cursor to perform queries
cur = con.cursor()

In [16]:
cur.execute(statement)
con.commit()

In [23]:
# con.rollback()

In [17]:
# execute a Query using the con object

df = read_sql("""select column_name, data_type, character_maximum_length
from INFORMATION_SCHEMA.COLUMNS where table_name = 'articles_table';""",con=con)
df

Unnamed: 0,column_name,data_type,character_maximum_length
0,body,character varying,65535
1,title,character varying,256


### Insert from Scraper to Database

In [18]:

articles =[article for article in paper.articles  if article.title is not None]
len(articles)

for article in articles:
    if article.title is not None:
        
        # download html from tree of article URLS
        article.download()
            
        # parse downloaded html to readable text
        article.parse()
            
        # clear newlines out of body text
        body = article.text.replace( '\n' , ' ' )
        
        try:
            # insert into database
            cur.execute("insert into articles_table values(%s,%s);",(article.title,body))
        except Exception as e:
            print('database exception',e)

In [27]:
# con.rollback()

In [20]:
df = read_sql("select * from articles_table limit 5;",con=con)
df

Unnamed: 0,title,body
0,'Ugly' fruits and vegetables home delivery ser...,"Beauty is only skin deep, especially with frui..."
1,Chicago Tribune,When Lasheena Hall opened her mail to find yet...
2,Wells St. Market food hall in Loop reveals ope...,The highly anticipated Loop food hall Wells St...


## Delete the cluster

In [21]:
response = redshift_client.delete_cluster(
    ClusterIdentifier=cluster_name,
    SkipFinalClusterSnapshot=True
)