# How to make a request to open an URL in Python

Here down i will explain in steps an approach (of many that are possible) to OPEN / COLLECT / FILTER / STORE data contained into a Webpage.

In these exapmple we will use few external libraries:
- Requests http://docs.python-requests.org/en/master/
- LXML http://lxml.de/xpathxslt.html

Here an exaple of a basic attempt to reach the google.com webpage:

In [5]:

# import a library that permit to do HTTP requests
import requests

# we will store in this variable a string with the url of the website to open
url = 'http://www.google.com'

# initialize the request and we store the result of the request in the r variable
# get is one of the method of the class requests 
r = requests.get(url)
# print the status code 
r.status_code


200

In the example, we had as output a number 200,
200 means the webpage was reached

follow this link to get a complete list of response code for the HTTP protocol:
- Status Codes https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Down here, we will try to open a random page (in this case a page from Wikipedia) and extract its source code

In [2]:

# like te example before will store in this variable a string with the url of the website to open
url = 'https://en.wikipedia.org/wiki/Vice_Squad'

# initialize the request and we store the result of the request in the r variable
# get is one of the method of the class requests 
r = requests.get(url)
# get the text of the r response
r.text


'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Vice Squad - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Vice_Squad","wgTitle":"Vice Squad","wgCurRevisionId":847735856,"wgRevisionId":847735856,"wgArticleId":1774263,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["EngvarB from April 2013","Use dmy dates from April 2013","Articles with hCards","Wikipedia articles with ISNI identifiers","Wikipedia articles with MusicBrainz identifiers","English punk rock groups","Musical groups from Bristol","Street punk groups","Musical groups established in 1978"],"wgBreakFrames":false,"wgPageContentLa


The output of last example is like as opening the same URL with your browser and 
clicking __alt+cmd+U__ (this works perfectly on Chrome on Mac, Firefox and Safari will use a similar combination on Windows it will be like alt+cntrl+U, we need to check it out during the workshop)

To transfrom text collected in a structured HTML we need to __import__ a library that will help in doing that


In [3]:

# we are adding an interpreter of HTML structure, it will allow us to parse the content
from lxml import html

# we transform the pure text into a structured HTML 
tree = html.fromstring(r.text)

# if we print the tree now is trasformed into and Object
print(tree)


<Element html at 0x107ba7e58>



In the last block of code we transformed a bunch of text in a structured HTML. 
We tried to print it out, but we got nothing fancy as aspected but instead something like this:

__Element html at 0x10b7e6cc8__ why?

Here down, we will extract from the Python Object that contains the HTML a specific content giving it an XPath


In [4]:
# which is the path for the title?
path = '//*[@id="firstHeading"]/text()'


# //*[@id="firstHeading"]

# remember that the XPath can reach diffrent types of information, you can use the last part of the path to 
# explain wich information you want: es. @href text() @src

# we store in this variable the content of the title
title = tree.xpath(path)

# and here we go
print(title)

['Vice Squad']


## We did it once, now scale up our Script.

Now we need to scale our script in a way that it will be able to repeat the process all the times
with diffrent urls that are using the same HTML structure.

down here the example:

In [6]:
# we put in a list a set of URLs to visit ;)
import time

urls = ["https://en.wikipedia.org/wiki/Yannick_Keith_Liz%C3%A9",
        "https://en.wikipedia.org/wiki/Jeffrey_M._Lacker",
        "https://en.wikipedia.org/wiki/SRCM_Mod._35",
        "https://en.wikipedia.org/wiki/ServiceOps"]

# tha Xpath will be the same for all, if we change the website template we will need to privide a different Xpath
title_path = '//*[@id="firstHeading"]/text()'

# loop the our preview script thru all the urls
for url in urls:
    # request to open the url
    r = requests.get(url)
    # tranform the pure text into structured HTML 
    tree = html.fromstring(r.text)
    # parse the tree and get the title looking for that path
    title = tree.xpath(title_path)
    # print the title
    print(title)
    time.sleep(2)
    
    

['Yannick Keith Lizé']
['Jeffrey M. Lacker']
['SRCM Mod. 35']
['ServiceOps']


As you can see the title is stored into a list es. ['Yannick Keith Lizé']
To reach the string inside the list, we will use later title[0] reaching the 0 element inside the list
we will do it assuring that the list will have at least 1 element, otherwise the script will crash.

# Collect multiple Data from the same page.

In the next block of code we will play with Macrumors. 

For the sake of the exercitation we will run the script a very few amount of times, 
access too many times at the same website from the same source IP it will lead probably to be banned from the site (the entire day)

With the next script will try to get all the links that are goint straight to posts listed into the Homepage of the Website

(Sorry Macrumors, it's just an exercitation there is nothing personal)

here we go:

In [7]:
# strategies for pagination
# websites 'macrumors.com' 

url = "https://www.macrumors.com/"

# a) get the first page, and print all the post links

# First  Title  //*[@id="content"]/div/div[3]/h2
# Second Title  //*[@id="content"]/div/div[5]/h2
# Third  Title  //*[@id="content"]/div/div[7]/h2
# Fourth Title  //*[@id="content"]/div/div[9]/h2
# Fifth  Title  //*[@id="content"]/div/div[10]/h2
# Sixth  Title  //*[@id="content"]/div/div[11]/h2
#               //*[@id="content"]/div/div[12]/h2
#               //*[@id="content"]/div/div[13]/h2
#               //*[@id="content"]/div/div[14]/h2
#               //*[@id="content"]/div/div[15]/h2
#               //*[@id="content"]/div/div[16]/h2
# Last   Title  //*[@id="content"]/div/div[17]/h2
# Next Page     //*[@id="content"]/div/div[18]/div[3]/a/@href

# in this list we will store the Xpath we want to search into the HTML to extracting some Data. 
paths = [
        '//*[@id="content"]/div/div[3]/h2/a/@href',
        '//*[@id="content"]/div/div[5]/h2/a/@href',
        '//*[@id="content"]/div/div[7]/h2/a/@href',
        '//*[@id="content"]/div/div[9]/h2/a/@href',
        '//*[@id="content"]/div/div[10]/h2/a/@href',
        '//*[@id="content"]/div/div[11]/h2/a/@href',
        '//*[@id="content"]/div/div[12]/h2/a/@href',
        '//*[@id="content"]/div/div[13]/h2/a/@href',
        '//*[@id="content"]/div/div[14]/h2/a/@href',
        '//*[@id="content"]/div/div[15]/h2/a/@href',
        '//*[@id="content"]/div/div[16]/h2/a/@href',
        '//*[@id="content"]/div/div[17]/h2/a/@href'
        ]

# here we store the next page Xpath
nextpage_path = '//*[@id="content"]/div/div[18]/div[3]/a/@href'

# starting the request
r = requests.get(url)
# transforming the text in HTML 
tree = html.fromstring(r.text)
# pass thru all the paths in the list
for path in paths:
    # get the result coming from this Xpath
    url = tree.xpath(path)
    # Print the Url
    print(url)

# now extract from the tree the nexpage link
nextpage = tree.xpath(nextpage_path)
# print the link
print(nextpage)

['//www.macrumors.com/2018/06/29/microsoft-dual-screen-pocket-surface/']
['//www.macrumors.com/2018/06/29/2018-iphones-embedded-apple-sim/']
['//www.macrumors.com/2018/06/29/apple-maps-to-be-rebuilt/']
['//www.macrumors.com/2018/06/28/five-mac-apps-june-2018/']
['//www.macrumors.com/2018/06/28/att-911-call-outage-fine/']
['//www.macrumors.com/2018/06/28/lg-supply-2-4-million-oled-panels-iphone-x-plus/']
['//www.macrumors.com/2018/06/27/apple-streaming-service-bundle-tv-music-news/']
['//www.macrumors.com/2018/06/27/att-doubles-administrative-fees/']
['//www.macrumors.com/2018/06/27/apple-samsung-patent-dispute-settled/']
['//www.macrumors.com/2018/06/27/samsung-galaxy-note-9-coming-in-august/']
['//www.macrumors.com/2018/06/27/apple-headphone-jack-adapter-top-seller-best-buy/']
['//www.macrumors.com/2018/06/27/play-impossible-gameball/']
['/2/']


# Make the script works until he find a nexpage to open (well... more or less)







In [8]:
# instead of print, collect all the post links and url of the paged you scraped

start_url = "https://www.macrumors.com"

# here we create some empty lists that will collect all your data
urls_to_scrape = []
urls_scraped = []
urls_posts = []

# in this list are stored all the paths needed to parse on the homepage
post_paths = [ '//*[@id="content"]/div/div[3]/h2/a/@href',
               '//*[@id="content"]/div/div[5]/h2/a/@href',
               '//*[@id="content"]/div/div[7]/h2/a/@href',
               '//*[@id="content"]/div/div[9]/h2/a/@href',
               '//*[@id="content"]/div/div[10]/h2/a/@href',
               '//*[@id="content"]/div/div[11]/h2/a/@href',
               '//*[@id="content"]/div/div[12]/h2/a/@href',
               '//*[@id="content"]/div/div[13]/h2/a/@href',
               '//*[@id="content"]/div/div[14]/h2/a/@href',
               '//*[@id="content"]/div/div[15]/h2/a/@href',
               '//*[@id="content"]/div/div[16]/h2/a/@href',
               '//*[@id="content"]/div/div[17]/h2/a/@href'
             ]

# this is the path of the link of next page
nextpage_path = '//*[@id="content"]/div/div[18]/div[3]/a/@href'

# a counter, we will use it to stop the script after a certain point
counter = 0 
run_n_times =5


def main():
    
    # if the list of links to scrape is not empty
    global counter 
    
    if urls_to_scrape:
        
        # take the fist element of the list  
        url = urls_to_scrape[0]
        # open the url
        r = requests.get(url)
        # transform the pure text in HTML 
        tree = html.fromstring(r.text)
        # for all the path in paths 
        for path in post_paths:
            # extract the url 
            u = tree.xpath(path)
            # if not empty 
            if u:
                # the url comes with //www, we replace the sting with this to have a proper address
                u = u[0].replace("//www",'http://www')
                # add the url to the posts url list
                urls_posts.append(u)
                # print the address
                print(u)
            
        
        # get the next page (it comes in '/2/' format)
        nextpage = tree.xpath(nextpage_path)
        # here we join the start_url 'http://www.macroumors.com' with the '/2/'
        urls_to_scrape.append(start_url+nextpage[0])
        # than we add the new url to the we scraped
        urls_scraped.append(urls_to_scrape[0])
        # now we take out the first element in the list of the one to scrape (it was the homepage)
        urls_to_scrape.pop(0)
        # print the last url you got
        print(urls_scraped)
        # add a item into the counter 
        counter = counter + 1
        
        # here we assure that after a certain numeber i stop
        if counter < run_n_times :
            # if is less of the run_n_times run all main again 
            main()
              
    else:
        # if the list is empty, add the start url
        urls_to_scrape.append(start_url)
        # than run main 
        main()
    

# here we initialize the script
main()




http://www.macrumors.com/2018/06/29/microsoft-dual-screen-pocket-surface/
http://www.macrumors.com/2018/06/29/2018-iphones-embedded-apple-sim/
http://www.macrumors.com/2018/06/29/apple-maps-to-be-rebuilt/
http://www.macrumors.com/2018/06/28/five-mac-apps-june-2018/
http://www.macrumors.com/2018/06/28/att-911-call-outage-fine/
http://www.macrumors.com/2018/06/28/lg-supply-2-4-million-oled-panels-iphone-x-plus/
http://www.macrumors.com/2018/06/27/apple-streaming-service-bundle-tv-music-news/
http://www.macrumors.com/2018/06/27/att-doubles-administrative-fees/
http://www.macrumors.com/2018/06/27/apple-samsung-patent-dispute-settled/
http://www.macrumors.com/2018/06/27/samsung-galaxy-note-9-coming-in-august/
http://www.macrumors.com/2018/06/27/apple-headphone-jack-adapter-top-seller-best-buy/
http://www.macrumors.com/2018/06/27/play-impossible-gameball/
['https://www.macrumors.com']
http://www.macrumors.com/2018/06/26/nikkei-airpods-charging-case-plus-iphone/
http://www.macrumors.com/2018/