# Introduction to web scraping with Python
Web scraping examples from the [blog post](https://datawhatnow.com/introduction-web-scraping-python/) about web scraping.

## Initialise libraries
 For web scraping we will use [requests](http://docs.python-requests.org/en/master/) and [lxml](http://lxml.de/).

In [1]:
import requests
from lxml import html

## Web scraping
The [datawhatnow](https://www.datawhatnow.com) website is used for scraping. First we have to request the website's HTML code using request and parse it with lxml. To simplify this approach we will use the **get_parsed_page** function.

In [2]:
url = 'https://www.datawhatnow.com'

def get_parsed_page(url):
    """Return the content of the website on the given url in
    a parsed lxml format that is easy to query."""
    
    response = requests.get(url)
    parsed_page = html.fromstring(response.content)
    return parsed_page

parsed_page = get_parsed_page(url)

Data is extracted using xpath queries, the website title is found in an a tag with a h1 parent tag - '//h1/a/text()'.

In [3]:
# Print website title
parsed_page.xpath('//h1/a/text()')

['Data, what now?']

In [4]:
# Print post names
parsed_page.xpath('//h2/a/text()')

['SimHash for question deduplication',
 'Feature importance and why it’s important']

['SimHash for question deduplication',
 'Feature importance and why it’s important']

## Crawling
Using the program to scrap multiple pages.

In [5]:
# Getting paragraph titles in blog posts
post_urls = parsed_page.xpath('//h2//a/@href')

for post_url in post_urls:
    print('Post url:', post_url)
    
    parsed_post_page = get_parsed_page(post_url)
    paragraph_titles = parsed_post_page.xpath('//h3/text()')
    
    paragraph_titles = map(lambda x: ' \n  ' + x, paragraph_titles)    
    print(''.join(paragraph_titles) + '\n')

Post url: https://datawhatnow.com/simhash-question-deduplicatoin/
 
  SimHash 
  Features 
  Model performance 
  Conclusion 
  References 
  Leave a Reply  
  GitHub 
  Newsletter 
  Recent Posts 
  Archives

Post url: https://datawhatnow.com/feature-importance/
 
  Data exploration 
  Feature engineering 
  Baseline model performance 
  Feature importance 
  Model performance with feature importance analysis 
  Conclusion 
  Leave a Reply  
  GitHub 
  Newsletter 
  Recent Posts 
  Archives



In [6]:
# Fixed XPath query

for post_url in post_urls:
    print('Post url:', post_url)
    
    parsed_post_page = get_parsed_page(post_url)
    paragraph_title_xpath = '//div[@class="entry-content"]/h3/text()'
    paragraph_titles = parsed_post_page.xpath(paragraph_title_xpath)
    
    paragraph_titles = map(lambda x: ' \n  ' + x, paragraph_titles)
    print(''.join(paragraph_titles) + '\n')

Post url: https://datawhatnow.com/simhash-question-deduplicatoin/
 
  SimHash 
  Features 
  Model performance 
  Conclusion 
  References

Post url: https://datawhatnow.com/feature-importance/
 
  Data exploration 
  Feature engineering 
  Baseline model performance 
  Feature importance 
  Model performance with feature importance analysis 
  Conclusion

