#Scraping website using beautifulsoup

In this notebook, we are going to collect data from the website: 

This website is **static**, making our lives easier when we try to get the HTTP response and parse it.
Other **dynamic** websites might be tricky to handle. You might need to use different advanced technics such as Selenium and scrapy (You can check other notebooks listed here)

##Packages Used


1.   requests
2.   bs


##Output type

*   CSV file


##Steps to follow


1.   Request the link via requests module
2.   Get content and turn it into soup
3.   Parse the soup and retrieve the data
4.   Store the data in your local memory






In [4]:
#installing bs package
!pip install beautifulsoup4



In [5]:
# Importing the packages we are going to use 
import requests
from bs4 import BeautifulSoup as bs


In [80]:
# url to the products available in realpython
url = "https://realpython.com/products/"
images_folder ="images"

In [81]:
def get_soup(url):
  '''
  Request the url and turn it into soup
  Return : soup abject
  Displays a pretty look of html 
  '''
  #request the html
  page = requests.get(url)
  #Turn html into soup
  soup = bs(page.content, "html.parser")
  return soup

In [82]:
import os

def download_image(url,name,folder):
  """
  Download the image and save it to the folder
  Return : None
  """
  img = requests.get(url).content
  #Extract name from url
  
  path_of_save =os.path.join(folder,name)
  with open(path_of_save, 'wb') as handler:
      handler.write(img)

In [83]:

def get_products(url):
  '''
  List all the available products in the url page
  output: List of porducts
  Display dataframe of the results
  '''
  results = []
  soup = get_soup(url)
  #Get the list of div that contain prods
  products =  soup.find_all("div", class_="mt-5 mb-3 alert")
  for index, product in enumerate(products):
   
    # name of product is in the first paragraph
    name = product.find('p').text
    # the image link is within an href 
    img_link = product.find('img').get("src")
    img_name = img_link.strip().split(".")[-2] + '.png'
    # descrption is in a div with class attribut col 
    description = product.find("div", class_="col").find('p').text 
    download_image(img_link,img_name,images_folder)
    results.append([name,description,img_name])
    #print(f'Product name:  {name} image link : {img_link} desc :  {description}')

  return results
    

In [84]:
products = get_products(url)

In [85]:
# Turning list into dataframe
import pandas as pd

df = pd.DataFrame (products, columns = ['Title','Description','Image_name']) 

df

Unnamed: 0,Title,Description,Image_name
0,Real Python Membership:Master Real-World Pytho...,Level up with unlimited access to our vast lib...,daf71ae6460c.png
1,Python Basics:A Practical Introduction to Pyth...,Go from beginner to intermediate in Python wit...,c2d73fad5510.png
2,Write Cleaner &More Pythonic Code,Discover Python’s best practices with simple e...,9a0964753d24.png
3,CPython Internals:Your Guide to the Python 3 I...,Unlock the inner workings of the Python langua...,6f2dc2c60c45.png
4,"Learn Python Programming, By Example",Learn Python and web development from the grou...,59e4a237633e.png
5,Leverage Python’s Third-Party Package Ecosyste...,Become a more efficient coder and get your Pyt...,dc1cb874c7a9.png
6,Optimize Your Python Workflow for Maximum Prod...,Set up a great Python development environment ...,96a1d0615ac0.png
7,A Peer-to-Peer Learning Community for Python E...,"PythonistaCafe is an invite-only, online commu...",601a63434c91.png
8,Look Pythonic & Support Real Python,Support realpython.com with this collection of...,5c4bfd7b7f1d.png
9,Love Python? Show It WithSome Python Swag,Every Pythonista needs a great coffee (or tea!...,5868ff89bfd9.png
