<a href="https://colab.research.google.com/github/access2vivek/Data-Science/blob/master/Scraping_with_BeautifulSoup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aim


The aim of the project is to scrape information from a website.

# Libraries Required

1. **Numpy** for working with arrays
2. **BeautifulSoup** to scrape data
3. **Requests** to get webpage

# Methods Used

**np.array([])**

This method creates a new Numpy array and returns it to be assigned to a variable. You can pass any type of value into the parameters of the method. However, be sure that the values passed are of similar data type. You will get an error if you pass two different data types. For example, if you use the method as numpy.array(["Name",45]), you will get an error because the first variable's data type is String whereas the second variable's data type is integer.


**range(n)**
This method is used to create a for loop from 0 to n-1. Note that since the index of the loop starts from 0, the loop will run n number of times, but the values that you get would be up to n-1.


**requests.get(URL)**
We use this method to fetch the HTML contents of a webpage for further use. This method makes a request to the URL specified. From this object, we can get the HTML content returned from the variable "**text**". So, we can use requests.get("http://www.example.com").text to see the HTML response from the server at example.com

**BeautifulSoup(HTMLContent,Parser)**
This method allows us to convert HTML content into a object for easier parsing withouth having to worry about the tags and traversing the tags. For our example, we dump all the content that we get from requests.get() into the content argument. Next, we define a parser method for the HTML content. You might get either XML or HTML or other type of content. So, we specify a parsing method to make use of the data in the most efficient manner. For our case, we use the "lxml" parsing method which makes our work super easy.




# Importing the Libraries

In [0]:
import numpy as np
from bs4 import BeautifulSoup
import requests

# Data Headers

In [0]:
# Adding a header to NumPy array for the variables we are extracting
bookInfo=np.array(["Book Link","Image Source","Rating","Title","Price (Euros)","In Stock"])

# Process

1. Get the HTML content of a page. This is done using requests.get(URL) method from the requests library.

2. Pass this content into BeautifulSoup so that we can parse it easily. This is done using BeautifulSoup(HTML,Parser). The HTML content is received from the previous step and the parser that we are using is **lxml**, which is a standard parser for HTML.

3. After going through the HTML of the page, we realize that all the product information is available in article tag that has a class of product_pod. Therefore, we use the method find_all from BeautifulSoup to get a list of all the products. **Note**: - We have used a lambda function here just to demonstrate the various ways in which we can use the find_all method. Other ways are shown below.

4. In the previous step, we got a list of all the items available on a page. So, we run a loop for each item and extract all the details that we require. In this case, we have 6 fields or features - "Book Link","Image Source","Rating","Title","Price (Euros)","In Stock". We just extract each piece of information and append it to an array.

5. Once all of the information is extracted, we simply add this information to the Numpy Array at the end so that we have everything we need in one array. In future, we can use this array to convert it into a Pandas DataFrame and use it for other purposes.

6. Once we have done that, we put the entire process in a loop which runs from page-1 to page-1000 and gathers all the information. As a good programming practice, I have set the range to 1005 instead of 1000 so that we can add a break statement when there are no elements on the page. This way, we won't have to run a loop for an exact number. In the future, if a few pages are removed, we won't get an error as the code will come out of the loop itself if no records are found.


# Main Loop

In [0]:
# Running a for loop to be used in getting new pages with information
for page in range(1005):

  # Getting the HTML content using Requests
  content = requests.get(f"http://books.toscrape.com/catalogue/page-{page+1}.html").text

  # Passing the HTML content into the BeautifulSoup method to allow easy scraping.
  booklist = bs4.BeautifulSoup(content,"lxml")

  # Getting all the items using find_all
  items = booklist.find_all(lambda elem:elem.get('class')==['product_pod'] and elem.name=="article")

  if(len(items)==0):
    break

  for book in items:
    bookDetails=[]
    bookDetails.append(book.div.a.get('href'))
    bookDetails.append(book.div.a.img.get('src'))
    bookDetails.append(book.p.get('class')[1])
    bookDetails.append(book.h3.a.get('title'))
    price = book.find_all("p",{"class":"price_color"})[0].get_text()
    price = ''.join([i for i in price if i.isdigit() or i=="."])
    bookDetails.append(price)
    stock = book.find_all("p",{"class":"instock"})[0].get_text().replace("\n","").replace("  ","")
    bookDetails.append(True if stock=="In stock" else False)
    bookInfo = np.vstack((bookInfo,bookDetails))

In [0]:
bookInfo