## In this Notebook, I work through scraping utazi.com to get its menu and prices

> I'll use the following dependencies

 Requests to access the site

 BeautifulSoup to parse the HTML file into a readable format

 Time to rate-limit scraping the website so I don't send too many requests at the same time

 Selenium to make sure I get all elements in the menu page, especially the ones rendered through javascript



In [1]:
#Importing all dependencies
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from selenium import webdriver

In [None]:
#creating a variable calles 'url' and using the requests get function to store content of the site to the variable data
url = 'https://utazing.com/menu'
data = requests.get(url)

In [3]:
#creating a variable 'driver' and storing the chromedriver path to this variable
driver = webdriver.Chrome(executable_path="/Users/JOHN ANALOH/Desktop/chromedriver")


  driver = webdriver.Chrome(executable_path="/Users/JOHN ANALOH/Desktop/chromedriver")


In [4]:
#Automating creation of a new window with selenium's driver
#Scrolling to the end of the page to ensure all information is captured
#rate-limiting requests after 2 secs
driver.get(url)
driver.execute_script("window.scrollTo(1,20000)")
time.sleep(2)

In [5]:
#Storing the content of the page to variable 'html'
html = driver.page_source

In [6]:
#writing content of the page to a new file to store data, this way we don't have to scrape utazi everytime
with open("utazi_full_menu.html", "w+",encoding='utf-8', errors='backslashreplace') as f:
    f.write(html)

In [7]:
#reading the content of utazi and storing it to variable 'utazi_full_data'
with open("utazi_full_menu.html", "r",encoding='utf-8', errors='backslashreplace') as f:
    utazi_full_data = f.read()

In [8]:
#here i'm using Beautiful soup's html parser to parse data into a more structured format and storing this in variable 'soup'
soup = BeautifulSoup(utazi_full_data, 'html.parser')

#### At this point, I do some inspection of utazi's menu page to identify a structure that allows me scrape data
#### I'm able to deduce that all the menu information is contained within a div with a class called 'menu-item-container'

In [16]:
utazi_full_clean = soup.find_all("div", class_="menu-item-container")

In [17]:
#I convert the content of 'menu-item-container' to a list so I can iterate easily
full_menu = list(utazi_full_clean)

In [18]:
#here i'm testing how to extract data i need
#in this case I want to get the name, price and a short description
#This code describes how to get the name of the 10th item on the container
full_menu[9].contents[1].text.strip()

#full_menu[9].contents[2].text.strip() gets you the description
#full_menu[9].contents[0].text.strip() gets you the price

'Peppered dry fish'

#### Next I loop through the container, and on each iteration, I collect and store the name, price and description on different variables

In [19]:
#I start out by creating empty lists
item = []
item_description = []
item_price = []
#my loop goes through 137 iterations because I've done a manual count, and the container has consistent formatting up till this point 
for i in range(1,138):
    #find all items under the 'menu-item-container'
    menu = list(soup.find_all("div", class_="menu-item-container"))[i]
    #store the values of name, price and description as the loop goes through each iteration
    try:
        food = menu.contents[1].text.strip()
        food_description = menu.contents[2].text.strip()
        food_price = menu.contents[0].text.strip()
    except:
        food = ""
        food_description = ""
        food_price = ""
    #Append the list created initially with the respective values of name, description and price as the loop progresses
    item.append(food)
    item_description.append(food_description)
    item_price.append(food_price)

In [20]:
# create a new dataframe with 3 columns and Store the values of name, description and price 
utazi_full_menu = pd.DataFrame({
    'Name':item,
    'Description':item_description,
    'Price':item_price
})

In [21]:
#The values of price all start with a 'NGN' prefix so I use a lamda function to remove all
utazi_full_menu['Price'] = list(map(lambda x: x[3:], utazi_full_menu['Price']))

In [22]:
utazi_full_menu.head()

Unnamed: 0,Name,Description,Price
0,Cat Fish Pepper Soup,Prepared in a spicy fish broth,7500
1,Fish Pepper Soup,"Crocker, barracuda, grouper or fisherman catch...",4500
2,Chicken Pepper Soup,Prepared in a spicy chicken broth,3500
3,Goat Meat Pepper Soup,Prepared in a spicy goat meat broth,4000
4,Peppered Snail,"Sauteed onions, peppers served with fries",6000


In [23]:
#Finally I write my output data and store in a csv
utazi_full_menu.to_csv('utazi.csv')