For this project we will go over the steps required by the demands of the course: 

Data acquisition - We will collect the data from the websites Just eat and Menulog. We will be collecting the data using a combination of the following tools: Selenium and BeautifulSoup. We will use Selenium to open the page and BeautifulSoup to extract the data from the website. after that, we will arrange the data in a data frame and we'll save it afterwards for further use.

EDA quality and comprehension - We will use different graphs and visualization techniques to examine connection between the columns we got in the data acquisition step and try to find a connection between the information that we have and the information we wish to predict.

Machine Learning experiments and insights - We will use different models of machine learning and try to find the best model between those that we have learned about in this course. We will train the model according to the train test split method and then we will apply the model on the test data. afterwards, we will asses the results and see if we can actually predict if a restaurant will be successful.

Before running tests we need to gather the data. We used Selenium webdriver combined with BeautifulSoup to open the pages and overcome the restriction we received while using an API get request.

When using the API request, we got blocked due to the fact we weren't the site owners. This is when the Selenium came handy. We used Selenium to open the website and overcome this obstacle and then we could use BeautifulSoup in order to extract all of the data from the website.

We took websites from the following countries: England, Ireland, New Zealand and Australia. From investigation we had on the worldwide just eat websites, these websites contained the biggest variety of information on their restaurants. We decided to take restaurants from different districts of several cities in order to get the most diverse kind of data for our project. 

Another thing to keep in mind is that not all countries we mentioned above use the same currency. We needed to take that in consideration since we didn't want normal data to look like an outlier because we miss-interpreted it. For that we made a dictionary with a currency converter that has Euro as default currency. We took that converter in consideration when extracting the data as can be seen in the code bellow. 

We used a function called set_arrays_the_same_length in order to make sure we get lists that are the same length. This function just inserts NaN to the list until one list reaches the length of the second list. This will help us to create the dataframe for the testing purposes since the function that creates the data frame does not allow the lists that are supposed to be the columns to be in different lengths.

In [5]:
from selenium import webdriver
from bs4 import BeautifulSoup
import os 
from os.path import isfile, join
import pandas as pd 
import re
import numpy as np


def set_arrays_the_same_length(array1,array2):
    while len(array2) < len(array1):
        array2.append(np.nan)
    
    return array2

total_ranking = []
total_times = []
total_delivery_fee = []
total_names = []
total_min_order = []
total_countries = []
total_rankers = []
currency_converter = {    #euro base converter
    "Ireland" : 1,
    "New Zeland": 0.6,
    "Australia" : 0.66,
    "England": 1.17
}

countries = {"Ireland":["https://www.just-eat.ie/area/inns_quay-dublin#10000",
                        "https://www.just-eat.ie/area/merchants_quay-dublin#10000",
                        "https://www.just-eat.ie/area/rotunda-dublin#10000"],
             "England":["https://www.just-eat.co.uk/area/e1-aldgate#10000",
                        "https://www.just-eat.co.uk/area/ls12-armley#10000",
                        "https://www.just-eat.co.uk/area/ec2v-london#10000",
                        "https://www.just-eat.co.uk/area/wc1-wc1#10000",
                        "https://www.just-eat.co.uk/area/ec1a-city_of_london#10000",
                        "https://www.just-eat.co.uk/area/ec2-liverpoolstreet#10000",
                        "https://www.just-eat.co.uk/area/ec3v-london_city#10000",
                        "https://www.just-eat.co.uk/area/ec2n-london#10000",
                        "https://www.just-eat.co.uk/area/ec2r-london#10000",
                        "https://www.just-eat.co.uk/area/nw1-regentspark#10000",
                        "https://www.just-eat.co.uk/area/se16-rotherhithe#10000",
                        "https://www.just-eat.co.uk/area/se1-southwark#10000"],
            "Australia": ["https://www.menulog.com.au/area/2015-alexandria",
                          "https://www.menulog.com.au/area/2037-forest-lodge",
                          "https://www.menulog.com.au/area/2009-pyrmont",
                          "https://www.menulog.com.au/area/2011-rushcutters-bay",
                          "https://www.menulog.com.au/area/3053-carlton",
                          "https://www.menulog.com.au/area/3008-docklands",
                         "https://www.menulog.com.au/area/3051-north-melbourne"],
            "New Zeland":[
                           "https://www.menulog.co.nz/area/1021-ponsonby",
                           "https://www.menulog.co.nz/area/1011-freemans-bay",
                           "https://www.menulog.co.nz/area/6021-hataitai",
                           "https://www.menulog.co.nz/area/6021-mount-cook",
                           "https://www.menulog.co.nz/area/6011-wellington-central"]
}


browser = webdriver.Chrome()
for country in countries:
    for url in countries[country]:
        print(url)
        browser.get(url)
        try:
            browser.find_element_by_xpath('//*[@id="skipToMain"]/div[5]/div[3]/div/div/div/div/div/div/div[2]/button[1]').click()
        except:
            continue
        soup = BeautifulSoup(browser.page_source, "html.parser") 
        #names
        divs_names = soup.find_all("h3", attrs = {"class":"RestaurantCard_c-restaurantCard-name_1Zwfd"}) 
        names = [name.get_text() for name in divs_names]
        # print("names", names)
        total_names.extend(names) 
        #ranking 
        divs_ranking = soup.find_all("data", attrs = {"class":"RestaurantRating_c-restaurantCard-rating-mean_2nucs"})
        ranking = [rank.get_text() for rank in divs_ranking]
        total_ranking.extend(ranking)
        # print("ranking", ranking)
        divs_rankers = soup.find_all("data", attrs = {"class":"RestaurantRating_c-restaurantCard-rating-count_1HT6D"})
        rankers = [ranker.get_text() for ranker in divs_rankers]
        total_rankers.extend(rankers)
        #Country 
        for i in range(len(names)): 
            total_countries.append(country)
        #time, min order, delivery fee
        divs = soup.find_all("span", attrs = {"class":"IconText_c-restaurantCard-iconText-content_2wOUu"}) 
        for anomaly in divs:
            if "mins" in anomaly.get_text():
                total_times.append(int(anomaly.get_text()[5:8]))
            elif "Delivery" in anomaly.get_text():
                if "FREE" in anomaly.get_text():
                    total_delivery_fee.append(0)
                else:                                         
                    if len(anomaly.get_text()) == 14:
                        if "from" in anomaly.get_text():
                            total_delivery_fee.append(float(anomaly.get_text()[10:]) * currency_converter[country])
                        else:
                            total_delivery_fee.append(float(anomaly.get_text()[11:13]) * currency_converter[country])
                    elif len(anomaly.get_text()) == 19:
                        total_delivery_fee.append(float(anomaly.get_text()[15:])* currency_converter[country])
                    elif len(anomaly.get_text()) == 16:
                        total_delivery_fee.append(float(anomaly.get_text()[11:15])* currency_converter[country])
                    elif len(anomaly.get_text()) == 21:      
                        total_delivery_fee.append(float(anomaly.get_text()[16:20])* currency_converter[country])
                    else:
                        total_delivery_fee.append('')  
            elif "min." in anomaly.get_text() or "Min." in anomaly.get_text(): 
                if "No" in anomaly.get_text():
                    total_min_order.append(0)
                elif len(anomaly.get_text()) > 13:
                        total_min_order.append(float(anomaly.get_text()[13:])* currency_converter[country])
                else:
                    total_min_order.append('')
    total_ranking = set_arrays_the_same_length(total_names,total_ranking)
    total_times = set_arrays_the_same_length(total_names,total_times)
    total_delivery_fee = set_arrays_the_same_length(total_names,total_delivery_fee)
    total_min_order = set_arrays_the_same_length(total_names,total_min_order)
    total_countries = set_arrays_the_same_length(total_names,total_countries)
    total_rankers = set_arrays_the_same_length(total_names,total_rankers)
df = pd.DataFrame({"names":total_names,"countries":total_countries,"min_order":total_min_order,"amount_of_rankers":total_rankers,"delivery_fee":total_delivery_fee,"delivery_time_MAX":total_times,"ranking":total_ranking})
df.to_csv("C:\\Users\\almog\\Desktop\\Data-Sience-Project-main1\\DataFrame.csv")
print("Done!\n")     
browser.quit()                    



https://www.just-eat.ie/area/inns_quay-dublin#10000


  browser.find_element_by_xpath('//*[@id="skipToMain"]/div[5]/div[3]/div/div/div/div/div/div/div[2]/button[1]').click()


https://www.just-eat.ie/area/merchants_quay-dublin#10000
https://www.just-eat.ie/area/rotunda-dublin#10000
https://www.just-eat.co.uk/area/e1-aldgate#10000
https://www.just-eat.co.uk/area/ls12-armley#10000
https://www.just-eat.co.uk/area/ec2v-london#10000
https://www.just-eat.co.uk/area/wc1-wc1#10000
https://www.just-eat.co.uk/area/ec1a-city_of_london#10000
https://www.just-eat.co.uk/area/ec2-liverpoolstreet#10000
https://www.just-eat.co.uk/area/ec3v-london_city#10000
https://www.just-eat.co.uk/area/ec2n-london#10000
https://www.just-eat.co.uk/area/ec2r-london#10000
https://www.just-eat.co.uk/area/nw1-regentspark#10000
https://www.just-eat.co.uk/area/se16-rotherhithe#10000
https://www.just-eat.co.uk/area/se1-southwark#10000
https://www.menulog.com.au/area/2015-alexandria
https://www.menulog.com.au/area/2037-forest-lodge
https://www.menulog.com.au/area/2009-pyrmont
https://www.menulog.com.au/area/2011-rushcutters-bay
https://www.menulog.com.au/area/3053-carlton
https://www.menulog.com.a

We finished the first step of the project. We gathered the data and made a data frame out of it. The next step from here is to clean the data a little bit and prepare it for further analysis which will be made in a different notebook.

During the gathering of the data we had to use a lot of conditions. we needed to get the values separated since all of them used the same class in the site. after gathering all of the values we balanced the lists with our function set_arrays_the_same_lenths.