# Reboot – Data Analyst Assessment

**Introduction**

The assessment will require you to source, extract, clean, and visualise data to answer the research question presented below. 

**Research Question**

The rise of Veganism in the UK has prompted restaurants to include vegan friendly options on their menus. While some restaurants have adapted with great speed, others have lagged behind. The task is to find the UK cities that are most suitable for vegans to eat out in.

To answer this question, we would like you to find the number of vegan friendly restaurants per capita in each of the top 20 UK cities listed on TripAdvisor. 

**Requirements**

You are free to use any tools to complete this task. We only require the following:

1.	The data collection from TripAdvisor must be automated.
2.	You must include a brief methodology report describing the steps you took and the sources you used.
3.	You must include a link to your GitHub repository containing any code used for the project.
4.	You must Include a link to your data visualisation dashboard.
5.  Create a data visualisation dashboard to display your results.



# FEAUTERS

- Restaurant name - #result-title span;
- Restaurant location - #address-text - div ;
- Restaurant reviews - #review_count - a;
- Mentions of VEGAN - #review-mention-block - div;
- What you can expect during your visit.

# LIBRARIES

In [22]:
#IMPORT LIBRARIES
import requests 
import pandas as pd
import boto3
import os
import re

from datetime import datetime

import selenium
from selenium import webdriver

from bs4 import BeautifulSoup

from secrets import access_key, secret_access_key


# COLLECT TOP 20 UK CITIES NAMES

In [2]:
#URL with 20 top cities
url = 'https://www.tripadvisor.co.uk/Restaurants-g186216-United_Kingdom.html'

#Create User-Agent for requests
headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" }

#result of the URL request
result_cities = requests.get(url, headers=headers )

#SET html.parser and text to result
soup = BeautifulSoup( result_cities.text, 'html.parser' )

#Cities Name = Find the tag DIV with class='geo_name'
allcities = soup.find_all( 'div', class_='geo_name')

#Take the CITIES NAME
allcitiesname = [p.get_text() for p in allcities]

#Replace \n and Restaurant to nothing
allcitiesname = [s.replace('\n', '') for s in allcitiesname]
allcitiesname = [s.replace('Restaurants', '') for s in allcitiesname]

#Take the LINK allcities[0].find('a').get('href')
allcitieslinks = [p.find('a').get('href') for p in allcities]

#Take the City ID
city_id = [i.split('-')[1] for i in allcitieslinks]

#Take the City URL
city_url = [i.split('-')[2] for i in allcitieslinks]

#Create DataFrame with feauters
data = pd.DataFrame([allcitiesname,city_id,city_url]).T

#Alter columns name
data.columns = ['city','city_id','city_url']

#Code VEGAN
data['code_vegan'] = 'zfz10697'

#Add new colum with datetime
data['scrap_datetime'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

In [18]:
data

Unnamed: 0,city,city_id,city_url,code_vegan,scrap_datetime
0,London,g186338,London_England.html,zfz10697,2021-08-01 08:26:28
1,Manchester,g187069,Manchester_Greater_Manchester_England.html,zfz10697,2021-08-01 08:26:28
2,Birmingham,g186402,Birmingham_West_Midlands_England.html,zfz10697,2021-08-01 08:26:28
3,Edinburgh,g186525,Edinburgh_Scotland.html,zfz10697,2021-08-01 08:26:28
4,Glasgow,g186534,Glasgow_Scotland.html,zfz10697,2021-08-01 08:26:28
5,Leeds,g186411,Leeds_West_Yorkshire_England.html,zfz10697,2021-08-01 08:26:28
6,Liverpool,g186337,Liverpool_Merseyside_England.html,zfz10697,2021-08-01 08:26:28
7,Bristol,g186220,Bristol_England.html,zfz10697,2021-08-01 08:26:28
8,Sheffield,g186364,Sheffield_South_Yorkshire_England.html,zfz10697,2021-08-01 08:26:28
9,Nottingham,g186356,Nottingham_Nottinghamshire_England.html,zfz10697,2021-08-01 08:26:28


# COLLECT DATA FOR MULTI CITIES

In [21]:
#Create User-Agent for requests
headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" }

#empty data frame
data_final = pd.DataFrame()

for i in range( len ( data ) ):
#URL with 20 top cities
    url2 = 'https://www.tripadvisor.co.uk/Restaurants-'+ data.loc[i, 'city_id']+'-'+ data.loc[i, 'code_vegan']+'-'+ data.loc[i, 'city_url']+''

    #result of the URL request
    result_cities = requests.get(url2, headers=headers )

    soup = BeautifulSoup(result_cities.text, 'html.parser')

    #Take the City URL
    city_url2 = url2.split('-')
    city_url2 = city_url2[1]
    city_url2

    #Take the SORT BY
    sortby = soup.find( 'div', class_='_1NO-LVmX _1xde6MOz')
    sortby = sortby.text

    #Get the AMOUNT of the RESTAURANTS
    path = 'C:\chromedriver.exe'
    driver = webdriver.Chrome(path)
    driver.get(url2)
    #Find class name _1D_QUaKi
    total_rest = driver.find_element_by_class_name("_1D_QUaKi")
    #Get text without html code.
    qtd_restaurants = total_rest.text
    #close the browser screen
    driver.quit()
    
    #Get first 5 restaurants
    list_item = soup.find( 'div', class_='_1kXteagE')
    #Get first 5 restaurants - each item list
    each_item = soup.find_all('div', attrs={'data-test':re.compile("^[1-5]_list_item")})
    #get the name of the restaurantes
    restaurant_name = [r.find('a', class_='_15_ydu6b').get_text() for r in each_item]
    #create the restaurants name DataFrame
    restaurant_name = pd.DataFrame(restaurant_name)
    #rename the columns
    restaurant_name.columns = ['Restaurants']
    

    #Create DataFrame with feauters
    data2 = pd.DataFrame([city_url2, sortby, qtd_restaurants, restaurant_name]).T

    #Alter columns name
    data2.columns = ['city_id','sort_by','qtd_restaurants', 'restaurant_name']

    #Create data final
    dataOverview = pd.merge(data, data2, on='city_id')
    
    #all 
    data_final = pd.concat( [data_final, dataOverview], axis=0 )
    

    
#Save data in CSV with 
report = data_final.to_csv("data"+datetime.now().strftime('%Y%m%d')+".csv")

In [28]:

client = boto3.client('s3',
                     aws_access_key_id = 'AKIAU24PJ4WDNXTUX5WT',
                    aws_secret_access_key = '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')

for file in os.listdir():
    if '.csv' in file:
        upload_file_bucket = 'rebootfelipe'
        upload_file_key = file
        client.upload_file(file,upload_file_bucket,upload_file_key)
        

In [None]:
driver = webdriver.Chrome(executable_path="chromedriver\chromedriver.exe")

In [34]:
daaaata = pd.read_csv('https://rebootfelipe.s3.amazonaws.com/data.csv')
daaaata.head()

Unnamed: 0.1,Unnamed: 0,city,city_id,city_url,code_vegan,scrap_datetime,sort_by,qtd_restaurants,restaurant_name
0,0,London,g186338,London_England.html,zfz10697,2021-08-01 08:26:28,Highest Rating,4796,Restaurants\n0 ...
1,0,Manchester,g187069,Manchester_Greater_Manchester_England.html,zfz10697,2021-08-01 08:26:28,Highest Rating,494,Restaurants\n0 ...
2,0,Birmingham,g186402,Birmingham_West_Midlands_England.html,zfz10697,2021-08-01 08:26:28,Highest Rating,514,Restaurants\n0 ...
3,0,Edinburgh,g186525,Edinburgh_Scotland.html,zfz10697,2021-08-01 08:26:28,Highest Rating,751,Restaurants\n0 ...
4,0,Glasgow,g186534,Glasgow_Scotland.html,zfz10697,2021-08-01 08:26:28,Highest Rating,572,Restauran...
