# **Preply Web Scrape and Upload to Firebase**

### Task Description
<p>I need to write a python script which will scrape records from a website (each record has about ten text fields and two images which need to be downloaded to a local folder) and save all the text information to a pandas data frame along with the image file names. Each record on the website has a unique id in the url string which can be used as the index/primary key. I think using Selenium with ChromeDriver will be the easiest way to do this because I will have to login through the browser window and navigate to the main page and pick some options before starting the scraping loop.

The second part of the project will be to push the scraped information and images to my firebase project. Since each record has a unique identifier it should be pretty straight forward to use that as the primary key to save the text to firestore. Also the images will need to be uploaded to firebase storage and links to the images saved as a field in firestore for each record.</p>

In [1]:
from selenium.common.exceptions import ElementClickInterceptedException
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from time import sleep
from bs4 import BeautifulSoup
import re
import pandas as pd

In [21]:
URL = "https://preply.com/en/online/english-tutors?page={0}"

to_int = lambda raw_string: int(''.join(re.findall(r'\d+', raw_string))) # strips anything that is not number and converts to integer

class Selectors:
    currency_button = "#__next > section > div._15uGWh._3rnoor.X4eHUe._1o4EVj.hUFwKI._1-5zEl._2x1aeM._1WH1_Q._2JlASK.GWHMDM._1kf68Q._3nnn2i > div.styles_HeaderDropdownWrapper__uIOg5 > button"
    n_of_pages_selector = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > div > div > ul > li:nth-child(5) > a"
    tutor = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0})"
    tutor_basic_info = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0}) > section > div > div:nth-child(2) > div.styles_SearchCardHeading__fL9u3"
    tutor_country = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0}) > section > div > div:nth-child(2) > div.styles_SearchCardHeading__fL9u3 > span > img"
    is_newly_joined = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0}) > section > div > div:nth-child(2) > div.styles_CoreIndicatorsBlockWrapper__JrZw6 > div > div > div:nth-child(1) > div > span"
    tutor_rating = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0}) > section > div > div:nth-child(2) > div.styles_CoreIndicatorsBlockWrapper__JrZw6 > div > div > div:nth-child(1) > div"
    thumbnail_img = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0}) > section > div > div.styles_SearchCardAvatarWrapper__u_zNw > div > a > picture > img"
    price = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0}) > section > div > div:nth-child(2) > div.styles_CoreIndicatorsBlockWrapper__JrZw6 > div > div > div:nth-child(2) > div > div.styles_PriceIndicatorPrice__vz9WR > div.styles_PriceIndicatorValue__ndpfb.styles_PriceIndicatorValueSecond__Py5XD"
    lesson_duration = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0}) > section > div > div:nth-child(2) > div.styles_CoreIndicatorsBlockWrapper__JrZw6 > div > div > div:nth-child(2) > div > div.styles_PriceIndicatorDuration__GKnmh"
    lessons_and_students = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0}) > section > div > div:nth-child(3) > div.styles_InfoWrap__CBt_k > div._15uGWh._3rnoor.X4eHUe._1o4EVj.qh7on_._22lwVW._2x1aeM._1WH1_Q._2JlASK.GWHMDM"
    tutor_languages = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0}) > section > div > div:nth-child(3) > div.styles_InfoWrap__CBt_k > div.styles_SpeaksWrapper__ssRz1"
    tutor_languages_more = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0}) > section > div > div:nth-child(3) > div.styles_InfoWrap__CBt_k > div.styles_SpeaksWrapper__ssRz1 > ul > span.styles_MoreSubjects__eJRX9"
    tutor_description = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0}) > section > div > div:nth-child(3) > div.styles_InfoWrap__CBt_k > div.styles_TutorDescriptionBlock__Th9E_ > div > div"
    tutor_description_more = "#__next > div.styles_PageWrap__lIYEc > main > div.styles_ResultsWrapper__kAXts > div.styles_MainContentWrapper__h3r02 > ul > li:nth-child({0}) > section > div > div:nth-child(3) > div.styles_InfoWrap__CBt_k > div.styles_TutorDescriptionBlock__Th9E_ > div > span > span"

def get_total_number_of_pages(driver):
    """Scrapes the total number of pages information
    found at the bottom of the page"""
    total_number_of_pages = int(driver.find_element(By.CSS_SELECTOR, Selectors.n_of_pages_selector).text) + 1
    return total_number_of_pages

def refresh_page_currency(driver):
    """
    Refreshes the page currency from EUR to USD.
    # Context:
    Since I'm in Mexico even that the USD currency is default,
    the page is loaded in MXN pesos currency. To turn around that,
    this function switch to EUR and then to USD, what makes the USD to
    be actually applied.
    """
    driver.find_element(By.CSS_SELECTOR, Selectors.currency_button).click()
    select = Select(driver.find_element(By.NAME, 'currency'))
    sleep(1)
    select.select_by_visible_text("EUR")
    sleep(1)
    driver.find_element(By.CSS_SELECTOR, Selectors.currency_button).click()
    select = Select(driver.find_element(By.NAME, 'currency'))
    select.select_by_visible_text("USD")

def scrape_preply_english_tutors(driver, URL, pages_to_iterate=11):
    """This function iterates over the URL https://preply.com/en/online/english-tutors
    and scrapes all english tutors contained information in it.
    Args:
    *driver: chrome driver object
    *URL: url template that is the scraping target
    *pages_to_iterate: number of pages that will be scraped. Default is 10.
    """
    for page_index in range(1,pages_to_iterate):
        tutors_dicttionary_list = []
        driver.get(URL.format(page_index))
        sleep(3)
        for tutor_index in range(1,20):
            try:
                tutor_basic_info = driver.find_element(By.CSS_SELECTOR,Selectors.tutor_basic_info.format(tutor_index))
                tutor_name = tutor_basic_info.text.split('\n')[0]
                tutor_teaches = tutor_basic_info.text.split('\n')[1]
            except NoSuchElementException:
                tutor_basic_info = ""
                tutor_name = ""
                tutor_teaches = ""
                continue
            try:
                driver.find_element(By.CSS_SELECTOR,Selectors.tutor_description_more.format(tutor_index)).click()
                sleep(0.5)
            except:
                print("Description is already fully visible - Nothing to click.")

            try:
                driver.find_element(By.CSS_SELECTOR,Selectors.tutor_languages_more.format(tutor_index)).click()
                sleep(0.5)
            except:
                print("No additional Languages - Nothing to click.")


            try:
                price = driver.find_element(By.CSS_SELECTOR,Selectors.price.format(tutor_index)).text
            except NoSuchElementException:
                price = ""

            try:
                lesson_duration = driver.find_element(By.CSS_SELECTOR,Selectors.lesson_duration.format(tutor_index)).text
            except NoSuchElementException:
                lesson_duration = ""

            try:
                is_newly_joined = driver.find_element(By.CSS_SELECTOR,Selectors.is_newly_joined.format(tutor_index)).text
                is_newly_joined = True
            except NoSuchElementException:
                is_newly_joined = False

            try:
                tutor_rating = driver.find_element(By.CSS_SELECTOR,Selectors.tutor_rating.format(tutor_index)).text
                rating = tutor_rating.split('\n')[0]
                if rating == "Newly joined":
                    rating = ""
                try:
                    n_of_reviews = to_int(tutor_rating.split('\n')[1])
                except:
                    message = "Not rated yet"
                    n_of_reviews = 0
            except NoSuchElementException:
                rating = ""
                message = "Not rated yet"

            try:
                thumbnail_img = driver.find_element(By.CSS_SELECTOR,Selectors.thumbnail_img.format(tutor_index)).get_attribute('src')
            except NoSuchElementException:
                pass

            try:
                tutor_country =  driver.find_element(By.CSS_SELECTOR,Selectors.tutor_country.format(tutor_index)).get_attribute('alt')
            except NoSuchElementException:
                pass

            try:
                lessons_and_students = driver.find_element(By.CSS_SELECTOR,Selectors.lessons_and_students.format(tutor_index)).text
                lessons = to_int(lessons_and_students.split("•")[0])
                try:
                    students = to_int(lessons_and_students.split("•")[1])
                except:
                    students = ""
                    print("No Students yet.")
            except NoSuchElementException:
                lessons = ""

            try:
                languages = driver.find_element(By.CSS_SELECTOR,Selectors.tutor_languages.format(tutor_index)).text
            except NoSuchElementException:
                languages = ""

            try:
                description = driver.find_element(By.CSS_SELECTOR,Selectors.tutor_description.format(tutor_index)).text
            except NoSuchElementException:
                description = ""

            tutor_dict = \
            { 
            "tutor_name":tutor_name
            ,"tutor_teaches":tutor_teaches
            ,"thumbnail_img":thumbnail_img
            ,"tutor_country":tutor_country
            ,"price":price
            ,"lesson_duration":lesson_duration
            ,"is_newly_joined":is_newly_joined
            ,"rating":rating
            ,"n_of_reviews":n_of_reviews
            ,"lessons":lessons
            ,"students":students
            ,"languages":languages
            ,"description":description}

            print(tutor_dict)
            tutors_dicttionary_list.append(tutor_dict)

    return tutors_dicttionary_list
            
def transform_list_of_dicts_in_df(list_of_dicts):
    "Transforms a list of Dictionaries into pandas DataFrame"
    df = pd.DataFrame(data=list_of_dicts)
    return df

### Scraping Execution

In [19]:
# Initialize Chrome Driver
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
options.add_argument("disable-infobars")
chrome_driver_path="C:\Program Files\Google\Chrome\Application\chromedriver.exe"
driver = webdriver.Chrome(options=options)
driver.get(URL.format(1))
sleep(2)
print("Driver successfully initialized.")
test_pages = 11
total_pages = get_total_number_of_pages(driver)
try:
    refresh_page_currency(driver)
    print("Currency successfully refreshed.")
except Exception as e:
    print(f"Could not refresh currency, try again. {e}")
    
list_of_tutor_dictionaries = scrape_preply_english_tutors(driver,URL,test_pages)
tutors_df = transform_list_of_dicts_in_df(list_of_tutor_dictionaries)



Driver successfully initialized.
Currency successfully refreshed.
No Students yet.
{'tutor_name': 'Brianna C.', 'tutor_teaches': 'English', 'thumbnail_img': 'https://avatars.preply.com/i/logos/i/logos/avatar_mm6njhw04lb.jpg?d=160x160&f=jpeg', 'tutor_country': 'United States of America', 'price': '30', 'lesson_duration': '50-min lesson', 'is_newly_joined': True, 'rating': '', 'n_of_reviews': 0, 'lessons': 2, 'students': '', 'languages': 'Speaks:\nEnglishNative Chinese (Mandarin)Beginner SpanishUpper-Intermediate', 'description': "Licensed English teacher with 7 years of experience with both kids and adults Hello, my name is Teacher Bri and I am from the USA. I enjoy traveling and learning new languages, so I can't wait to help you learn English!\nSome other things I enjoy are reading, cooking, yoga, and learning about new cultures. Do we have any of the same interests?"}
No additional Languages - Nothing to click.
No Students yet.
{'tutor_name': 'Cristina S.', 'tutor_teaches': 'English'

### Pandas DF form

In [20]:
tutors_df.head()

Unnamed: 0,tutor_name,tutor_teaches,thumbnail_img,tutor_country,price,lesson_duration,is_newly_joined,rating,n_of_reviews,lessons,students,languages,description
0,Joseph M.,English,https://avatars.preply.com/i/logos/i/logos/ava...,South Africa,12,50-min lesson,False,5,10,22,356,Speaks:\nEnglishNative,TEFL Certified and experienced English Tutor. ...
1,Daniel A.,English,https://avatars.preply.com/i/logos/i/logos/ava...,United Kingdom,20,50-min lesson,False,5,11,27,539,Speaks:\nEnglishNative SpanishIntermediate,TEFL certified tutor with over 3 years of expe...
2,Herman P.,English,https://avatars.preply.com/i/logos/i/logos/ava...,South Africa,12,50-min lesson,False,5,23,39,2318,Speaks:\nEnglishNative,Certified English Teacher with 4 years experie...
3,John Ellionico M.,English,https://avatars.preply.com/i/logos/i/logos/ava...,Philippines,12,50-min lesson,False,5,16,7,926,Speaks:\nEnglishNative TagalogNative,Certified English Tutor with Bachelor's Degree...
4,Nicolette V.,English,https://avatars.preply.com/i/logos/i/logos/ava...,South Africa,13,50-min lesson,False,5,36,47,3351,Speaks:\nEnglishNative,"TEFL/TESOL/ESL Certified. Friendly, native Eng..."
