<div>
<img src="images/icon_important.jpg" width="50" align="left"/>
</div>
<br>
<br>

#### __Important Legal Notice__
By running and editing this Jupyter notebook with the corresponding dataset, you agree that you will not use or store the data for other purposes than participating in the Champagne Coding with DNB & Women in Data Science, Oslo. You will delete the data and notebook after the event and will not attempt to identify any of the commentors.

### Scraping Reviews

In [None]:
from pathlib import Path
current_directory = Path.cwd()
reviews_directory = Path(current_directory, 'reviews')
html_directory = Path(current_directory, 'html')

In [None]:
import re
import pandas as pd

import bs4
from bs4 import BeautifulSoup

In [None]:
from selenium import webdriver
from time import sleep
import requests
from selenium.webdriver.common.keys import Keys
import time

In [None]:
apps_to_crawl = {
    
}

apps_to_crawl['nordea'] = 'no.nordea.mobilebank&hl=en'
apps_to_crawl['dnb'] = 'no.apps.dnbnor&hl=en'
apps_to_crawl['sparebank'] = 'no.sparebank1.mobilbank&hl=en'
apps_to_crawl['sbanken'] = 'no.skandiabanken&hl=en'
apps_to_crawl['posten'] = 'no.posten.sporing.controller&hl=en'
apps_to_crawl['aftenposten'] = 'no.cita&hl=en'
apps_to_crawl['nrk'] = 'no.nrk.mobil.app&hl=en'

#### Start crawling the webpage.

Here we will open a new driver for each application and run a loop in the page - switching between scrolling and clicking. It will click if there is an element called ```Show More``` or we will scroll continuously if the element isn't present. The loop will only stop once the max number of clicks or the max number of scrolls has been reached.

We will then find the different HTML elements to parse the reviews into a dataframe

Here is what we've found for parsing HTML elements of the reviews:
- __Entire Review & Contents__: ```div jscontroller="H6eOGe"```
- __Name__: ```span class="X43Kjb"```
- __Date__: ```span class="p2TkOb"```
- __Review Score__: ```div class="pf5lIe"```
- __Review Text__: ```span jsname="fbQN7e"```

In [None]:
from webdriver_manager.chrome import ChromeDriverManager

In [None]:
for app_name, app_id in apps_to_crawl.items():
    driver = webdriver.Chrome(ChromeDriverManager().install())
    
    link = "https://play.google.com/store/apps/details?id={}".format(app_id)
    driver.get(link + '&showAllReviews=true')

    # Change this number to get more or less reviews
    max_clicks = 20

    # Start crawling
    num_clicks = 0
    num_scrolls = 0
    while num_clicks <= max_clicks and num_scrolls <= max_clicks*5:
        try:
            show_more = driver.find_elements_by_xpath("//*[contains(text(), 'Show More')]")
            show_more[0].click()

            num_clicks += 1
        except:
            html = driver.find_element_by_tag_name('html')
            html.send_keys(Keys.END)
            num_scrolls +=1
            time.sleep(1)

    print('Done scrolling')        

    soup = bs4.BeautifulSoup(driver.page_source, 'html.parser')
    driver.close()

    all_comments = soup.body.find_all('div', attrs={'jscontroller': 'H6eOGe'})

    ### Loop through each of the comments and use a beautiful soup function to find the relevant parts
    all_reviews_dict={}
    i = 0

    for each_comment in all_comments:    
        current_review = {}

        name = each_comment.find('span', attrs= {'class': 'X43Kjb'})
        current_review['Name'] = name.text 

        date = each_comment.find('span', attrs= {'class': 'p2TkOb'})
        current_review['Date'] = date.text 

        score = each_comment.find('div', attrs= {'class': 'pf5lIe'})
        current_review['Review_Score'] = re.search('(\d+) stars out of five stars', str(score)).group(1)

        review_text = each_comment.find('span', attrs= {'jsname': 'bN97Pc'}) #jsname="bN97Pc"
        current_review['Review_Text'] = review_text.text
        i += 1

        all_reviews_dict[i] = current_review

    df_allreviews = pd.DataFrame(all_reviews_dict)
    df_allreviews = df_allreviews.T

    df_allreviews.drop_duplicates(inplace=True)
    print("Done reading {} application data.{} reviews were found."format(app_name, 
                                                                          len(df_allreviews))
    
    df_allreviews.to_csv(Path(reviews_directory, '{}_reviews.csv'.format(app_name)))