<div>
<img src="images/icon_important.jpg" width="50" align="left"/>
</div>
<br>
<br>

#### __Important Legal Notice__
By running and editing this Jupyter notebook with the corresponding dataset, you agree that you will not use or store the data for other purposes than participating in the Champagne Coding with DNB & Women in Data Science, Oslo. You will delete the data and notebook after the event and will not attempt to identify any of the commentors.

### Scraping Reviews

Here we will review two methods for scraping reviews:
- __Method 1__: Download HTML directly from the website
- __Method 2__: Using a crawler using a web driver and the application ID

Both of these methods require some element of personalization. When using __Method 1__, you will first need to scroll to the end of the reviews in order to get access to the HTML of all the reviews. You will then need to use the "Inspect" tool in your web browser to find the unique identifiers for the elements of the review (name, date, review score, review text, etc). We've identified these elements for the DNB application, but they may be different for other applications.

In [None]:
from pathlib import Path
current_directory = Path.cwd()
reviews_directory = Path(current_directory, 'reviews')
html_directory = Path(current_directory, 'html')

### Method 1: Use HTML downloaded directly from Google Play

For this method to work, you need to first have a saved html file in the ```html``` directory. You can download the HTML for the application by first finding the application you are interested in through the Google Play store. 

In this case, we will go through the steps for downloading the HTML for the DNB app (https://play.google.com/store/apps/details?id=no.apps.dnbnor&hl=en&showAllReviews=true)

We start by navigating to the google play store and finding the application we are interested in. This will show you a page similar to this:
<br>
<br>
<div>
<img src="images/00-google_play_store_screenshot.png" width="500"/>
</div>
<br>

Scroll all the way to the bottom of the reviews, clicking to show all reviews. This may take some time depending on the popularity of the application and number of reviews.

We will download the HTML using the Inspect option since we need to understand the elements of the code for our scraping purposes. In Google Chrome, you can access this by right clicking somewhere on the page and clicking ```Inspect```
<br>
<br>
<div>
<img src="images/01-inspect.png" width="500"/>
</div>
<br>

To save the HTML, you can either copy the highest level element, or find the element that includes the comments. 
<br>
<br>
<div>
<img src="images/01-copy_html.png" width="500"/>
</div>
<br>
Save the copied element in a text editor. I've saved mine as ```dnb.html``` and placed it in a directory called ```html```, located within this folder.

#### Now that you've saved your HTML, we're ready to extract some information.

We will use BeautifulSoup, a Python library that makes it easy to scrape information from web pages. Using an HTML parser, it provides useful methods for iterating, searching, and modifying the parse tree.

In [None]:
import re
import pandas as pd

import bs4
from bs4 import BeautifulSoup

Let's start by reading our saved HTML file. Here, we will read our html file ```dnb.html``` which is located in our html directory.

In [None]:
read_file = open(Path(html_directory, 'dnb.html'), 'r', encoding='utf-8').read()

This is where our inspection job becomes important. When using the inspect tool in Google Chrome, we found the names and key-value pairs for each element we are interested in. 

__Note__: These will be different for different applications so make sure that you update these parameters.

Here is what we've found for the DNB application:
- __Entire Review & Contents__: ```div jscontroller="H6eOGe"```
- __Name__: ```span class="X43Kjb"```
- __Date__: ```span class="p2TkOb"```
- __Review Score__: ```div class="pf5lIe"```
- __Review Text__: ```span jsname="fbQN7e"```

We'll start by converting our html string that we've read in to a BeautifulSoup object.

In [None]:
soup = BeautifulSoup(read_file)

We want to find all of the comments, so let's go back to our inspect tool and look at what the review elements are called. Here we can see that they are div element, with the following attribute name pair: ```'jscontroller': 'H6eOGe'```

<br>
<br>
<div>
<img src="images/all_comments.png" width="500"/>
</div>
<br>

We will use that knowledge to search for all comments with these attributes.

In [None]:
all_comments = soup.body.find_all('div', attrs={'jscontroller': 'H6eOGe'})

Now, we need to find the corresponding elements for name, date, score, and the review text. Adjust the parameters within accordingly.

In [None]:
### Loop through each of the comments and use a beautiful soup function to find the relevant parts
all_reviews_dict={}
i = 0

for each_comment in all_comments:
    
    current_review = {}

    name = each_comment.find('span', attrs= {'class': 'X43Kjb'})
    current_review['Name'] = name.text 

    date = each_comment.find('span', attrs= {'class': 'p2TkOb'})
    current_review['Date'] = date.text 

    score = each_comment.find('div', attrs= {'class': 'pf5lIe'})
    current_review['Review_Score'] = re.search('(\d+) stars out of five stars', str(score)).group(1)

    review_text = each_comment.find('span', attrs= {'jsname': 'bN97Pc'}) #jsname="bN97Pc"
    current_review['Review_Text'] = review_text.text
    i += 1
    
    all_reviews_dict[i] = current_review

How does this look as a data frame?

In [None]:
df_allreviews = pd.DataFrame(all_reviews_dict)
df_allreviews = df_allreviews.T

df_allreviews.head()

How many reviews do we have?

In [None]:
len(df_allreviews)

Check for duplicates and empty rows

In [None]:
df_allreviews.drop_duplicates().shape

Let's delete completely duplicate rows

In [None]:
df_allreviews.drop_duplicates(inplace=True)

Are there any empty cells?

In [None]:
df_allreviews.dropna().shape

#### Anonymization

Since there were names in some of the reviews, we decided to use the list of names from the scraped reviews and replace them with empty strings in the review text. Make sure that you always prioritize anonymization when working with potentially sensitive data :) 

If you added new app reviews that contain the names, make sure to run this function!

In [None]:
flatten = lambda l: [item for sublist in l for item in sublist]

def remove_names(input_string, list_names):
    for n in list_names:
        input_string = input_string.replace(n.lower(), "")
        
    return input_string

def anonymization(df):
    """
    Create a list of names and search for them within the review text. 
    Replaces these names with blank strings.
    """
    # Store all of the names to a list
    all_names = df.Name.tolist()
    # Take the set of the list of names that are longer than 5
    names_list = list(set([n for n in all_names if len(n) > 5]))
    
    df.Review_Text = df.Review_Text.apply(lambda x: remove_names(str(x).lower(), names_list))
    
    return df

In [None]:
df_allreviews = anonymization(df_allreviews)

Save the file.

In [None]:
df_allreviews[['Date', 
               'Review_Score', 
               'Review_Text']].to_csv(Path(reviews_directory, 'dnb_reviews.csv'))

### Method 2: Use a crawler and the app id

Here we will use selenium's webdriver API that allows us to control a browser running locally. There exists many drivers for the different web browsers. Here we will use Chrome. If you don't have chrome installed, you can choose an Internet Explorer, Firefox, or Safari web driver. 

More information [here](https://www.seleniumhq.org/projects/webdriver/).

In [None]:
from selenium import webdriver
from time import sleep
import requests
from selenium.webdriver.common.keys import Keys
import time

First find the name of the application you would like to scrape.

In [None]:
app_id = 'no.apps.dnbnor&hl=en'

#### Start the driver

This will open a Chrome instance and search for the application ID you've set.

In [None]:
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())

link = "https://play.google.com/store/apps/details?id={}".format(app_id)
driver.get(link + '&showAllReviews=true')

#### Start crawling the webpage.

In this loop, there are two options - click if there is an element called ```Show More``` or we will scroll continuously if the element isn't present. The loop will only stop once the max number of clicks or the max number of scrolls has been reached.

__Note__ : This can take some time depending on how many ```max_clicks``` you set.

##### __TODO__:  Setting the number of maximum clicks is arbitrary in this case. How could you set up this method so that it scrolls and clicks just enough rather than continuing until the max criteria is reached?

In [None]:
# Change this number to get more or less reviews
max_clicks = 40

# Start crawling
num_clicks = 0
num_scrolls = 0
while num_clicks <= max_clicks and num_scrolls <= max_clicks*5:
    try:
        show_more = driver.find_elements_by_xpath("//*[contains(text(), 'Show More')]")
        show_more[0].click()
        
        num_clicks += 1
    except:
        html = driver.find_element_by_tag_name('html')
        html.send_keys(Keys.END)
        num_scrolls +=1
        time.sleep(1)

print('Done scrolling')        

#### Now we're ready to evaluate the HTML we've scraped.

We'll use ```BeautifulSoup``` again to parse the html.

In [None]:
soup = bs4.BeautifulSoup(driver.page_source, 'html.parser')
driver.close()

Here, we will find the major H2 categories of the page. You can see that one of them is called ```Reviews```. This is what we are interested in parsing.

In [None]:
h2 = soup.find_all('h2')
h2

#### Let's get parsing! This will use the same code as above

##### __TODO__ : Feeling brave and ambitious? How do you think this could be done in a way that wouldn't require inspecting the HTML for the name as we did earlier? Does there seem to be a method to the madness?

In [None]:
all_comments = soup.body.find_all('div', attrs={'jscontroller': 'H6eOGe'})

### Loop through each of the comments and use a beautiful soup function to find the relevant parts
all_reviews_dict={}
i = 0

for each_comment in all_comments:    
    current_review = {}

    name = each_comment.find('span', attrs= {'class': 'X43Kjb'})
    current_review['Name'] = name.text 

    date = each_comment.find('span', attrs= {'class': 'p2TkOb'})
    current_review['Date'] = date.text 

    score = each_comment.find('div', attrs= {'class': 'pf5lIe'})
    current_review['Review_Score'] = re.search('(\d+) stars out of five stars', str(score)).group(1)

    review_text = each_comment.find('span', attrs= {'jsname': 'bN97Pc'}) #jsname="bN97Pc"
    current_review['Review_Text'] = review_text.text
    i += 1

    all_reviews_dict[i] = current_review

df_allreviews_driver = pd.DataFrame(all_reviews_dict)
df_allreviews_driver = df_allreviews_driver.T

df_allreviews_driver.drop_duplicates(inplace=True)
print("Done reading {} application data. {} reviews were found.".format('DNB',
                                                                        len(df_allreviews_driver)))

# Anonymize the reviews using the anonymization function
df_allreviews_driver = anonymization(df_allreviews_driver)

df_allreviews_driver[['Date', 
               'Review_Score', 
               'Review_Text']].to_csv(Path(reviews_directory, '{}_reviews_from-webdriver.csv'.format('DNB')))

How many reviews were extracted with this method?

In [None]:
df_allreviews_driver.shape

Is it the same length as the original file where we saved the HTML ourselves?

In [None]:
df_allreviews.shape

Anonymize it.

In [None]:
df_allreviews_driver = anonymization(df_allreviews_driver)

Save it.

In [None]:
df_allreviews_driver = df_allreviews_driver.reset_index(drop=True)
df_allreviews_driver[['Date', 
            'Review_Score',
            'Review_Text']].to_csv(Path(reviews_directory, 'dnb_reviews-autoparsed.csv'))

Now we're ready to start analyzing! Continue to the next notebook.