# Lecture 8

In [7]:
# Please inlcude your names below
# Also, please edit the name of the file and include the names of the two(or three) people answering

# Pair answering the assignment: Daniel Reiss, David Stalder
# Pair giving feedback: Ivan Allinckx, Chritian Aeberhard

## Selenium Download
For the next exercises you will have to download selenium. 

You can read more about the webdriver here (https://chromedriver.chromium.org), but if you want to go straight to the download, go to https://chromedriver.storage.googleapis.com/index.html?path=80.0.3987.106/ and download your version. 

Moreover, in your terminal type `pip install selenium`. 

Once this is done, you should be able to run:
- `from selenium import webdriver`
- `browser = webdriver.Chrome([the path where you put the googlechromedriver])`

In case of any issues, the https://chromedriver.chromium.org website has some straightforward info on common bugs. 


In [8]:
from selenium import webdriver
import requests
import time

# replace path_chromedriver with you path to chromdriver.exe
# David: r'C:\Users\David\Documents\GitHub\SoComp\notebook_lecture8\David&Dani\chromedriver.exe'

path_chromedriver = r'C:\Users\Dani\Documents\SoComp\notebook_lecture8\David&Dani\chromedriver.exe'
browser = webdriver.Chrome(path_chromedriver)
browser.close()

### 1. Rate limiting

1. By now, you are familiar with 3 APIs, namely Google Books, NYT, and Dribble. For each one, find and copy the rules about rate limits. Next, pick one and try to exceed the rate limit; explain what you do and what reaction you get from the API.

Google Books: 100000 per day
    
NYT: 4,000 requests per day and 10 requests per minute

Dribbble: 60 requests per minute and 1,440 requests per day per authenticated user

2. In the next problem you will check how many requests you can send to Google Search before getting blocked. Websites protect themselves from automated crawling by checking requests that come from the same computer in a small time frame and after a while, they won't respond to the request. A valid response would be "Response 200", which you can see if you just print the response of `requests.get('https://www.google.com/search?q=zurich')`. 

The question is
a) how many requests does it take to get blocked (when you first get a response other than 200)?
b) What is the number of a blocked response and what does it exactly stand for (Google response XXX). If you still can:) 

In [9]:
count = 0
s = requests.Session()
while s.get('https://www.google.com/search?q=zurich').status_code == 200:
    if count % 5 == 0:
        print(count)
    count += 1

print(s.get('https://www.google.com/search?q=zurich')) # Response 429 - Too Many Requests
print('It takes ' + str(count) + ' requests before getting blocked.')

0
5
10
15
20
25
30
35
40
45
50
55
60
<Response [429]>
It takes 64 requests before getting blocked.


### 2. Selenium sessions

Go to a website of your choice where you have an account. It can for example be the New York Times APi website where you created a login last time but also tutti.ch, comparis, whatever simple website you often use.

Using Selenium create a session where you 
1. go to the main website 
2. log in 
3. click on an element of your choice 
4. scroll to the bottom of the page
5. then save the page. 

When logging in, you will have to find the name of the login form and submit your credentials to it and then click the login button. Here you find an example for a login using selenium but in case you decide to use this help, Facebook should not be your chosen website. https://crossbrowsertesting.com/blog/test-automation/automate-login-with-selenium/

In [10]:
driver = webdriver.Chrome()

# 1)
driver.get("https://developer.nytimes.com/accounts/login")

# 2)
driver.find_element_by_name("username").send_keys("david.stalder@uzh.ch")
driver.find_element_by_name("password").send_keys("HelloWorld1*")
driver.find_element_by_id("login-button").click()

# 3)
time.sleep(5)
driver.find_element_by_xpath("//div/ng-component/page-content/div/div[2]/mat-card[1]/mat-card-header").click()

# 4)
time.sleep(5)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# scrolling doesn't seem to work on this page.. we tried scrolling on other sites and worked fine there

# 5)
with open("page.html", "w") as f:
    f.write(driver.page_source)
    
driver.close()

### 3. Measuring personalization

In this exercise you will have to imitate the study described in class on a website of your interest. You will have to measure differences in the content that you receive back from the website under varying treatments. 

You will have to choose a website and a treatment. Use selenium for this exercise as well. 
- As for websites, you can pick an online store, or traveling site, some news site, Google News.. basically try to pick something that you suspect gives different results for different searchers. 
- Examples for treatments would be location, being logged in with an account, history with the website, being on a phone vs a desktop, etc. 
- You can try to pick multiple searches to make sure you are measuring real phenomenon, not only noise
- You can include a control treatment in case you suspect there's A/B testing or noise in how the pages look
- Finally you have to pick a measure for the differences on the page. In case you receive items on a page, for example URLs or products, you can define an overlap metric. In case the page is more unstructured, come up with an explanation for how you define differences.

As your answer, explain which of the above you chose, how you implemented the experiment, and what difference you found in the pages you collected. 

You can find more infor on how to run multiple browsers at the same time here: https://crossbrowsertesting.com/blog/selenium/run-test-multiple-browsers-parallel-selenium/

In [11]:
from bs4 import BeautifulSoup

'''
First, we go to youtube.com with Chrome and our userprofile that we visit youtube on a regular basis. We extract all
videotitles of the starting page and save them in a list.
Then, we go to youtube again, but this time with Firefox and without a userprofile (default). We again save all titles.
Finally, we compare both lists of video titles and print the matching percentage.
'''

# CHROME
options = webdriver.ChromeOptions()
# edit path_google_account_data to your matching path
# you can type chrome://version in chrome address bar and use the value of Profilepath without "\Default\"
# David: r'user-data-dir=C:\Users\David\AppData\Local\Google\Chrome\User Data'
path_google_account_data = r'user-data-dir=C:\Users\Dani\AppData\Local\Google\Chrome\User Data'
options.add_argument(path_google_account_data)

chrome = webdriver.Chrome(path_chromedriver, options=options)
chrome.get("https://www.youtube.com/")
chrome_page = chrome.page_source
chrome_html = BeautifulSoup(chrome_page, "html.parser")
chrome_tags = chrome_html.find_all("yt-formatted-string", id="video-title")
chrome_titles = []
for tag in chrome_tags:
    chrome_titles.append(tag.text)
chrome.close()

#print(chrome_titles)

# FIREFOX
# you will need geckdriver for this: https://github.com/mozilla/geckodriver/releases

ff = webdriver.Firefox()
ff.get("https://www.youtube.com/")
ff_page = ff.page_source
ff_html = BeautifulSoup(ff_page, "html.parser")
ff_tags = ff_html.find_all("yt-formatted-string", id="video-title")
ff_titles = []
for tag in ff_tags:
    ff_titles.append(tag.text)
ff.close()

#print(ff_titles)

total = len(chrome_titles) + len(ff_titles)
matching_count = 0
for title in chrome_titles:
    if title in ff_titles:
        matching_count += 1

print("The matching quota is: {}%".format(matching_count/total*100))

The matching quota is: 0.0%


Congratulations for completing the third notebook! Now it’s time for feedback.
1.	Pass your solution to the other pair in your group.
2.	Include your feedback in the other pair’s notebook. Don’t forget to add your names at the top.
3.	Return the notebook with feedback to the original pairs.
4.	Upload your notebook, with the feedback included by the other pair on OLAT.

You can think of/suggest (among other things)
 - improvements in the code (e.g. readability, efficiency)
 - improvements in the answers (e.g. are they easy to understand, are they correct, how can they be improved?)
 - point out differences (e.g. are there any differences between the responses of the two pairs? if yes what are they, what is the cause, and in which way can they be useful?)
 
Not all suggestions about the type of feedback apply to all types of questions. Try to give feedback in a meaningful and constructive way. 

In [6]:
# Below there is space for giving feedback. This space should be used only by the other pair in your group.

'''
Feedback here
'''

'\nFeedback here\n'