# Online data collection using the requests library

In [52]:
# Please inlcude your names below
# Also, please edit the name of the file and include the names of the two(or three) people answering

# Pair answering the assignment: Daniel Reiss, David Stalder
# Pair giving feedback: Christian Aeberhard, Ivan Allinckx

As step 0, pick your favorite Wikipedia page, open it in the browser, and then save it as an html file. Now open it in the browser as well as in a text editor and look at the difference. 

Using the requests library you can retrieve the html source of the page, in a response object (using requests.get(“url”)). The response object you received has content that you can access calling the .text function on it.

Call text and save the result in a file, then open the file in a browser and check whether you successfully saved the page. Note, you will only be able to open the file in the browser if you give it an html extension.

### 1) Basic web crawling

URLs have specific formats, for example any Wikipedia page will be of the format https://en.wikipedia.org/wiki/Pythonidae where the last word is the topic of the article.
Next, we want to automate this saving process using the requests library and making automated requests to Wikipedia.

Exercise: Pick 5 different words, and write code that loops through these words, and retrieves the html content for each associated wikipedia page, and saves the html text as wiki_htmls/[word].html files. (Choose words that actually have associated wiki pages). 


In [53]:
### your code here
import requests
res = requests.get("https://de.wikipedia.org/wiki/Iron_Maiden")
res.text
names = ['Iron_Maiden', 'Dio_(band)','Rainbow_(rock_band)','Black_Sabbath','Ozzy_Osbourne']

for i in names:
    html = requests.get('https://en.wikipedia.org/wiki/'+i)
    with open('wiki_htmls_'+i+'.html','w', encoding="utf-8") as f:
        f.write(html.text)
    

### 2) URL formats

What is the common URL in the case of Google searches? And in the case of Yelp? 

In [54]:
# Google: https://www.google.com/search?q={Search Term}&oq={Search Term}&aqs={Browser}..69i57j0l7.1497j0j8&sourceid={Browser}&ie=UTF-8
# Yelp: https://de.yelp.ch/z{location}

And what happens to the URL if you want to define the location as well as the type of venue you are looking for?

In [55]:
# https://de.yelp.ch/search?find_desc={venue}&find_loc={location}

Can you find more search parameters for either of the two pages that you can define via the URLs? What do they mean?

In [56]:
# Google Maps: https://www.google.com/maps/search/{Search Term}/@{Coordinates},14z/data=!3m1!4b1?hl={language}
# Yelp for Company Owners: https://de.biz.yelp.ch/?source=consumer_site_header&utm_content=header&utm_medium=www&utm_source=cons_home

### 3) And now let's work with the HTML content

In [57]:
import requests
res = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")

Using the BeautifulSoup parser library we will parse the webpage that you just saved. 

In [58]:
# let's import BeautifulSoup, our parser library
# And make a soup object out of the html of the page

# in case bs4 throws error try
# !pip install --upgrade html5lib==1.0b8

from bs4 import BeautifulSoup
soup = BeautifulSoup(res.content, 'html.parser')

In [59]:
# print a nice version using prettify
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


Here's how we can find all instances of a tag at once: Try to predict what the following command will return: `soup.find_all('p')` and then call it to check if you were right. 

In [60]:
soup.find_all('p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>, <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

Print out the second element of this list.

In [61]:
soup.find(id="second")

<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>

Print out the text inside the second element of the list, using the .text on the element.

In [62]:
print(soup.find(id="second").get_text())



                First outer paragraph.
            



When you try to find a specific element on a page you can reach it by finding classes or IDs of the elements.

In [63]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

How many elements would it return for 'inner_class'? Guess, and check your guess by using the find_all command

In [64]:
# 2
soup.find_all('p', class_='inner-text')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

### 4) Finding elements in the browser
Since every web page is different and html can get very large and messy, the easiest way to find elements that you are interested in is to start from the browser window. So next we will quickly look at how to find elements using the developer tools in your browser. Open the following webpage in your browser (preferably Chrome): http://forecast.weather.gov/MapClick.php?lat=21.3049&lon=-157.8579#.Wkwh8VQ-fVo 

Find the developer tools in your browser. (In Chrome, it's view --> developer --> developer tools or Control+Shift+C on Windows and Command+Shift+C on Mac) You should end up with a panel at the bottom or the right side of the browser like what you see below. Make sure the Elements panel is highlighted:

In [81]:
res = requests.get("http://forecast.weather.gov/MapClick.php?lat=21.3049&lon=-157.8579")
soup = BeautifulSoup(res.content, 'html.parser')

When trying to find a specific element, you can right click on it on the page and select "inspect". This will also open up the developer tools window. For example if we want to extract the current temperature value:

<img src="inspect.png">

<img src="inspect_class.png">

<br><br>
1. Using the find function, extract and print out the current temperature from the page. 
2. Do the same with the value in Celsius. 

In [82]:
### Fill out and print a full sentence describing the temperature in F and C. 
temp_F = soup.find(class_="myforecast-current-lrg").get_text()
print('The current temperature in Fahrenheit is: ' + temp_F + '.')
temp_C = soup.find(class_="myforecast-current-sm").get_text()
print('The current temperature in Celsius is: ' + temp_C + '.')

The current temperature in Fahrenheit is: 74°F.
The current temperature in Celsius is: 23°C.


3. In this exercise we will extract each day's forecast from the 7 day extended forecast on the weather report page. <br>
    a. Find the container for the seven day forecast on the weather page we just downloaded. <br>
    b. Make a list with all forecast items (overnight, Wednesday, Wednesday night, etc) <br>
    c. For each time period, print out the name of the period, the short description of the expected weather conditions, and the temperature. 

In [80]:
# For each time period print out something like: 
# Overnight the weather will be mostly clear and breezy and the temperature will be 65F.
res = requests.get("http://forecast.weather.gov/MapClick.php?lat=21.3049&lon=-157.8579")
str_page = res.content.decode()
str_split = '\n<'.join(str_page.split('<'))
str_split = '>\n'.join(str_split.split('>'))
str_split = str_split.replace('\n', '')
str_split = str_split.replace('<br>', ' ')
soup = BeautifulSoup(str_split.encode(), 'html.parser')

seven_day_forecast = soup.find(id="seven-day-forecast-list")

for i in seven_day_forecast:
    if i.find(class_='temp temp-high'):
        temp = i.find(class_='temp temp-high').get_text().replace('High: ', '')
    else:
        temp = i.find(class_='temp temp-low').get_text().replace('Low: ', '')
    print(i.find(class_='period-name').get_text() + ' the weather will be ' + 
          i.find(class_='short-desc').get_text().replace('<\br>', ' ') + ' and the temperature will be ' + 
          temp + '.\n\n')

Today   the weather will be Scattered Showers and Breezy and the temperature will be 81 °F.


Tonight   the weather will be Isolated Showers and Breezy and the temperature will be 69 °F.


Tuesday   the weather will be Isolated Showers and Breezy then Mostly Sunny and Windy and the temperature will be 80 °F.


Tuesday Night the weather will be Isolated Showers and Windy and the temperature will be 70 °F.


Wednesday   the weather will be Windy. Isolated Showers then Mostly Sunny and the temperature will be 81 °F.


Wednesday Night the weather will be Scattered Showers and Windy and the temperature will be 69 °F.


Thursday   the weather will be Scattered Showers and Windy and the temperature will be 81 °F.


Thursday Night the weather will be Scattered Showers and Windy and the temperature will be 69 °F.


Friday   the weather will be Scattered Showers and Breezy and the temperature will be 81 °F.




4. Take a list of jobs (e.g.['teacher', 'lawyer', 'data-scientist']). For each job save the html of the result of searching it on indeed. The url of a result page looks like: https://www.indeed.com/q-data-scientist-jobs.html. 
<br>
For each job find the names of the companies from the first result page.  Make a dictionary where the keys are the jobs and value is a list of the company names. 

In [42]:
job_list = ['teacher', 'lawyer', 'data-scientist']
comp = []
totalDict = {}
for c in job_list:
    res = requests.get("https://www.indeed.com/jobs?q=" + c + "&l=")
    soup = BeautifulSoup(res.content, 'html.parser').find_all(class_="company")
    comp.append(soup)
    job_list = []
    for j in soup:
        job_list.append(j.get_text().replace('\n', ''))
    totalDict[c] = job_list
    
print(totalDict)

{'teacher': ['EduWorld China', 'Hamilton County Department Of Education', 'bValue', 'Edgenuity', "The Coeur d'Alene Tribe", 'Winchester Public Schools', 'SkySlate', 'Mosher Tech', 'Seneca Valley School District', 'Hamilton County Department Of Education'], 'lawyer': ['TransPerfect', 'Osano', 'Brenda A. Ray Law Offices', 'Axiom Law', 'Level 2 Legal Solutions', 'Michaels', 'US Department of State', 'Axiom Law', 'State of Minnesota Board of Public Defense', 'Kentucky Housing Corporation'], 'data-scientist': ['Pathrise', 'Hallmark', 'HBSE', 'Degreed', 'Bayer', 'Perrigo Company', 'Microsoft', 'Cisco Systems', 'HBO Max', 'kraken']}


5. Next, we will do a 2 step crawling exercise. First, request the page for one chosen job category. Then make a list of the links to all specific job ads on that page. In a second step, crawl and save the content to all of these links. Name the folders and files in a meaningful way that helps you identify them later. 

In [50]:
import requests
import os
jobCategory = 'teacher'
linksList = []
res = requests.get('https://www.indeed.com/jobs?q=' + jobCategory + '&l=')
soup = BeautifulSoup(res.content, 'html.parser').find_all(class_="jobtitle turnstileLink")
os.makedirs('./' + jobCategory + '_jobs_found', exist_ok=True)
for tag in soup:
    linksList.append(tag['href'])
for link in linksList:
    res = requests.get("https://www.indeed.com" + link)
    htmlFile = io.open(jobCategory + "_jobs_found/"+ BeautifulSoup(res.content, 'html.parser').find(class_="icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title").get_text().replace("/", "") + ".html", "w", encoding="utf-8")
    htmlFile.write(res.text)
    htmlFile.close()

### 5) Headers

Every request you send has a so called HTTP header (unrelated to the content of the message), for example to communicate the size of the message, the browser from which the request is coming from, or what kind of response it is expecting back in the response. 

1) Read up on this: What parts does a request contain exactly and what is the purpose of a header? 

2) Look in the browser: Take a URL and find the request header using the developer tools in your browser. (Hint: you will need to look inside 'network'). 

3) If you don’t tell python otherwise, it will use a default header when sending requests. What is this default when you use the requests library?

4) The requests library allows to specify the headers of your request exactly. Set the header of your request (for the  URL you previously picked) to be the one copied from your browser. 

Your chosen URL: ##

Default header of Python requests: ##

Header in your browser: ##

5) Now compare the response headers for the same URL in the browser, and by calling a function on the response object in your code. What differences do you see? 

Response header in your browser: ##

Response header in the response in python: ##

Difference: ##

Congratulations for completing the first notebook! Now it’s time for feedback.
1.	Pass your solution to the other pair in your group.
2.	Include your feedback in the other pair’s notebook. Don’t forget to add your names at the top.
3.	Return the notebook with feedback to the original pairs.
4.	Upload your notebook, with the feedback included by the other pair on OLAT.

You can think of/suggest (among other things)
 - improvements in the code (e.g. readability, efficiency)
 - improvements in the answers (e.g. are they easy to understand, are they correct, how can they be improved?)
 - point out differences (e.g. are there any differences between the responses of the two pairs? if yes what are they, what is the cause, and in which way can they be useful?)
 
In this specific notebook the questions to focus on for feedback are: 1, 2, 4 and 5. 3 was just an intro to parsing so no need to analyze in detail. Not all suggestions about the type of feedback apply to all types of questions, try to 

In [None]:
# Below there is space for giving feedback. This space should be used only by the other pair in your group.

'''
Feedback here
'''