# Online data collection using the requests library

In [1]:
# Please inlcude your names below
# Also, please edit the name of the file and include the names of the two(or three) people answering

# Pair answering the assignment: Jialu Liu, Michael Blum
# Pair giving feedback: ..., ...

In [98]:
import requests
from pprint import pprint
import os.path

As step 0, pick your favorite Wikipedia page, open it in the browser, and then save it as an html file. Now open it in the browser as well as in a text editor and look at the difference. 

Using the requests library you can retrieve the html source of the page, in a response object (using requests.get(“url”)). The response object you received has content that you can access calling the .text function on it.

Call text and save the result in a file, then open the file in a browser and check whether you successfully saved the page. Note, you will only be able to open the file in the browser if you give it an html extension.

### 1) Basic web crawling

URLs have specific formats, for example any Wikipedia page will be of the format https://en.wikipedia.org/wiki/Pythonidae where the last word is the topic of the article.
Next, we want to automate this saving process using the requests library and making automated requests to Wikipedia.

Exercise: Pick 5 different words, and write code that loops through these words, and retrieves the html content for each associated wikipedia page, and saves the html text as wiki_htmls/[word].html files. (Choose words that actually have associated wiki pages). 


In [4]:
### your code here

import requests

def getHtml(url):
    r=requests.get(url)
    html=r.text
    return html

def saveHtml(file_name, file_content):
    with open(file_name.replace('/','_')+'.html','wb') as f:
        f.write(file_content)

s = 'https://en.wikipedia.org/wiki/'
a = ['Google', 'Baidu', 'Twitter', 'Github', 'QQ']
for i in a:
    web=s+i
    html=str.encode(getHtml(web))
    #file.write(html)
    saveHtml("wiki_htmls/"+'['+i+']',html)
    print("End")


End
End
End
End
End


### 2) URL formats

What is the common URL in the case of Google searches? And in the case of Yelp? 

In [None]:
Google searches:
    https://www.google.com/search?
Yelp:
    https://de.yelp.ch/search?find_desc=&find_loc=

And what happens to the URL if you want to define the location as well as the type of venue you are looking for?

In [None]:
Yelp:
    https://de.yelp.ch/search?find_desc=[type of venue]&find_loc=[location]

Can you find more search parameters for either of the two pages that you can define via the URLs? What do they mean?

In [None]:
Google search:
    q:Query(key words)
    ie:Input encoding
    hl:Interface language
    sa:Safe search setting
    start:Page number of the results

### 3) And now let's work with the HTML content

In [8]:
import requests
res = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")

Using the BeautifulSoup parser library we will parse the webpage that you just saved. 

In [21]:
# let's import BeautifulSoup, our parser library
# And make a soup object out of the html of the page

# in case bs4 throws error try
# !pip install --upgrade html5lib==1.0b8

try:
    from bs4 import BeautifulSoup
except ModuleNotFoundError:
    !pip3 install --upgrade html5lib==1.0b8
    !pip3 install BeautifulSoup4
    from BeautifulSoup4 import BeautifulSoup
soup = BeautifulSoup(res.content, 'html.parser')

In [22]:
# print a nice version using prettify
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


Here's how we can find all instances of a tag at once: Try to predict what the following command will return: `soup.find_all('p')` and then call it to check if you were right. 

In [26]:
#it will return all tags containing the letter "p"
find_all_p = soup.find_all('p')
for line in find_all_p:
    print(line, '\n')
# assumption was right; we got all paragraphs

<p class="inner-text first-item" id="first">
                First paragraph.
            </p> 

<p class="inner-text">
                Second paragraph.
            </p> 

<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p> 

<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p> 



Print out the second element of this list.

In [55]:
list1=soup.find_all('p')
print(list1[1])

<p class="inner-text">
                Second paragraph.
            </p>


Print out the text inside the second element of the list, using the .text on the element.

In [52]:
list1[1].text

'\n                Second paragraph.\n            '

When you try to find a specific element on a page you can reach it by finding classes or IDs of the elements.

In [None]:
soup.find_all('p', class_='outer-text')

How many elements would it return for 'inner_class'? Guess, and check your guess by using the find_all command

In [54]:
# guess: there are 2 paragraphs with the class name 'inner-text'
list2=soup.find_all('p', class_='inner-text')
print(len(list2))

2


### 4) Finding elements in the browser
Since every web page is different and html can get very large and messy, the easiest way to find elements that you are interested in is to start from the browser window. So next we will quickly look at how to find elements using the developer tools in your browser. Open the following webpage in your browser (preferably Chrome): http://forecast.weather.gov/MapClick.php?lat=21.3049&lon=-157.8579#.Wkwh8VQ-fVo 

Find the developer tools in your browser. (In Chrome, it's view --> developer --> developer tools or Control+Shift+C on Windows and Command+Shift+C on Mac) You should end up with a panel at the bottom or the right side of the browser like what you see below. Make sure the Elements panel is highlighted:

In [27]:
res = requests.get("http://forecast.weather.gov/MapClick.php?lat=21.3049&lon=-157.8579")
soup = BeautifulSoup(res.content, 'html.parser')

When trying to find a specific element, you can right click on it on the page and select "inspect". This will also open up the developer tools window. For example if we want to extract the current temperature value:

<img src="inspect.png">

<img src="inspect_class.png">

<br><br>
1. Using the find function, extract and print out the current temperature from the page. 
2. Do the same with the value in Celsius. 

In [35]:
### Fill out and print a full sentence describing the temperature in F and C. 

temp_F = soup.find_all(class_='myforecast-current-lrg')[0].text
temp_C = soup.find_all(class_='myforecast-current-sm')[0].text
print("The temperature in Fahrenheit is " + temp_F)
print("The temperature in Celsius is " + temp_C)


The temperature in Fahrenheit is 70°F
The temperature in Celsius is 21°C


3. In this exercise we will extract each day's forecast from the 7 day extended forecast on the weather report page. <br>
    a. Find the container for the seven day forecast on the weather page we just downloaded. <br>
    b. Make a list with all forecast items (overnight, Wednesday, Wednesday night, etc) <br>
    c. For each time period, print out the name of the period, the short description of the expected weather conditions, and the temperature. 

In [69]:
# For each time period print out something like: 
# Overnight the weather will be mostly clear and breezy and the temperature will be 65F.

tombstones = soup.select('#seven-day-forecast-list > li > .tombstone-container')
for tombstone in tombstones:
    date = tombstone.select('.period-name')[0].get_text(separator=" ")
    short_desc = tombstone.select('.short-desc')[0].get_text(separator=" ").replace('   ', ' ').lower()
    temp = tombstone.select('.temp')[0].get_text()[-5:]
    print(date + ' the weather will be ' + short_desc + ' with a temperature of ' + temp + '.')



Overnight the weather will be showers likely and breezy with a temperature of 69 °F.
Friday the weather will be showers likely and breezy with a temperature of 81 °F.
Friday Night the weather will be heavy rain and breezy with a temperature of 68 °F.
Saturday the weather will be heavy rain with a temperature of 81 °F.
Saturday Night the weather will be heavy rain with a temperature of 68 °F.
Sunday the weather will be heavy rain with a temperature of 82 °F.
Sunday Night the weather will be scattered showers with a temperature of 68 °F.
Monday the weather will be scattered showers then sunny with a temperature of 82 °F.
Monday Night the weather will be scattered showers with a temperature of 68 °F.


4. Take a list of jobs (e.g.['teacher', 'lawyer', 'data-scientist']). For each job save the html of the result of searching it on indeed. The url of a result page looks like: https://www.indeed.com/q-data-scientist-jobs.html. 
<br>
For each job find the names of the companies from the first result page.  Make a dictionary where the keys are the jobs and value is a list of the company names. 

In [92]:
jobs = ['engineer', 'artist', 'entertainer']
job_dict = {}
for job in jobs:
    job_dict[job] = []
    url = 'https://www.indeed.com/q-' + job + '-jobs.html'
    
    html = str.encode(getHtml(url))
    saveHtml("indeed/"+'['+job+']', html)
    
    soup = BeautifulSoup(html, 'html.parser')
    
    job_results = soup.select('td#resultsCol [data-tn-component="organicJob"]')
    for job_result in job_results:
        company = job_result.select('.company')[0].text.replace('\n','')
        job_dict[job].append(company)

pprint(job_dict)

{'artist': ['Adriyana Solutions',
            'Matthews International Corporation',
            'On Point Marketing',
            'Spartina 449',
            'Filmless',
            'Player One Trailers',
            'MGA Entertainment Inc',
            'Matthews International Corporation',
            'Rooster Teeth',
            'Transparent Language'],
 'engineer': ['thyssenkrupp Industrial Solutions (USA)',
              'Danville Metal Stamping Co',
              'ASM NEXX',
              'Henkel',
              'ANSYS',
              'Formlabs',
              'Amazon.com Services LLC',
              'Phoenix Contact',
              'Volvo Group',
              'LOCKHEED MARTIN CORPORATION'],
 'entertainer': ['Free Agency',
                 'Free Agency',
                 'Universal Orlando',
                 'Guest Services, Inc.',
                 'Utah Jive',
                 'American Cruise Lines',
                 'Salt Lake Community College',
                 'Free Agency'

In [110]:
url = 'https://www.indeed.com/q-entertainer-jobs.html'
   
html = str.encode(getHtml(url))
saveHtml("indeed/[entertainer]", html)
    
soup = BeautifulSoup(html, 'html.parser')
job_results = soup.select('td#resultsCol [data-tn-component="organicJob"]')

index = 0
for job_result in job_results:
    href = job_result.select('.title > a')[0].get('href')
    redirect_url = 'https://www.indeed.com/viewjob?' + href[8:]
    print(redirect_url)
    
    html = str.encode(getHtml(redirect_url))
    file_name = 'redirectUrls/Entertainer[' + str(index) + ']'
    saveHtml(file_name, html)
    index += 1

https://www.indeed.com/viewjob?jk=3ac71fb15693d7a9&fccid=e00e3209f0319eac&vjs=3
https://www.indeed.com/viewjob?jk=59b0db57aa40e2b1&fccid=e00e3209f0319eac&vjs=3
https://www.indeed.com/viewjob?jk=321b5d1e64863e94&fccid=8a2e5e25a9623039&vjs=3
https://www.indeed.com/viewjob?jk=ed5b3050c7ae6fbd&fccid=cb2a33b61768d00f&vjs=3
https://www.indeed.com/viewjob?jk=d3437b0d2a6d4bb3&fccid=a48ec0707748f15e&vjs=3
https://www.indeed.com/viewjob?jk=c149ff42e1911559&fccid=13ab1df0327a097a&vjs=3
https://www.indeed.com/viewjob?jk=542ac3c576d1de55&fccid=dd616958bd9ddc12&vjs=3
https://www.indeed.com/viewjob?jk=6233fef6785eb7ae&fccid=e00e3209f0319eac&vjs=3
https://www.indeed.com/viewjob?jk=a2e2826e81cd63b7&fccid=b9b8b6021bbe0971&vjs=3
https://www.indeed.com/viewjob?jk=b450cdcbdcb7344b&fccid=e00e3209f0319eac&vjs=3


5. Next, we will do a 2 step crawling exercise. First, request the page for one chosen job category. Then make a list of the links to all specific job ads on that page. In a second step, crawl and save the content to all of these links. Name the folders and files in a meaningful way that helps you identify them later. 

### 5) Headers

Every request you send has a so called HTTP header (unrelated to the content of the message), for example to communicate the size of the message, the browser from which the request is coming from, or what kind of response it is expecting back in the response. 

1) Read up on this: What parts does a request contain exactly and what is the purpose of a header? 

2) Look in the browser: Take a URL and find the request header using the developer tools in your browser. (Hint: you will need to look inside 'network'). 

3) If you don’t tell python otherwise, it will use a default header when sending requests. What is this default when you use the requests library?

4) The requests library allows to specify the headers of your request exactly. Set the header of your request (for the  URL you previously picked) to be the one copied from your browser. 

Your chosen URL: ##

Default header of Python requests: ##

Header in your browser: ##

5) Now compare the response headers for the same URL in the browser, and by calling a function on the response object in your code. What differences do you see? 

Response header in your browser: ##

Response header in the response in python: ##

Difference: ##

In [115]:
url = 'https://de.wikipedia.org/wiki/Sherlock_(Fernsehserie)'
req = requests.get(url)
sent_header = req.request.headers

# 1)  Purpose of the header is to give some meta-information about the request. For example
#     GET or POST indicating if you just wanna get data or also write data.
#     Most importantly it contains the URL and the status response (200 if ok)
#     Many more fields can be set of course
#     
# 2)  url: https://twitter.com/home
#     Referrer Policy: no-referrer-when-downgrade
# 
# 3)  
print(sent_header)

# 4)
url = 'https://twitter.com/home'
headers = {
            'Sec-Fetch-User': '?1',
            'Upgrade-Insecure-Requests': '1' 
          }

r = requests.get(url, headers = headers)

# 5)
# Difference: many more header fields in the response

{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}


Congratulations for completing the first notebook! Now it’s time for feedback.
1.	Pass your solution to the other pair in your group.
2.	Include your feedback in the other pair’s notebook. Don’t forget to add your names at the top.
3.	Return the notebook with feedback to the original pairs.
4.	Upload your notebook, with the feedback included by the other pair on OLAT.

You can think of/suggest (among other things)
 - improvements in the code (e.g. readability, efficiency)
 - improvements in the answers (e.g. are they easy to understand, are they correct, how can they be improved?)
 - point out differences (e.g. are there any differences between the responses of the two pairs? if yes what are they, what is the cause, and in which way can they be useful?)
 
In this specific notebook the questions to focus on for feedback are: 1, 2, 4 and 5. 3 was just an intro to parsing so no need to analyze in detail. Not all suggestions about the type of feedback apply to all types of questions, try to 

In [None]:
# Below there is space for giving feedback. This space should be used only by the other pair in your group.

'''
Feedback here
'''