# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
import bs4
import pandas as pd

import re

import json

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import undetected_chromedriver as uc

In [None]:
# visto en clase
url = 'https://github.com/trending/developers'

response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

article = parsed_html.find_all("h1", {"class": "h3 lh-condensed"})

articles = []
for a in article:
    a = str(a)
    soup_a = bs4.BeautifulSoup(a, "html.parser")
    list_a = soup_a.find_all("a")
    for e in list_a:
        articles.append(e.string.strip())
        
print(len(articles))
articles

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
# your code here
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 
parsed_html


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-92c7d381038e.css" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-d4a90c367f0c.css" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+sol

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (Á•ûÊ•ΩÂùÇË¶ö„ÄÖ)',
 'script-8']
 ```

In [4]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# tags
tags = parsed_html.find_all("div", {"class": "col-md-6"})

# developers
list_dev = []
for dev in tags:
    # name
    tag_name = dev.find("h1", {"class": "h3 lh-condensed"})
    if tag_name != None:
        name = tag_name.find("a").string.strip()
        # nick
        tag_nick = dev.find("p")
        if tag_nick != None:
            nick = tag_nick.find("a").string.strip()
        developer = name + " (" + nick + ")"
        list_dev.append(developer)

print(len(list_dev))
list_dev

25


['Marten Seemann (marten-seemann)',
 'Manu MA (manucorporat)',
 'Lee Robinson (leerob)',
 'Hoang (hoangvvo)',
 'Daniel Imms (Tyriar)',
 'Daniel Vaz Gaspar (dpgaspar)',
 'Zhen√´k (Ebazhanov)',
 'Adeeb Shihadeh (adeebshihadeh)',
 'Rick Anderson (Rick-Anderson)',
 'Daniel Mendler (minad)',
 'berstendÃîÃÑÃìÃêÃÑÕõÕòÃÄÃ≤Ã´Ã°ÃπÃ†ÃñÕöÕì (berstend)',
 'Jake Vanderplas (jakevdp)',
 'Alex Gaynor (alex)',
 'Yair Morgenstern (yairm210)',
 'Diego Muracciole (diegomura)',
 'Fernando Cejas (android10)',
 'Brandon Morelli (bmorelli25)',
 'Mu Li (mli)',
 'Jason Quense (jquense)',
 'Facundo Olano (facundoolano)',
 'Jakub T. Jankiewicz (jcubic)',
 'abhishek thakur (abhishekkrthakur)',
 'John Kerl (johnkerl)',
 'Samuel Berthe (samber)',
 'Suyeol Jeon (devxoul)']

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [5]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [6]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# tags
tags = parsed_html.find_all("h1", {"class": "h3 lh-condensed"})

# repositories
list_rep = []
for tag in tags:
    # repo
    rep = tag.find("span").string.replace('/','').strip()
    list_rep.append(rep)

print(len(list_rep))
list_rep

25


['facundoolano',
 'public-apis',
 'sensity-ai',
 'home-assistant',
 'vinta',
 'nategentile',
 'd2l-ai',
 'pittcsc',
 'WongKinYiu',
 'sei-protocol',
 'ocrmypdf',
 'aeyesec',
 'Orange-Cyberdefense',
 'ucupumar',
 'karpathy',
 'p0dalirius',
 'AstraaDev',
 'Rapptz',
 'healthchecks',
 'snap-research',
 'wilsonfreitas',
 'neetcode-gh',
 'corpnewt',
 'RsaCtfTool',
 'itsnebulalol']

#### Display all the image links from Walt Disney wikipedia page.

In [7]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [8]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# images
images = parsed_html.find_all("img")

# url images
url_images = ["https:" + i["src"] for i in images]
print(len(url))
url = [print(i) for i in url_images]

41
https://upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png
https://upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG
https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg
https://upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Walt_Disney_1935.jpg/170px-Walt_Disney_1935.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_whi

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [9]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python'

In [10]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# body content (only content)
body_html = parsed_html.find("div", {"id": "bodyContent"})

# links 
links = body_html.find_all("a")

# url of links
#url_links = [url if url.startswith("http") else "https://en.wikipedia.org" + url for url in [i["href"] for i in links if i.has_attr("href") == True]]
url_links = [i["href"] for i in links if i.has_attr("href") == True]
url_links = [url if url.startswith("http") else "https://en.wikipedia.org" + url for url in url_links]
print(len(url_links))
url = [print(i) for i in url_links]

60
https://en.wikipedia.org#mw-head
https://en.wikipedia.org#searchInput
https://en.wiktionary.org/wiki/Python
https://en.wiktionary.org/wiki/python
https://en.wikipedia.org#Snakes
https://en.wikipedia.org#Computing
https://en.wikipedia.org#People
https://en.wikipedia.org#Roller_coasters
https://en.wikipedia.org#Vehicles
https://en.wikipedia.org#Weaponry
https://en.wikipedia.org#Other_uses
https://en.wikipedia.org#See_also
https://en.wikipedia.org/w/index.php?title=Python&action=edit&section=1
https://en.wikipedia.org/wiki/Pythonidae
https://en.wikipedia.org/wiki/Python_(genus)
https://en.wikipedia.org/wiki/Python_(mythology)
https://en.wikipedia.org/w/index.php?title=Python&action=edit&section=2
https://en.wikipedia.org/wiki/Python_(programming_language)
https://en.wikipedia.org/wiki/CMU_Common_Lisp
https://en.wikipedia.org/wiki/PERQ#PERQ_3
https://en.wikipedia.org/w/index.php?title=Python&action=edit&section=3
https://en.wikipedia.org/wiki/Python_of_Aenus
https://en.wikipedia.org/wik

#### Find the number of titles that have changed in the United States Code since its last release point.

In [11]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [12]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# titles
titles = parsed_html.find_all("div", {"class": "usctitlechanged"})

# num of titles
print(len(titles))
# get name a problem with * Appendix

for tit in titles:
    pattern = r'Title(.*)'
    output = re.search(pattern, str(tit))
    if '<span' in output.group(0):
        desc_title = output.group(0).split('<span')[0].strip()
    else:
        desc_title = output.group(0).strip()
    print(desc_title)

9
Title 6 - Domestic Security
Title 7 - Agriculture
Title 11 - Bankruptcy
Title 18 - Crimes and Criminal Procedure
Title 20 - Education
Title 22 - Foreign Relations and Intercourse
Title 28 - Judiciary and Judicial Procedure
Title 34 - Crime Control and Law Enforcement
Title 42 - The Public Health and Welfare


#### Find a Python list with the top ten FBI's Most Wanted names.

In [13]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [14]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 
parsed_html

top_ten = parsed_html.find_all("h3", {"class": "title"})
top_wanted = [wanted.find("a").text for wanted in top_ten]
top_wanted

[]

In [15]:
# ¬° OJO !
# en la red de mi casa esta web a trav√©s de web scraping no funciona igual que en la red de ironhack
# por bs4 no accedo al contenido (incluso indicando headers)
# por selenium s√≠, aunque usando undetected_chromedriver :-O

# selenium

# load URL#
path = '/home/elvira/drivers/chromedriver'
service = Service(executable_path=path)

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
#driver = webdriver.Chrome(service=service) ### if not use undetected_chromedriver doesn't load page
driver = uc.Chrome(options=options)
url = 'https://www.fbi.gov/wanted/topten'

driver.get(url)
driver.maximize_window()


In [16]:
# analyze content
top_ten = driver.find_elements(by=By.CLASS_NAME, value='title')[1:]
wanted = [top.text for top in top_ten]
print(len(wanted))
wanted

10


['RUJA IGNATOVA',
 'ARNOLDO JIMENEZ',
 'ALEXIS FLORES',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ',
 'YULAN ADONAY ARCHAGA CARIAS',
 'RAFAEL CARO-QUINTERO',
 'EUGENE PALMER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'JASON DEREK BROWN']

In [17]:
# exit
driver.quit()

####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [18]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [19]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# 20 first earthquakes
tag_earthquakes = parsed_html.find("tbody", {"id": "tbody"}).find_all("tr")[:20]

# list info 20 first earthquakes
list_earthquakes = []
for tag in tag_earthquakes:
    dict_earthquakes = {}
    # date
    dict_earthquakes["date"] = tag.find("td", {"class": "tabev6"}).find_all("a")[0].string.strip()
    # time
    dict_earthquakes["time"] = tag.find("td", {"class": "tabev6"}).find_all("i", {"class": "ago"})[0].string.strip()
    # latitude
    dict_earthquakes["latitude"] = tag.find_all("td", {"class": "tabev1"})[0].string.strip()
    # longitude
    dict_earthquakes["longitude"] = tag.find_all("td", {"class": "tabev1"})[1].string.strip()
    # region name
    dict_earthquakes["region name"] = tag.find("td", {"class": "tb_region"}).string.strip()
    list_earthquakes.append(dict_earthquakes)

df_earthquakes = pd.DataFrame(list_earthquakes)
df_earthquakes

Unnamed: 0,date,time,latitude,longitude,region name
0,2022-07-11¬†¬†¬†07:48:46.0,28min ago,38.81,175.97,NORTH ISLAND OF NEW ZEALAND
1,2022-07-11¬†¬†¬†07:13:46.9,1hr 03min ago,35.51,3.58,STRAIT OF GIBRALTAR
2,2022-07-11¬†¬†¬†07:05:13.2,1hr 12min ago,19.18,155.51,"ISLAND OF HAWAII, HAWAII"
3,2022-07-11¬†¬†¬†07:01:35.0,1hr 15min ago,27.44,175.45,KERMADEC ISLANDS REGION
4,2022-07-11¬†¬†¬†06:24:28.7,1hr 52min ago,41.18,43.92,GEORGIA (SAK'ART'VELO)
5,2022-07-11¬†¬†¬†06:23:29.0,1hr 53min ago,38.78,116.2,NEVADA
6,2022-07-11¬†¬†¬†06:14:09.7,2hr 03min ago,41.21,43.93,GEORGIA (SAK'ART'VELO)
7,2022-07-11¬†¬†¬†06:01:43.8,2hr 15min ago,35.42,3.58,STRAIT OF GIBRALTAR
8,2022-07-11¬†¬†¬†06:00:17.9,2hr 17min ago,39.65,23.43,AEGEAN SEA
9,2022-07-11¬†¬†¬†05:55:58.0,2hr 21min ago,12.42,87.71,NEAR COAST OF NICARAGUA


#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

# selenium

# load URL
path = '/home/elvira/drivers/chromedriver'
service = Service(executable_path=path)
driver = webdriver.Chrome(service=service)

url = 'https://twitter.com/elvestevez'

driver.get(url)
driver.maximize_window()

In [None]:
# NO CONSIGO LLEGAR A LOS TWEETS SIN LOGAR, TWEETER PONE SUS PALITOS PARA LLEGAR ¬ø???

# analyze content
tweets_elements = driver.find_elements(by=By.TAG_NAME, value='div')

# tweets
num_tweets = len([tweets for tweets in tweets_elements if tweets.get_attribute('data-testid') == 'cellInnerDiv'])

In [None]:
# exit
driver.quit()

#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [20]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [21]:
# your code here

# selenium

# load URL
path = '/home/elvira/drivers/chromedriver'
service = Service(executable_path=path)
driver = webdriver.Chrome(service=service)

url = 'https://twitter.com/elvestevez'

driver.get(url)
driver.maximize_window()

In [22]:
# analyze content
tweeter_elements = driver.find_elements(by=By.TAG_NAME, value='a')

# followers
num_followers = 0
for followers in tweeter_elements:
    if followers.get_attribute('href') == 'https://twitter.com/elvestevez/followers':
        num_followers = followers.find_element(by=By.TAG_NAME, value='span').text

print('Followers: ' + num_followers)

Followers: 14


In [23]:
# exit
driver.quit()

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [24]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [25]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# table of languages
tab_lang = parsed_html.find("div", {"class": "central-featured"})
languages = tab_lang.find_all("div")

# languages
list_languages=[]
for lang in languages:
    dict_languages = {}
    dict_languages["language"] = lang.find("strong").string
    dict_languages["num articles"] = lang.find("bdi").string
    list_languages.append(dict_languages)

print(len(list_languages))
l = [print(i["language"] + " - " + i["num articles"]) for i in list_languages]

10
English - 6¬†458¬†000+
Êó•Êú¨Ë™û - 1¬†314¬†000+
–†—É—Å—Å–∫–∏–π - 1¬†798¬†000+
Deutsch - 2¬†667¬†000+
Espa√±ol - 1¬†755¬†000+
Fran√ßais - 2¬†400¬†000+
Italiano - 1¬†742¬†000+
‰∏≠Êñá - 1¬†256¬†000+
Portugu√™s - 1¬†085¬†000+
Polski - 1¬†512¬†000+


#### A list with the different kind of datasets available in data.gov.uk.

In [28]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [29]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# list datasets
list_data = parsed_html.find("ul", {"class": "govuk-list dgu-topics__list"}).find_all("a")

datasets = [dat.string for dat in list_data]
len(datasets)
datasets

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [30]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [31]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# table of languages
tab_lang = parsed_html.find("table").find_all("tbody")[0].find_all("tr")[1:11]

# list info languages and native speakers
list_languages = []
for tag in tab_lang:
    dict_languages = {}       
    # language
    dict_languages["language"] = tag.find_all("td")[1].find("a").string.strip()
    # native speakers
    dict_languages["native speakers"] = tag.find_all("td")[2].string.strip()
    list_languages.append(dict_languages)

df_languages = pd.DataFrame(list_languages)
df_languages

Unnamed: 0,language,native speakers
0,Mandarin Chinese,929.0
1,Spanish,474.7
2,English,372.9
3,Hindi,343.9
4,Bengali,233.7
5,Portuguese,232.4
6,Russian,154.0
7,Japanese,125.3
8,Western Punjabi,92.7
9,Yue Chinese,85.2


## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

# A TWEETER NO LE PILLO... GRGRGRGR


#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [32]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [33]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# table of top films
tab_films = parsed_html.find("tbody", {"class": "lister-list"}).find_all("tr")
print(len(tab_films))

# list info to films
list_films = []
for film in tab_films:
    dict_films = {}       
    # movie name
    dict_films["movie name"] = film.find("td", {"class": "titleColumn"}).find("a").string.strip()
    # initial release
    dict_films["initial release"] = film.find("td", {"class": "titleColumn"}).find("span").string.strip()
    # director name
    dict_films["director name"] = film.find("td", {"class": "titleColumn"}).find("a")["title"].split(",")[0]
    # stars
    dict_films["stars"] = film.find("td", {"class": "ratingColumn imdbRating"}).find("strong").string.strip()
    # add list
    list_films.append(dict_films)

df_films = pd.DataFrame(list_films)
df_films

250


Unnamed: 0,movie name,initial release,director name,stars
0,Cadena perpetua,(1994),Frank Darabont (dir.),9.2
1,El padrino,(1972),Francis Ford Coppola (dir.),9.2
2,El caballero oscuro,(2008),Christopher Nolan (dir.),9.0
3,El padrino: Parte II,(1974),Francis Ford Coppola (dir.),9.0
4,12 hombres sin piedad,(1957),Sidney Lumet (dir.),8.9
...,...,...,...,...
245,Jai Bhim,(2021),T.J. Gnanavel (dir.),8.0
246,Aladd√≠n,(1992),Ron Clements (dir.),8.0
247,Gandhi,(1982),Richard Attenborough (dir.),8.0
248,Criadas y se√±oras,(2011),Tate Taylor (dir.),8.0


#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [34]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [35]:
%%time

# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# table of top films
tab_films = parsed_html.find("tbody", {"class": "lister-list"}).find_all("tr")[:10]
print(len(tab_films))

# list info to films
list_films = []
for film in tab_films:
    dict_films = {}       
    # movie name
    dict_films["movie name"] = film.find("td", {"class": "titleColumn"}).find("a").string.strip()
    # initial release
    dict_films["year"] = film.find("td", {"class": "titleColumn"}).find("span").string.replace("(","").replace(")","")
    # director name
    url_film = "https://www.imdb.com" + film.find("td", {"class": "titleColumn"}).find("a")["href"]
    html_film = requests.get(url_film).content
    parsed_html_film = bs4.BeautifulSoup(html_film, "html.parser") 
    dict_films["summary"] = parsed_html_film.find("div", {"class": "sc-16ede01-8 hXeKyz sc-910a7330-11 GYbFb"}).find("span", {"role": "presentation"}, {"data-testid": "plot-xl"}).string
    # add list
    list_films.append(dict_films)

df_films = pd.DataFrame(list_films)
df_films

10
CPU times: user 1.64 s, sys: 39.6 ms, total: 1.68 s
Wall time: 12.4 s


Unnamed: 0,movie name,year,summary
0,Cadena perpetua,1994,Two imprisoned men bond over a number of years...
1,El padrino,1972,The aging patriarch of an organized crime dyna...
2,El caballero oscuro,2008,When the menace known as the Joker wreaks havo...
3,El padrino: Parte II,1974,The early life and career of Vito Corleone in ...
4,12 hombres sin piedad,1957,The jury in a New York City murder trial is fr...
5,La lista de Schindler,1993,"In German-occupied Poland during World War II,..."
6,El se√±or de los anillos: El retorno del rey,2003,Gandalf and Aragorn lead the World of Men agai...
7,Pulp Fiction,1994,"The lives of two mob hitmen, a boxer, a gangst..."
8,El se√±or de los anillos: La comunidad del anillo,2001,A meek Hobbit from the Shire and eight compani...
9,"El bueno, el feo y el malo",1966,A bounty hunting scam joins two men in an unea...


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [36]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=9120dd14f0a8fe627dcf03e25c50dded&units=metric'

Enter the city: Moraleja de Enmedio


In [37]:
# your code here

# json
response = requests.get(url)
res_weather = response.content
json_weather = json.loads(res_weather)
json_weather

temperature = json_weather['main']['temp']
print(f"temperature in {city} is {temperature}")

wind_speed = json_weather['wind']['speed']
print(f"wind speed in {city} is {wind_speed}")

description = json_weather['weather'][0]['description']
print(f"description in {city} is {description}")

weather = json_weather['weather'][0]['main'] ## exactly, whats is "weather" ¬ø???
print(f"weather in {city} is {weather}")

temperature in Moraleja de Enmedio is 27.85
wind speed in Moraleja de Enmedio is 3.6
description in Moraleja de Enmedio is clear sky
weather in Moraleja de Enmedio is Clear


#### Find the book name, price and stock availability as a pandas dataframe.

In [38]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [39]:
# your code here

# only first page (with bs4)

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# table of top films
tag_books = parsed_html.find_all("article", {"class": "product_pod"})
print(len(tag_books))

# list info to books
list_books = []
for book in tag_books:
    dict_books = {}       
    # book name
    dict_books["book name"] = book.find("h3").find("a")["title"]
    # price
    dict_books["price"] = book.find("div", {"class": "product_price"}).find("p", {"class": "price_color"}).string
    # stock
    if book.find("div", {"class": "product_price"}).find("i", {"class": "icon-ok"}) != None:
        dict_books["stock"] = "yes"
    else:
        dict_books["stock"] = "no"
    # add list
    list_books.append(dict_books)

df_books = pd.DataFrame(list_books)
df_books

20


Unnamed: 0,book name,price,stock
0,A Light in the Attic,¬£51.77,yes
1,Tipping the Velvet,¬£53.74,yes
2,Soumission,¬£50.10,yes
3,Sharp Objects,¬£47.82,yes
4,Sapiens: A Brief History of Humankind,¬£54.23,yes
5,The Requiem Red,¬£22.65,yes
6,The Dirty Little Secrets of Getting Your Dream...,¬£33.34,yes
7,The Coming Woman: A Novel Based on the Life of...,¬£17.93,yes
8,The Boys in the Boat: Nine Americans and Their...,¬£22.60,yes
9,The Black Maria,¬£52.15,yes


In [40]:
def add_books(df_data):
    # title
    tag_title = driver.find_elements(by=By.TAG_NAME, value='img')
    list_title = [title.get_attribute('alt') for title in tag_title]
    
    # price
    tag_price = driver.find_elements(by=By.CLASS_NAME, value='price_color')
    list_price = [price.text for price in tag_price]
    
    # stock
    tag_stock = driver.find_elements(by=By.CLASS_NAME, value='icon-ok')
    list_stock = ['yes' if stock != None else 'no' for stock in tag_stock if stock]
    
    data = list(zip(list_title, list_price, list_stock))
    df_data_aux = pd.DataFrame(data, columns=['book name', 'price','stock'])
    df_data = pd.concat([df_data, df_data_aux])
    
    return df_data

In [44]:
# all pages (with selenium)

# selenium

# load URL
path = '/home/elvira/drivers/chromedriver'
service = Service(executable_path=path)
driver = webdriver.Chrome(service=service)

url = 'http://books.toscrape.com/'

driver.get(url)
driver.maximize_window()

driver.implicitly_wait(10)

In [45]:
%%time

# analyze content

df_data_books = pd.DataFrame([], columns=['book name', 'price','stock'])

i = 0
while i in range(100):
    i += 1
    # add books
    df_data_books = add_books(df_data_books)
    try:
        if driver.find_elements(by=By.TAG_NAME, value="a")[-1].text == 'next':
            driver.find_elements(by=By.TAG_NAME, value="a")[-1].click()
            #print(driver.find_elements(by=By.TAG_NAME, value="a")[-1].get_attribute("href"))
            #print(driver.find_elements(by=By.TAG_NAME, value="a")[-1].text)
        else:
            break

    except:
        print(f'You have reached the total amount of clicks: {click}')
        break

print(df_data_books.count())
df_data_books

book name    1000
price        1000
stock        1000
dtype: int64
CPU times: user 3.31 s, sys: 192 ms, total: 3.5 s
Wall time: 1min 16s


Unnamed: 0,book name,price,stock
0,A Light in the Attic,¬£51.77,yes
1,Tipping the Velvet,¬£53.74,yes
2,Soumission,¬£50.10,yes
3,Sharp Objects,¬£47.82,yes
4,Sapiens: A Brief History of Humankind,¬£54.23,yes
...,...,...,...
15,Alice in Wonderland (Alice's Adventures in Won...,¬£55.53,yes
16,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",¬£57.06,yes
17,A Spy's Devotion (The Regency Spies of London #1),¬£16.97,yes
18,1st to Die (Women's Murder Club #1),¬£53.98,yes


In [46]:
# exit
driver.quit()