# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
import bs4
import pandas as pd

In [None]:
# visto en clase
url = 'https://github.com/trending/developers'

response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

article = parsed_html.find_all("h1", {"class": "h3 lh-condensed"})

articles = []
for a in article:
    a = str(a)
    soup_a = bs4.BeautifulSoup(a, "html.parser")
    list_a = soup_a.find_all("a")
    for e in list_a:
        articles.append(e.string.strip())
        
print(len(articles))
articles

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
# your code here
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 
parsed_html


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-92c7d381038e.css" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-d4a90c367f0c.css" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+sol

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [4]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# tags
tags = parsed_html.find_all("div", {"class": "col-md-6"})

# developers
list_dev = []
for dev in tags:
    # name
    tag_name = dev.find("h1", {"class": "h3 lh-condensed"})
    if tag_name != None:
        name = tag_name.find("a").string.strip()
        # nick
        tag_nick = dev.find("p")
        if tag_nick != None:
            nick = tag_nick.find("a").string.strip()
        developer = name + " (" + nick + ")"
        list_dev.append(developer)

print(len(list_dev))
list_dev

25


['Roman Khavronenko (hagen1778)',
 'Daniel Vaz Gaspar (dpgaspar)',
 'chencheng (云谦) (sorrycc)',
 'Stephen Celis (stephencelis)',
 'Florian Rival (4ian)',
 'Andrew Lock (andrewlock)',
 'Joel Hawksley (joelhawksley)',
 'Roger Peppe (rogpeppe)',
 'disksing (rogpeppe)',
 'Carlton Gibson (carltongibson)',
 'Marcin Rataj (lidel)',
 'Olivier Halligon (AliSoftware)',
 'Mike Perham (mperham)',
 'Gal Schlezinger (Schniz)',
 'Josh Bleecher Snyder (josharian)',
 'Jonny Borges (jonataslaw)',
 'Arseny Kapoulkine (zeux)',
 'maliming (zeux)',
 'Abubakar Abid (abidlabs)',
 'Nicolas Gallagher (necolas)',
 'Jaco (jacogr)',
 'Hoang (hoangvvo)',
 'Kenji Miyake (kenji-miyake)',
 'James Henry (JamesHenry)',
 'Jeff Ching (chingor13)']

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [5]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [6]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# tags
tags = parsed_html.find_all("h1", {"class": "h3 lh-condensed"})

# repositories
list_rep = []
for tag in tags:
    # repo
    rep = tag.find("span").string.replace('/','').strip()
    list_rep.append(rep)

print(len(list_rep))
list_rep

### TODO --> No llego al contenido en negrita, parece el contenido del tag a ¿??
#for tag in tags:
#    rep = tag.find("a")
#    print(rep.string)

25


['babysor',
 'OpenEthan',
 'sherlock-project',
 'secretflow',
 'alexbieber',
 'NVlabs',
 'RasaHQ',
 'pittcsc',
 'gto76',
 'sqlfluff',
 'TheAlgorithms',
 'soimort',
 'ray-project',
 'laluka',
 'benoitc',
 'doccano',
 'streamlit',
 'localstack',
 'MIC-DKFZ',
 'netbox-community',
 'ckan',
 'geohot',
 'pytest-dev',
 'goauthentik',
 'AntixK']

#### Display all the image links from Walt Disney wikipedia page.

In [7]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [8]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# images
images = parsed_html.find_all("img")

# url images
url_images = ["https:" + i["src"] for i in images]
print(len(url))
url = [print(i) for i in url_images]

41
https://upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png
https://upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG
https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg
https://upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Walt_Disney_1935.jpg/170px-Walt_Disney_1935.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_whi

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [9]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python'

In [10]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# body content (only content)
body_html = parsed_html.find("div", {"id": "bodyContent"})

# links 
links = body_html.find_all("a")

# url of links
#url_links = [url if url.startswith("http") else "https://en.wikipedia.org" + url for url in [i["href"] for i in links if i.has_attr("href") == True]]
url_links = [i["href"] for i in links if i.has_attr("href") == True]
url_links = [url if url.startswith("http") else "https://en.wikipedia.org" + url for url in url_links]
print(len(url_links))
url = [print(i) for i in url_links]

60
https://en.wikipedia.org#mw-head
https://en.wikipedia.org#searchInput
https://en.wiktionary.org/wiki/Python
https://en.wiktionary.org/wiki/python
https://en.wikipedia.org#Snakes
https://en.wikipedia.org#Computing
https://en.wikipedia.org#People
https://en.wikipedia.org#Roller_coasters
https://en.wikipedia.org#Vehicles
https://en.wikipedia.org#Weaponry
https://en.wikipedia.org#Other_uses
https://en.wikipedia.org#See_also
https://en.wikipedia.org/w/index.php?title=Python&action=edit&section=1
https://en.wikipedia.org/wiki/Pythonidae
https://en.wikipedia.org/wiki/Python_(genus)
https://en.wikipedia.org/wiki/Python_(mythology)
https://en.wikipedia.org/w/index.php?title=Python&action=edit&section=2
https://en.wikipedia.org/wiki/Python_(programming_language)
https://en.wikipedia.org/wiki/CMU_Common_Lisp
https://en.wikipedia.org/wiki/PERQ#PERQ_3
https://en.wikipedia.org/w/index.php?title=Python&action=edit&section=3
https://en.wikipedia.org/wiki/Python_of_Aenus
https://en.wikipedia.org/wik

#### Find the number of titles that have changed in the United States Code since its last release point.

In [11]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [17]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# titles
titles = parsed_html.find_all("div", {"class": "usctitlechanged"})

# num of titles
print(len(titles))
# get name a problem with * Appendix

### TODO --> Si sacara el nombre de los titulos tengo un problema para sacar el contenido de los que tienen * 
for tit in titles:
    #print(tit)
    print(tit.string)
    #print(tit.string.strip())

9


          Title 6 - Domestic Security

        


          Title 7 - Agriculture

        
None
None


          Title 20 - Education

        


          Title 22 - Foreign Relations and Intercourse

        
None


          Title 34 - Crime Control and Law Enforcement

        


          Title 42 - The Public Health and Welfare

        


#### Find a Python list with the top ten FBI's Most Wanted names.

In [10]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [11]:
# your code here

# html
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'}
response = requests.get(url, headers=headers)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 



In [12]:
all_tags = [tag.name for tag in parsed_html.find_all(True)]
set(all_tags)

tag = parsed_html.find_all("h1")
tag


#import dryscrape

#session = dryscrape.Session()
#session.visit(url)
#html = session.body()
#parsed_html = bs4.BeautifulSoup(html, "html.parser") 
#all_tags = [tag.name for tag in parsed_html.find_all(True)]
#set(all_tags)
#tag = parsed_html.find_all("h1")
#tag


### TODO: con requests y bs4 es imposible !!!!????


[<h1>Additional security has been added.</h1>,
 <h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>,
 <h1><span data-translate="checking_browser">Checking your browser before accessing</span> www.fbi.gov.</h1>]

In [42]:
# your code here

# selenium

from selenium import webdriver
import undetected_chromedriver as uc

from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service


# load URL#
path = '/home/elvira/drivers/chromedriver'
service = Service(executable_path=path)

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
#driver = webdriver.Chrome(service=service) ### if not use undetected_chromedriver doesn't load page
driver = uc.Chrome(options=options)
url = 'https://www.fbi.gov/wanted/topten'

driver.get(url)
driver.maximize_window()


In [43]:
# analyze content
top_ten = driver.find_elements(by=By.CLASS_NAME, value='title')[1:]
wanted = [top.text for top in top_ten]
print(len(wanted))
wanted

10


['RUJA IGNATOVA',
 'ARNOLDO JIMENEZ',
 'ALEXIS FLORES',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ',
 'YULAN ADONAY ARCHAGA CARIAS',
 'RAFAEL CARO-QUINTERO',
 'EUGENE PALMER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'JASON DEREK BROWN']

In [44]:
# exit
driver.quit()

####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [21]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [22]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# 20 first earthquakes
tag_earthquakes = parsed_html.find_all("tbody", {"id": "tbody"})[0].find_all("tr")[:20]

# list info 20 first earthquakes
list_earthquakes = []
for tag in tag_earthquakes:
    dict_earthquakes = {}
    # date
    dict_earthquakes["date"] = tag.find_all("td", {"class": "tabev6"})[0].find_all("a")[0].string.strip()
    # time
    dict_earthquakes["time"] = tag.find_all("td", {"class": "tabev6"})[0].find_all("i", {"class": "ago"})[0].string.strip()
    # latitude
    dict_earthquakes["latitude"] = tag.find_all("td", {"class": "tabev1"})[0].string.strip()
    # longitude
    dict_earthquakes["longitude"] = tag.find_all("td", {"class": "tabev1"})[1].string.strip()
    # region name
    dict_earthquakes["region name"] = tag.find_all("td", {"class": "tb_region"})[0].string.strip()
    list_earthquakes.append(dict_earthquakes)

df_earthquakes = pd.DataFrame(list_earthquakes)
df_earthquakes

Unnamed: 0,date,time,latitude,longitude,region name
0,2022-07-08 08:25:42.1,07min ago,19.51,155.67,"ISLAND OF HAWAII, HAWAII"
1,2022-07-08 07:37:42.4,55min ago,43.61,6.4,NEAR SOUTH COAST OF FRANCE
2,2022-07-08 07:37:30.0,55min ago,32.23,71.92,"OFFSHORE VALPARAISO, CHILE"
3,2022-07-08 07:21:21.0,1hr 11min ago,2.14,121.15,CELEBES SEA
4,2022-07-08 07:05:20.4,1hr 27min ago,19.23,155.43,"ISLAND OF HAWAII, HAWAII"
5,2022-07-08 06:43:17.7,1hr 49min ago,17.95,66.94,PUERTO RICO
6,2022-07-08 06:39:32.0,1hr 53min ago,58.4,155.83,ALASKA PENINSULA
7,2022-07-08 06:33:32.3,1hr 59min ago,18.0,66.87,PUERTO RICO
8,2022-07-08 06:22:23.0,2hr 10min ago,36.96,27.68,DODECANESE IS.-TURKEY BORDER REG
9,2022-07-08 06:04:10.8,2hr 28min ago,58.15,138.02,SOUTHEASTERN ALASKA


#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

# html
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'}

url = 'https://twitter.com/elvestevez'

response = requests.get(url, headers=headers)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 
#parsed_html

print(url)

following = parsed_html.find_all("a")#, {"href": "/elvestevez/following"})
print(len(following))
following


#### TODO: selenium ¿??????????????????

###### try/except ¿¿¿¿????


#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [6]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [7]:
# your code here

# selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service


# load URL
path = '/home/elvira/drivers/chromedriver'
service = Service(executable_path=path)
driver = webdriver.Chrome(service=service)

url = 'https://twitter.com/elvestevez'

driver.get(url)
driver.maximize_window()

In [8]:
# analyze content
tweeter_elements = driver.find_elements(by=By.TAG_NAME, value='a')

# followers
num_followers = 0
for followers in tweeter_elements:
    if followers.get_attribute('href') == 'https://twitter.com/elvestevez/followers':
        num_followers = followers.find_element(by=By.TAG_NAME, value='span').text

print(num_followers)

###### try/except ¿¿¿¿????


14


In [9]:
# exit
driver.quit()

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [32]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [33]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# table of languages
tab_lang = parsed_html.find("div", {"class": "central-featured"})
languages = tab_lang.find_all("div")

# languages
list_languages=[]
for lang in languages:
    dict_languages = {}
    dict_languages["language"] = lang.find("strong").string
    dict_languages["num articles"] = lang.find("bdi").string
    list_languages.append(dict_languages)

print(len(list_languages))
l = [print(i["language"] + " - " + i["num articles"]) for i in list_languages]

10
English - 6 458 000+
日本語 - 1 314 000+
Русский - 1 798 000+
Deutsch - 2 667 000+
Español - 1 755 000+
Français - 2 400 000+
Italiano - 1 742 000+
中文 - 1 256 000+
Português - 1 085 000+
Polski - 1 512 000+


#### A list with the different kind of datasets available in data.gov.uk.

In [34]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [35]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# list datasets
list_data = parsed_html.find("ul", {"class": "govuk-list dgu-topics__list"}).find_all("a")

datasets = [dat.string for dat in list_data]
len(datasets)
datasets

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [36]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [55]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# table of languages
tab_lang = parsed_html.find("table").find_all("tbody")[0].find_all("tr")[1:21]

# list info languages and native speakers
list_languages = []
for tag in tab_lang:
    dict_languages = {}       
    # language
    dict_languages["language"] = tag.find_all("td")[1].find("a").string.strip()
    # native speakers
    dict_languages["native speakers"] = tag.find_all("td")[2].string.strip()
    list_languages.append(dict_languages)

df_languages = pd.DataFrame(list_languages)
df_languages

### TODO: no saca contenido de tag porque tiene una marquita (visto más atrás caso análogo)
# position 32 - Indonesian has a odd element in num speakers (leggend)
#tab_lang = parsed_html.find("table").find_all("tbody")[0].find_all("tr")[32]
#print(tab_lang.find_all("td"))
#print(tab_lang.find_all("td")[1].find("a").string)
#print(tab_lang.find_all("td")[2])
#print(tab_lang.find_all("td")[2].string)


Unnamed: 0,language,native speakers
0,Mandarin Chinese,929.0
1,Spanish,474.7
2,English,372.9
3,Hindi,343.9
4,Bengali,233.7
5,Portuguese,232.4
6,Russian,154.0
7,Japanese,125.3
8,Western Punjabi,92.7
9,Yue Chinese,85.2


## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [56]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [57]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# table of top films
tab_films = parsed_html.find("tbody", {"class": "lister-list"}).find_all("tr")
print(len(tab_films))

# list info to films
list_films = []
for film in tab_films:
    dict_films = {}       
    # movie name
    dict_films["movie name"] = film.find("td", {"class": "titleColumn"}).find("a").string.strip()
    # initial release
    dict_films["initial release"] = film.find("td", {"class": "titleColumn"}).find("span").string.strip()
    # director name
    dict_films["director name"] = film.find("td", {"class": "titleColumn"}).find("a")["title"].split(",")[0]
    # stars
    dict_films["stars"] = film.find("td", {"class": "ratingColumn imdbRating"}).find("strong").string.strip()
    # add list
    list_films.append(dict_films)

df_films = pd.DataFrame(list_films)
df_films

250


Unnamed: 0,movie name,initial release,director name,stars
0,Cadena perpetua,(1994),Frank Darabont (dir.),9.2
1,El padrino,(1972),Francis Ford Coppola (dir.),9.2
2,El caballero oscuro,(2008),Christopher Nolan (dir.),9.0
3,El padrino: Parte II,(1974),Francis Ford Coppola (dir.),9.0
4,12 hombres sin piedad,(1957),Sidney Lumet (dir.),8.9
...,...,...,...,...
245,Jai Bhim,(2021),T.J. Gnanavel (dir.),8.0
246,Aladdín,(1992),Ron Clements (dir.),8.0
247,Gandhi,(1982),Richard Attenborough (dir.),8.0
248,Criadas y señoras,(2011),Tate Taylor (dir.),8.0


#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [58]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [60]:
%%time

# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# table of top films
tab_films = parsed_html.find("tbody", {"class": "lister-list"}).find_all("tr")[:10]
print(len(tab_films))

# list info to films
list_films = []
for film in tab_films:
    dict_films = {}       
    # movie name
    dict_films["movie name"] = film.find("td", {"class": "titleColumn"}).find("a").string.strip()
    # initial release
    dict_films["year"] = film.find("td", {"class": "titleColumn"}).find("span").string.replace("(","").replace(")","")
    # director name
    url_film = "https://www.imdb.com" + film.find("td", {"class": "titleColumn"}).find("a")["href"]
    html_film = requests.get(url_film).content
    parsed_html_film = bs4.BeautifulSoup(html_film, "html.parser") 
    dict_films["summary"] = parsed_html_film.find("div", {"class": "sc-16ede01-8 hXeKyz sc-910a7330-11 GYbFb"}).find("span", {"role": "presentation"}, {"data-testid": "plot-xl"}).string
    # add list
    list_films.append(dict_films)

df_films = pd.DataFrame(list_films)
df_films

10
CPU times: user 2.38 s, sys: 44.9 ms, total: 2.42 s
Wall time: 13.1 s


Unnamed: 0,movie name,year,summary
0,Cadena perpetua,1994,Two imprisoned men bond over a number of years...
1,El padrino,1972,The aging patriarch of an organized crime dyna...
2,El caballero oscuro,2008,When the menace known as the Joker wreaks havo...
3,El padrino: Parte II,1974,The early life and career of Vito Corleone in ...
4,12 hombres sin piedad,1957,The jury in a New York City murder trial is fr...
5,La lista de Schindler,1993,"In German-occupied Poland during World War II,..."
6,El señor de los anillos: El retorno del rey,2003,Gandalf and Aragorn lead the World of Men agai...
7,Pulp Fiction,1994,"The lives of two mob hitmen, a boxer, a gangst..."
8,El señor de los anillos: La comunidad del anillo,2001,A meek Hobbit from the Shire and eight compani...
9,"El bueno, el feo y el malo",1966,A bounty hunting scam joins two men in an unea...


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [54]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=9120dd14f0a8fe627dcf03e25c50dded&units=metric'

Enter the city: Madrid


In [55]:
# your code here

url

#### wating activate key


'http://api.openweathermap.org/data/2.5/weather?q=Madrid&APPID=024fe03284f1c7ae4b0d1484a4e130ba&units=metric'

#### Find the book name, price and stock availability as a pandas dataframe.

In [46]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [49]:
# your code here

# html
response = requests.get(url)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

# table of top films
tag_books = parsed_html.find_all("article", {"class": "product_pod"})
print(len(tag_books))

# list info to books
list_books = []
for book in tag_books:
    dict_books = {}       
    # book name
    dict_books["book name"] = book.find("h3").find("a")["title"]
    # price
    dict_books["price"] = book.find("div", {"class": "product_price"}).find("p", {"class": "price_color"}).string
    ### TODO: no llego al contenido del stock
    ### 
    ### 
    print(book.find("div", {"class": "product_price"}).find("p", {"class": "instock availability"}))
    print(book.find("div", {"class": "product_price"}).find("p", {"class": "instock availability"}))
    # stock
    dict_books["stock"] = book.find("div", {"class": "product_price"}).find("p", {"class": "instock availability"}).string 
    # add list
    list_books.append(dict_books)

df_books = pd.DataFrame(list_books)
df_books

20
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<p class="instock 

Unnamed: 0,book name,price,stock
0,A Light in the Attic,£51.77,
1,Tipping the Velvet,£53.74,
2,Soumission,£50.10,
3,Sharp Objects,£47.82,
4,Sapiens: A Brief History of Humankind,£54.23,
5,The Requiem Red,£22.65,
6,The Dirty Little Secrets of Getting Your Dream...,£33.34,
7,The Coming Woman: A Novel Based on the Life of...,£17.93,
8,The Boys in the Boat: Nine Americans and Their...,£22.60,
9,The Black Maria,£52.15,
