# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [546]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [65]:
url = 'https://github.com/trending/developers'

In [66]:
response = requests.get(url, 'lxml')
response.text

'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-OZSLZxbfZRavuNMaKn9S2z6nOiqb+cSXqL/eTi4TqwhiRm1fDxQpuwjViN7NzGw/nhXT4O0BZIIg0Ym7szrbpg==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-39948b6716df6516afb8d31a2a7f52db.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-jaRxAk/R7Eq6XXtxt2dWYc6UfgT/Jk9zYWYh4UpAt5LFRnYVaWqEM3sPhUFL3fOBmHhHoOcn4wfLkMS21Q1yaw==" rel="stylesheet" href="https://github.githubassets.com/assets/site-8da471024fd1ec4aba5d7b71b7675661.css" />\n    <link crossorigin="anonymous" media="all" integrity="sha512-9nE+XgrWtARaS0zwxOiHy2GiHph7

In [67]:
response.content

b'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-OZSLZxbfZRavuNMaKn9S2z6nOiqb+cSXqL/eTi4TqwhiRm1fDxQpuwjViN7NzGw/nhXT4O0BZIIg0Ym7szrbpg==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-39948b6716df6516afb8d31a2a7f52db.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-jaRxAk/R7Eq6XXtxt2dWYc6UfgT/Jk9zYWYh4UpAt5LFRnYVaWqEM3sPhUFL3fOBmHhHoOcn4wfLkMS21Q1yaw==" rel="stylesheet" href="https://github.githubassets.com/assets/site-8da471024fd1ec4aba5d7b71b7675661.css" />\n    <link crossorigin="anonymous" media="all" integrity="sha512-9nE+XgrWtARaS0zwxOiHy2GiHph

In [68]:
html = response.content
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-39948b6716df6516afb8d31a2a7f52db.css" integrity="sha512-OZSLZxbfZRavuNMaKn9S2z6nOiqb+cSXqL/eTi4TqwhiRm1fDxQpuwjViN7NzGw/nhXT4O0BZIIg0Ym7szrbpg==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-8da471024fd1ec4aba5d7b71b7675661.css" integrity="sha512-jaRxAk/R7Eq6XXtxt2dWYc6UfgT/Jk9zYWYh4UpAt5LFRnYVaWqEM3sPhUFL3fOBmHhHoOcn4wfLkMS21Q1yaw==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/behaviors-f6713e5e0ad6b4045a4b4cf0c4e887cb.css" integr

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [90]:
x = soup.find_all('h1', attrs = {'class':"h3 lh-condensed"})
data = []
for i in range(len(x)):
    y = soup.find_all('h1', attrs = {'class':"h3 lh-condensed"})[i].text
    y = y.replace('\n', '').strip()
    data.append(y)
data    

['Juliette',
 'Thibault Duplessis',
 'Rico Suter',
 'Hajime Hoshi',
 'Stephan Dilly',
 'Mike McQuaid',
 'Matt (IPv4) Cowley',
 'Evan Wallace',
 'Michael Lynch',
 'Frost Ming',
 'Jesse Duffield',
 'Erik Ejlskov Jensen',
 'Francois Zaninotto',
 'Mo Gorhom',
 'Lee Robinson',
 'XAMPPRocky',
 '/rootzoll',
 'Eduardo San Martin Morote',
 'Kyle Mathews',
 'Tomas Votruba',
 'David Khourshid',
 'Franck Nijhof',
 'Barry vd. Heuvel',
 'Alex Gaynor',
 'Vadim Dalecky']

In [94]:
x = soup.find_all('p', attrs = {'class':"f4 text-normal mb-1"})
data1 = []
for i in range(len(x)):
    y = soup.find_all('p', attrs = {'class':"f4 text-normal mb-1"})[i].text
    y = y.replace('\n', '').strip()
    data1.append(y)
data1    

['jrfnl',
 'ornicar',
 'RicoSuter',
 'hajimehoshi',
 'extrawurst',
 'MikeMcQuaid',
 'MattIPv4',
 'evanw',
 'mtlynch',
 'frostming',
 'jesseduffield',
 'ErikEJ',
 'fzaninotto',
 'gorhom',
 'leerob',
 'rootzoll',
 'posva',
 'KyleAMathews',
 'TomasVotruba',
 'davidkpiano',
 'frenck',
 'barryvdh',
 'alex',
 'streamich']

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [97]:
url = 'https://github.com/trending/python?since=daily'

In [98]:
response = requests.get(url, 'lxml')
response.text

'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-OZSLZxbfZRavuNMaKn9S2z6nOiqb+cSXqL/eTi4TqwhiRm1fDxQpuwjViN7NzGw/nhXT4O0BZIIg0Ym7szrbpg==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-39948b6716df6516afb8d31a2a7f52db.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-jaRxAk/R7Eq6XXtxt2dWYc6UfgT/Jk9zYWYh4UpAt5LFRnYVaWqEM3sPhUFL3fOBmHhHoOcn4wfLkMS21Q1yaw==" rel="stylesheet" href="https://github.githubassets.com/assets/site-8da471024fd1ec4aba5d7b71b7675661.css" />\n    <link crossorigin="anonymous" media="all" integrity="sha512-9nE+XgrWtARaS0zwxOiHy2GiHph7

In [99]:
response.content

b'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-OZSLZxbfZRavuNMaKn9S2z6nOiqb+cSXqL/eTi4TqwhiRm1fDxQpuwjViN7NzGw/nhXT4O0BZIIg0Ym7szrbpg==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-39948b6716df6516afb8d31a2a7f52db.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-jaRxAk/R7Eq6XXtxt2dWYc6UfgT/Jk9zYWYh4UpAt5LFRnYVaWqEM3sPhUFL3fOBmHhHoOcn4wfLkMS21Q1yaw==" rel="stylesheet" href="https://github.githubassets.com/assets/site-8da471024fd1ec4aba5d7b71b7675661.css" />\n    <link crossorigin="anonymous" media="all" integrity="sha512-9nE+XgrWtARaS0zwxOiHy2GiHph

In [100]:
html = response.content
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-39948b6716df6516afb8d31a2a7f52db.css" integrity="sha512-OZSLZxbfZRavuNMaKn9S2z6nOiqb+cSXqL/eTi4TqwhiRm1fDxQpuwjViN7NzGw/nhXT4O0BZIIg0Ym7szrbpg==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-8da471024fd1ec4aba5d7b71b7675661.css" integrity="sha512-jaRxAk/R7Eq6XXtxt2dWYc6UfgT/Jk9zYWYh4UpAt5LFRnYVaWqEM3sPhUFL3fOBmHhHoOcn4wfLkMS21Q1yaw==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/behaviors-f6713e5e0ad6b4045a4b4cf0c4e887cb.css" integr

In [111]:
x = soup.find_all('h1', attrs = {'class':"h3 lh-condensed"})
data2 = []
for i in range(len(x)):
    y = soup.find_all('h1', attrs = {'class':"h3 lh-condensed"})[i].text
    y = y.replace('\n', '').replace(' ', '').strip()
    data2.append(y)
data2    

['microsoft/Swin-Transformer',
 'd2l-ai/d2l-zh',
 'DidierRLopes/GamestonkTerminal',
 'Chia-Network/chia-blockchain',
 'huggingface/transformers',
 'soimort/you-get',
 'deepset-ai/haystack',
 'PaddlePaddle/PaddleOCR',
 'scipy/scipy',
 'docker/compose',
 'ageitgey/face_recognition',
 'ansible/awx',
 'pittcsc/Summer2022-Internships',
 'milesial/Pytorch-UNet',
 'heartexlabs/label-studio',
 'mai-lang-chai/Middleware-Vulnerability-detection',
 'Shawn-Shan/fawkes',
 'hindupuravinash/the-gan-zoo',
 'wgpsec/tig',
 'spulec/moto',
 'scikit-learn/scikit-learn',
 'awslabs/aws-data-wrangler',
 'WZMIAOMIAO/deep-learning-for-image-processing',
 'revoxhere/duino-coin',
 'ZhaoJ9014/face.evoLVe.PyTorch']

#### Display all the image links from Walt Disney wikipedia page.

In [192]:
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [193]:
response = requests.get(url, 'lxml')
html = response.content
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Walt Disney - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"f04ade1e-1e7a-46f2-ad0f-ac59c6d2f84f","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Walt_Disney","wgTitle":"Walt Disney","wgCurRevisionId":1016569860,"wgRevisionId":1016569860,"wgArticleId":32917,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Articles with short description","Short description is different from Wikidata","Wikipedia extended-confirmed-protected p

In [194]:
x = soup.find_all('a', attrs = {'class':"image"})
data3 = []
for i in range(len(x)):
    y = soup.find_all('a', attrs = {'class':"image"})[i]['href']
    y = y.replace('\n', '').replace(' ', '').strip()
    data3.append(y)
data3 

['/wiki/File:Walt_Disney_1946.JPG',
 '/wiki/File:Walt_Disney_1942_signature.svg',
 '/wiki/File:Walt_Disney_envelope_ca._1921.jpg',
 '/wiki/File:Trolley_Troubles_poster.jpg',
 '/wiki/File:Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg',
 '/wiki/File:Steamboat-willie.jpg',
 '/wiki/File:Walt_Disney_1935.jpg',
 '/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 '/wiki/File:Disney_drawing_goofy.jpg',
 '/wiki/File:DisneySchiphol1951.jpg',
 '/wiki/File:WaltDisneyplansDisneylandDec1954.jpg',
 '/wiki/File:Walt_disney_portrait_right.jpg',
 '/wiki/File:Walt_Disney_Grave.JPG',
 '/wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg',
 '/wiki/File:Disney_Display_Case.JPG',
 '/wiki/File:Disney1968.jpg',
 '/wiki/File:Disneyland_Resort_logo.svg',
 '/wiki/File:Animation_disc.svg',
 '/wiki/File:P_vip.svg',
 '/wiki/File:Magic_Kingdom_castle.jpg',
 '/wiki/File:Video-x-generic.svg',
 '/wiki/File:Flag_of_Los_Angeles_County,_Califor

In [227]:
links = [i.get('src').strip('//') for i in soup.find_all('img')]
links

['upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 'upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 'upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 'upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 'upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 'upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg',
 'upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 'upload.wikimedia.org/wikipedia/commons/thumb/7/71/Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg/170px-Walt_Disney_a

#### Retrieve an random Wikipedia page of "Python" and create a list of links on that page.

In [376]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [377]:
response = requests.get(url, 'lxml')
html = response.content
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Python - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6cab3a54-ba99-4863-9130-4ee43fd9afcf","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Python","wgTitle":"Python","wgCurRevisionId":997582414,"wgRevisionId":997582414,"wgArticleId":46332325,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Disambiguation pages with short descriptions","Short description is different from Wikidata","All article disambiguation pages","All disambiguation pages","Animal common name disambiguation

In [385]:
x = soup.find_all('a')
for i in x:
    if 'href' in i.attrs:
        if i['href'].startswith('/wiki/') == True:
            print(i.get('href'))

/wiki/Pythons
/wiki/Python_(genus)
/wiki/Python_(programming_language)
/wiki/CMU_Common_Lisp
/wiki/PERQ#PERQ_3
/wiki/Python_of_Aenus
/wiki/Python_(painter)
/wiki/Python_of_Byzantium
/wiki/Python_of_Catana
/wiki/Python_Anghelo
/wiki/Python_(Efteling)
/wiki/Python_(Busch_Gardens_Tampa_Bay)
/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
/wiki/Python_(automobile_maker)
/wiki/Python_(Ford_prototype)
/wiki/Python_(missile)
/wiki/Python_(nuclear_primary)
/wiki/Colt_Python
/wiki/PYTHON
/wiki/Python_(film)
/wiki/Python_(mythology)
/wiki/Monty_Python
/wiki/Python_(Monty)_Pictures
/wiki/Cython
/wiki/Pyton
/wiki/Pithon
/wiki/File:Disambig_gray.svg
/wiki/Help:Disambiguation
/wiki/Help:Category
/wiki/Category:Disambiguation_pages
/wiki/Category:Human_name_disambiguation_pages
/wiki/Category:Disambiguation_pages_with_given-name-holder_lists
/wiki/Category:Disambiguation_pages_with_short_descriptions
/wiki/Category:Short_description_is_different_from_Wikidata
/wiki/Category:All_article_disambiguation_p

In [379]:
links4 = [element.find('a').get('href') for element in soup.find_all('li') if element.find('a') is not None if element.find('a').get('href').startswith('/wiki/') and 'ython' in element.find('a').get('href')]
links4

['/wiki/Pythons',
 '/wiki/Python_(genus)',
 '/wiki/Python_(programming_language)',
 '/wiki/Python_of_Aenus',
 '/wiki/Python_(painter)',
 '/wiki/Python_of_Byzantium',
 '/wiki/Python_of_Catana',
 '/wiki/Python_Anghelo',
 '/wiki/Python_(Efteling)',
 '/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 '/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 '/wiki/Python_(automobile_maker)',
 '/wiki/Python_(Ford_prototype)',
 '/wiki/Python_(missile)',
 '/wiki/Python_(nuclear_primary)',
 '/wiki/Colt_Python',
 '/wiki/Python_(film)',
 '/wiki/Python_(mythology)',
 '/wiki/Monty_Python',
 '/wiki/Python_(Monty)_Pictures',
 '/wiki/Cython',
 '/wiki/Python',
 '/wiki/Talk:Python',
 '/wiki/Python',
 '/wiki/Special:WhatLinksHere/Python',
 '/wiki/Special:RecentChangesLinked/Python']

#### Find a Python list with the top ten FBI's Most Wanted names.

In [386]:
url = 'https://www.fbi.gov/wanted/topten'

In [387]:
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html data-gridsystem="bs3" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="https://www.fbi.gov/wanted/topten" rel="canonical"/>
<link href="https://www.fbi.gov/wanted/topten/RSS" rel="alternate" title="Ten Most Wanted Fugitives - RSS 1.0" type="application/rss+xml"/>
<link href="https://www.fbi.gov/wanted/topten/rss.xml" rel="alternate" title="Ten Most Wanted Fugitives - RSS 2.0" type="application/rss+xml"/>
<link href="https://www.fbi.gov/wanted/topten/atom.xml" rel="alternate" title="Ten Most Wanted Fugitives - Atom" type="application/rss+xml"/>
<title>Ten Most Wanted Fugitives — FBI</title>
<meta content="text/plain" name="DC.format"/>
<meta content="2010/07/16 - " name="DC.date.valid_range"/>
<meta content="Folder" name="DC.type"/>
<meta content="Wanted by the FBI, Top Ten Most Wanted, Ten Most Wanted Fugitives, Top Ten Fugitives, Top

In [416]:
x = soup.find_all('h3', attrs = {'class':'title'})
for i in range(len(x)):
    y = x[i].text.replace('\n', '')
    print(y)

ROBERT WILLIAM FISHER
ALEJANDRO ROSALES CASTILLO
ARNOLDO JIMENEZ
JASON DEREK BROWN
ALEXIS FLORES
JOSE RODOLFO VILLARREAL-HERNANDEZ
EUGENE PALMER
RAFAEL CARO-QUINTERO
BHADRESHKUMAR CHETANBHAI PATEL
YASER ABDEL SAID


####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [5]:
url = 'https://www.emsc-csem.org/Earthquake/'

In [6]:
response = requests.get(url, 'lxml')
html = response.content
soup = BeautifulSoup(html)
x = soup.find_all('td', attrs = {'class':'tb_region'})
region_name = []
for i in range(len(x)):
    if i < 20:
        region_name.append(soup.find_all('td', attrs = {'class':'tb_region'})[i].text.replace('\xa0', ''))
region_name  

['ANTOFAGASTA, CHILE',
 'DODECANESE ISLANDS, GREECE',
 'FIJI REGION',
 'SOUTH OF SUMBAWA, INDONESIA',
 'ANTOFAGASTA, CHILE',
 'NEAR THE COAST OF WESTERN TURKEY',
 'OKLAHOMA',
 'NEAR COAST OF SOUTHERN PERU',
 'OFFSHORE GUERRERO, MEXICO',
 'ISLAND OF HAWAII, HAWAII',
 'SOUTH AUSTRALIA',
 'ISLAND OF HAWAII, HAWAII',
 'SWITZERLAND',
 'FRANCE',
 'ICELAND REGION',
 'NEUQUEN, ARGENTINA',
 'EASTERN MEDITERRANEAN SEA',
 'PUERTO RICO',
 'NEAR ISLANDS, ALEUTIAN ISLANDS',
 'OFFSHORE ATACAMA, CHILE']

In [7]:
response1 = requests.get(url)
html1 = response1.content
soup1 = BeautifulSoup(html1)
x1 = soup1.find_all('td', attrs = {'class':'tabev1'})
longitude = []
latitude = []
for i in range(len(x1)):
    if i <= 42 and i%2 != 0 :
        longitude.append(soup1.find_all('td', attrs = {'class':'tabev1'})[i].text.replace('\xa0', ''))
    else:
        latitude.append(soup1.find_all('td', attrs = {'class':'tabev1'})[i].text.replace('\xa0', ''))
longitude 

['69.84',
 '27.51',
 '177.83',
 '117.07',
 '69.18',
 '26.09',
 '97.43',
 '74.64',
 '99.24',
 '155.41',
 '137.84',
 '155.41',
 '8.17',
 '4.44',
 '18.49',
 '70.51',
 '28.50',
 '66.76',
 '171.58',
 '71.41',
 '178.30']

In [8]:
latitude

['22.32',
 '35.29',
 '18.00',
 '10.40',
 '24.29',
 '39.29',
 '34.39',
 '15.71',
 '16.12',
 '19.22',
 '29.90',
 '19.20',
 '47.55',
 '48.17',
 '68.13',
 '36.23',
 '35.46',
 '18.01',
 '53.59',
 '28.17',
 '35.65',
 '37.58',
 '121.68',
 '36.45',
 '27.13',
 '40.30',
 '124.43',
 '21.17',
 '68.76',
 '42.52',
 '1.86',
 '36.96',
 '98.09',
 '32.28',
 '67.93',
 '39.30',
 '26.09',
 '28.96',
 '13.35',
 '61.53',
 '146.53',
 '36.49',
 '27.15',
 '39.29',
 '26.09',
 '8.01',
 '107.12',
 '19.23',
 '155.40',
 '17.65',
 '97.58',
 '36.49',
 '27.09',
 '63.54',
 '147.33',
 '32.64',
 '94.07',
 '39.66',
 '11.84',
 '12.45',
 '141.59',
 '4.46',
 '103.01',
 '38.82',
 '122.79',
 '38.00',
 '27.19',
 '42.54',
 '16.20',
 '38.79',
 '23.39',
 '42.57',
 '18.46',
 '39.33',
 '26.12',
 '39.26',
 '26.08',
 '38.14',
 '15.94']

In [9]:
response2 = requests.get(url)
html2 = response2.content
soup2 = BeautifulSoup(html2)
x2 = soup2.find_all('a')
date_time = []
for i in x2:
    if i.text.startswith('2021')== True:
        date_time.append(i.text.replace('\xa0\xa0\xa0', ' '))
        if len(date_time) == 20:
            break
date_time

['2021-04-15 11:07:46.6',
 '2021-04-15 11:04:56.4',
 '2021-04-15 10:53:34.6',
 '2021-04-15 10:36:20.6',
 '2021-04-15 10:35:27.0',
 '2021-04-15 10:14:25.4',
 '2021-04-15 10:07:36.8',
 '2021-04-15 10:05:50.0',
 '2021-04-15 09:58:22.0',
 '2021-04-15 09:56:32.2',
 '2021-04-15 09:52:25.6',
 '2021-04-15 09:46:39.2',
 '2021-04-15 09:44:42.1',
 '2021-04-15 09:36:28.5',
 '2021-04-15 09:29:54.2',
 '2021-04-15 09:09:52.0',
 '2021-04-15 09:09:07.6',
 '2021-04-15 08:40:28.2',
 '2021-04-15 08:33:40.7',
 '2021-04-15 08:26:16.0']

In [10]:
data=[]
for i in range(20):
    data.append([region_name[i],date_time[i],latitude[i],longitude[i]])

In [11]:
earthdf = pd.DataFrame(data, columns = ['Region', 'Date and Time', "Latitude", "Longitude"]) 
earthdf

Unnamed: 0,Region,Date and Time,Latitude,Longitude
0,"ANTOFAGASTA, CHILE",2021-04-15 11:07:46.6,22.32,69.84
1,"DODECANESE ISLANDS, GREECE",2021-04-15 11:04:56.4,35.29,27.51
2,FIJI REGION,2021-04-15 10:53:34.6,18.0,177.83
3,"SOUTH OF SUMBAWA, INDONESIA",2021-04-15 10:36:20.6,10.4,117.07
4,"ANTOFAGASTA, CHILE",2021-04-15 10:35:27.0,24.29,69.18
5,NEAR THE COAST OF WESTERN TURKEY,2021-04-15 10:14:25.4,39.29,26.09
6,OKLAHOMA,2021-04-15 10:07:36.8,34.39,97.43
7,NEAR COAST OF SOUTHERN PERU,2021-04-15 10:05:50.0,15.71,74.64
8,"OFFSHORE GUERRERO, MEXICO",2021-04-15 09:58:22.0,16.12,99.24
9,"ISLAND OF HAWAII, HAWAII",2021-04-15 09:56:32.2,19.22,155.41


#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [66]:
url = 'https://www.wikipedia.org/'

In [74]:
response3 = requests.get(url, 'lxml')
html3 = response3.content
soup3 = BeautifulSoup(html3)
x3 = soup3.find('div', {'class':'central-featured'}).find_all('div')
result = []
for i in x3:
    xy = i.text.replace('\n', ' ').replace('\xa0', '').strip()
    result.append(xy)
result

['English 6280000+ articles',
 'Español 1673000+ artículos',
 'Deutsch 2562000+ Artikel',
 '日本語 1263000+ 記事',
 'Русский 1714000+ статей',
 'Français 2317000+ articles',
 'Italiano 1685000+ voci',
 '中文 1190000+ 條目',
 'Português 1065000+ artigos',
 'Polski 1467000+ haseł']

In [54]:
html3 = response3.content
soup3 = BeautifulSoup(html3)
x3 = soup3.find('div', {'class':'central-featured'}).find_all('div')
datax = []
for i in x3:
    print(i.strong.text)
    datax.append(i.strong.text)

English
Español
Deutsch
日本語
Русский
Français
Italiano
中文
Português
Polski


In [55]:
html3 = response3.content
soup3 = BeautifulSoup(html3)
x3 = soup3.find('div', {'class':'central-featured'}).find_all('div')
datay = []
for i in x3:
    print(i.small.text)
    datay.append(i.small.text)

6 280 000+ articles
1 673 000+ artículos
2 562 000+ Artikel
1 263 000+ 記事
1 714 000+ статей
2 317 000+ articles
1 685 000+ voci
1 190 000+ 條目
1 065 000+ artigos
1 467 000+ haseł


In [56]:
d0 = pd.DataFrame({'Language':datax, 'Articles':datay})
d0

Unnamed: 0,Language,Articles
0,English,6 280 000+ articles
1,Español,1 673 000+ artículos
2,Deutsch,2 562 000+ Artikel
3,日本語,1 263 000+ 記事
4,Русский,1 714 000+ статей
5,Français,2 317 000+ articles
6,Italiano,1 685 000+ voci
7,中文,1 190 000+ 條目
8,Português,1 065 000+ artigos
9,Polski,1 467 000+ haseł


#### A list with the different kind of datasets available in data.gov.uk.

In [75]:
url = 'https://data.gov.uk/'

In [82]:
response = requests.get(url, 'lxml')
html = response.content
soup = BeautifulSoup(html)
x = soup.find_all('h3')
y = []
for i in x:
    p = i.text
    y.append(p)
y

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

In [117]:
response = requests.get(url, 'lxml')
html = response.content
soup = BeautifulSoup(html)
x1 = soup.find_all('p', {'class':'govuk-body'})
y1 = []
for i in range(len(x1)):
    i += 1
    y1.append(x1[i].text)
    if i >= 14:
        break
y1

['Small businesses, industry, imports, exports and trade',
 'Courts, police, prison, offenders, borders and immigration',
 'Armed forces, health and safety, search and rescue',
 'Students, training, qualifications and the National Curriculum',
 'Weather, flooding, rivers, air quality, geology and agriculture',
 'Staff numbers and pay, local councillors and department business plans',
 'Includes all payments by government departments over £25,000',
 'Includes smoking, drugs, alcohol, medicine performance and hospitals',
 'Addresses, boundaries, land ownership, aerial photographs, seabed and land terrain',
 'Employment, benefits, household finances, poverty and population',
 'Includes housing, urban planning, leisure, waste and energy, consumption',
 'Airports, roads, freight, electric vehicles, parking, buses and footpaths',
 'Cost, usage, completion rate, digital take-up, satisfaction',
 'Trusted data that is referenced and shared across government departments']

In [118]:
pd.DataFrame({'Dataset': y, 'Description': y1})

Unnamed: 0,Dataset,Description
0,Business and economy,"Small businesses, industry, imports, exports a..."
1,Crime and justice,"Courts, police, prison, offenders, borders and..."
2,Defence,"Armed forces, health and safety, search and re..."
3,Education,"Students, training, qualifications and the Nat..."
4,Environment,"Weather, flooding, rivers, air quality, geolog..."
5,Government,"Staff numbers and pay, local councillors and d..."
6,Government spending,Includes all payments by government department...
7,Health,"Includes smoking, drugs, alcohol, medicine per..."
8,Mapping,"Addresses, boundaries, land ownership, aerial ..."
9,Society,"Employment, benefits, household finances, pove..."


#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [200]:
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [220]:
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
x = soup.find_all('td')
data = []
for i in range(len(x)):
    try:
        y = soup.find_all('td')[i].find('a').text
        if y != None:
            data.append(y)
    except:
        pass
data

['Mandarin Chinese',
 'Sino-Tibetan',
 'Sinitic',
 'Spanish',
 'Indo-European',
 'Romance',
 'English',
 'Indo-European',
 'Germanic',
 'Hindi',
 'Indo-European',
 'Indo-Aryan',
 'Bengali',
 'Indo-European',
 'Indo-Aryan',
 'Portuguese',
 'Indo-European',
 'Romance',
 'Russian',
 'Indo-European',
 'Balto-Slavic',
 'Japanese',
 'Japonic',
 'Japanese',
 'Western Punjabi',
 'Indo-European',
 'Indo-Aryan',
 'Marathi',
 'Indo-European',
 'Indo-Aryan',
 'Telugu',
 'Dravidian',
 'South-Central',
 'Wu Chinese',
 'Sino-Tibetan',
 'Sinitic',
 'Turkish',
 'Turkic',
 'Oghuz',
 'Korean',
 'Koreanic',
 'language isolate',
 'French',
 'Indo-European',
 'Romance',
 'German',
 'Indo-European',
 'Germanic',
 'Vietnamese',
 'Austroasiatic',
 'Vietic',
 'Tamil',
 'Dravidian',
 'South',
 'Yue Chinese',
 'Sino-Tibetan',
 'Sinitic',
 'Urdu',
 'Indo-European',
 'Indo-Aryan',
 'Javanese',
 'Austronesian',
 'Malayo-Polynesian',
 'Italian',
 'Indo-European',
 'Romance',
 'Egyptian Arabic',
 'Afroasiatic',
 'Semi

In [226]:
x = 0
new = []
for i in data:
    new.append(data[x])
    x += 3
    if x == 30:
        break
new

['Mandarin Chinese',
 'Spanish',
 'English',
 'Hindi',
 'Bengali',
 'Portuguese',
 'Russian',
 'Japanese',
 'Western Punjabi',
 'Marathi']

In [231]:
pd.DataFrame({'Languages' : new})

Unnamed: 0,Languages
0,Mandarin Chinese
1,Spanish
2,English
3,Hindi
4,Bengali
5,Portuguese
6,Russian
7,Japanese
8,Western Punjabi
9,Marathi


## Bonus

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [233]:
url = 'https://www.imdb.com/chart/top'

In [398]:
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
x = soup.find_all('td', {'class':'titleColumn'})
name = [i.find('a').text for i in x]
name

['Um Sonho de Liberdade',
 'O Poderoso Chefão',
 'O Poderoso Chefão II',
 'Batman: O Cavaleiro das Trevas',
 '12 Homens e uma Sentença',
 'A Lista de Schindler',
 'O Senhor dos Anéis: O Retorno do Rei',
 'Pulp Fiction: Tempo de Violência',
 'Três Homens em Conflito',
 'O Senhor dos Anéis: A Sociedade do Anel',
 'Clube da Luta',
 'Forrest Gump: O Contador de Histórias',
 'A Origem',
 'O Senhor dos Anéis: As Duas Torres',
 'Star Wars, Episódio V: O Império Contra-Ataca',
 'Matrix',
 'Os Bons Companheiros',
 'Um Estranho no Ninho',
 'Os Sete Samurais',
 'Seven: Os Sete Crimes Capitais',
 'A Vida é Bela',
 'Cidade de Deus',
 'O Silêncio dos Inocentes',
 'A Felicidade Não se Compra',
 'O Resgate do Soldado Ryan',
 'Guerra nas Estrelas',
 'À Espera de um Milagre',
 'A Viagem de Chihiro',
 'Interestelar',
 'Parasita',
 'O Profissional',
 'Harakiri',
 'Os Suspeitos',
 'O Rei Leão',
 'O Pianista',
 'O Exterminador do Futuro 2: O Julgamento Final',
 'De Volta para o Futuro',
 'A Outra História A

In [397]:
x = soup.find_all('td', {'class':'titleColumn'})
year = [i.find('span').text.replace('(', '').replace(')', '') for i in x]
year

['1994',
 '1972',
 '1974',
 '2008',
 '1957',
 '1993',
 '2003',
 '1994',
 '1966',
 '2001',
 '1999',
 '1994',
 '2010',
 '2002',
 '1980',
 '1999',
 '1990',
 '1975',
 '1954',
 '1995',
 '1997',
 '2002',
 '1991',
 '1946',
 '1998',
 '1977',
 '1999',
 '2001',
 '2014',
 '2019',
 '1994',
 '1962',
 '1995',
 '1994',
 '2002',
 '1991',
 '1985',
 '1998',
 '1936',
 '2000',
 '1960',
 '2006',
 '1931',
 '2014',
 '2011',
 '1988',
 '2006',
 '1968',
 '1942',
 '1988',
 '1954',
 '1979',
 '1979',
 '2000',
 '1940',
 '1981',
 '2012',
 '2006',
 '1957',
 '2020',
 '2008',
 '2019',
 '1980',
 '2018',
 '1950',
 '1957',
 '2003',
 '2018',
 '1997',
 '1964',
 '2012',
 '1984',
 '2016',
 '1986',
 '2017',
 '2019',
 '2018',
 '1999',
 '1995',
 '1963',
 '1995',
 '1981',
 '2009',
 '1984',
 '2009',
 '1997',
 '1983',
 '2007',
 '1992',
 '1968',
 '2000',
 '2012',
 '1958',
 '1931',
 '2004',
 '1941',
 '2016',
 '1985',
 '1952',
 '1921',
 '1948',
 '1987',
 '1952',
 '2000',
 '1959',
 '1983',
 '1971',
 '2019',
 '1976',
 '2010',
 '2011',
 

In [396]:
rating = []
x = soup.find_all('td', {'class':'ratingColumn'})
for i in x:
    i = i.text
    rating.append(i.replace('\n', '').replace('12345678910 NOT YET RELEASED Seen', '').replace('\xa0', ''))
remove =  filter(None, rating) 
rating = list(remove)
rating

['9.2',
 '9.1',
 '9.0',
 '9.0',
 '8.9',
 '8.9',
 '8.9',
 '8.8',
 '8.8',
 '8.8',
 '8.8',
 '8.8',
 '8.7',
 '8.7',
 '8.7',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.6',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.5',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.4',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.3',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',
 '8.2',


In [415]:
x = soup.find_all('td', {'class':'titleColumn'})
directors0 = [i.find('a')['title']for i in x]
directors = [re.sub('.(dir.).*', '', i).strip() for i in directors0]
directors

['Frank Darabont',
 'Francis Ford Coppola',
 'Francis Ford Coppola',
 'Christopher Nolan',
 'Sidney Lumet',
 'Steven Spielberg',
 'Peter Jackson',
 'Quentin Tarantino',
 'Sergio Leone',
 'Peter Jackson',
 'David Fincher',
 'Robert Zemeckis',
 'Christopher Nolan',
 'Peter Jackson',
 'Irvin Kershner',
 'Lana Wachowski',
 'Martin Scorsese',
 'Milos Forman',
 'Akira Kurosawa',
 'David Fincher',
 'Roberto Benigni',
 'Fernando Meirelles',
 'Jonathan Demme',
 'Frank Capra',
 'Steven Spielberg',
 'George Lucas',
 'Frank Darabont',
 'Hayao Miyazaki',
 'Christopher Nolan',
 'Bong Joon Ho',
 'Luc Besson',
 'Masaki Kobayashi',
 'Bryan Singer',
 'Roger Allers',
 'Roman Polanski',
 'James Cameron',
 'Robert Zemeckis',
 'Tony Kaye',
 'Charles Chaplin',
 'Ridley Scott',
 'Alfred Hitchcock',
 'Martin Scorsese',
 'Charles Chaplin',
 'Damien Chazelle',
 'Olivier Nakache',
 'Isao Takahata',
 'Christopher Nolan',
 'Sergio Leone',
 'Michael Curtiz',
 'Giuseppe Tornatore',
 'Alfred Hitchcock',
 'Ridley Scott

In [443]:
x = soup.find_all('td', {'class':'titleColumn'})
actors0 = [i.find('a')['title']for i in x]
actors = [' & '.join(i.split(',')[1:]) for i in actors0]
actors

[' Tim Robbins &  Morgan Freeman',
 ' Marlon Brando &  Al Pacino',
 ' Al Pacino &  Robert De Niro',
 ' Christian Bale &  Heath Ledger',
 ' Henry Fonda &  Lee J. Cobb',
 ' Liam Neeson &  Ralph Fiennes',
 ' Elijah Wood &  Viggo Mortensen',
 ' John Travolta &  Uma Thurman',
 ' Clint Eastwood &  Eli Wallach',
 ' Elijah Wood &  Ian McKellen',
 ' Brad Pitt &  Edward Norton',
 ' Tom Hanks &  Robin Wright',
 ' Leonardo DiCaprio &  Joseph Gordon-Levitt',
 ' Elijah Wood &  Ian McKellen',
 ' Mark Hamill &  Harrison Ford',
 ' Keanu Reeves &  Laurence Fishburne',
 ' Robert De Niro &  Ray Liotta',
 ' Jack Nicholson &  Louise Fletcher',
 ' Toshirô Mifune &  Takashi Shimura',
 ' Morgan Freeman &  Brad Pitt',
 ' Roberto Benigni &  Nicoletta Braschi',
 ' Alexandre Rodrigues &  Leandro Firmino',
 ' Jodie Foster &  Anthony Hopkins',
 ' James Stewart &  Donna Reed',
 ' Tom Hanks &  Matt Damon',
 ' Mark Hamill &  Harrison Ford',
 ' Tom Hanks &  Michael Clarke Duncan',
 ' Daveigh Chase &  Suzanne Pleshette',

In [444]:
pd.DataFrame({'Movie' : name, 'Year': year, 'Rating': rating, 'Directors': directors, 'Actors': actors})

Unnamed: 0,Movie,Year,Rating,Directors,Actors
0,Um Sonho de Liberdade,1994,9.2,Frank Darabont,Tim Robbins & Morgan Freeman
1,O Poderoso Chefão,1972,9.1,Francis Ford Coppola,Marlon Brando & Al Pacino
2,O Poderoso Chefão II,1974,9.0,Francis Ford Coppola,Al Pacino & Robert De Niro
3,Batman: O Cavaleiro das Trevas,2008,9.0,Christopher Nolan,Christian Bale & Heath Ledger
4,12 Homens e uma Sentença,1957,8.9,Sidney Lumet,Henry Fonda & Lee J. Cobb
...,...,...,...,...,...
245,Milagre na Cela 7,2019,8.0,Mehmet Ada Öztekin,Aras Bulut Iynemli & Nisa Sofiya Aksongur
246,Aconteceu Naquela Noite,1934,8.0,Frank Capra,Clark Gable & Claudette Colbert
247,Tangerinas,2013,8.0,Zaza Urushadze,Lembit Ulfsak & Elmo Nüganen
248,A Batalha de Argel,1966,8.0,Gillo Pontecorvo,Brahim Hadjadj & Jean Martin


#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [467]:
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)

In [468]:
from random import shuffle
n_random = 10

x = soup.find_all('td', {'class':'titleColumn'})
shuffle(x)
name = [i.find('a').text for i in x[0:n_random]]
year = [i.find('span').text.replace('(', '').replace(')', '') for i in x[0:n_random]]
links = [i.find('a').get('href') for i in x[0:n_random]]

summary = []
for i in links:
    html = requests.get('https://www.imdb.com' + i).content
    soup = BeautifulSoup(html, "lxml")
    summary.append(soup.find('div', {'class':'summary_text'}).text.strip())
    
pd.DataFrame({'Title': name, 'Release': year, 'Summary': summary})

Unnamed: 0,Title,Release,Summary
0,Garota Exemplar,2014,With his wife's disappearance having become th...
1,Os Caçadores da Arca Perdida,1981,"In 1936, archaeologist and adventurer Indiana ..."
2,Lawrence da Arábia,1962,"The story of T.E. Lawrence, the English office..."
3,Um Corpo que Cai,1958,A former police detective juggles wrestling wi...
4,A Queda! As Últimas Horas de Hitler,2004,"Traudl Junge, the final secretary for Adolf Hi..."
5,A Vida dos Outros,2006,"In 1984 East Berlin, an agent of the secret po..."
6,O Rei Leão,1994,Lion prince Simba and his father are targeted ...
7,O Iluminado,1980,A family heads to an isolated hotel for the wi...
8,A Ponte do Rio Kwai,1957,British POWs are forced to build a railway bri...
9,Ratsasan,2018,A sub-inspector sets out in pursuit of a myste...


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [488]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'
response = requests.get(url)
x = response.json()

Enter the city: Fortaleza


In [517]:
wind_speed = x['wind']['speed']
description = x['weather'][0]['description']
temp = x['main']['temp']
weather = x['weather'][0]['main']

In [519]:
pd.DataFrame({'Temperature':[temp], 'Wind Speed': [wind_speed], 'Description':[description], 'Weather': [weather]})

Unnamed: 0,Temperature,Wind Speed,Description,Weather
0,26.07,3.09,broken clouds,Clouds


#### Find the book name, price and stock availability as a pandas dataframe.

In [547]:
url = 'http://books.toscrape.com/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)

x = soup.find_all('h3')
titles = []
for i in x:
    url1 = url + i.find('a').get('href')
    response1 = requests.get(url1)
    html1 = response1.content
    soup1 = BeautifulSoup(html1)
    for y in soup1.find_all('h1'):
        titles.append(y.text)        

x2 = soup.find_all('p', attrs = {'class':'price_color'})
prices = [i.text for i in x2]

x1 = soup.find_all('p', attrs = {'class':'instock availability'})
stocks = [i.text.replace('\n', '').strip() for i in x1]

pd.DataFrame({'Titles': titles, 'Prices':prices, 'Availability':stocks})

Unnamed: 0,Titles,Prices,Availability
0,A Light in the Attic,£51.77,In stock
1,Tipping the Velvet,£53.74,In stock
2,Soumission,£50.10,In stock
3,Sharp Objects,£47.82,In stock
4,Sapiens: A Brief History of Humankind,£54.23,In stock
5,The Requiem Red,£22.65,In stock
6,The Dirty Little Secrets of Getting Your Dream...,£33.34,In stock
7,The Coming Woman: A Novel Based on the Life of...,£17.93,In stock
8,The Boys in the Boat: Nine Americans and Their...,£22.60,In stock
9,The Black Maria,£52.15,In stock
