# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [3]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [4]:
# your code here
html1 = requests.get(url).content
html1[0:600]

b'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossori'

In [6]:
from bs4 import BeautifulSoup
soup1 = BeautifulSoup(html1, "lxml")
soup1

<!DOCTYPE html>
<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-d46e2b60992dc114d02a7edf55f254c4.css" integrity="sha512-1G4rYJktwRTQKn7fVfJUxH8RRZFUJlGo77xMZfBfIhZPx4BHVrzPE1VgnafttXI8G3y/PywH3uXyhNkSLp3+oA==" media="all" rel="stylesheet"/><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-848e5bda8a9313d9e37e362b7eecd7a8.css" integrity="sha512-hI5b2oqTE9njfjYrfuzXqA4bSGSNrE5OMc9I

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [57]:
# your code here
tags = ['h1','p']
text = [element.text for element in soup1.find_all(tags)]
text


['Trending',
 '\n      These are the developers building the hot tools today.\n    ',
 '\n\n            nick black\n ',
 '\n\n              dankamongmen\n ',
 '\n\n\n\n\n      notcurses\n ',
 '\n\n            Rick\n ',
 '\n\n              LinuxSuRen\n ',
 '\n\n\n\n\n      remote-jobs-in-china\n ',
 '\n\n            David Pedersen\n ',
 '\n\n              davidpdrsn\n ',
 '\n\n\n\n\n      todo-or-die\n ',
 '\n\n            Seth Vargo\n ',
 '\n\n              sethvargo\n ',
 '\n\n\n\n\n      go-password\n ',
 '\n\n            PySimpleGUI\n ',
 '\n\n              PySimpleGUI\n ',
 '\n\n\n\n\n      PySimpleGUI\n ',
 '\n\n            Anthony Fu\n ',
 '\n\n              antfu\n ',
 '\n\n\n\n\n      unconfig\n ',
 '\n\n            Adrienne Walker\n ',
 '\n\n              quisquous\n ',
 '\n\n\n\n\n      cactbot\n ',
 '\n\n            David Tolnay\n ',
 '\n\n              dtolnay\n ',
 '\n\n\n\n\n      efg\n ',
 '\n\n            Damodar Lohani\n ',
 '\n\n              lohanidamodar\n ',
 '\n\n

In [87]:
b=[]
for i in range(0,len(text)):
    c=str(text[i])
    b.append(c.strip().replace("\n",""))
print(b)


['Trending', 'These are the developers building the hot tools today.', 'nick black', 'dankamongmen', 'notcurses', 'Rick', 'LinuxSuRen', 'remote-jobs-in-china', 'David Pedersen', 'davidpdrsn', 'todo-or-die', 'Seth Vargo', 'sethvargo', 'go-password', 'PySimpleGUI', 'PySimpleGUI', 'PySimpleGUI', 'Anthony Fu', 'antfu', 'unconfig', 'Adrienne Walker', 'quisquous', 'cactbot', 'David Tolnay', 'dtolnay', 'efg', 'Damodar Lohani', 'lohanidamodar', 'flutter_ui_challenges', 'Ariel Mashraki', 'a8m', 'golang-cheat-sheet', 'Fons van der Plas', 'fonsp', 'Pluto.jl', 'Philipp Oppermann', 'phil-opp', 'blog_os', 'Josh Larson', 'jplhomer', '@shopify', 'Zoltan Kochan', 'zkochan', '@teambit', 'Vitor Mattos', 'vitormattos', '@PHPRio and @LibreCodeCoop', 'Felix Angelov', 'felangel', 'bloc', 'Klaus Post', 'klauspost', 'cpuid', 'Luigi Ballabio', 'lballabio', 'QuantLib', 'Casey Rodarmor', 'casey', 'just', 'Han Xiao', 'hanxiao', 'bert-as-service', 'Shougo', 'Shougo', 'ddc.vim', 'Alex Goodman', 'wagoodman', 'dive', 

In [107]:
name1=[]
for item in range(2, len(b), 3):
    name1.append(b[item])
print(name1)

['nick black', 'Rick', 'David Pedersen', 'Seth Vargo', 'PySimpleGUI', 'Anthony Fu', 'Adrienne Walker', 'David Tolnay', 'Damodar Lohani', 'Ariel Mashraki', 'Fons van der Plas', 'Philipp Oppermann', 'Josh Larson', 'Zoltan Kochan', 'Vitor Mattos', 'Felix Angelov', 'Klaus Post', 'Luigi Ballabio', 'Casey Rodarmor', 'Han Xiao', 'Shougo', 'Alex Goodman', 'Taner Şener', 'Bogdan Popa', 'Tomoki Hayashi']


In [108]:
user1=[]
for item in range(3, len(b), 3):
    user1.append(b[item])
print(user1)

['dankamongmen', 'LinuxSuRen', 'davidpdrsn', 'sethvargo', 'PySimpleGUI', 'antfu', 'quisquous', 'dtolnay', 'lohanidamodar', 'a8m', 'fonsp', 'phil-opp', 'jplhomer', 'zkochan', 'vitormattos', 'felangel', 'klauspost', 'lballabio', 'casey', 'hanxiao', 'Shougo', 'wagoodman', 'tanersener', 'Bogdanp', 'kan-bayashi']


In [110]:
results=[]
for item in range(0,len(user1)):
    results.append(name1[item]+" "+"("+user1[item]+")")
print(results)

['nick black (dankamongmen)', 'Rick (LinuxSuRen)', 'David Pedersen (davidpdrsn)', 'Seth Vargo (sethvargo)', 'PySimpleGUI (PySimpleGUI)', 'Anthony Fu (antfu)', 'Adrienne Walker (quisquous)', 'David Tolnay (dtolnay)', 'Damodar Lohani (lohanidamodar)', 'Ariel Mashraki (a8m)', 'Fons van der Plas (fonsp)', 'Philipp Oppermann (phil-opp)', 'Josh Larson (jplhomer)', 'Zoltan Kochan (zkochan)', 'Vitor Mattos (vitormattos)', 'Felix Angelov (felangel)', 'Klaus Post (klauspost)', 'Luigi Ballabio (lballabio)', 'Casey Rodarmor (casey)', 'Han Xiao (hanxiao)', 'Shougo (Shougo)', 'Alex Goodman (wagoodman)', 'Taner Şener (tanersener)', 'Bogdan Popa (Bogdanp)', 'Tomoki Hayashi (kan-bayashi)']


#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [130]:
# This is the url you will scrape in this exercise
url2 = 'https://github.com/trending'

In [131]:
# your code here
html2 = requests.get(url2).content
html2[0:600]

b'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossori'

In [132]:
from bs4 import BeautifulSoup
soup2 = BeautifulSoup(html2, "lxml")
soup2

<!DOCTYPE html>
<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-d46e2b60992dc114d02a7edf55f254c4.css" integrity="sha512-1G4rYJktwRTQKn7fVfJUxH8RRZFUJlGo77xMZfBfIhZPx4BHVrzPE1VgnafttXI8G3y/PywH3uXyhNkSLp3+oA==" media="all" rel="stylesheet"/><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-848e5bda8a9313d9e37e362b7eecd7a8.css" integrity="sha512-hI5b2oqTE9njfjYrfuzXqA4bSGSNrE5OMc9I

In [144]:
tags2 = ['h1']
text2 = [element.text for element in soup2.find_all(tags2)]
text2

['Trending',
 '\n\n\n\n\n\n        trekhleb /\n\n      javascript-algorithms\n ',
 '\n\n\n\n\n\n        kamranahmedse /\n\n      developer-roadmap\n ',
 '\n\n\n\n\n\n        babysor /\n\n      MockingBird\n ',
 '\n\n\n\n\n\n        yeemachine /\n\n      kalidokit\n ',
 '\n\n\n\n\n\n        commaai /\n\n      openpilot\n ',
 '\n\n\n\n\n\n        prabhatsharma /\n\n      zinc\n ',
 '\n\n\n\n\n\n        public-apis /\n\n      public-apis\n ',
 '\n\n\n\n\n\n        alibaba /\n\n      arthas\n ',
 '\n\n\n\n\n\n        glibg10b /\n\n      ltt-linux-challenge-issues\n ',
 '\n\n\n\n\n\n        EbookFoundation /\n\n      free-programming-books\n ',
 '\n\n\n\n\n\n        SummitRoute /\n\n      csp_security_mistakes\n ',
 '\n\n\n\n\n\n        jwasham /\n\n      coding-interview-university\n ',
 '\n\n\n\n\n\n        danistefanovic /\n\n      build-your-own-x\n ',
 '\n\n\n\n\n\n        donnemartin /\n\n      system-design-primer\n ',
 '\n\n\n\n\n\n        OpenIMSDK /\n\n      Open-IM-Server\n ',
 '

In [155]:
b2=[]
for i in range(0,len(text2)):
    c=str(text2[i])
    d=c.strip().replace(" /\n\n      ","/")
    d=d.strip().replace("\n","")
    b2.append(d)
b2.pop(0)
print(b2)

['trekhleb/javascript-algorithms', 'kamranahmedse/developer-roadmap', 'babysor/MockingBird', 'yeemachine/kalidokit', 'commaai/openpilot', 'prabhatsharma/zinc', 'public-apis/public-apis', 'alibaba/arthas', 'glibg10b/ltt-linux-challenge-issues', 'EbookFoundation/free-programming-books', 'SummitRoute/csp_security_mistakes', 'jwasham/coding-interview-university', 'danistefanovic/build-your-own-x', 'donnemartin/system-design-primer', 'OpenIMSDK/Open-IM-Server', 'coder2gwy/coder2gwy', 'loveispapapa/txt_files', 'mehdihadeli/awesome-software-architecture', 'Pycord-Development/pycord', 'iron-fish/ironfish', 'edeng23/binance-trade-bot', 'kon9chunkit/GitHub-Chinese-Top-Charts', 'gofiber/fiber', 'supabase/pg_graphql', 'open-mmlab/mmtracking']


#### Display all the image links from Walt Disney wikipedia page.

In [157]:
# This is the url you will scrape in this exercise
url3 = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [158]:
# your code here
html3 = requests.get(url3).content
html3[0:600]

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Walt Disney - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"4cc48875-23c7-4fc0-9536-4ad47b731ee2","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"'

In [163]:
from bs4 import BeautifulSoup
soup3 = BeautifulSoup(html3, "lxml")
soup3

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Walt Disney - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"4cc48875-23c7-4fc0-9536-4ad47b731ee2","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Walt_Disney","wgTitle":"Walt Disney","wgCurRevisionId":1058747998,"wgRevisionId":1058747998,"wgArticleId":32917,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Articles with short description","Short description is different from Wikidata","Wikipedia indefinitely mov

In [234]:
listlinks=[]
for link in soup3.find_all('img'):
    a=link.get('src')
    listlinks.append("https:"+a)
print(listlinks)

['https://upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png', 'https://upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png', 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG', 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png', 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg', 'https://upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg', 'https://uplo

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [237]:
# This is the url you will scrape in this exercise
url4 ='https://en.wikipedia.org/wiki/Python' 

In [280]:
# your code here
import re
html4 = requests.get(url4).content
soup4 = BeautifulSoup(html4, "lxml")
listlinks2=[]
for link in soup4.find_all('a'):
    a=link.get('href')
    listlinks2.append(a)
listlinks2.pop(0)
listfinal=[]
for i in listlinks2:
    if re.search(r"(https://)+",i)!=None:
        listfinal.append(i)
print(listfinal)


['https://en.wiktionary.org/wiki/Python', 'https://en.wiktionary.org/wiki/python', 'https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Python&namespace=0', 'https://en.wikipedia.org/w/index.php?title=Python&oldid=1058841632', 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en', 'https://www.wikidata.org/wiki/Special:EntityPage/Q747452', 'https://commons.wikimedia.org/wiki/Category:Python', 'https://af.wikipedia.org/wiki/Python', 'https://als.wikipedia.org/wiki/Python', 'https://ar.wikipedia.org/wiki/%D8%A8%D8%A7%D9%8A%D8%AB%D9%88%D9%86_(%D8%AA%D9%88%D8%B6%D9%8A%D8%AD)', 'https://az.wikipedia.org/wiki/Python', 'https://bn.wikipedia.org/wiki/%E0%A6%AA%E0%A6%BE%E0%A6%87%E0%A6%A5%E0%A6%A8_(%E0%A6%A6%E0%A7%8D%E0%A6%AC%E0%A7%8D%E0%A6%AF%E0%A6%B0%E0%A7%8D%E0%A6%A5%E0%A6%A4%E0%A6%BE_%E0%A6%A8%E0%A6%BF%E0%A6%B0%E0%A6%B8%E0%A6%A8)', 'https://be.wikipedia.org/wiki/Python', 'https://bg.wiki

#### Find the number of titles that have changed in the United States Code since its last release point.

In [281]:
# This is the url you will scrape in this exercise
url5 = 'http://uscode.house.gov/download/download.shtml'

In [293]:
# your code here
import re
html5 = requests.get(url5).content
soup5 = BeautifulSoup(html5, "lxml")
print(soup5)
table6 = soup5.find_all('div',{'class':"usctitlechanged"})
table6
count



<?xml version='1.0' encoding='UTF-8' ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=8" http-equiv="X-UA-Compatible"/>
<meta content="no-cache" http-equiv="pragma"/><!-- HTTP 1.0 -->
<meta content="no-cache,must-revalidate" http-equiv="cache-control"/><!-- HTTP 1.1 -->
<meta content="0" http-equiv="expires"/>
<link href="/javax.faces.resource/favicon.ico.xhtml?ln=images" rel="shortcut icon"/><link href="/javax.faces.resource/cssLayout.css.xhtml?ln=css" rel="stylesheet" type="text/css"/><script src="/javax.faces.resource/jsf.js.xhtml?ln=javax.faces" type="text/javascript"></script><link href="/javax.faces.resource/static.css.xhtml?ln=css" rel="stylesheet" type="text/css"/></head><body style="display:none;"><script src="/javax.faces.resource/browserPreferences.js.xhtml?ln=script

#### Find a Python list with the top ten FBI's Most Wanted names.

In [299]:
# This is the url you will scrape in this exercise
url6 = 'https://www.fbi.gov/wanted/topten'

In [323]:
# your code here
# your code here

html6 = requests.get(url6).content
soup6 = BeautifulSoup(html6, "lxml")


tags3 = ['h3']
text3 = [element.text for element in soup6.find_all(tags3)]
b3=[]
for i in range(0,len(text3)-1):
    c=str(text3[i])
    d=c.strip().replace("\n","")
    
    b3.append(d)
print(b3)


    


['JOSE RODOLFO VILLARREAL-HERNANDEZ', 'OCTAVIANO JUAREZ-CORRO', 'RAFAEL CARO-QUINTERO', 'YULAN ADONAY ARCHAGA CARIAS', 'EUGENE PALMER', 'BHADRESHKUMAR CHETANBHAI PATEL', 'ALEJANDRO ROSALES CASTILLO', 'ARNOLDO JIMENEZ', 'JASON DEREK BROWN', 'ALEXIS FLORES']


####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [4]:
# This is the url you will scrape in this exercise
url7 = 'https://www.emsc-csem.org/Earthquake/'
html7 = requests.get(url7).content
soup7 = BeautifulSoup(html7, "lxml")
soup7


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/">
<head><meta content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" name="google-site-verification"/><meta content="BCAA3C04C41AE6E6AFAF117B9469C66F" name="msvalidate.01"/><meta content="43b36314ccb77957" name="y_key"/><!-- 5-Clk8f50tFFdPTU97Bw7ygWE1A -->
<meta content="en" http-equiv="Content-Language"/><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="all" name="robots"/>
<meta content="earthquake,earthquakes,last earthquake,earthquake today,earthquakes today,earth quake,earth quakes,real time seismicity,seismic,seismicity,seismicity map,seismology,sismologie,EMSC,CSEM,seismicity on google earth,sumatra,tsunami,tsunamis,map,maps,richter,mercalli,moment tensors,epicenter,magnitude,seismology,foreshock,aftershock,tremor" name="keywo

In [76]:
# your code here

tags4 = ['tr']
text4 = [element.text for element in soup7.find_all(tags4)]


b3=[]

for i in range(14,len(text4)-2):
    c=str(text4[i])
    
    d=c.strip().replace("(\xa0)+"," ")
    print(d)
    


earthquake2021-12-09   23:22:15.604min ago19.21 N  155.42 W  33Md2.5 ISLAND OF HAWAII, HAWAII2021-12-09 23:25
1Fearthquake2021-12-09   23:14:40.212min ago6.73 N  73.03 W  147M 4.8 NORTHERN COLOMBIA2021-12-09 23:25
earthquake2021-12-09   23:07:36.019min ago9.37 S  113.53 E  10 M4.6 SOUTH OF JAVA, INDONESIA2021-12-09 23:15
earthquake2021-12-09   23:04:23.222min ago36.59 N  7.16 W  15ML2.2 STRAIT OF GIBRALTAR2021-12-09 23:20
earthquake2021-12-09   23:00:01.027min ago0.74 N  129.27 E  10 M3.8 HALMAHERA, INDONESIA2021-12-09 23:10
earthquake2021-12-09   22:54:49.132min ago35.43 N  3.61 W  16ML2.6 STRAIT OF GIBRALTAR2021-12-09 23:17
earthquake2021-12-09   22:53:08.034min ago17.95 N  67.03 W  13Md2.0 PUERTO RICO REGION2021-12-09 23:09
earthquake2021-12-09   22:47:36.439min ago35.17 N  23.34 E  35ML3.2 CRETE, GREECE2021-12-09 22:50
earthquake2021-12-09   22:37:54.549min ago19.78 N  63.66 W  51Md4.2 NORTH OF THE VIRGIN ISLANDS2021-12-09 23:26
earthquake2021-12-09   22:32:33.354min ago18.33 N  68

#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [88]:
pip install tweepy

Note: you may need to restart the kernel to use updated packages.


In [3]:
import tweepy

In [8]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url56 = "https://twitter.com/lemondefr"

In [47]:
# your code here
client = tweepy.Client(bearer_token='AAAAAAAAAAAAAAAAAAAAAMZlWgEAAAAAyV3XxU6FhTYTC4NQ3Pfr7Lx%2FZDI%3DwJXox0KRjq8j5WU4lIZH1rRvwTkcyUGn7zbOOFdo8J7CQVrF57')
# Replace with your own search query
query = 'from:Lemondefr'
tweets = client.search_recent_tweets(query=query, tweet_fields=['context_annotations', 'created_at'], max_results=100)

try:
    for tweet in tweets.data:
        print(tweet.text)
        if len(tweet.context_annotations) > 0:
            print(tweet.context_annotations)
except: tweets == 'NoneType' 
print("Not found")
    

Cinq syndicats français critiquent Solidarnosc pour sa complaisance à l’égard de Marine Le Pen et d’Eric Zemmour https://t.co/yFYLOb7U61
[{'domain': {'id': '10', 'name': 'Person', 'description': 'Named people in the world like Nelson Mandela'}, 'entity': {'id': '1466053355365548040', 'name': 'Éric Zemmour'}}, {'domain': {'id': '35', 'name': 'Politician', 'description': 'Politicians in the world, like Joe Biden'}, 'entity': {'id': '1466053355365548040', 'name': 'Éric Zemmour'}}, {'domain': {'id': '10', 'name': 'Person', 'description': 'Named people in the world like Nelson Mandela'}, 'entity': {'id': '822153193526169600', 'name': 'Marine Le Pen', 'description': 'Marine Le Pen'}}, {'domain': {'id': '35', 'name': 'Politician', 'description': 'Politicians in the world, like Joe Biden'}, 'entity': {'id': '822153193526169600', 'name': 'Marine Le Pen', 'description': 'Marine Le Pen'}}]
Etats généraux de la justice : les « citoyens » rejoignent les magistrats pour demander plus de moyens | par

In [None]:
auth=tweepy.OAuthHandler('fcWJ9pTzQcEOpm0Js7SecQwlQ','ri3XEFYaYXDQl1Ldn3hq0V4dgtiUpzKwXLeOdYHUzqH9odlROn')
auth.set_access_token('356464952-PwBSpEq57hpLqV6coaErWBEILw9Gxc1YDZ8FgQBj','AAAAAAAAAAAAAAAAAAAAAMZlWgEAAAAAgBTi8SCtR9oH2A6AneP3PsTi%2FiE%3DJ8KDWDQqqNWqxfEOW84qsR8sAmhlYkjk4IC2dvoXBovg9zxbJv')
api = tweepy.API(auth)
screen_name = "Lemondefr"
user = api.get_user(screen_name)
public_tweets = api.home_timeline()

print(len(public_tweets))   
#I didn't get the access

#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [4]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [75]:
# your code here

# Replace with your own search query
auth=tweepy.OAuthHandler('fcWJ9pTzQcEOpm0Js7SecQwlQ','ri3XEFYaYXDQl1Ldn3hq0V4dgtiUpzKwXLeOdYHUzqH9odlROn')
auth.set_access_token('356464952-PwBSpEq57hpLqV6coaErWBEILw9Gxc1YDZ8FgQBj','AAAAAAAAAAAAAAAAAAAAAMZlWgEAAAAAgBTi8SCtR9oH2A6AneP3PsTi%2FiE%3DJ8KDWDQqqNWqxfEOW84qsR8sAmhlYkjk4IC2dvoXBovg9zxbJv')

api = tweepy.API(auth)

# the ID of the user
user_name='Lemondefr'
  
# fetching the user
user = api.get_user(user_name)
  
# fetching the followers_count
followers_count = user.followers_count

#I didn't get the access through Tweepy


Unauthorized: 401 Unauthorized
32 - Could not authenticate you.

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [76]:
# This is the url you will scrape in this exercise
url8 = 'https://meta.wikimedia.org/wiki/List_of_Wikipedias'

In [161]:
# your code here
import re
html8 = requests.get(url8).content
soup8 = BeautifulSoup(html8, "lxml")
tags8 = ['table','a']
text8 = [element.text for element in soup8.find_all(tags8)]
b=[]
s=0
for i in range(135,len(text8)):
    c=str(text8[i])
    
    d=c.strip().replace("\n","")
    print(d)


English
English
en
6,423,359
1,055,690,155
1,067
42,723,196
125,073
895,820
Cebuano
Sinugboanong Binisaya
ceb
6,069,148
33,565,053
6
85,101
182
0
Swedish
Svenska
sv
2,825,989
49,840,782
64
800,934
2,371
0
German
Deutsch
de
2,642,423
216,773,766
189
3,828,076
18,432
129,169
French
Français
fr
2,381,292
188,421,743
159
4,255,413
19,023
66,260
Dutch
Nederlands
nl
2,073,985
60,391,137
37
1,180,242
3,981
20
Russian
Русский
ru
1,778,416
118,344,852
77
3,085,233
11,392
234,776
Spanish
Español
es
1,737,409
139,899,189
65
6,408,902
15,542
0
Italian
Italiano
it
1,730,672
124,238,750
119
2,195,245
8,149
142,688
Polish
Polski
pl
1,500,074
65,389,991
105
1,139,273
4,273
263
Egyptian Arabic
مصرى (Maṣri)
arz
1,468,316
6,018,005
7
169,102
187
1,454
Japanese
日本語
ja
1,305,018
86,825,651
40
1,869,856
15,138
36,649
Vietnamese
Tiếng Việt
vi
1,269,478
67,522,418
21
841,222
2,617
23,646
Waray-Waray
Winaray
war
1,265,600
6,258,594
3
50,266
79
42
Chinese
中文
zh
1,245,308
68,797,657
65
3,165,763
8,135
58,127
Ara

810
74,136
3
11,691
10
0
Romani
romani - रोमानी
rmy
769
49,757
1
15,790
16
0
Tswana
Setswana
tn
722
24,637
1
8,639
6
0
Bambara
Bamanankan
bm
720
39,793
2
9,354
16
0
Tsonga
Xitsonga
ts
709
36,311
2
8,344
6
0
Venda
Tshivenda
ve
638
18,922
1
6,706
14
0
Cheyenne
Tsetsêhestâhese
chy
622
23,877
1
10,011
5
0
Kirundi
Ikirundi
rn
620
21,796
1
8,340
20
0
Tumbuka
chiTumbuka
tum
601
23,084
1
6,791
9
0
Inuktitut
ᐃᓄᒃᑎᑐᑦ
iu
590
44,150
2
16,019
16
0
Akan
Akana
ak
564
27,151
1
11,477
22
0
Swati
SiSwati
ss
543
37,684
3
7,210
9
0
Chamorro
Chamoru
ch
534
22,581
1
13,655
18
0
Pontic
Ποντιακά
pnt
485
34,946
1
8,925
8
0
Adyghe
Адыгэбзэ
ady
448
11,456
1
5,406
14
0
Inupiak
Iñupiatun
ik
403
37,063
1
7,498
9
0
Ewe
Eʋegbe
ee
384
49,104
2
12,626
9
0
Fula
Fulfulde
ff
360
22,579
0
7,344
11
0
Dinka
Thuɔŋjäŋ
din
308
7,064
1
5,319
3
0
Sango
Sängö
sg
278
20,324
2
5,647
9
0
Dzongkha
ཇོང་ཁ
dz
223
28,648
1
8,667
5
0
Tigrinya
ትግርኛ
ti
217
23,334
2
7,974
13
0
Paiwan
Paiwan
pwn
165
6,559
0
301
9
0
Cree
Nehiyaw
cr
158
36,526
2


#### A list with the different kind of datasets available in data.gov.uk.

In [163]:
# This is the url you will scrape in this exercise
url90 = 'https://data.gov.uk/'

In [170]:
# your code here
response90 = requests.get(url90).content
soup90 = BeautifulSoup(response90, "lxml")

tags29 = ["h3"]
text29 = [element.text for element in soup90.find_all(tags29)]
text29

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [172]:
# This is the url you will scrape in this exercise
url80 = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [191]:
# your code here

response80 = requests.get(url80).content
soup80 = BeautifulSoup(response80, "lxml")

tags291 = ["tr","tr""a"]
text291 = [typen.text for typen in soup80.find_all(tags291)]
for i in range(2,12):
    c=str(text291[i])
    
    d=c.strip().replace("\n","") and c.replace("[0-9]","")
    print(d)


1

Mandarin Chinese

918

11.922%

Sino-Tibetan

Sinitic


2

Spanish

480

5.994%

Indo-European

Romance


3

English

379

4.922%

Indo-European

Germanic


4

Hindi (sanskritised Hindustani)[11]

341

4.429%

Indo-European

Indo-Aryan


5

Bengali

300

4.000%

Indo-European

Indo-Aryan


6

Portuguese

221

2.870%

Indo-European

Romance


7

Russian

154

2.000%

Indo-European

Balto-Slavic


8

Japanese

128

1.662%

Japonic

Japanese


9

Western Punjabi[12]

92.7

1.204%

Indo-European

Indo-Aryan


10

Marathi

83.1

1.079%

Indo-European

Indo-Aryan



## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code here

#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
# your code here

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here