![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Web Scraping Multiple Pages

#### Business goal:

- Check the `case_study_gnod.md` file.
- Make sure you've understood the big picture of your project:

  - the goal of the company (`Gnod`),
  - their current product (`Gnoosic`),
  - their strategy, and
  - how your project fits into this context.

  Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

#### Instructions 

#### Prioritize the MVP

In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

#### Expand the project

If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

#### Practice web scraping

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: `url ='https://en.wikipedia.org/wiki/Python'`
- Find the number of titles that have changed in the United States Code since its last release point: `url = 'http://uscode.house.gov/download/download.shtml'`
- Create a Python list with the top ten FBI's Most Wanted names: `url = 'https://www.fbi.gov/wanted/topten'`
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: `url = 'https://www.emsc-csem.org/Earthquake/'`
- List all language names and number of related articles in the order they appear in [wikipedia.org](wikipedia.org): `url = 'https://www.wikipedia.org/'`
- A list with the different kind of datasets available in [data.gov.uk](data.gov.uk): `url = 'https://data.gov.uk/'`
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: `url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'`

In [50]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [72]:
url="https://www.basketball-reference.com/draft/"

response=requests.get(url)

response.status_code

200

In [73]:
soup=BeautifulSoup(response.content, "html.parser")
soup


<!DOCTYPE html>

<html class="no-js" data-root="/home/bbr/build" data-version="klecko-" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport">
<link href="https://cdn.ssref.net/req/202309261" rel="dns-prefetch"/>
<!-- Quantcast Choice. Consent Manager Tag v2.0 (for TCF 2.0) -->
<script async="true" type="text/javascript">
    (function() {
	var host = window.location.hostname;
	var element = document.createElement('script');
	var firstScript = document.getElementsByTagName('script')[0];
	var url = 'https://cmp.quantcast.com'
	    .concat('/choice/', 'XwNYEpNeFfhfr', '/', host, 
		    '/choice.js?tag_version=V2');
	var uspTries = 0;
	var uspTriesLimit = 3;
	element.async = true;
	element.type = 'text/javascript';
	element.src = url;
	
	firstScript.parentNode.insertBefore(element, firstScript);
	
	function makeStub() {
	    var TCF_LOCATOR_NAME = '__tcfapiL

In [82]:
#Extraigo la información de canciones
title = soup.find_all('div')


In [83]:
title

[<div id="wrap">
 <div id="header" role="banner">
 <ul class="notranslate" id="subnav">
 <li><a href="https://www.sports-reference.com/?utm_source=bbr&amp;utm_medium=sr_xsite&amp;utm_campaign=2023_01_srnav"><svg height="15px" width="20px"><use xlink:href="#ic-sr-pennant"></use></svg> Sports Reference ®</a></li>
 <li><a href="https://www.baseball-reference.com/?utm_source=bbr&amp;utm_medium=sr_xsite&amp;utm_campaign=2023_01_srnav">Baseball</a></li>
 <li><a href="https://www.pro-football-reference.com/?utm_source=bbr&amp;utm_medium=sr_xsite&amp;utm_campaign=2023_01_srnav">Football</a> <a href="https://www.sports-reference.com/cfb/">(college)</a></li>
 <li class="current"><a href="https://www.basketball-reference.com/?utm_source=bbr&amp;utm_medium=sr_xsite&amp;utm_campaign=2023_01_srnav">Basketball</a> <a href="https://www.sports-reference.com/cbb/">(college)</a></li>
 <li><a href="https://www.hockey-reference.com/?utm_source=bbr&amp;utm_medium=sr_xsite&amp;utm_campaign=2023_01_srnav">Hoc

In [102]:
years = soup.find_all('th', class_='right')

In [103]:
years

[<th class="right" data-stat="draft" scope="row"><a href="/draft/NBA_2023.html">2023</a></th>,
 <th class="right" data-stat="draft" scope="row"><a href="/draft/NBA_2022.html">2022</a></th>,
 <th class="right" data-stat="draft" scope="row"><a href="/draft/NBA_2021.html">2021</a></th>,
 <th class="right" data-stat="draft" scope="row"><a href="/draft/NBA_2020.html">2020</a></th>,
 <th class="right" data-stat="draft" scope="row"><a href="/draft/NBA_2019.html">2019</a></th>,
 <th class="right" data-stat="draft" scope="row"><a href="/draft/NBA_2018.html">2018</a></th>,
 <th class="right" data-stat="draft" scope="row"><a href="/draft/NBA_2017.html">2017</a></th>,
 <th class="right" data-stat="draft" scope="row"><a href="/draft/NBA_2016.html">2016</a></th>,
 <th class="right" data-stat="draft" scope="row"><a href="/draft/NBA_2015.html">2015</a></th>,
 <th class="right" data-stat="draft" scope="row"><a href="/draft/NBA_2014.html">2014</a></th>,
 <th class="right" data-stat="draft" scope="row"><

In [104]:
#Tamaño de canciones
len(years)

77

In [112]:
#Extraigo la información de artistas
teams = soup.find_all('td', class_="left", attrs={"data-stat": "team_name"})

In [113]:
teams

[<td class="left" data-stat="team_name"><a href="/teams/SAS/draft.html">San Antonio Spurs</a></td>,
 <td class="left" data-stat="team_name"><a href="/teams/ORL/draft.html">Orlando Magic</a></td>,
 <td class="left" data-stat="team_name"><a href="/teams/DET/draft.html">Detroit Pistons</a></td>,
 <td class="left" data-stat="team_name"><a href="/teams/MIN/draft.html">Minnesota Timberwolves</a></td>,
 <td class="left" data-stat="team_name"><a href="/teams/NOH/draft.html">New Orleans Pelicans</a></td>,
 <td class="left" data-stat="team_name"><a href="/teams/PHO/draft.html">Phoenix Suns</a></td>,
 <td class="left" data-stat="team_name"><a href="/teams/PHI/draft.html">Philadelphia 76ers</a></td>,
 <td class="left" data-stat="team_name"><a href="/teams/PHI/draft.html">Philadelphia 76ers</a></td>,
 <td class="left" data-stat="team_name"><a href="/teams/MIN/draft.html">Minnesota Timberwolves</a></td>,
 <td class="left" data-stat="team_name"><a href="/teams/CLE/draft.html">Cleveland Cavaliers</a><

In [114]:
players = soup.find_all('a')

In [115]:
players

[<a href="https://www.sports-reference.com/?utm_source=bbr&amp;utm_medium=sr_xsite&amp;utm_campaign=2023_01_srnav"><svg height="15px" width="20px"><use xlink:href="#ic-sr-pennant"></use></svg> Sports Reference ®</a>,
 <a href="https://www.baseball-reference.com/?utm_source=bbr&amp;utm_medium=sr_xsite&amp;utm_campaign=2023_01_srnav">Baseball</a>,
 <a href="https://www.pro-football-reference.com/?utm_source=bbr&amp;utm_medium=sr_xsite&amp;utm_campaign=2023_01_srnav">Football</a>,
 <a href="https://www.sports-reference.com/cfb/">(college)</a>,
 <a href="https://www.basketball-reference.com/?utm_source=bbr&amp;utm_medium=sr_xsite&amp;utm_campaign=2023_01_srnav">Basketball</a>,
 <a href="https://www.sports-reference.com/cbb/">(college)</a>,
 <a href="https://www.hockey-reference.com/?utm_source=bbr&amp;utm_medium=sr_xsite&amp;utm_campaign=2023_01_srnav">Hockey</a>,
 <a href="https://fbref.com/en/?utm_source=bbr&amp;utm_medium=sr_xsite&amp;utm_campaign=2023_01_srnav">Soccer</a>,
 <a href="ht

In [119]:
# Elegimos unicamente los player del draft historico
players = soup.find_all('a', href=lambda href: href and '/players/' in href)[1:78]
players

[<a href="/players/w/wembavi01.html">Victor Wembanyama</a>,
 <a href="/players/b/banchpa01.html">Paolo Banchero</a>,
 <a href="/players/c/cunnica01.html">Cade Cunningham</a>,
 <a href="/players/e/edwaran01.html">Anthony Edwards</a>,
 <a href="/players/w/willizi01.html">Zion Williamson</a>,
 <a href="/players/a/aytonde01.html">Deandre Ayton</a>,
 <a href="/players/f/fultzma01.html">Markelle Fultz</a>,
 <a href="/players/s/simmobe01.html">Ben Simmons</a>,
 <a href="/players/t/townska01.html">Karl-Anthony Towns</a>,
 <a href="/players/w/wiggian01.html">Andrew Wiggins</a>,
 <a href="/players/b/bennean01.html">Anthony Bennett</a>,
 <a href="/players/d/davisan02.html">Anthony Davis</a>,
 <a href="/players/i/irvinky01.html">Kyrie Irving</a>,
 <a href="/players/w/walljo01.html">John Wall</a>,
 <a href="/players/g/griffbl01.html">Blake Griffin</a>,
 <a href="/players/r/rosede01.html">Derrick Rose</a>,
 <a href="/players/o/odengr01.html">Greg Oden</a>,
 <a href="/players/b/bargnan01.html">Andrea

In [120]:
college = soup.find_all('a', href=lambda href: href and '/friv/draft.fcgi?college=' in href)
college

[<a href="/friv/draft.fcgi?college=duke">Duke</a>,
 <a href="/friv/draft.fcgi?college=okstate">Oklahoma State</a>,
 <a href="/friv/draft.fcgi?college=georgia">Georgia</a>,
 <a href="/friv/draft.fcgi?college=duke">Duke</a>,
 <a href="/friv/draft.fcgi?college=arizona">Arizona</a>,
 <a href="/friv/draft.fcgi?college=washington">Washington</a>,
 <a href="/friv/draft.fcgi?college=lsu">LSU</a>,
 <a href="/friv/draft.fcgi?college=kentucky">Kentucky</a>,
 <a href="/friv/draft.fcgi?college=kansas">Kansas</a>,
 <a href="/friv/draft.fcgi?college=unlv">UNLV</a>,
 <a href="/friv/draft.fcgi?college=kentucky">Kentucky</a>,
 <a href="/friv/draft.fcgi?college=duke">Duke</a>,
 <a href="/friv/draft.fcgi?college=kentucky">Kentucky</a>,
 <a href="/friv/draft.fcgi?college=oklahoma">Oklahoma</a>,
 <a href="/friv/draft.fcgi?college=memphis">Memphis</a>,
 <a href="/friv/draft.fcgi?college=ohiost">Ohio State</a>,
 <a href="/friv/draft.fcgi?college=utah">Utah</a>,
 <a href="/friv/draft.fcgi?college=cincy">Cincin

In [121]:
len(college)

71

In [122]:
#Lista de almacenamiento
anno=[]
equipo=[]
jugador=[]


#Recorro cada uno de los valores encontrados para guardarlos en las lista vacias
for i in range(len(years)):
    anno.append(years[i].get_text())
    equipo.append(teams[i].get_text())
    jugador.append(players[i].get_text())

In [124]:
#Creo un DataFrame
histo_draft=pd.DataFrame({"year": anno, 
                     "team": equipo,
                     "player": jugador})

In [125]:
histo_draft

Unnamed: 0,year,team,player
0,2023,San Antonio Spurs,Victor Wembanyama
1,2022,Orlando Magic,Paolo Banchero
2,2021,Detroit Pistons,Cade Cunningham
3,2020,Minnesota Timberwolves,Anthony Edwards
4,2019,New Orleans Pelicans,Zion Williamson
...,...,...,...
72,1951,Baltimore Bullets,Gene Melchiorre
73,1950,Boston Celtics,Chuck Share
74,1949,Providence Steamrollers,Howie Shannon
75,1948,Providence Steamrollers,Andy Tonkovich
