# Single Page, Single Table


### We want to scrape a table that contains favorite morning drinks on this <a href="https://sandeepmj.github.io/scrape-example-page/">demo page</a>.

The webpage is ```https://sandeepmj.github.io/scrape-example-page/```

In [1]:
## import libraries
import pandas as pd ##sirve para scrappear tablas o tabular data.
import requests ## va a una página y baja todo el contenido

In [2]:
##scrape url website
url = "https://sandeepmj.github.io/scrape-example-page/"

In [4]:
## MUST turn html into a string
response = requests.get(url) # uso la librería request y el método get. Trae todo el contenido de la url

In [6]:
## response status code 
## 200 quiere decir que se pudo comunicar con el servido y devolver los datos. 
## 300 quiere decir que se comunicó con la web, pero esta lo redirigió a alguna otra parte. 
## 400 o 400 y algo quiere decir que la página no existe o que hay algo roto. 
## 500 es cuando el servidor te bloquea activamente)
response.status_code

200

In [7]:
## what type of data is response
## dice que es un reques model response  object
type(response)

requests.models.Response

In [9]:
## para ver el contenido, usamos el método content
response.content

b'<!doctype html>\n<!--\n   Basic template\n-->\n<html lang="en">\n\n<head>\n\n\t<!-- Makes the page responsive and scaled to be read easily -->\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\n\t<!-- Links to stylesheet -->\n\t<link rel="stylesheet" type="text/css" href="style.css">\n\t<!-- Remember to update page title -->\n\t<title>Demo Webpage for Scraping</title>\n\n</head>\n\n<body>\n\t<!-- All content goes here -->\n\n<div class="container">\n<div class="headline">Demo Webpage for Scraping</div>\n<div class="text">\n\t<p>This page holds some content to demo scraping.</p>\n\n\t<ul>\n\t\t<li><a href="#bev">Morning Beverages</a></li>\n    \n\t\t<li><a href="#organized">Organized Data</a></li>\n\t\t<li><a href="#disorganized">Disorganized Data</a></li>\n    <li><a href="#spanned">Extra Spans</a></li>\n    <li><a href="#exclude">Exclude a Class in Common</a></li>\n\t\t<li><a href="#nfl_table">Tabular Data</a></li>\n    <li><a href="heaviest-animals-page1.htm

In [10]:
## qué es ese response.content 
## dice que es bytes (lo indica la b antes del código)
type(response.content)

bytes

In [12]:
## podemos usar el método text para obtener solo el texto del contenido (sin la b)
response.text

'<!doctype html>\n<!--\n   Basic template\n-->\n<html lang="en">\n\n<head>\n\n\t<!-- Makes the page responsive and scaled to be read easily -->\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\n\t<!-- Links to stylesheet -->\n\t<link rel="stylesheet" type="text/css" href="style.css">\n\t<!-- Remember to update page title -->\n\t<title>Demo Webpage for Scraping</title>\n\n</head>\n\n<body>\n\t<!-- All content goes here -->\n\n<div class="container">\n<div class="headline">Demo Webpage for Scraping</div>\n<div class="text">\n\t<p>This page holds some content to demo scraping.</p>\n\n\t<ul>\n\t\t<li><a href="#bev">Morning Beverages</a></li>\n    \n\t\t<li><a href="#organized">Organized Data</a></li>\n\t\t<li><a href="#disorganized">Disorganized Data</a></li>\n    <li><a href="#spanned">Extra Spans</a></li>\n    <li><a href="#exclude">Exclude a Class in Common</a></li>\n\t\t<li><a href="#nfl_table">Tabular Data</a></li>\n    <li><a href="heaviest-animals-page1.html

In [15]:
## vemos qué tipo es (vemos que es un string object)
## Esto es importante para cuando cuando estemos más avanzado, tratando de bajar JSON files (usamos response.content)
## pero cuando queremos bajar tablas, neceistamos usar response.text (muchas veces)
## pandas es muy bueno pulling down tablas
type(response.text)

str

In [16]:
## use Pandas to read tables on page
pd.read_html(response.text)

[         Drink  Serving Size (oz)  Caffeine (mg)
 0    Coke Zero                 16             45
 1   Chai Latte                 16             95
 2  Caffe Latte                 16            150,
                 Player position Played Team  salary 2019
 0         Kirk Cousins              QB  MIN  $27,500,000
 1       Jameis Winston              QB  TAM  $20,922,000
 2       Marcus Mariota              QB  TEN  $20,922,000
 3           Derek Carr              QB  OAK  $19,900,000
 4           Joe Flacco              QB  DEN  $18,500,000
 ...                ...             ...  ...          ...
 1909  D'Ernest Johnson              RB  CLE     $495,000
 1910  Garrett Bradbury              OL  MIN     $495,000
 1911      Alex Redmond               G  CIN     $493,236
 1912       Holton Hill              CB  MIN     $435,882
 1913    Tyrone Swoopes              TE  SEA     $378,034
 
 [1914 rows x 4 columns]]

## what type of object is ```tables```?

In [17]:
## show type of object
type(pd.read_html(response.text))

list

In [18]:
len(pd.read_html(response.text)) ## sin dos tablas

2

In [20]:
#qué tipo de objetos está en la lista
type(pd.read_html(response.text)[0])
#va a decir que es un padas core dataframe

pandas.core.frame.DataFrame

## As a demo, let's target the first table and export it as a CSV.

In [21]:
## let's look at the first table:
pd.read_html(response.text)[0]

Unnamed: 0,Drink,Serving Size (oz),Caffeine (mg)
0,Coke Zero,16,45
1,Chai Latte,16,95
2,Caffe Latte,16,150


In [22]:
## we store it in a new variable, or dataframe:
df_drinks = pd.read_html(response.text)[0]

In [23]:
## Let's look at what type of object it is:
df_drinks

Unnamed: 0,Drink,Serving Size (oz),Caffeine (mg)
0,Coke Zero,16,45
1,Chai Latte,16,95
2,Caffe Latte,16,150


In [24]:
## use pandas to write to csv file
df_drinks.to_csv("drinks.csv", encoding = "UTF-8", index = False )

## The old way

## The reason I didn't assign a video or reading on scraping tables

* Most tutorials on scraping tables are convoluted and inefficient.
* Many aren't using the most modern methods (probably because people keep doing what they already know how to do...)




## Grab the correct table onward:
<img src="https://github.com/sandeepmj/fall20-student-practical-python/raw/682ec7738eb7e2d8748e3566846939c5264cab44/support_files/grab-data.png">

## Export to CSV:
<img src="https://github.com/sandeepmj/fall20-student-practical-python/raw/682ec7738eb7e2d8748e3566846939c5264cab44/support_files/export-csv.png">

In [34]:
#scrappear tabla 2
all_dfs = pd.read_html(response.text)
df_nfl = all_dfs[1]
df_nfl

Unnamed: 0,Player,position Played,Team,salary 2019
0,Kirk Cousins,QB,MIN,"$27,500,000"
1,Jameis Winston,QB,TAM,"$20,922,000"
2,Marcus Mariota,QB,TEN,"$20,922,000"
3,Derek Carr,QB,OAK,"$19,900,000"
4,Joe Flacco,QB,DEN,"$18,500,000"
...,...,...,...,...
1909,D'Ernest Johnson,RB,CLE,"$495,000"
1910,Garrett Bradbury,OL,MIN,"$495,000"
1911,Alex Redmond,G,CIN,"$493,236"
1912,Holton Hill,CB,MIN,"$435,882"


In [37]:
df_nfl.to_csv("nfl_salaries.csv", encoding = "UTF-8", index = False)