# Web Scrapping

## First Web Scrapping

In [1]:
import requests 
urltoget = 'https://bradfordtuckfield.com/indexarchive20210903.html' 
pagecode = requests.get(urltoget) 
print(pagecode.text[0:600])

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">
  <head><meta http-equiv="Content-Type" content="text/html; charset=utf-8">

    <title>Bradford Tuckfield</title>
    <meta name="description" content="Bradford Tuckfield" />
    <meta name="keywords" content="Bradford Tuckfield" />
    <meta name="google-site-verification" content="eNw-LEFxVf71e-ZlYnv5tGSxTZ7V32coMCV9bxS3MGY" />
<link rel="stylesheet" type="text/css" href=


In [2]:
urltoget = 'https://bradfordtuckfield.com/contactscrape.html'
pagecode = requests.get(urltoget)
mail_beggining = pagecode.text.find('Email:')
mail_beggining

511

In [3]:
print(pagecode.text[mail_beggining:mail_beggining+80])

Email:  <label class="email" href="#">demo@bradfordtuckfield.com</label>
</div>


In [4]:
print(pagecode.text[mail_beggining+38:mail_beggining+64])

demo@bradfordtuckfield.com


In [5]:
at_beginning=pagecode.text.find('@') 
print(at_beginning)

553


In [6]:
print(pagecode.text[at_beginning-4:at_beginning+22])

demo@bradfordtuckfield.com


## Expresiones regulares

In [7]:
import re 
print(re.search(r'recommend','irrelevant text I recommend irrelevant text').span())

(18, 27)


In [8]:
print(re.search('rec+om+end', 'irrelevant text I recommend irrelevant text').span())

(18, 27)


### Uso de metacaractéres

In [9]:
 re.search('10*','My bank balance is 100').span() # Uso de *

(19, 22)

In [10]:
print(re.search('Clarke?','Please refer questions to Mr. Clark').span()) # Uso de ?

(30, 35)


### Carácter de escape

In [11]:
 re.search('99\+12=111','Example addition: 99+12=111').span() # Usando + como catacter \+

(18, 27)

In [12]:
 re.search('\d','The loneliest number is 1').span()

(24, 25)

In [13]:
re.search('[a-z]','My Twitter is @fake; my email is abc@def.com').span()

(1, 2)

In [14]:
re.search('[A-Z]','My Twitter is @fake; my email is abc@def.com').span()

(0, 1)

In [15]:
 re.search('Manchac[a|k]','Lets drive on Manchaca.').span()

(14, 22)

### Búsquedas combinadas

In [16]:
print(re.search('school.*\.pdf$','schoolforgottenname.pdf').span()) 

(0, 23)


In [17]:
print(re.search('school.*\.pdf$','school.pdf').span())

(0, 10)


In [18]:
print(re.search('school.*\.pdf$','schoolothername.pdf').span()) 

(0, 19)


### Caso práctico: buscar un correo electrónico

In [19]:
re.search('[a-zA-Z]+@[a-zA-Z]+\.[a-zA-Z]+','My Twitter is @fake; my email is abc@def.com').span()

(33, 44)

## Transformar resultados en tablas

Obtener el código HTML de la página:

In [20]:
urltoget = 'https://bradfordtuckfield.com/contactscrape2.html'
pagecode = requests.get(urltoget)

Obtener las direcciones de correo mediante expresiones regulares:

In [21]:
allmatches=re.finditer('[a-zA-Z]+@[a-zA-Z]+\.[a-zA-Z]+',pagecode.text)

Traspasar las direcciones de correo a una lista:

In [22]:
alladdresses = [] 
for match in allmatches: 
    alladdresses.append(match[0]) 
alladdresses

['abc@abc.com',
 'def@def.com',
 'ghi@ghi.com',
 'jkl@jkl.org',
 'mno@mno.net',
 'pqr@pqr.edu',
 'stu@stu.com']

Convertir la lista de direcciones de correo en una dataframe:

In [23]:
import pandas as pd
alladdpd=pd.DataFrame(alladdresses)
alladdpd

Unnamed: 0,0
0,abc@abc.com
1,def@def.com
2,ghi@ghi.com
3,jkl@jkl.org
4,mno@mno.net
5,pqr@pqr.edu
6,stu@stu.com


## Beatiful Soup

In [27]:
from bs4 import BeautifulSoup 
URL = 'https://bradfordtuckfield.com/indexarchive20210903.html' 
response = requests.get(URL) 
soup = BeautifulSoup(response.text, 'lxml') 
all_urls = soup.find_all('a') 
for each in all_urls: 
    print(each['href'])

https://nostarch.com/Dive-Into-Algorithms
https://www.penguinrandomhouse.com/books/645953/dive-into-algorithms-by-bradford-tuckfield/9781718500686/
https://www.amazon.com/dp/1718500688
https://www.amazon.com/Applied-Unsupervised-Learning-relationships-hierarchical/dp/1789956390/
final20190428.pdf
http://thedreamtigers.com/
https://kmbara.com
https://www.theamericanconservative.com/author/bradford-tuckfield/
http://www.the-american-interest.com/byline/bradford-tuckfield/
tai.pdf
tai2018.pdf
https://www.nationalaffairs.com/authors/detail/bradford-tuckfield
nationalaffairs.pdf
https://americanaffairsjournal.org/2017/10/the-incoherence-of-the-economists/
borges.pdf
https://quillette.com/author/bradford-tuckfield/
https://vitadabrutto.wordpress.com/2019/03/13/disuguaglianze-estetiche-ed-economia-del-sesso/
https://nas.org/blogs/dicta/avoiding_scholarships_dead_ends
http://www.newenglishreview.org/custpage.cfm/frm/167043/sec_id/167043
https://web.archive.org/web/20170705161916/http://www.tas

### Analizando HTML labels

In [29]:
URL = 'https://bradfordtuckfield.com/contactscrape.html' 
response = requests.get(URL) 
soup = BeautifulSoup(response.text, 'lxml') 
email = soup.find('label',{'class':'email'}).text 
mobile = soup.find('label',{'class':'mobile'}).text 
website = soup.find('a',{'class':'website'}).text 
print("Email : {}".format(email)) 
print("Mobile : {}".format(mobile)) 
print("Website : {}".format(website))

Email : demo@bradfordtuckfield.com
Mobile : +1 879-890-9767
Website : www.bradfordtuckfield.com


### Scraping and parsing HTML tables

In [31]:
URL = 'https://bradfordtuckfield.com/user_detailsscrape.html' 
response = requests.get(URL) 
soup = BeautifulSoup(response.text, 'lxml') 
all_user_entries = soup.find_all('tr',{'class':'user-details'}) 
for each_user in all_user_entries: 
    user = each_user.find_all("td") 
    print("User Firstname : {}, Lastname : {}, Age: {}".format(user[0].text, user[1].text, user[2].text))

User Firstname : Jill, Lastname : Smith, Age: 50
User Firstname : Eve, Lastname : Jackson, Age: 44
User Firstname : John, Lastname : Jackson, Age: 24
User Firstname : Kevin, Lastname : Snow, Age: 34


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=25415bd8-24a1-4217-9df7-438e1e208889' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>