# S12 T02 - Web scraping
## Descripción
Aprende a realizar web scraping.
___
Objetivos
- Web scraping
- Documentar datos recogidos con web scraping

In [1]:
import pandas as pd

import requests
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import Chrome
from webdriver_manager.chrome import ChromeDriverManager

from docx import Document
from docx.shared import Inches
from docx.enum.text import WD_LINE_SPACING
from docx import text
from docx.shared import Pt

import warnings
warnings.filterwarnings('ignore')

import scrapy
from scrapy import signals
from scrapy.exporters import CsvItemExporter
from scrapy.http import TextResponse
from scrapy import Selector
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import CloseSpider
from scrapy.crawler import CrawlerRunner, CrawlerProcess
from fake_useragent import UserAgent
from scrapy.http import TextResponse

import csv
import json
import logging

## Nivel 1
### - Ejercicio 1
Realiza web scraping de una página de la bolsa de Madrid (https://www.bolsamadrid.es) utilizando BeautifulSoup y Selenium.

> **web scraping** es el proceso de recopilar informacion de Internet. 

> *Challenges* : Variedad en la estructura de la informacion y Durabilidad del sitio web (webistes cambian constantemente)

Primero se hace scrap del html de la pagina:

In [20]:
#get url
URL = "https://www.bolsamadrid.es"
page = requests.get(URL)

page

<Response [200]>

### Beautifiul Soup
Primero utilizaremos la libreria Beautiful soup para hacer "parsing" de los datos. Esta libreria permite interactuar con HTML de manera similar que se interactua con la pagina web utilizando los developer tools.

In [3]:
#create Beautiful soup object
soup = BeautifulSoup(page.content, "html.parser") 
#page.content avoids parsing problems, html parser makes the appropriate parser

In [4]:
#get the title of the webpage
title = soup.find("title").get_text().replace("\r\n","").replace("\t","")
print(title)

Bolsa de Madrid


Ahora, se va a buscar la url que contiene la informacion referente a las acciones

In [6]:
#search in the shares tag
shares_tag = soup.find('a', string = 'Acciones')
print(shares_tag)

<a href="/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000" target="_self">Acciones</a>


In [8]:
#get url by the tag
link_s = shares_tag.get(key='href')
print(link_s)

/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000


Y ahora se combina la URL de la pagina principal con la ubicacion de las acciones y creamos otro beautiful soup object

In [21]:
#set the url of the data
url_shares = URL+link_s.replace('esp','ing')
print(url_shares)
html_shares = requests.get(url_shares)

https://www.bolsamadrid.es/ing/aspx/Mercados/Precios.aspx?indice=ESI100000000


In [22]:
#create beautifulsoup element for new url
soup_s = BeautifulSoup(html_shares.content, 'html.parser')

Ahora, se busca el elemento que incluye la tabla, se itera para extraer los datos y, finalmente, creamos un dataframe con ellos.

In [23]:
#find the data by tag
table_tag = soup_s.find(id='ctl00_Contenido_tblAcciones')

Una de las opciones para obtener la informacion requerida es iterar en la jerarquia de los elementos de la web, ahora identificamos que la tabla es un child element, por lo que se va a iterar en él para obtener la informacion relativa a las acciones y generar una la tabla:

In [25]:
#iteration for gather data
rows = []
for child in table_tag.children: #iterate child element
    element = []
    if child != "\n":
        for i in child:
            if i != "\n":
                element.append(i.text)
        rows.append(element)

In [28]:
#save to dataframe
df_acciones = pd.DataFrame(rows[1:], columns=rows[0])
df_acciones

Unnamed: 0,Name,Last,% Dif.,High,Low,Volume,Turnover (€ Thousands),Date,Time
0,ACCIONA,169.6,-0.64,173.1,168.4,182238,31085.53,29/03/2022,Close
1,ACERINOX,10.27,1.73,10.3,10.1,1350393,13798.84,29/03/2022,Close
2,ACS,24.93,0.93,25.1,24.75,879936,21932.87,29/03/2022,Close
3,AENA,150.2,2.18,150.6,147.55,113071,16912.78,29/03/2022,Close
4,ALMIRALL,11.75,-5.62,12.53,11.71,666050,7925.93,29/03/2022,Close
5,AMADEUS,60.7,7.28,61.04,57.02,1359484,80913.07,29/03/2022,Close
6,ARCELORMIT.,29.975,-1.38,30.78,29.215,798387,24017.71,29/03/2022,Close
7,B.SANTANDER,3.2485,5.47,3.263,3.114,71880085,230269.21,29/03/2022,Close
8,BA.SABADELL,0.8016,7.02,0.804,0.7512,55519643,43534.96,29/03/2022,Close
9,BANKINTER,5.45,6.78,5.46,5.206,4206500,22548.98,29/03/2022,Close


## Selenium 
Ahora utilizando el paquete Selenium se hara el web scraping.

>Selenium es una poderosa herramienta para controlar los navegadores web a través de programas y realizar la automatización del navegador.

*https://www.geeksforgeeks.org/selenium-basics-components-features-uses-and-limitations/*

In [55]:
#create webdriver for chrome
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 99.0.4844
Get LATEST chromedriver version for 99.0.4844 google-chrome
There is no [win32] chromedriver for browser 99.0.4844 in cache
Trying to download new driver from https://chromedriver.storage.googleapis.com/99.0.4844.51/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\Usuario\.wdm\drivers\chromedriver\win32\99.0.4844.51]


Ahora iniciamos el manejo de la url a traves del webdriver de selenium:

In [61]:
#get the url through the driver
driver.get(url_shares)

Se realiza la busqueda de la tabla anterior de las acciones mediante el Xpath: 
>“XPath is a language for selecting nodes in XML documents, which can also be used with HTML.”
<br>

*https://www.browserstack.com/guide/find-element-by-xpath-in-selenium*

In [62]:
#search bt xpath
table_rows = driver.find_elements(By.XPATH, "/html/body/div[1]/table/tbody/tr[4]/td[2]/div[1]/form/div[6]/table/tbody/tr")

Iteramos nuevamente por la tabla para extraer los datos:

In [63]:
#iteration for gather data

new_rows = []
for row in table_rows[1:]:
    element = []
    values = row.find_elements(By.XPATH, "td")
    for value in values:
        element.append(value.text)
    new_rows.append(element)

Se convierten los datos extraidos en un dataframe :

In [64]:
#save to dataframe
df_shares = pd.DataFrame(new_rows[1:], columns=rows[0])
df_shares

Unnamed: 0,Name,Last,% Dif.,High,Low,Volume,Turnover (€ Thousands),Date,Time
0,ACERINOX,10.27,1.73,10.3,10.1,1350393,13798.84,29/03/2022,Close
1,ACS,24.93,0.93,25.1,24.75,879936,21932.87,29/03/2022,Close
2,AENA,150.2,2.18,150.6,147.55,113071,16912.78,29/03/2022,Close
3,ALMIRALL,11.75,-5.62,12.53,11.71,666050,7925.93,29/03/2022,Close
4,AMADEUS,60.7,7.28,61.04,57.02,1359484,80913.07,29/03/2022,Close
5,ARCELORMIT.,29.975,-1.38,30.78,29.215,798387,24017.71,29/03/2022,Close
6,B.SANTANDER,3.2485,5.47,3.263,3.114,71880085,230269.21,29/03/2022,Close
7,BA.SABADELL,0.8016,7.02,0.804,0.7512,55519643,43534.96,29/03/2022,Close
8,BANKINTER,5.45,6.78,5.46,5.206,4206500,22548.98,29/03/2022,Close
9,BBVA,5.501,6.71,5.553,5.208,35106942,188935.83,29/03/2022,Close


Se extrae informacion adicional referente al titulo y la clase de datos de la tabla:

In [67]:
#get page title and classname
TituloPag = driver.find_element(By.CLASS_NAME, "TituloPag").text
Ctr = driver.find_element(By.CLASS_NAME, "Ctr").text
TituloPag + ' ' + Ctr

'Session Prices IBEX 35®'

Se salvan los datos en un archivo csv:

In [68]:
#save data to file
filename= './{}_{}_Selenium.csv'.format(TituloPag.replace(" ",""),Ctr.replace(r"®","").replace(r" ",""))
df_shares.to_csv(filename, index=False)

Se cierra y termina el driver:

In [69]:
#close file
driver.quit()

## Nivel 2
### - Ejercicio 2
Documenta en un word tu conjunto de datos generado con la información que tienen los distintos archivos de Kaggle.

In [53]:
#new document
document = Document()

# Heading
document.add_heading('IBEX 35 Shares', 0)

# Context
document.add_heading(' CONTEXT: ', level = 2)
document.add_paragraph('This document shows the shares listed on the Spanish market. The IBEX 35 is the main reference stock market index of the Spanish stock market prepared by Bolsas y Mercados Españoles.', style = 'List Bullet').runs[0].font.size =Pt(10)


# Content
document.add_heading(' CONTENT: ', level = 2)
document.add_paragraph('Nombre: Share\'s name', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Últ: Latest share price update', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('% Dif: Share price difference', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Máx: The maximum share price of the session', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Mín: The minimum share price of the session', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Volumen: Share volume', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Efectivo (miles €):  Company cash', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Fecha: Daily date', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Hora: Hour ', style = 'List Bullet').runs[0].font.size = Pt(10)


# Table 
table = document.add_table(rows = 1, cols = len(df_acciones.columns), style = 'TableGrid')
hdr_cells = table.rows[0].cells

rows = []
for child in table_tag.children: #iterate child element
    element = []
    if child != "\n":
        for i in child:
            if i != "\n":
                element.append(i.text)
        rows.append(element)

paragraph = document.add_paragraph()
paragraph.paragraph_format.line_spacing_rule = WD_LINE_SPACING.EXACTLY

document.add_page_break()

document.save('data_shares.docx')

## Nivel 3 
### - Ejercicio 3
Elige una página web que quieras y realiza web scraping mediante la librería Scrapy. 
> **Scraping es una técnica, la cual podemos utilizar, para hacer barridos de web completas.** Su arquitectura basada en Pipelines, Schedulers, Spiders y Downloaders permite al desarrollador tener control sobre todo el proceso de Scraping.

*https://www.theninjacto.xyz/Web-Scraping-Using-Python-in-Jupyter-notebooks/*

Se crea un pipeline sencillo para guardar los elementos encontrados en un JSON file:

In [3]:
class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('quoteresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

Se define la clase spider de la cual se va a realizar el crawling de la web para obtener los datos.

In [4]:
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'quoteresult.json'                        # Used for pipeline 2
    }
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

Se inicia el crawler:

In [26]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start()

2022-04-03 13:27:47 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-04-03 13:27:47 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19041-SP0
2022-04-03 13:27:47 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


ReactorAlreadyInstalledError: reactor already installed

Por ultimo se salvan los datos en un dataframe:

In [9]:
quotes = pd.read_json('quoteresult.jl', lines=True)
quotes

Unnamed: 0,text,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"[change, deep-thoughts, thinking, world]"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"[abilities, choices]"
2,“There are only two ways to live your life. On...,Albert Einstein,"[inspirational, life, live, miracle, miracles]"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"[aliteracy, books, classic, humor]"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"[be-yourself, inspirational]"
5,“Try not to become a man of success. Rather be...,Albert Einstein,"[adulthood, success, value]"
6,“It is better to be hated for what you are tha...,André Gide,"[life, love]"
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"[edison, failure, inspirational, paraphrased]"
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,[misattributed-eleanor-roosevelt]
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"[humor, obvious, simile]"


Nota: Despues de encontrar dificultades para hacer el web scraping de algunas paginas conocidas, me he decidido por utilizar una web que se utiliza regularmente con fines de estudio del web scraping.