<div class="alert alert-block alert-info"><font size="6"><b>Sprint 12 Task 2 (S12_T02)</b></font><h6 align="right"><u>Author: Alberto Achaval</u></h6></div>

## <SPAN style=color:#1F618D>Level 1</SPAN>

### <SPAN style=color:#1F618D>Practice 1</SPAN>

<SPAN style=color:#1F618D>Perform scraping on the Madrid Stock Exchange site (https://www.bolsamadrid.es) using BeautifulSoup and Selenium.</SPAN>

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

With this code issues an HTTP GET request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.

In [2]:
URL = "https://www.bolsamadrid.es/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000"
page = requests.get(URL)

print(page.text)


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head data-idioma="esp" data-hora-act="Fri, 13 May 2022 14:50:40 GMT" data-app-path="/" data-bolsa="BMadrid" data-analytics-id="UA-35966870-2"><meta http-equiv="X-UA-Compatible" content="IE=11" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta id="ctl00_copyright" name="copyright" content="Copyright © BME 2022" /><title>
	Bolsa de Madrid - Precios de la sesión
</title><link id="ctl00_RSSLink1" rel="alternate" type="application/rss+xml" href="/esp/aspx/RSS/RSS.ashx?feed=Todo" title="Bolsa de Madrid: Todos los contenidos agregados" /><link id="ctl00_RSSLink2" rel="alternate" type="application/rss+xml" href="/esp/aspx/RSS/RSS.ashx?feed=NotasPrensa" title="Bolsa de Madrid: Notas de Prensa" /><link id="ctl00_RSSLink3" rel="alternate" type="application/rss+xml" href="/esp/aspx/RSS/RSS.ashx?feed=R

### Web scraping: BeautifulSoup

Parse HTML Code With Beautiful Soup: create a Beautiful Soup object that takes page.content, which is the HTML content you scraped earlier, as its input.

In [3]:
soup = BeautifulSoup(page.content, 'html.parser')

In [4]:
title = soup.find('title').get_text().replace('\r\n','').replace('\t', '')
title

'Bolsa de Madrid - Precios de la sesión'

In [5]:
results = soup.find(id = "ctl00_Contenido_tblAcciones")

Iterate thorough child elements in order to gather the data:

In [6]:
rows = []
for child in results.children: 
    element = []
    if child != "\n":
        for i in child:
            if i != "\n":
                element.append(i.text)
        rows.append(element)

Now we present the data in a Pandas DataFrame:

In [7]:
df_prices = pd.DataFrame(rows[1:], columns = rows[0])
df_prices

Unnamed: 0,Nombre,Últ.,% Dif.,Máx.,Mín.,Volumen,Efectivo (miles €),Fecha,Hora
0,ACCIONA,1784000,96,1814000,1766000,36.326,"6.491,39",13/05/2022,14:34:05
1,ACERINOX,103550,88,105450,103050,590.249,"6.152,67",13/05/2022,14:35:01
2,ACS,242300,-66,243700,240000,262.619,"6.355,68",13/05/2022,14:35:13
3,AENA,1327500,275,1332000,1300000,34.742,"4.581,47",13/05/2022,14:33:56
4,ALMIRALL,105600,183,106600,103900,175.936,"1.859,56",13/05/2022,14:31:44
5,AMADEUS,587200,184,591200,579600,91.919,"5.393,50",13/05/2022,14:35:08
6,ARCELORMIT.,261150,128,262200,257400,229.930,"5.984,62",13/05/2022,14:31:23
7,B.SANTANDER,27110,217,27210,26755,35.560.615,"95.763,57",13/05/2022,14:35:29
8,BA.SABADELL,7114,111,7178,7050,9.307.309,"6.614,18",13/05/2022,14:35:22
9,BANKINTER,53220,103,53740,52800,630.240,"3.355,19",13/05/2022,14:35:29


There we have it, we scraped 'Bolsa de Madrid' site and gather this dataset.

### Web scraping: Selenium

In [8]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code). 

In [9]:
driver = webdriver.Chrome(executable_path = 'C:/Program Files (x86)/chromedriver.exe')

In [10]:
driver.get(URL)

Locating elements by XPATH:

In [11]:
results2_header = driver.find_elements(By.XPATH, '/html/body/div[1]/table/tbody/tr[4]/td[2]/div[1]/form/div[6]/table/tbody/tr[1]/th')
results2_data = driver.find_elements_by_xpath('/html/body/div[1]/table/tbody/tr[4]/td[2]/div[1]/form/div[6]/table/tbody/tr')

And now to get the text of each element into a list:

In [12]:
col_names = []
for column in results2_header:
    col_names.append(column.text)
print(col_names)

['Nombre', 'Últ.', '% Dif.', 'Máx.', 'Mín.', 'Volumen', 'Efectivo (miles €)', 'Fecha', 'Hora']


Iteration:

In [13]:
new_rows = []
for row in results2_data:
    element = []
    values = row.find_elements(By.XPATH, 'td')
    for value in values:
        element.append(value.text)
    new_rows.append(element)

In [14]:
df_prices2 = pd.DataFrame(new_rows[1:], columns = col_names)
df_prices2

Unnamed: 0,Nombre,Últ.,% Dif.,Máx.,Mín.,Volumen,Efectivo (miles €),Fecha,Hora
0,ACCIONA,1784000,96,1814000,1766000,36.326,"6.491,39",13/05/2022,14:34:05
1,ACERINOX,103550,88,105450,103050,590.249,"6.152,67",13/05/2022,14:35:01
2,ACS,242300,-66,243700,240000,262.619,"6.355,68",13/05/2022,14:35:13
3,AENA,1327500,275,1332000,1300000,34.742,"4.581,47",13/05/2022,14:33:56
4,ALMIRALL,105600,183,106600,103900,175.936,"1.859,56",13/05/2022,14:31:44
5,AMADEUS,587200,184,591200,579600,91.919,"5.393,50",13/05/2022,14:35:08
6,ARCELORMIT.,261150,128,262200,257400,229.930,"5.984,62",13/05/2022,14:31:23
7,B.SANTANDER,27110,217,27210,26755,35.560.615,"95.763,57",13/05/2022,14:35:29
8,BA.SABADELL,7114,111,7178,7050,9.307.309,"6.614,18",13/05/2022,14:35:22
9,BANKINTER,53220,103,53740,52800,630.240,"3.355,19",13/05/2022,14:35:29


Now we quit the driver:

In [15]:
driver.quit()

## <SPAN style=color:#1F618D>Level 2</SPAN>

### <SPAN style=color:#1F618D>Practice 2</SPAN>

<SPAN style=color:#1F618D>Document in a word your data set generated with the information that the different Kaggle files have.</SPAN>

In [16]:
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_LINE_SPACING
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx import text

We are going to build a word document with a summary information about the dataset.

In [17]:
# New document object
document = Document()

# Principal heading 
document.add_heading('BOLSA DE MADRID: IBEX 35 Shares', 0)


# Summary
document.add_heading('ABOUT THE DATASET', 2)
para1 = document.add_paragraph(text = """This dataset contains principal stock exchange of the BOLSA DE MADRID. The IBEX 35 (IBerian IndEX) is the benchmark stock market index of the Bolsa de Madrid, Spain's principal stock exchange. Initiated in 1992, the index is administered and calculated by Sociedad de Bolsas, a subsidiary of Bolsas y Mercados Españoles (BME), the company which runs Spain's securities markets (including the Bolsa de Madrid). It is a market capitalization weighted index comprising the 35 most liquid Spanish stocks traded in the Madrid Stock Exchange General Index and is reviewed twice annually. Trading on options and futures contracts on the IBEX 35 is provided by MEFF (Mercado Español de Futuros Financieros), another subsidiary of BME.""")
para1.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY


# Columns
document.add_heading("COLUMNS DESCRIPTION", 2)
document.add_paragraph('Nombre: share name', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Últ: latest share price update', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('% Dif: share price difference', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Máx: max share price of the session', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Mín: min share price of the session', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Volumen: share volume', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Efectivo (miles €):  company cash', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Fecha: date', style = 'List Bullet').runs[0].font.size = Pt(10)
document.add_paragraph('Hora: hour ', style = 'List Bullet').runs[0].font.size = Pt(10)


# Table 

table = document.add_table(df_prices.shape[0]+1, df_prices.shape[1], style = "TableGrid")


for j in range(df_prices.shape[1]):
    table.cell(0,j).text = df_prices.columns[j]
    
for i in range(df_prices.shape[0]):
    for j in range(df_prices.shape[1]):
        table.cell(i+1,j).text = str(df_prices.values[i,j])

for row in table.rows:
    for cell in row.cells:
        paragraphs = cell.paragraphs
        for paragraph in paragraphs:
            for run in paragraph.runs:
                font = run.font
                font.size = Pt(6)

# Save the word doc
document.save('ibex_35_stocks.docx')

  return self._get_style_id_from_style(self[style_name], style_type)


## <SPAN style=color:#1F618D>Level 3</SPAN>

### <SPAN style=color:#1F618D>Practice 3</SPAN>

<SPAN style=color:#1F618D>Choose a web page you want and do web scraping using the Scrapy library.</SPAN>

Now let's do a little example with Scrapy library. It's quite complicated to do it in a Jupyter Notebook so I will reproduct an example from internet performed in site that has been created for scrapping purposes. I made several tries with other sites without succes as most sites forbid this process. I tried changing user_agent and other solutions but didn't work.

In [18]:
import requests
from scrapy.http import TextResponse

We send a reques to the URL:

In [19]:
res = requests.get('http://quotes.toscrape.com/page/1/')
response = TextResponse(res.url, body = res.text, encoding = 'utf-8') # object that represents an HTTP response

In [20]:
quotes = []

In [21]:
for quote in response.css('div.quote'):
    quotes.append ({
        'text': quote.css('span.text::text').extract_first(),
        'author': quote.css('span small::text').extract_first(),
        'tags': quote.css('div.tags a.tag::text').extract(),
            })

In [22]:
quotes

[{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  'author': 'Albert Einstein',
  'tags': ['change', 'deep-thoughts', 'thinking', 'world']},
 {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
  'author': 'J.K. Rowling',
  'tags': ['abilities', 'choices']},
 {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
  'author': 'Albert Einstein',
  'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']},
 {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
  'author': 'Jane Austen',
  'tags': ['aliteracy', 'books', 'classic', 'humor']},
 {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
  'author': 'Marilyn Monroe',
  'tags': ['be-

In [23]:
quotes = pd.DataFrame(quotes)
quotes

Unnamed: 0,text,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"[change, deep-thoughts, thinking, world]"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"[abilities, choices]"
2,“There are only two ways to live your life. On...,Albert Einstein,"[inspirational, life, live, miracle, miracles]"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"[aliteracy, books, classic, humor]"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"[be-yourself, inspirational]"
5,“Try not to become a man of success. Rather be...,Albert Einstein,"[adulthood, success, value]"
6,“It is better to be hated for what you are tha...,André Gide,"[life, love]"
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"[edison, failure, inspirational, paraphrased]"
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,[misattributed-eleanor-roosevelt]
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"[humor, obvious, simile]"


There we have it, we scraped the quotes on display them on a DataFrame.