### FUENTES DE INFORMACION DE Libreria BeatifulSoup

1.- PAGINA WEB DE BeerAdvocates DE ESTILOS DE CERVEZAS Y FAMILIAS DONDE SE AGRUPAN:

'https://www.beeradvocate.com/beer/styles/'

2.- CADA PAGINA WEB DE BeerAdvocates DE DATOS DE UN ESTILO DE CERVEZA Y LISTA DE CERVEZAS INCLUIDAS, como, por ejemplo:

'https://www.beeradvocate.com/beer/styles/32/'


### OBJETIVO DE ESTE NOTEBOOK:

El objetivo es crear un DATASET DE ESTILOS DE CERVEZAS cuyos datos valdran para:

1.- Los analisis posteriores.

2.- Tener 1 FICHA DE DATOS DE CADA ESTILO DE CERVEZA que se podria mostrar en un futuro Recomendador Web.

3.- Tener datos de estilos de cervezas que podrian usarse en un futuro Recomendador Web para:

	3.a.- Crear perfiles de usuario automaticamente.

	3.b.- Filtrar u ordenar la lista de cervezas recomendadas.


In [1]:
import os
os.chdir('/home/dsc/Python_notebooks/TFM-ajao-Beer-Recommender/Data')

print(os.getcwd())

/home/dsc/Python_notebooks/TFM-ajao-Beer-Recommender/Data


In [2]:
import requests
import bs4
import re
import numpy as np
import pandas as pd

### FUNCIONES AUXILIARES

Funcion para sustituir los separadores de las palabras que componen un string por un unico espacio en blanco ' ' como separador

In [3]:
def separator_cleaning(separatorCharList, string):
    auxString = ''
    for elem in string.split(' '):
        if elem not in separatorCharList:
            auxString += elem + ' '
    return(auxString.rstrip().lstrip())

Funcion para obtener los RANGOS DE ABV y de IBU de la pagina web de datos de un ESTILO DE CERVEZA de BeerADVOCATES.

In [4]:
def getStyleBeerData(styleBeerUrl, strABVRangeHtmlMark, strIBURangeHtmlMark):
    htmlReq = requests.get(styleBeerUrl)
    soup = bs4.BeautifulSoup(htmlReq.text, "lxml")
    strABVRange = ''
    strIBURange = ''
    #print('getStyleBeerData():', styleBeerUrl)
    for s in soup.body.find_all('div'):
        if s.text:
            familyStrDataAux = str(s.div)
            familyDivsIndex0 = str(s.div).find(strABVRangeHtmlMark)  
            if familyDivsIndex0 != -1:
                #print(c, 'ABV FOUND -----------', familyDivsIndex0)
                strABVRange = familyStrDataAux[familyDivsIndex0:].split('</b>')[1].split('</span>')[0].lstrip()
                #print(strABVRange)
                #print('END -------------')        
            familyDivsIndex0 = str(s.div).find(strIBURangeHtmlMark)  
            if familyDivsIndex0 != -1:
                #print(c, 'IBU FOUND -----------', familyDivsIndex0)
                strIBURange = familyStrDataAux[familyDivsIndex0:].split('</b>')[1].split('</span>')[0].lstrip()
                #print(strIBURange)
                #print('END -------------')
            #c += 1
            if strABVRange != '' and strIBURange != '':
                #return { 'abv_range': strABVRange, 'ibu_range': strIBURange}
                soup = None
                return [strABVRange, strIBURange]
                break

#### TABLA GUIA DE RANGOS E INTENSIDAD DE GRADO ALCOHOLICO ABV

Segun http://dev.bjcp.org/beer-styles/introduction-to-beer-styles

Low Strength = intensidad-baja = 1: <4% ABV

Medium Strength = intensidad-estandar = 2: 4-6% ABV

High Strength = intensidad-alta = 3: 6-9% ABV

Strong Strength = intensidad-muy-alta = 4: >9% ABV


##### Funciones get_abv_strength(abv_val) y get_abv_range_numbers(str_abv_range)

Para cada ESTILO DE CERVEZA se calculara la MEDIA de ABV a partir del MAXIMO Y MINIMO de su RANGO y se usara para asignarle su valor de INTENSIDAD DE PORCENTAJE DE ALCOHOL ABV mediante esas funciones.

In [5]:
def get_abv_strength(abv_val):
    if (abv_val >= 1.0 and abv_val < 4.0):
        return 1
    elif (abv_val >= 4.0 and abv_val < 6.0):
        return 2
    elif (abv_val >= 6.0 and abv_val < 9.0):
        return 3
    else:
        return 4

In [6]:
print((4.2 + 8.8) / 2)
print(get_abv_strength((4.2 + 8.8) / 2))
print((2.5 + 3.6) / 2)
print(get_abv_strength((2.5 + 3.6) / 2))

6.5
3
3.05
1


In [7]:
def get_abv_range_numbers(str_abv_range):
    # SEPARADOR '-'
    if str_abv_range.find('-') != -1:
        separatorChar = '-'
    
    # SEPARADOR '–'
    if str_abv_range.find('–') != -1:
        separatorChar = '–'

    l = str_abv_range.split(separatorChar)
    rMin = float(l[0])
    rMax = float(l[1].split('%')[0])
    return rMin, rMax

In [8]:
rMin, rMax = get_abv_range_numbers('6.3-7.6%')
print(rMin)
print(rMax)

6.3
7.6


In [9]:
rMin, rMax = get_abv_range_numbers('4.2–8.8%')
print(rMin)
print(rMax)

4.2
8.8


#### TABLA DE RANGOS E INTENSIDAD DE AMARGOR SEGUN EL VALOR IBU DE UNA CERVEZA

https://birrapedia.com/enciclopedia-de-la-cerveza/ibu---ebu/i

TABLA GUIA DE RANGOS DE AMARGOR SEGUN EL VALOR IBU DE UNA CERVEZA:

De 5 a 10 = 1: poco amarga. 

De 11 a 20 = 2: medio amarga (AÑADIDO)

De 21 a 35 = 3: amarga. 

De 36 a 46 = 4: bastante amarga. 

Más de 46 = 5: muy amarga


##### Funciones get_ibu_strength(ibu_val) y get_ibu_range_numbers(str_ibu_range)

Para cada ESTILO DE CERVEZA se calculara la MEDIA de IBU o AMARGOR a partir del MAXIMO Y MINIMO de su RANGO y se usara para asignarle su valor de INTENSIDAD DE AMARGOR IBU mediante esas funciones.

In [10]:
def get_ibu_strength(ibu_val):
    if (ibu_val >= 5.0 and ibu_val <= 10.0):
        return 1
    elif (ibu_val > 10.0 and ibu_val <= 20.0):
        return 2
    elif (ibu_val > 20.0 and ibu_val <= 35.0):
        return 3
    elif (ibu_val > 35.0 and ibu_val <= 46.0):
        return 4
    else:
        return 5

In [11]:
def get_ibu_range_numbers(str_ibu_range):
    # SEPARADOR '-'
    if str_ibu_range.find('-') != -1:
        separatorChar = '-'
    
    # SEPARADOR '–'
    if str_ibu_range.find('–') != -1:
        separatorChar = '–'

    l = str_ibu_range.split(separatorChar)
    #print(l)
    rMin = float(l[0])
    rMax = float(l[1])
    
    return rMin, rMax

In [12]:
rMin, rMax = get_ibu_range_numbers('5-30')
print(rMin)
print(rMax)

5.0
30.0


In [13]:
rMin, rMax = get_ibu_range_numbers('60–100')
print(rMin)
print(rMax)

60.0
100.0


#### CREACION DEL DATASET DE ESTILOS DE CERVEZAS

El DATASET DE ESTILOS DE CERVEZAS se obtendra con estos pasos:

1.- Mediante Web Scrapping de 'https://www.beeradvocate.com/beer/styles/' obtendremos:

	'family_name', 'style_name' y 'style_URL'

NOTA: En este paso se aplica la funcion separator_cleaning() a 'style_name' para dejar el caracter espacio en blanco ' ' como unico separador de palabras.

2.- Mediante Web Scrapping de las paginas web de cada estilo de cerveza (tipo 'https://www.beeradvocate.com/beer/styles/' + numero + '/'), obtendremos:

	'abv_range' y 'ibu_range'

3.- Para cada estilo de cerveza añadiremos estas columnas calculadas:

	'abv_strength' en base a la media del MINIMO Y MAXIMO ABV registrados en 'abv_range'
    
	'ibu_strength' en base a la media del MINIMO Y MAXIMO ABV registrados en 'ibu_range'

4.- Antes de salvar el DATASET DE ESTILOS DE CERVEZAS en un fichero CSV se sustituira separador '/' de 'family_name' por el caracter espacio en blanco ' '.

El tratamiento de los separadores de 'family_name', 'style_name' sera util sobre todo para el proceso posterior a este que sera el de mapeo de ESTILOS DE CERVEZA <= 2015 a los actuales de 2019.


In [14]:
page1 = 'https://www.beeradvocate.com/beer/styles/'

In [15]:
req1 = requests.get(page1)

In [16]:
soup1 = bs4.BeautifulSoup(req1.text, "lxml")

In [17]:
print(soup1.prettify())

<!DOCTYPE html>
<html class="Public NoJs uix_javascriptNeedsInit LoggedOut Sidebar Responsive pageIsLtr not_hasTabLinks hasSearch is-sidebarOpen hasRightSidebar is-setWidth navStyle_0 pageStyle_0 hasFlexbox" dir="LTR" id="XenForo" lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=Edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <base href="https://www.beeradvocate.com/community/"/>
  <script>
   var _b = document.getElementsByTagName('base')[0], _bH = "https://www.beeradvocate.com/community/";
			if (_b && _b.href != _bH) _b.href = _bH;
  </script>
  <title>
   Beer Styles | BeerAdvocate
  </title>
  <noscript>
   <style>
    .JsOnly, .jsOnly { display: none !important; }
   </style>
  </noscript>
  <link href="css.php?css=xenforo,form,public&amp;style=6&amp;dir=LTR&amp;d=1577471330" rel="stylesheet"/>
  <link href="css.php?css=login_bar,moderator_bar,ui

In [18]:
styleDataHtmlPrefix = 'https://www.beeradvocate.com/beer/styles/'

In [19]:
beerBeerStylesDictList = []
beerStyleId = -1
print('BEGIN\n')
for s in soup1.body.find_all('div'):
    if s.text:
        familyDivsIndex = str(s.div).find('div class="stylebreak"')
        # Encontramos bloques div de Estilos de Cervezas
        if familyDivsIndex != -1:
            familyDivs = str(s.div)[familyDivsIndex:].split('<div class="stylebreak"')
            beerFamilyStylesDict = {}
            for familyDiv in familyDivs:
                #print('BEGIN Family ++++++++++++++++++++++++++++++++++++++++++++++')
                familyName = separator_cleaning(['/', ''], \
                             str(familyDiv).split('<b>')[1].split('</b>')[0].rstrip().lstrip())
                familyBeerStylesData = str(familyDiv).split('<b>')[1].split('</b>')[1]
                #print(familyName)                
                beerStylesDataList1 = familyBeerStylesData.split('<a href="/beer/styles/')
                # Filtrando Datos de Beer Style - Paso 1
                beerStylesDataList2 = []
                for stylesDataElement1 in beerStylesDataList1:
                    if stylesDataElement1[0] >= '0' and stylesDataElement1[0] <= '9':
                        beerStylesDataList2.append(stylesDataElement1)
                # Filtrando Datos de Beer Style - Paso 2
                beerStylePages = []
                beerStyleNames = []
                beerStyleURLs = []
                beerStyleABVRanges = []
                beerStyleAvgABVStrengths = []
                beerStyleIBURanges = []
                beerStyleAvgIBUStrengths = []
                totalFamilyBeerStyles = len(beerStylesDataList2)
                print(familyName, ' processing')
                for stylesDataElement2 in beerStylesDataList2:
                    # ASI NO PIERDO Los 'English Sweet / Milk Stout'
                    beerStylesDataList3 = str(stylesDataElement2).split('/"') 
                    # Filtrando Datos de Beer Style - Paso 3
                    
                    # Beer Style Name
                    beerStyleId += 1
                    beerStyleName = separator_cleaning(['/', ''], \
                                    beerStylesDataList3[1].split('>')[1].replace('</a', '').rstrip().lstrip())                    
                    beerStyleNames.append(beerStyleName)
                    
                    # https://www.beeradvocate.com/beer/styles/173/
                    # Beer Style PageNumber
                    beerStylePage = beerStylesDataList3[0]
                    beerStylePages.append(beerStylePage)
                    beerStyleURL = styleDataHtmlPrefix + beerStylesDataList3[0] + '/'
                    beerStyleURLs.append(beerStyleURL)
                    print('URL:', beerStyleURL, 'processing', beerStyleName)
                    
                    # ABV and IBU Ranges
                    l = getStyleBeerData(beerStyleURL, \
                        '<span class="Tooltip" title="The percentage of alcohol by volume">', \
                        '<span class="Tooltip" title="International Bitterness Units is a measurement the beer\'s bitterness from hops">')
                    
                    beerStyleABVRange = l[0]
                    beerStyleABVRanges.append(beerStyleABVRange)
                    minABV, maxABV = get_abv_range_numbers(beerStyleABVRange)
                    beerStyleAvgABVStrength = get_abv_strength((minABV + maxABV) / 2)
                    beerStyleAvgABVStrengths.append(beerStyleAvgABVStrength)
                    
                    beerStyleIBURange = l[1]
                    beerStyleIBURanges.append(beerStyleIBURange)
                    minIBU, maxIBU = get_ibu_range_numbers(beerStyleIBURange)
                    beerStyleAvgIBUStrength = get_ibu_strength((minIBU + maxIBU) / 2)
                    beerStyleAvgIBUStrengths.append(beerStyleAvgIBUStrength)
                    beerStyleDict = { \
                                      'family_name': familyName, \
                                      'style_name': beerStyleName, \
                                      'style_id': beerStyleId,
                                      'style_URL': beerStyleURL, \
                                      'abv_range': beerStyleABVRange, \
                                      'abv_strength': beerStyleAvgABVStrength, \
                                      'ibu_range': beerStyleIBURange, \
                                      'ibu_strength': beerStyleAvgIBUStrength \
                                    }
                    beerBeerStylesDictList.append(beerStyleDict)
                    
            break

print('END\n')

#beerBeerStylesDictList

BEGIN

Bocks  processing
URL: https://www.beeradvocate.com/beer/styles/32/ processing German Bock
URL: https://www.beeradvocate.com/beer/styles/35/ processing German Doppelbock
URL: https://www.beeradvocate.com/beer/styles/36/ processing German Eisbock
URL: https://www.beeradvocate.com/beer/styles/33/ processing German Maibock
URL: https://www.beeradvocate.com/beer/styles/92/ processing German Weizenbock
Brown Ales  processing
URL: https://www.beeradvocate.com/beer/styles/73/ processing American Brown Ale
URL: https://www.beeradvocate.com/beer/styles/74/ processing English Brown Ale
URL: https://www.beeradvocate.com/beer/styles/75/ processing English Dark Mild Ale
URL: https://www.beeradvocate.com/beer/styles/86/ processing German Altbier
Dark Ales  processing
URL: https://www.beeradvocate.com/beer/styles/175/ processing American Black Ale
URL: https://www.beeradvocate.com/beer/styles/119/ processing Belgian Dark Ale
URL: https://www.beeradvocate.com/beer/styles/57/ processing Belgian 

URL: https://www.beeradvocate.com/beer/styles/87/ processing Berliner Weisse
URL: https://www.beeradvocate.com/beer/styles/91/ processing German Dunkelweizen
URL: https://www.beeradvocate.com/beer/styles/89/ processing German Hefeweizen
URL: https://www.beeradvocate.com/beer/styles/90/ processing German Kristalweizen
Wild/Sour Beers  processing
URL: https://www.beeradvocate.com/beer/styles/198/ processing American Brett
URL: https://www.beeradvocate.com/beer/styles/171/ processing American Wild Ale
URL: https://www.beeradvocate.com/beer/styles/15/ processing Belgian Faro
URL: https://www.beeradvocate.com/beer/styles/10/ processing Belgian Fruit Lambic
URL: https://www.beeradvocate.com/beer/styles/14/ processing Belgian Gueuze
URL: https://www.beeradvocate.com/beer/styles/50/ processing Belgian Lambic
URL: https://www.beeradvocate.com/beer/styles/52/ processing Flanders Oud Bruin
URL: https://www.beeradvocate.com/beer/styles/53/ processing Flanders Red Ale
URL: https://www.beeradvocate.

### GUARDAR EN ".csv" el DATASET DE ESTILOS DE CERVEZAS

In [20]:
# BEER ESTYLE DICTIONARIES LIST TO LIST DATAFRAME
df = pd.DataFrame(beerBeerStylesDictList)
df

Unnamed: 0,abv_range,abv_strength,family_name,ibu_range,ibu_strength,style_URL,style_id,style_name
0,6.3-7.6%,3,Bocks,20-30,3,https://www.beeradvocate.com/beer/styles/32/,0,German Bock
1,6.6-7.9%,3,Bocks,17-27,3,https://www.beeradvocate.com/beer/styles/35/,1,German Doppelbock
2,7.0-14.0%,4,Bocks,25-35,3,https://www.beeradvocate.com/beer/styles/36/,2,German Eisbock
3,6.3-8.1%,3,Bocks,20-38,3,https://www.beeradvocate.com/beer/styles/33/,3,German Maibock
4,7.0-9.5%,3,Bocks,15-35,3,https://www.beeradvocate.com/beer/styles/92/,4,German Weizenbock
5,4.2–8.8%,3,Brown Ales,25–45,3,https://www.beeradvocate.com/beer/styles/73/,5,American Brown Ale
6,4.2–7.0%,2,Brown Ales,15–25,2,https://www.beeradvocate.com/beer/styles/74/,6,English Brown Ale
7,3.0–6.0%,2,Brown Ales,20–30,3,https://www.beeradvocate.com/beer/styles/75/,7,English Dark Mild Ale
8,4.0–7.0%,2,Brown Ales,25–50,4,https://www.beeradvocate.com/beer/styles/86/,8,German Altbier
9,6.3-7.6%,3,Dark Ales,50-70,5,https://www.beeradvocate.com/beer/styles/175/,9,American Black Ale


SUSTITUIR EL SEPARADOR ' / ' EN 'family_name' POR UN BLANCO ' '.

Ejemplo: beer_name = 'Barrel Aged Port Royal Stout W/ Vanilla Beans', Wild/Sour Beers', ...


In [21]:
df['family_name'] = df['family_name'].str.replace('/', ' ')
df[df['family_name'] == 'Wild Sour Beers'].head()

Unnamed: 0,abv_range,abv_strength,family_name,ibu_range,ibu_strength,style_URL,style_id,style_name
102,6.0-9.0%,3,Wild Sour Beers,0-0,5,https://www.beeradvocate.com/beer/styles/198/,102,American Brett
103,6.0-10.0%,3,Wild Sour Beers,5-30,2,https://www.beeradvocate.com/beer/styles/171/,103,American Wild Ale
104,2.0-5.0%,1,Wild Sour Beers,0-10,1,https://www.beeradvocate.com/beer/styles/15/,104,Belgian Faro
105,5.0-8.9%,3,Wild Sour Beers,15-21,2,https://www.beeradvocate.com/beer/styles/10/,105,Belgian Fruit Lambic
106,5.0-8.0%,3,Wild Sour Beers,0-10,1,https://www.beeradvocate.com/beer/styles/14/,106,Belgian Gueuze


In [22]:
df.head()

Unnamed: 0,abv_range,abv_strength,family_name,ibu_range,ibu_strength,style_URL,style_id,style_name
0,6.3-7.6%,3,Bocks,20-30,3,https://www.beeradvocate.com/beer/styles/32/,0,German Bock
1,6.6-7.9%,3,Bocks,17-27,3,https://www.beeradvocate.com/beer/styles/35/,1,German Doppelbock
2,7.0-14.0%,4,Bocks,25-35,3,https://www.beeradvocate.com/beer/styles/36/,2,German Eisbock
3,6.3-8.1%,3,Bocks,20-38,3,https://www.beeradvocate.com/beer/styles/33/,3,German Maibock
4,7.0-9.5%,3,Bocks,15-35,3,https://www.beeradvocate.com/beer/styles/92/,4,German Weizenbock


In [23]:
df.to_csv('./beer_styles_v1.csv', header=True, sep=',', index=False)