In [1]:
# Los imports
import os
import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

from bs4 import BeautifulSoup

import networkx as nx

# "Agrupaciones Gaiteras" collaboration network

I love the "Gaita Zuliana" music genre, is a folklore expression from my hometown, Maracaibo, Venezuela, so as the first gift from me to my hometown and the "Gaiteros", I will make a collaboration network analysis of a reduced database from the internet.

To do that I will gather the necessary data by utilizing the powerful combination of __Selenium__ and __Beautifulsoup__ for web scraping, but this only will be shown in this notebook. 

The obtained data will then be analyzed using __NetworkX__, a python library for the study of complex networks, yet this will be done in a separate notebook.


Finally, we will use a language model from Huggingface to extract named entities __(NER)__ related to the artists, specifically their names. Through this process, we aim to uncover the relationships and collaborations within the Gaita Zuliana music beautiful scene.

## Scraping the links that point to the information

The information will be extracted from a web page called "Sabor Gaitero" (in spanish), and the way the information is presented is the following:

* There is a section called "Agrupaciones Gaiteras", much like a search result page, where each group is mentioned, and has a hyperlink that points to their information. You can see the page in the following link:

http://saborgaitero.com/agrupaciones-gaiteras/

* In each "Gaita" group page, there are several paragraphs containing the information in text form, and is exactly this information what we will scrape, as seen here:

http://saborgaitero.com/zagalines-los/

As a first step, let´s create a browser starter function that uses __Selenuim__:

In [2]:
# HOME_DIR = os.path.expanduser('~')
DRIVER_FILE = r'C:\Users\armedina\Documents\DS-Projects\Deep_Gaitas\Drivers\chromedriver.exe'
# DRIVER_FILE = os.path.join()

def browser_starter(driver_file):
    """
    Function that initializes the Browser Driver object.
    Returns a "webdriver.Chrome()" object that I will use in the different scraping activities
    """
    ## Setup chrome options
    chrome_options = Options()
    chrome_options.add_argument("--headless") # Ensure GUI is off
    chrome_options.add_argument("--no-sandbox")

    # # Instancia el driver del navegador
    webdriver_service = Service(DRIVER_FILE)
    browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

    return browser

The next part is a function that takes a BS4 soup object and extracts the links for each "Gaita" group (from the first page shown in the intro):

In [3]:
def get_all_pages_from_agrupaciones_sabor_gaitero(sopa):
    """
    Function that obtains all the links of the "Gaiteros" groups
     from the specified web page.

     * Params:
         * soup : BS4 soup object.
        
     * Returns:
         A list with the links of the bagpiper groups in "text string" format
    """
    
    links = []
    output_links = []
    UMBRAL_LONGITUD_TITULO = 6
    # Iterates through the different titles (which are hyperlinks), and gets the links.
    for div in sopa.find_all('h3'):
        if len((str(div.a.get('href'))).split(' ')) > UMBRAL_LONGITUD_TITULO:
            continue
        else:
            links.append(str(div.a.get('href')).lower())
    # Some words within the inspected elements are not required, so
     # we filter those links.
    STOP_WORDS = ['ano', 'año', 'himno', 'escudo', 'de', 'la', 'puente', 'sobre', 'lago']
    for enlace in links:
        for stop_word in STOP_WORDS:
            if stop_word in enlace:
                continue
            else:
                output_links.append(enlace)
    return output_links

Next we cycle through each link of the search page, to get all the links that points to all the "Gaita" groups (as in the second link shown in the intro):

In [4]:
# Get the links of the bagpipe groups page
PAGINA_WEB_AGRUPACIONES = r'http://saborgaitero.com/agrupaciones-gaiteras/'
# Opens the web page
browser = browser_starter(DRIVER_FILE)
browser.get(PAGINA_WEB_AGRUPACIONES)
TIEMPO_ESPERA = 1.2
# Obtains the html code from the page.
html = browser.page_source
soup = BeautifulSoup(html,'html.parser')
time.sleep(TIEMPO_ESPERA)

# First links
enlaces_agrupaciones = get_all_pages_from_agrupaciones_sabor_gaitero(soup)

# Following links
BOTON_UNICO = '/html[1]/body[1]/div[5]/div[2]/div[1]/div[1]/div[1]/div[1]/div[1]/div[1]/div[3]/a[2]'


# Iterate through the results pages, and extract the links from the
# gaitas groups (although some unwated extra topics slip in)
for _ in range(20):
    try:
        browser.find_element('xpath', BOTON_UNICO).click()
        html = browser.page_source
        soup = BeautifulSoup(html,'html.parser')
        enlaces_agrupaciones += get_all_pages_from_agrupaciones_sabor_gaitero(soup)
        time.sleep(TIEMPO_ESPERA)
    except Exception as e:
        print('A problem has occured...')
        print(e)
    finally:
        pass

browser.close()
browser.quit()

As a final part in the scraping phase we cycle now through all the pages that contain the actual text that describes the "Gaita" group.

In [18]:
# Extract the text information from each individual "Gaita" group web page.
browser = browser_starter(DRIVER_FILE)
soups = []
for enlace in set(enlaces_agrupaciones):
    try:
        browser.get(enlace)

        # Obtains the html code from the page.
        html = browser.page_source
        soup = BeautifulSoup(html,'html.parser')
        soups.append(soup)
        time.sleep(TIEMPO_ESPERA)
    except Exception as e:
        print('A problem has occured...')
        print(e)

agrupaciones = []

for sopa in soups:
    agrupacion = {}
    titulo = sopa.find('h1').text.lower()
    agrupacion['TITLE'] = agrupacion.get('TITLE', titulo)
    texto = ' '.join(parrafo.text.lower() for parrafo in sopa.find_all('p'))
    agrupacion['TEXT'] = agrupacion.get('TEXT', texto)

    
    agrupaciones.append(agrupacion)

browser.close()

We save the data in a Pandas Dataframe like follows:

In [19]:
agrupaciones_df = pd.DataFrame.from_dict(agrupaciones)
agrupaciones_df = agrupaciones_df.loc[~agrupaciones_df['TITLE'].duplicated()]

The columns are:

* TITLE: contains the name of the "Gaita" group.
* TEXT: the text extracted from the group´s web page.

In [20]:
agrupaciones_df

Unnamed: 0,TITLE,TEXT
0,alegres gaiteros,fundado 1981 nerio ríos / jesús pirele / otros...
1,colorama,fundado 1964 arsacio acurero / quintiliano sán...
2,gaiteros de ziruma,fundado 2004 ender rojas / alí ojeda / daniela...
3,"rafael rincón gonzález, arbol de zulianidad","la hermosa y extraña palabra “biombo”, que sig..."
4,gran coquivacoa,fundado 1968 neguito borjas / osías acosta / e...
5,maracaibo 15,fundado 1974 betulio medina / otros amparito /...
6,santanita,fundado 1964 gladys vera / raiza portillo / ot...
7,profesionales de la gaita,fundado 1986 enrique gotera / pedro rosell / h...
8,"dragones, los",fundado 1964 hugo huerta / josé ángel huerta /...
9,birimbao,fundado 1980 jerry sánchez / eloy montiel / ot...


## Data preparation and cleaning

In order to continue, we need some domain knowledge to identify if all the information extracted is valid, and to proceed with the data cleaning.

The first cleaning stage will be just filter out all the rows that contain non "Gaiteros" entries, and to do this, we just use our domain knowledge and create a list to that contains the non gaiteros groups to use it as a filter:

In [21]:
non_gaiteros = [
    'bachelet regresó a presidir la patria de jara. crónica semanal por @leonmagnom',
    '13 años de saborgaitero.com',
    'marisela árraga, una mujer ideal',
    'las 15 grandes gaitas de la temporada 2015',
    'argenis carruyo 69 años de vida y cantando',
    'banco occidental de descuento en gaitas',
    'somos',
    'san isidro',
    'kongaby',
    'arbol de zulianidad',
              ]

agrupaciones_df = agrupaciones_df[~agrupaciones_df['TITLE'].isin(non_gaiteros)]

And we go from 48 rows in the dataframe, to 39.

In [22]:
agrupaciones_df.shape

(39, 2)

## Using NLP to extract the presice information

For the second step, we need to extract the names of the group members from the text, and we are goint to do this by using the __NER__ (Name Entity Recognition) which is a __"Token classification"__ feature of language models that is trained to recognize names of people within the evaluated text (among many other things, like for example, they are able to recognize names of cities or countries)-

Specifically we will use the __Flair__ library for language processing, which has a __"Sequence Tagger"__ model, and we will combine it with their __"ner-spanish-large"__ language model for NER, which can be downloaded from the __Huggingface__ hub at:

https://huggingface.co/flair/ner-spanish-large

For a great and short tutorial on how to leverage the power of these kind of language models, you can refer to the Huggingface tutorial page for LMs.

In [23]:
from flair.data import Sentence
from flair.models import SequenceTagger

In [24]:
# load tagger
tagger = SequenceTagger.load("flair/ner-spanish-large")



2023-02-08 21:21:16,060 loading file C:\Users\armedina\.flair\models\ner-spanish-large\045ad6c7dc21e0eb85935dce0544eec65f8c63c58412154df4dee7ff5f11665b.d4d3456316d2951bc100d060bd63a690b33af6d273adffa1b90df32328ed3257
2023-02-08 21:21:32,168 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-LOC, S-ORG, B-PER, I-PER, E-PER, S-MISC, B-ORG, E-ORG, S-PER, I-ORG, B-LOC, E-LOC, B-MISC, E-MISC, I-MISC, I-LOC, <START>, <STOP>


Next we need to define a function that takes the text and extract the names of the "Gaiteros", by using __Token Classification__ as mentioned before (this function was taken from the example section of the web page of the language model, in __Huggingface__ hub, shown in this section introduction):

In [25]:
def ner_converver(texto):
    sentence = Sentence(texto)
    tagger.predict(sentence)
    entities = []
    for entity in sentence.get_spans('ner'):
        if entity.tag == 'PER':
            entities.append(entity.text)
            
    return ','.join(entities)

Then we create a new column named __NER__ that will contain a string containing the names of the "Gaiteros" for each group, separated by a comma:

In [26]:
# predict NER tags
agrupaciones_df['NER'] = agrupaciones_df['TEXT'].apply(lambda texto: ner_converver(texto))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  agrupaciones_df['NER'] = agrupaciones_df['TEXT'].apply(lambda texto: ner_converver(texto))


In [27]:
agrupaciones_df

Unnamed: 0,TITLE,TEXT,NER
0,alegres gaiteros,fundado 1981 nerio ríos / jesús pirele / otros...,"nerio ríos,jesús pirele,nelson romero,josé rom..."
1,colorama,fundado 1964 arsacio acurero / quintiliano sán...,"arsacio acurero,quintiliano sánchez,alfonso hu..."
2,gaiteros de ziruma,fundado 2004 ender rojas / alí ojeda / daniela...,"ender rojas,alí ojeda,daniela palma,gerardo mo..."
3,"rafael rincón gonzález, arbol de zulianidad","la hermosa y extraña palabra “biombo”, que sig...","rafael rincón gonzález,rafael caldera,rafael a..."
4,gran coquivacoa,fundado 1968 neguito borjas / osías acosta / e...,"neguito borjas,osías acosta,ender carruyo,punt..."
5,maracaibo 15,fundado 1974 betulio medina / otros amparito /...,"betulio medina,betulio medina,medina,jesús rod..."
6,santanita,fundado 1964 gladys vera / raiza portillo / ot...,"gladys vera,raiza portillo,raiza portillo,glad..."
7,profesionales de la gaita,fundado 1986 enrique gotera / pedro rosell / h...,"enrique gotera,pedro rosell,huascar pacheco,ja..."
8,"dragones, los",fundado 1964 hugo huerta / josé ángel huerta /...,"hugo huerta,josé ángel huerta,hugo huerta,fern..."
9,birimbao,fundado 1980 jerry sánchez / eloy montiel / ot...,"jerry sánchez,eloy montiel,emerson martínez,je..."


And finally we save the data in an excel file, to be used in the next phase, where we'll create the colaboration network using __NetworkX__ as main tool for data exploration:

In [28]:
nombre_archivo = 'Agrupaciones_NER_flair(v3).xlsx'
agrupaciones_df.to_excel(nombre_archivo)