## Get Plataforma Sucupira's datasheets 

**Objective**: This scripts scrapes datasheets for the updated list of UFPB's graduate programs directly from Plataforma Sucupira.

### Please, follow these guidelines to have Chrome webdriver working:

1. Download the latest version for your OS from the [https://chromebrowser.storage.googleapis.com/index.html](repo).
2. Remove it from quarantine with
`xattr -d com.apple.quarantine <chromedriver>`, where `<chromedriver>` is your version. (This step is necessary for macOS).
3. Afterwards, it can be executed. Otherwise, check if your current Chrome version requires an older webdriver release. In any case, update Chrome. 

### Information

- This script tries to get the spreadsheets automatically. If it fails because some prevention of `onclick` event from the Sucupira side, the workaround is to download the XLS files manually

- The spreadsheets might be updated by CAPES periodically or at a 4-year basis after the QUADRIENAL.

- Weirdly, UFPB has 3 spreadsheets with 3 distinct codes, one per campus. This is issue is under PRPG's treatment.

- The full CNPq knowledge table is found [here](http://lattes.cnpq.br/documents/11871/24930/TabeladeAreasdoConhecimento.pdf/d192ff6b-3e0a-4074-a74d-c280521bd5f7).

In [None]:
import os, re, time, shutil, pandas as pd, numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver

In [None]:
# settings

# Webdriver required for Selenium
webdriver_path = '../utils/chromedriver-1020500561-m1-apple'

# Destination path of processed spreadsheets
save_path = '../input/sucupira/'


# Links for PPG tables
sucupira_links = {
    'UFPB-JP':'https://sucupira.capes.gov.br/sucupira/public/consultas/coleta/programa/quantitativos/quantitativoPrograma.jsf?areaAvaliacao=0&cdRegiao=2&sgUf=PB&ies=338423',
    'UFPB-AREIA':'https://sucupira.capes.gov.br/sucupira/public/consultas/coleta/programa/quantitativos/quantitativoPrograma.jsf?areaAvaliacao=0&cdRegiao=2&sgUf=PB&ies=338424',
    'UFPB-RT':'https://sucupira.capes.gov.br/sucupira/public/consultas/coleta/programa/quantitativos/quantitativoPrograma.jsf?areaAvaliacao=0&cdRegiao=2&sgUf=PB&ies=338425'}    


In [None]:
'''Function to download XLS sheets directly from Sucupira Platform.

    url: dict with links to Graduate Program datasheets
    webdriver_path: path to Selenium driver
    save_path: destination path where to save final datasheets
    fn: XLS file name (standard by CAPES)
    
    REMARK: here, we use Selenium webdriver for Chrome. 
            The Chrome driver depends on your OS architecture and
            should be declared accordingly. For this repo, 
            a M1 Apple version is saved into utils/. 
            Notice that you should run the Chrome driver with 
            a compatible Chrome's version.
            
    GOTCHA: if Sucupira reject to download the file, try:
    
            i) pass the dict 'url' with one link at time
            ii) manual download
    
'''
def get_sheets_from_sucupira(url,
                             webdriver_path,
                             save_path,
                             fn='quantitativo_programas.xls'):

    for prefix, url_ppg in url.items():
    
    
        # initizalize webdriver in silent mode
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")

        # execute webdriver and get PPG's url
        browser = webdriver.Chrome(webdriver_path,options=options)
        browser.get(url_ppg)

        # find link with JSF script of the datasheet.
        link = browser.find_elements_by_xpath("//a[contains(@onclick,'areaAvaliacao')]")

        # simple check. A single link should be found
        if len(link) == 1 and link[0].text == 'Gerar arquivo XLS':
            # 1st way
            #link[0].click()
            
            # 2nd way
            webdriver.ActionChains(browser).move_to_element(link[0]).click(link[0]).perform()        
            
            #print(f'XLS downloaded from Sucupira')

        # span current directory to get XLS just downloaded
        this = os.listdir(os.curdir)    
        _,ext = os.path.splitext(fn)

                
        # create save directory
        if not os.path.exists(save_path):
            os.makedirs(save_path)

        # move files 
        if fn in this:
            os.rename(fn, prefix + ext)
            src = os.path.join(os.curdir,prefix + ext)
            dst = os.path.join(save_path,prefix + ext)                                           
            shutil.move(src,dst)
            print(f'File: \'{prefix + ext}\' moved to {save_path}')
            
    browser.close()

In [None]:
'''Function to fix spreadsheets and return clean CSVs'''

def fix_datasheet(save_path):
    
    sheets = os.listdir(save_path)

    ok = {}
    for prefix in ['JP','AREIA','RT']:
        name = f'UFPB-{prefix}.xls'
        if sheets.count(name) == 1:
            ok[name] = True
        else:
            ok[name] = False
                    
    if not all(ok.values()):
        raise NameError(f'Sucupira files should be renamed to:\n {list(ok.keys())}')

    
    fix = {'UFPB-JP':'João Pessoa',
           'UFPB-AREIA':'Areia',
           'UFPB-RT':'Rio Tinto'}    

    modalidade = {'ME':'Mestrado',
                  'DO':'Doutorado',
                  'MP':'Mestrado Profissional',
                  'DP':'Doutorado Profissional'}
    
    dfs = []
    

    for f in ok.keys():
        campus,_ = f.split('.')
        df = pd.read_excel(os.path.join(save_path,f))
        df['Sigla da IES'] = df['Sigla da IES'].str.replace(campus,fix[campus])
        c = np.full(len(df),fix[campus])
        df['Campus'] = c    
        df = df[['Nome do Programa', 'Campus','Código do Programa', 'ME', 'DO', 'MP', 'DP']]
        dfs.append(df)
    
    lista_ppg = pd.concat(dfs)
    lista_ppg['Nome do Programa'] = lista_ppg['Nome do Programa'].apply(lambda x: x.upper())
    lista_ppg = lista_ppg.sort_values(by='Nome do Programa')
    lista_ppg = lista_ppg.rename(columns={'Nome do Programa':'nome_ppg','Campus':'campus','Código do Programa':'codigo_CAPES'})
    lista_ppg.to_csv(os.path.join(os.pardir,'input','lista-ppg.csv'),index=False)    

### Main execution

Do everything by calling functions here.

In [None]:
get_sheets_from_sucupira(sucupira_links,
                             webdriver_path,
                             save_path,
                             fn='quantitativo_programas.xls')

fix_datasheet(save_path)


# close all
browser.quit()