## Adquirindo o Conjunto de Dados

Para utilizarmos o nosso algorítmo decidimos criar o nosso próprio _dataset_ a partir de _web scraping_. 

Para isso, os pacotes abaixo serão necessários.

In [1]:
import numpy as np
import numpy.linalg as la
import pandas as pd
import requests
import re
from urllib.parse import urlparse, urldefrag
from bs4 import BeautifulSoup as bs

Caso não haja familiaridade com os módulos além de `numpy` e `pandas`. Os módulos `requests`, `re`, e `urllib` já estão disponíveis por padrão no Python. Eles, respectivamente, fazem requisições _http_, tratam expressões regulares e permitem _url parsing_.

A classe `BeautifulSoup` do módulo `bs4` permite, de forma geral, o _web scraping_.

A nossa ideia de aquisição de dados foi, primeiramente escolher uma página da wikipédia como ponto inicial e adquirir todos seus links (e os links dentro das páginas desses links). Após isso, entrar em cada um desses links, filtrar por essa primeira listagem adquirida, agrupar as contagens (ver quais se repetem e quantas vezes) e depois calcular a probabilidade de você clicar em cada um dos links ao clicar em um aleatóriamente para cada página.

Desse modo teremos uma matriz quadrada que atende nossos requisitos.

### Definindo funções

A função abaixo retorna um _dataframe_ com os links de um _url_ específico.

In [2]:
def get_df(url: str):
    ################
    ### SCRAPING ###
    ################
    
    # Requisitando a página.
    raw_page = requests.get(url)
    
    # Adquirindo seu HTML.
    html_page = raw_page.text
    
    # Criando o objeto scrapper.
    soup = bs(html_page, 'lxml')
    
    # Capturando todos os links que levam à própria Wikipédia.
    links = soup.findAll('a', attrs= {'rel': 'mw:WikiLink'})
    
    # Criando lista com os links.
    data = [link.get('href') for link in links]
    
    # Criando dataframe a partir desses dados.
    df = pd.DataFrame(data, columns = ['link'])
    
    return df

Agora, essa função adiciona uma coluna de probabilidade no dataframe retornado pela função anterior, também recebe o seu _link_.

In [3]:
def find_probs(url: str, df: pd.DataFrame): 
    #####################
    ### PROBABILIDADE ###
    #####################
    
    # Retirando fragmento da url
    df['link'] = df['link'].apply(lambda x: urldefrag(x)[0])
    
    # Contabilizando quantas referências a cada link o site possui. 
    df = df.groupby(['link']).size().reset_index(name='count')
    
    # Criando a identificação da coluna a partir da url (padronizar).
    title = url.replace('https://pt.wikipedia.org/api/rest_v1/page/html', '.')
    
    # Criando essa coluna de probabilidades identificada pelo site.
    df[title] = df['count']/df['count'].sum()
    
    return df[['link', title]]

Já que iremos filtrar a "nossa internet", essa função retorna um _dataframe_ com todos os links de uma _url_ recebida. Vale notar que pode-se escolher uma "profundidade". No nosso caso será 1, ou seja, capturará todos os links de todos os links da `url` inicial.

In [4]:
def get_links(url: str, depth: int):
    #############################################
    ### CAPTURA TODOS OS LINKS PROFUNDIDADE n ###
    #############################################
    
    # Requisitando a página.
    raw_page = requests.get(url)
    
    # Adquirindo seu HTML.
    html_page = raw_page.text
    
    # Criando o objeto scrapper.
    soup = bs(html_page, 'lxml')
    
    # Capturando todos os links que levam à própria Wikipédia.
    links = soup.findAll('a', attrs= {'rel': 'mw:WikiLink'})
    
    # Criando set com links. (Removendo o #param)
    link_set = {urldefrag(link.get('href'))[0] for link in links}
    
    
    if depth > 0:
        for link in link_set:
            # Transformando link em absoluto
            full_link = link.replace('./', 'https://pt.wikipedia.org/api/rest_v1/page/html/') 
            # Recursividade
            new_link_set = get_links(full_link, depth -1)
            # Unindo os sets
            link_set = set.union(link_set, new_link_set)
        
        # Adicionando url original ao set
        title = url.replace('https://pt.wikipedia.org/api/rest_v1/page/html', '.')
        link_set.add(title)
        
        # Criando dataframe a partir desse set.
        link_filter = pd.DataFrame(link_set, columns = ['link'])

        return link_filter
    else:
        return link_set
    

Essa é a função principal. Nela, a criação do _dataframe_ final é realizada. Ela recebe aquela primeira _url_ utilizada (somente para ser a primeira coluna) e a lista de _links_, para ser utilizada como filtro.

In [5]:
def scraping_loop(first_url: str, link_filter: pd.DataFrame):
    ####################################
    ### LOOP PRINCIPAL DE SCRAPING 2 ###
    ####################################
    
    # Referenciando ao set que contém todos os links
    all_links = set()
    
    # Adicionando o link principal a esse set
    all_links.add(first_url)
    
    # Encontra os links (e suas probabilidades) para a url inicial
    links_df = get_df(first_url)
    links = find_probs(first_url, links_df)
    
    # Cria uma cópia para ser retornada após merge
    main_frame = links.copy()
    
    # Itera entre os links da url principal
    for link in link_filter['link']:
        
        # Transformando a url relativa para a absoluta.
        full_link = link.replace('./', 'https://pt.wikipedia.org/api/rest_v1/page/html/')
        
        # Checa se o link já foi utilizado
        if full_link in all_links:
            continue
        else:
            # Adiciona ao set de links o link
            all_links.add(full_link)
            
            # Cria o df para esse link
            df = get_df(full_link)
            
            # Filtra os links para conter somente os de nossa internet
            df = df[df['link'].isin(link_filter['link'])]
            
            # Cria a coluna (df) de probabilidades.
            probs = find_probs(full_link, df) 
            
            # Faz um outer join com o main_frame
            main_frame = pd.merge(main_frame, probs, how='outer', on='link')
            
    return main_frame.fillna(0)

### Procedimento de aquisição

Nas linhas abaixo, discorriremos sobre o procedimento utilizado, a partir das funções previamente definidas.

Primeiramente, escolhemos uma _url_.

In [6]:
url = 'https://pt.wikipedia.org/api/rest_v1/page/html/Álgebra_linear'

Depois, encontraremos a lista de _links_, que servirá de filtro. Veja que é necessário 29.8 segundos para executar essa função.

In [7]:
%%time
links = get_links(url, 1)

CPU times: user 11 s, sys: 154 ms, total: 11.1 s
Wall time: 29.8 s


Enfim, podemos criar o nosso _dataframe_ final. Que também é possível ver que foi-se 1h25min para a execução da função.

In [8]:
%%time
df = scraping_loop(url, links)

CPU times: user 1h 10min 46s, sys: 7.11 s, total: 1h 10min 53s
Wall time: 1h 25min 59s


Vejamos o _dataframe_.

In [9]:
df

Unnamed: 0,link,./Álgebra_linear,./Processo_Penrose,./Visão_computacional,./Axiomas,./Michael_Green,./Complexidade_NP,./Instabilidade,./Experimento_de_Franck-Hertz,./Processo_empírico,...,./Grande_Teoria_Unificada,./Limites_e_colimites,./Órbita,./Galileu_Galilei,./Conjunto_causal,./Núcleo,./Regressão_não_linear,./Transformação_projetiva,./Grupoide_(teoria_das_categorias),./Peter_Higgs
0,./Ajuda:Controle_de_autoridade,0.005319,0.0,0.0,0.0,0.011905,0.0,0.0,0.0,0.0,...,0.0,0.0,0.006711,0.002427,0.0,0.0,0.0,0.0,0.0,0.007246
1,./Anel_(matemática),0.005319,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
2,./Análise_complexa,0.005319,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
3,./Análise_funcional,0.010638,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
4,./Análise_matemática,0.005319,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7055,./Lógica_em_ciência_da_computação,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
7056,./NAND,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
7057,./Proposições_simples,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
7058,./Soma_de_vetores,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000


Para garantir que a matriz seja quadrada, podemos remover as colunas remanescentes.

In [11]:
# Removemos a coluna de link do df
df2 = df.drop(columns = ['link'])
# Verificamos se há mais colunas que linhas.
cols = list(df2.columns[~df2.columns.isin(df['link'])])

In [12]:
# Se houver colunas remanescentes, as removemos
if cols != []:
    wiki = df.drop(columns = cols)

A fim de não ter que executar novamente essas funções anteriores, podemos exportar, de forma compactada nosso _dataframe_.

In [13]:
wiki.to_csv("datasets/wikipedia.gzip",compression='gzip', index=False)

Em seguida, criamos nosso `np.array`

In [14]:
wiki_array = np.array(wiki.drop(columns = ['link']))

Verificamos suas dimensões.

In [15]:
wiki_array.shape

(7060, 7060)

Pelo mesmo motivo anterior, também exportamos o _array_.

In [16]:
with open('datasets/wiki_db.npz', 'wb') as f:
    np.savez_compressed(f, wiki_array)

### Recuperando os dados

Para utilizar os dados a partir das exportações feita anteriormente é muito simples.

No caso do `numpy`, o exemplo abaixo deve ser seguido.

In [17]:
wiki_saved = np.load('datasets/wiki_db.npz')['arr_0']

Podemos confirmar se as colunas de fato somam 1 ou 0.

In [23]:
# Somatório das colunas
somatory = np.sum(wiki_saved,axis =0)

Vemos abaixo os valores únicos. Note que há 1 repetidos, por conta de arredondamentos, nada que alterará nossos algorítmos.

In [24]:
np.unique(somatory)

array([0., 1., 1., 1., 1., 1., 1.])

Para o `pandas`, é parecido. Veja:

In [None]:
df_saved = pd.read_csv('datasets/wikipedia.gzip', compression='gzip')