# PanelJam Web Scraping
## Collaborations Graph construction

***PanelJam.com*** is a small online community of artists, on which cartoons are published. In particular each cartoon (also called jam) is made up by different panels drawn by distinct users: so it is the result of some artists collaboration.

This script performs a **Web Scraping** activity, analyzing the HTML code of the web pages (so without using any APIs), to get information about collaboration works between the users. This information are used to model an undirected and weighted **Collaborations Graph**. In particular if two users are connected, it means that they worked together on a cartoon: the weight of the edge stands for the number of cartoon on which they worked together.

The used libraries are:
- ***requests***: is a Python HTTP library, used to easily make HTTP requests
- ***bs4***: BeautifulSoup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
- ***networkx***:  is a Python library for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
- ***pickle***: is a library which implements binary protocols for serializing and de-serializing a Python object structure.

In [None]:
import requests
import bs4
import networkx as nx
import re

The defined functions are:

- ***getJamsOnPage ( page )*** : this function takes as input an integer which refers to the page for browsing jams on *PanelJam.com*. It returns a list containing the integer identifier of the jams shown in that page.

In [None]:
def getJamsOnPage(page):
    #sends an HTTP request
    res = requests.get('https://www.paneljam.com/jams/?page=' + str(page))
    #parses the obtained page
    soup = bs4.BeautifulSoup(res.text, 'lxml')
    #searches all the a elements with class: strip-preview-click, containing jams information
    soupJams = soup.findAll("a", {"class": "strip-preview-click"})
    
    jamsList = []
    
    #adds jams identifiers to jamsList
    for i in soupJams:
        #checks if the retrieved link matches with the pattern
        if (re.search("^https://www.paneljam.com/jams/", i['href'])):
            jamsList.append(int(i['href'].replace("https://www.paneljam.com/jams/","").replace("/panels/","")))
        
    return jamsList

- ***getAuthorsOfJam ( jam )***: this function takes as input the integer identifier of a jam, and returns a list containing the names of its authors.

In [None]:
def getAuthorsOfJam(jam):
    #sends an HTTP request
    res = requests.get('https://www.paneljam.com/jams/' + str(jam) + '/panels/')
    #parses the obtained page
    soup = bs4.BeautifulSoup(res.text, 'lxml')
    #searches all the div elements with class: left, containing authors information
    soupAuthors = soup.find("div", {"class": "left"}).findAll("a");
    
    authorsList = []
    
    #adds authors names to authorsList
    for i in soupAuthors:
        authorsList.append(i['href'].replace('/',''))
    
    return authorsList

The following code is used to get information about collaboration relationships between *PanelJam.com* users, and model a graph through it. In particular this information is taken browsing pages which show all the *PanelJam.com* completed jams, and taking their authors. At last, the obtained graph is saved in a *collaborationGraph.pckl* file.

In [None]:
#imports the friendshipGraph previously built, to get the list of all PanelJam.com users
f = open('friendshipGraph.pckl', 'rb')
friendshipGraph = pickle.load(f)
f.close()

#creates an undirected graph, and puts as nodes the PanelJam.com users
G = nx.Graph()
G.add_nodes_from(list(friendshipGraph.nodes))

page = 1
while True:
    
    #print('Page: ' + str(page) + ', nodes: ' + str(len(G.nodes)) + ', edges: ' + str(len(G.edges)))
    
    #gets the jams shown on nth page
    jams = getJamsOnPage(page)
    
    #if the nth page is empty, jams searching is interrupted
    if len(jams) > 0:
        
        for p in jams:
            
            #gets the authors of the jam, and adds them as nodes in the graph
            authors = getAuthorsOfJam(p)            
            G.add_nodes_from(authors)
            
            #adds or updates the edges weight between the jam authors
            i = 0
            while (i < len(authors)):
                j = i + 1
                while (j < len(authors)):
                    
                    if G.has_edge(authors[i], authors[j]):
                        G[authors[i]][authors[j]]['weight'] = G[authors[i]][authors[j]]['weight'] + 1
                    else:
                        G.add_edge( authors[i], authors[j], weight = 1)                        
                    
                    j = j + 1
                i = i + 1
        
    else:
        break
    
    #moves to the next page
    page = page + 1
            
#saves the graph
f = open('collaborationGraph.pckl', 'wb')
pickle.dump(G, f)
f.close()