# PanelJam Web Scraping
## Liked Jams Graph construction

***PanelJam.com*** is a small online community of artists, on which cartoons are published. In particular each cartoon (also called jam) is made up by different panels drawn by distinct users: so its the result of some artists collaboration.

This script performs a **Web Scraping** activity, analyzing the HTML code of the web pages (so without using any APIs), to get information about jams liked by users. This information are used to model a directed and weighted **Liked Jams Graph**. In particular if an user is connected to another one, it means that the first one liked a jam on which the second one has worked: the weight of the edge stands for the number of liked jams.

The used libraries are:
- ***requests***: is a Python HTTP library, used to easily make HTTP requests
- ***bs4***: BeautifulSoup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
- ***networkx***:  is a Python library for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
- ***pickle***: is a library which implements binary protocols for serializing and de-serializing a Python object structure.
- ***re***: RegEx is a Python library, which can be used to work with Regular Expressions.

In [None]:
import requests
import bs4
import networkx as nx
import pickle
import re

The defined functions are:

- ***getJamsLikedBy ( user )*** : this function takes as input the input of a user, and returns a list containing the identifiers of jams liked by that user.

In [None]:
def getJamsLikedBy( user ):
    
    jamsList = []
    
    page = 1
    while True:
        
        #print("  Panels liked by " + user + ": page " + str(page))
        
        #sends an HTTP request
        res = requests.get('https://www.paneljam.com/' + str(user) + '/liked/?page=' + str(page))
        #parses the obtained page
        soup = bs4.BeautifulSoup(res.text, 'lxml')
        #searches all the a elements with class: strip-preview-click, containing liked jams information
        soupJams = soup.findAll("a", {"class": "strip-preview-click"})
        
        #if the nth page is empty, jams searching is interrupted
        if len(soupJams) > 0:
            
            #adds jams to jamsList
            for i in soupJams:
                jamsList.append(int(i['href'].replace("/jams/","").replace("/panels/","")))                            

        else:                
            break
        
        #move to the next page
        page = page + 1    
        
    return jamsList

- ***getAuthorsOfJam ( jam )***: this function takes as input the integer identifier of a jam, and returns a list containing the names of its authors.

In [None]:
def getAuthorsOfJam(jam):
    #sends an HTTP request
    res = requests.get('https://www.paneljam.com/jams/' + str(jam) + '/panels/')
    #parses the obtained page
    soup = bs4.BeautifulSoup(res.text, 'lxml')
    #searches all the div elements with class: left, containing authors information
    soupAuthors = soup.find("div", {"class": "left"}).findAll("a");
    
    authorsList = []
    
    #adds authors names to authorsList
    for i in soupAuthors:
        authorsList.append(i['href'].replace('/',''))
    
    return authorsList

The following code is used to get information about jams liked by users on *PanelJam.com*, and model a graph through it. In particular this information is taken browsing users profile pages, and accessing to their liked jams. At last, the obtained graph is saved in a *likeGraph.pckl* file.

In [None]:
#imports the friendshipGraph previously built, to get the list of all PanelJam.com users
f = open('friendshipGraph.pckl', 'rb')
friendshipGraph = pickle.load(f)
f.close()
users = list(friendshipGraph.nodes)

#creates an empty oriented graph 
G = nx.DiGraph()

for u in users:
    
    #print("User " + str(users.index(u) + 1) + " of " + str(len(users)))
    
    #gets the jams liked by nth user
    jams = getJamsLikedBy(u)
    
    likedAuthors = []
    
    #gets the authors of liked jams, and adds it to likedAuthors
    for j in jams:
        
        #print("  getting authors of jam " + str(jams.index(j) + 1) + " of " + str(len(jams)))
        
        authors = getAuthorsOfJam(j)        
        likedAuthors = likedAuthors + authors
    
    #Adds the nth user and the authors of liked jams, as nodes of the graph
    G.add_nodes_from([u] + likedAuthors)
    
    #adds or updates edges between users
    for l in likedAuthors:
        
        #this is avoided if the nth user and jam author are the same
        if u != l:
            
            if G.has_edge(u,l):               
                G[u][l]['weight'] = G[u][l]['weight'] + 1
                
            else:                
                G.add_edge(u, l, weight = 1)
            
#saves the graph
f = open('likeGraph.pckl', 'wb')
pickle.dump(G, f)
f.close()