Opgave:
    https://github.com/datsoftlyngby/dat4sem2020spring-python/blob/master/Facebook_exercise.ipynb
    
Facebook Assignment

This assignment requires you to work with Facebook network data, 
data preprocessing and networkx. Note that this is real data 
from real people!

Part 1: Preparing data

The dataset you will be working with is available here: 
    https://snap.stanford.edu/data/egonets-Facebook.html

You're first job is to

    Download the data
    Unpack the data
    Import the data as an undirected graph in networkx

This should all be done from your notebook in Python. 
This is an important step for you to automate data preprocessing.

Note: this could take a while, so if you feel adventurous you 
    can use the multiprocessing library to speed things up.

Hand-in:

    The code for downloading, unpacking and loading the dataset



In [59]:
import tarfile
from tqdm import tqdm
import wget
import gzip
import shutil
import os
import numpy as np 
import networkx as nx
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout, write_dot
import pygraphviz

def download_file(url, file_name="data_file"):
    """download and save file"""
    print('Beginning file download with wget module')
    wget.download(url, './' + file_name)

    
def unpack_tar(file_name):
    """unpack tar.gz file"""
    if (file_name.endswith("tar.gz")):
        tar = tarfile.open(file_name)
        tar.extractall()
        tar.close()
        print("Extracted in Current Directory")
    else:
        print ("Not a tar.gz file: '%s '" % file_name)

        
def unpack_gz(file_name, file_type='.txt'):
    """unpack gz file"""
    with gzip.open(file_name, 'rb') as f_in:
        raw_fine_name = os.path.splitext(file_name)[0]
        print(raw_fine_name+file_type)
        with open(raw_fine_name + file_type, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
        
        
def get_files_from_folder(folder):
    """return list of files in folder. Not used in this exercise"""
    from os import listdir
    from os.path import isfile, join
    return [f for f in listdir(folder) if isfile(join(folder, f))]
        
    
def load_data(file):
    """load data into array of tubles"""
    #return np.loadtxt(file)
    result = []
    with open(file, "r") as fp:
        for i in fp.readlines():
            tmp = i.split(" ")
            try:
                result.append((int(tmp[0]), int(tmp[1])))
            except:pass

    return result


def create_graph(list):
    graph_temp = nx.DiGraph() 
    graph_temp.clear()
    for idx, name_number in enumerate(list):
        graph_temp.add_node(idx, name=name_number)
    
    graph_temp.add_edges_from(list)
    return graph_temp    
    
url = 'https://snap.stanford.edu/data/facebook_combined.txt.gz'

#download_file(url, "facebook_combined.gz")
#unpack_gz("facebook_combined.gz")
dataset = load_data("facebook_combined.txt")
#print(dataset)
print(len(dataset))
graph = create_graph(dataset)

88234


Part 2: Analyse the data

Now, let's take a look at the network you imported.

By node degree we mean the number of edges to and from a node. This is different in an undirected network, where in-degree == out-degree, and a directed network where in-degree != out-degree.

By graph degree we mean the number of edges in the entire network.

Hand-in code that display:

    The number of nodes in the network
    The number of edges in the network
    The average degree in the network
    A visualisation of the network inside your notebook



In [92]:
print("The number of nodes in the network: ", graph.number_of_nodes())
print("The number of edges in the network: ", len(graph.edges()))
print("The average degree in the network: ",graph.number_of_nodes()// len(graph.edges()))



The number of nodes in the network:  88234
The number of edges in the network:  88234
The average degree in the network:  1


In [None]:
from concurrent.futures import ProcessPoolExecutor
import multiprocessing


def split_list(alist, wanted_parts=1):
    length = len(alist)
    return [ alist[i*length // wanted_parts: (i+1)*length // wanted_parts] 
             for i in range(wanted_parts) ]
    
def multiprocessing(func, args, workers=multiprocessing.cpu_count()):
    with ProcessPoolExecutor(workers) as ex:
        res = ex.map(draw_graph, args)
    return list(res)
    
def draw_graph(graph):
    nx.draw(graph, pos=graphviz_layout(graph), 
            node_size=30, width=.05, cmap=plt.cm.Blues, 
            with_labels=True, node_color=range(len(graph)))

print("A visualisation of the network inside your notebook")
#draw_graph(graph)
## obs bliver vist ikke færdig i denne levetid. Kunne ikke få multiprocessing til at virke
nx.draw(praph)


Part 3: Find the most popular people

We're naturally interested in who has the most friends, so we want to extract top 10. That is, the 10 most connected people.

Hand-in:

    Code that extracts and reports the 10 people with the most connections in the network



In [81]:
in_deg_vec = np.array([graph.degree(n) for n in graph.nodes()])
print("10 people with the most connections in the network:")
for idx, person in enumerate(sorted(in_deg_vec, reverse=True)):
    if idx < 10:
        print("#" + str(idx+1), person)


10 people with the most connections in the network
#1 1045
#2 792
#3 755
#4 547
#5 347
#6 294
#7 291
#8 254
#9 245
#10 235
