### From relational to graph databases
Due to the poor performance of our relational database in answering graph-like questions, we may
want to move our tabular data into a graph format.
First, we will consider a sensible graph schema for our data, based on the information we have available,
before writing a pipeline to move data from MySQL into a Python igraph network. By doing this, we
can benchmark how a graphical approach to our path-based question performs, in comparison to
the same question we answered with SQL.

#### Schema design
In our tables, we have two types of entities, users and games, which have different properties. Because
of this, it is wise to consider users and games as different node types.
For users, we only have a unique ID for each user. To add data to an igraph graph, we will need to add
an increasing integer igraph node ID for each distinct node, as we learned in Chapter 1, Introducing
Graphs in the Real World, and Chapter 2, Working with Graph Data Models.

In [44]:
from graphtastic.database.mysql import query_mysql

In [46]:
play_query = "SELECT id, game_name, hours FROM steam_play;"

play_data = query_mysql(play_query, password = "")
print(play_data[:10])

NameError: name 'getpass' is not defined

In [5]:
purchase_query = "SELECT id, game_name FROM steam_purchase;"
purchase_data = query_mysql(purchase_query, password="")
print(purchase_data[:10])

[('151603712', 'Fallout 4'), ('151603712', 'Spore'), ('151603712', 'Fallout New Vegas'), ('151603712', 'Left 4 Dead 2'), ('151603712', 'HuniePop'), ('151603712', 'Path of Exile'), ('151603712', 'Poly Bridge'), ('151603712', 'Left 4 Dead'), ('151603712', 'Team Fortress 2'), ('151603712', 'Tomb Raider')]


In play_data, we have information on users, the games they have played, and the time they
have spent playing each game. In purchase_data, we only need users and the games they
have purchased.

Next, as stated in the process of schema design, to use igraph, we will need to add an increasing
integer igraph node ID to both the User and Game nodes, starting from 0.

In [10]:
users = set([row[0] for row in play_data] + [row[0] for row in purchase_data])
user_ids = {user_id: igraph_id for igraph_id, user_id in enumerate(users)}
print(len(user_ids))

12393


The following lines of code will generate a set() of unique user IDs and users in both the `play_data` and `purchase_data` list. 

Then, a combination of a dictionary comprehension and the enumerate() method will generate a dictionary with keys containing Steam users IDs, and values containing our igraph IDs. 

In [11]:
games = set([row[1] for row in play_data] + [row[1] for row in purchase_data])
game_ids = {user_id: igraph_id for igraph_id, user_id in enumerate(games, len(user_ids))}
print(len(game_ids))

5155


As with users, our print statement shows the number of unique games in our datasets, 5155. Using enumerate() with the second parameter should avoid any ID conflicts between users and games. 

In [12]:
print(sorted(user_ids.values(), reverse=True)[:10])
print(sorted(game_ids.values(), reverse=False)[:10])

[12392, 12391, 12390, 12389, 12388, 12387, 12386, 12385, 12384, 12383]
[12393, 12394, 12395, 12396, 12397, 12398, 12399, 12400, 12401, 12402]


This shows that the highest generated ID for users is 12,392, while the lowest generated ID for games is 12393, as expected. 

In [13]:
all_ids = sorted(list(user_ids.values()) + list(game_ids.values()))
assert all_ids == list(range(len(all_ids)))

Because this assert statement raises no exceptions, we can be confident that our generated IDs have been created correctly.

##### Building Graph 

In [14]:
import igraph as ig
g = ig.Graph(directed=True)

In [15]:
users_ids =  dict(sorted(user_ids.items(), key=lambda item: item[1]))
game_ids = dict(sorted(game_ids.items(), key=lambda item: item[1]))

# We can now take the keys from these dictionaries and convert them into lists, ready  to be added as propertiess
steam_user_ids = list(user_ids.keys())
steam_game_ids = list(game_ids.keys())

In [17]:
g.add_vertices(len(steam_user_ids) + len(steam_game_ids))
assert len(g.vs) == len(steam_user_ids) + len(steam_game_ids)

In [19]:
all_steam_ids = steam_user_ids + steam_game_ids
#print(all_steam_ids)

In [20]:
# Let's also use list comprehension to create a list containing our nodes types
node_types = ['user' for _ in steam_user_ids] + ['game' for _ in steam_game_ids]

With our lists prepared, we can now add properties listwise to all the nodes in our graph by
accessing the vs attribute of our igraph Graph() object:

In [23]:
g.vs['steam_id'] = all_steam_ids
g.vs['type'] = node_types

In [25]:
print(g.vs['steam_id'][:10])
print(g.vs['type'][:10])

game_nodes = g.vs.select(type_eq='game')

print(len(game_nodes))

['208513774', '130931340', '188250158', '127461395', '66748534', '27168078', '159424645', '152741550', '229203462', '158959728']
['user', 'user', 'user', 'user', 'user', 'user', 'user', 'user', 'user', 'user']
5155


Next, we need to add edges to our graph. Edges are contained in both the data from steam_purchase and steam_play, now contained in the purchase_data and play_data variables.

Let's generate the edges for both of these types of transactions by finding the igraph IDs for the users and games.

In [26]:
purchase_edges = [[user_ids[user], game_ids[purchase]] for user, purchase in purchase_data]

play_edges = [[user_ids[user], game_ids[game], hours] for user, game, hours in play_data]

For play_edges, the number of hours is also included, so we can add this to our graph as edge properties. Let's add the PLAYED edges first along with thier hoursr attribute, in listwise fashion, using more list comprehensions.

In [27]:
g.add_edges([(n, m) for n, m, _ in play_edges])
g.es['hours'] = [hours for _, _, hours in play_edges]

Now, we can add the edges representing the purchased relationships. There are no attributes to add to our dataset that are specifically related to a game's purchase, so we can just use the following 

In [28]:
g.add_edges(purchase_edges)

Finally, to complete this graph, we can add edge_type as an edge attribute to all edges

In [29]:
edge_type = ['PLAYED' for _ in play_edges] + ['PURCHASED' for _ in purchase_edges]
g.es['edge_type'] = edge_type

In [30]:
user_id_ex = g.vs.select(steam_id_eq='151603712')[0].index
purchased_ex = g.es.select(_source_eq=user_id_ex, edge_type='PURCHASED')
print(len(list(purchased_ex)))

39


In [33]:
paths = g.get_all_simple_paths(user_id_ex, cutoff=3, mode='all')
print(paths[:10])

[[4240, 12498], [4240, 12498, 1592], [4240, 12498, 1592, 12608], [4240, 12498, 1592, 13018], [4240, 12498, 1592, 13277], [4240, 12498, 1592, 13400], [4240, 12498, 1592, 13474], [4240, 12498, 1592, 13579], [4240, 12498, 1592, 14657], [4240, 12498, 1592, 14728]]


With our print() statement, we can take a look at some of the patths that have been found. Our paths variable contains the igraph IDs of nodes that are traversed from our original user node as a list of lists.

In [35]:
rec_game_ids = [path[3] for path in paths if len(path) == 4]
print(rec_game_ids[:4])

[12608, 13018, 13277, 13400]


In [37]:
game_names = [g.vs[game_id]['steam_id'] for game_id in rec_game_ids]
print(game_names[:4])

['The Mighty Quest For Epic Loot', 'Half-Life Blue Shift', 'Ricochet', 'Rise of Incarnates']


Now, we have a list of igraph IDs for nodes representing our games. 

In [38]:
neighbors = g.neighbors(user_id_ex)
purchased_games = [g.vs[node_id]['steam_id'] for node_id in g.neighbors(user_id_ex)]
print(purchased_games[:3])

['Eldevin', 'Eldevin', 'BioShock 2']


We found the corresponding game names with a similar list comprehension as we used in the previous code snippet, looking for steam_id values with g.vs

In [39]:
games_names = [game for game in game_names if game not in purchased_games]

In [40]:
from collections import Counter
game_frequency = Counter(game_names)
print(game_frequency)



Looking at the start of the printed counter() object, we can see the top three games we might want to recommend to our user are Counter-Strike Global offensive The Elder Scrolls V Skyrim, Left 4 Dead 2