# 2. Dataset Preparation

Code used to convert the dataset from JSON Lines to CSV.

## Setup

The `csv` and `json_lines` libraries will be used to parse the data files.

In [1]:
from csv import writer as csv_writer
from json_lines import reader as jl_reader

The prepared friend and group data will be written to individual CSV files while the review data will be split into multiple files, each containing a maximum of 300,000 rows.

In [2]:
# raw data
DIR_RAW  = '../data/raw/'
PATH_FRIENDS_RAW = DIR_RAW + 'friends.jl'
PATH_GROUPS_RAW  = DIR_RAW + 'groups.jl'
PATH_REVIEWS_RAW = DIR_RAW + 'reviews/review_page%d.jl'
PATH_TEXTS_RAW   = DIR_RAW + 'reviews_text/reviewtext_page%d.jl'
REVIEW_PAGES = 524
# prepared data
DIR_PREP = '../data/prepared/'
PATH_FRIENDS_PREP = DIR_PREP + 'friends.csv'
PATH_GROUPS_PREP  = DIR_PREP + 'groups.csv'
PATH_REVIEWS_PREP = DIR_PREP + 'reviews/%d.csv'
ROWS_PER_PAGE = 300000

Helper functions for converting user IDs and group names to simpler formats:

In [3]:
def profile_str_to_int(profile):
    return int(profile[16:])

In [4]:
def group_str_to_name(group):
    return group[34:]

Helper function for writing data to CSV files:

In [5]:
def write_buffer(buffer, path, page_num=None):
    return
    if page_num is not None:
        path = path % page_num
    with open(path, 'w+', encoding='utf-8', newline='') as f:
        writer = csv_writer(f, delimiter=',')
        writer.writerows(buffer)

## Preparation Functions

### Reviews

In [6]:
def prepare_review_data():
    # map of Steam IDs to custom counter
    user_map = {}
    user_map_counter = 0
    # buffer and page setup
    buffer = []
    out_page_num = 1
    for in_page_num in range(1, REVIEW_PAGES + 1):
        # open files
        fr = open(PATH_REVIEWS_RAW % in_page_num, 'rb')
        ft = open(PATH_TEXTS_RAW % in_page_num, 'rb')
        # iterate over both files simultaneously
        for ir, it in zip(jl_reader(fr), jl_reader(ft)):
            # convert Steam ID to counter value
            user_int = profile_str_to_int(ir['steamid'])
            if user_int not in user_map:
                user_map[user_int] = user_map_counter
                user_map_counter += 1
            user_id = user_map[user_int]
            if user_id not in [28, 17290, 126400, 261875, 280165, 1518820, 1561787, 1843165, 2352954, 3007801, 3431419, 4030277]:
                continue
            for dr, dt in zip(ir['reviews'], it['reviews']):
                data = [
                    user_id,
                    int(dr['appid']),
                    int(dr['voted_up']),
                    int(dr['early_access']),
                    dr['playtime_forever'],
                    dr['playtime_forever'],
                    int(dr['tstamp_created']),
                    int(dr['tstamp_created']),
                    dr['votes_up'],
                    dr['votes_funny'],
                    dt['text']
                ]
                if 'playtime_atreview' in dr:
                    data[5] = dr['playtime_atreview']
                if 'tstamp_updated' in dr:
                    data[7] = int(dr['tstamp_updated'])
                buffer.append(data)
                print(user_id, data[2], data[-1])
        # write buffer(s)
        while len(buffer) >= ROWS_PER_PAGE:
            write_buffer(buffer[:ROWS_PER_PAGE], PATH_REVIEWS_PREP,
                         page_num=out_page_num)
            buffer = buffer[ROWS_PER_PAGE:]
            out_page_num += 1
        # close files
        fr.close()
        ft.close()
    # write the remaining buffer
    if len(buffer) > 0:
        write_buffer(buffer, PATH_REVIEWS_PREP,
                     page_num=out_page_num)
    return user_map

### Friends

In [108]:
def prepare_friend_data(user_map):
    data = []
    with open(PATH_FRIENDS_RAW, 'rb') as f:
        for item in jl_reader(f):
            user_int = profile_str_to_int(item['steamid'])
            if user_int not in user_map: continue
            row = [user_map[user_int]]
            for friend in item['ids']:
                friend_int = profile_str_to_int(friend)
                if friend_int not in user_map: continue
                row.append(user_map[friend_int])
            data.append(row)
    write_buffer(data, PATH_FRIENDS_PREP)

### Groups

In [109]:
def prepare_group_data(user_map):
    group_map = {}
    group_map_counter = 0
    data = []
    with open(PATH_GROUPS_RAW, 'rb') as f:
        for item in jl_reader(f):
            user_int = profile_str_to_int(item['steamid'])
            if user_int not in user_map: continue
            row = [user_map[user_int]]
            for group in item['urls']:
                group_name = group_str_to_name(group)
                if group_name not in group_map:
                    group_map[group_name] = group_map_counter
                    group_map_counter += 1
                row.append(group_map[group_name])
            data.append(row)
    write_buffer(data, PATH_GROUPS_PREP)

## Preparing the Data

In [7]:
user_map = prepare_review_data()

28 1 How the hell does some other reviewer have 86 hours...the game just came out?
28 1 This game has a LOT of potential. Crazy how only one person has made this so far. There were a few things I'm not sure was intentional or not like being able to fly around the map by spamming SHIFT + A + S + SPACE or something like that. I think it made the bosses easier because I was able to just shoot a lot and then SHIFT into some geometry and fly across the map but it was still a lot of fun. Can't wait for the full release! Should probably let you know that this is more of a Demo than a full game. I was able to finish it in about 50 minutes. Still worth picking it up.
28 1 Are you seriously not going to buy the new Half-Life game that took 13 years to release?
28 1 Great VR game! Should definitely be in everyone's VR list even if they play it rarely. There are two issues I've noticed though: 1) Hands often get stuck in game because I didn't realise they were behind a wall or something. I don't b

In [111]:
prepare_friend_data(user_map)

In [112]:
prepare_group_data(user_map)