# 2.2 The Dataset

Code used to run some sanity checks on the raw dataset for section 2.2 of the dissertation.

## Setup

The `json_lines` library will be used to parse the data files.

In [10]:
from json_lines import reader as jl_reader

Both the friend and group data are stored in individual files. However, the review and review text data are divided between 524 different enumerated files.

In [30]:
DIR = '../data/raw/'
PATH_FRIENDS = DIR + 'friends.jl'
PATH_GROUPS  = DIR + 'groups.jl'
PATH_REVIEWS = DIR + 'reviews/review_page%d.jl'
PATH_TEXTS   = DIR + 'reviews_text/reviewtext_page%d.jl'
REVIEW_PAGES = 524

The following function will extract and return the unique integer value from a Steam profile URL string. This conversion will reduce the amount of memory used by the program and make it more efficicent.

In [31]:
def profile_str_to_int(profile):
    return int(profile[16:])

## Sanity Checks

### User Consistency Check

This check will iterate over every entry in all four sections of the dataset and attempt to ensure that every user with an entry in any one section of the dataset has entries in all four sections of the dataset. For example, if the dataset contains the friend list for a user then it should also contain their review, review text and group membership data.

In [22]:
def read_user_ids_from_file(path):
    user_ids = set()
    with open(path, 'rb') as f:
        for entry in jl_reader(f):
            user_ids.add(profile_str_to_int(entry['steamid']))
    return user_ids

friend_uids = read_user_ids_from_file(PATH_FRIENDS)
group_uids = read_user_ids_from_file(PATH_GROUPS)
review_uids = set()
text_uids = set()
for i in range(1, REVIEW_PAGES + 1):
    review_uids |= read_user_ids_from_file(PATH_REVIEWS % i)
    text_uids |= read_user_ids_from_file(PATH_TEXTS % i)

In [28]:
combined_uids = friend_uids & group_uids & review_uids & text_uids

In [29]:
print('Friends: ', len(friend_uids))
print('Groups:  ', len(group_uids))
print('Reviews: ', len(review_uids))
print('Texts:   ', len(text_uids))
print('Combined:', len(combined_uids))

Friends:  4000093
Groups:   4183277
Reviews:  4183276
Texts:    4183276
Combined: 4000033
