<h1>Predicting Gender from Foursquare Attributes</h1>
<h2>MS&E 234</h2>

<h3>Setup</h3>

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy.sparse import csc_matrix
from typing import List

Download Dingqi Yang's 'User Profile Dataset' and 'Global-scale Check-in Dataset with User Social Networks' [here](https://sites.google.com/site/yangdingqi/home/foursquare-dataset) and place them in this directory.

In [2]:
# User Profile Paths
user_profile_dir = 'dataset_UbiComp2016/'
nyc = 'dataset_UbiComp2016_UserProfile_NYC.txt'
tky = 'dataset_UbiComp2016_UserProfile_TKY.txt'

# Social Network Paths
social_network_dir = 'dataset_WWW2019/'
friendships = 'dataset_WWW_friendship_new.txt'

<h3>Preprocessing</h3>

The following functions parse user profile and friendship data. The functions returns the gender for each user and social network adjacency matrix.

In [3]:
def parse_genders(filenames):
    """ Parse user profile datasets for gender.
        
        Parameters:
        - filenames = list of paths to user profile dataset
        Returns:
        - genders = dict containing gender for each user
    """
    genders = {}
    for f in filenames:
        parsed = np.genfromtxt(f, usecols=(0, 1), dtype=[('user_id', np.int32), ('gender', np.unicode_, 16)])
        genders.update({user_id: int(gender == 'male') for (user_id, gender) in parsed})
    return genders

genders = parse_genders([user_profile_dir + nyc, user_profile_dir + tky])

In [4]:
def parse_friendships(filename):
    """ Parse friendship dataset.
        
        Parameters:
        - filename = path to friendship dataset
        Returns:
        - adj = sparse adjacency matrix of friendships
    """
    parsed = np.genfromtxt(filename, dtype=[('user_id', np.int32), ('friend_id', np.int32)])
    rows = [parsed[i][0] for i in range(parsed.shape[0]) if parsed[i][0] in genders and parsed[i][1] in genders]
    cols = [parsed[i][1] for i in range(parsed.shape[0]) if parsed[i][0] in genders and parsed[i][1] in genders]
    max_id = max(max(rows), max(cols))
    temp = rows.copy()
    rows.extend(cols)
    cols.extend(temp)
    data = [1]*len(rows)
    adj = csc_matrix((data, (rows, cols)), shape=(max_id + 1, max_id + 1), dtype=np.int8)
    return adj

adj = parse_friendships(social_network_dir + friendships)