# Processing Gowalla Checkins

This notebook processes the [original Gowalla checkins data](https://snap.stanford.edu/data/loc-gowalla.html) into a database-style format, with a few normalised tables (augmented with some freely-available data).

It represents the first step in a wider project, to create a pipeline from a PostgreSQL database to BigQuery, that automatically triggers updates and displays some simple dashboards in Looker.

---

## Setup

Import packages and other basic setup

In [51]:
## Packages and modules
import pandas as pd
from os import path
from randomuser import RandomUser

In [41]:
## Constants

# path to raw Gowalla data
RAW_DATA = path.join("raw", "loc-gowalla_totalCheckins.txt.gz")
# number of checkins to include in sample
SAMPLE_SIZE = 2000

---

## Read data from file, and sample

Pandas has no issues reading the data from a .gz archive, so can just load the file directly. However, a couple of gotchas:
- There are no headers in the file, just the raw data;
- The delimiters used are tabs (i.e. `\t`), not commas.

In [42]:
df = pd.read_csv(
    RAW_DATA,
    header=None,
    names=["user", "check_in_time", "latitude", "longitude", "location_id"],
    delimiter="\t"
)
df.head()

Unnamed: 0,user,check_in_time,latitude,longitude,location_id
0,0,2010-10-19T23:55:27Z,30.235909,-97.79514,22847
1,0,2010-10-18T22:17:43Z,30.269103,-97.749395,420315
2,0,2010-10-17T23:42:03Z,30.255731,-97.763386,316637
3,0,2010-10-17T19:26:05Z,30.263418,-97.757597,16516
4,0,2010-10-16T18:50:42Z,30.274292,-97.740523,5535878


That's what we're after. Let's check a few stats...

In [43]:
print(f"Length of df: {len(df)}")

Length of df: 6442892


In [44]:
for col in ["user", "location_id"]:
    print(f"Number of unique vals in '{col}': {len(df[col].unique())}")

Number of unique vals in 'user': 107092
Number of unique vals in 'location_id': 1280969


In [45]:
107092/6442892

0.016621728254951347

So, there are relatively few unique users, but quite a lot of unique locations. Not good or bad, exactly - just notable.

---

Now, get a randomly-selected sample of these rows (we don't need / want to work with all 6.4m rows for this). 

To reduce the overhead a bit when it comes to building the db, we'll get the number of times each user appears, add this to the original df, and weight the users by this value (so that users with more checkins are more likely to appear in the table)

In [46]:
df_user_counts = df["user"].value_counts().rename_axis("user").reset_index(name="counts")
df_user_counts.head()

Unnamed: 0,user,counts
0,10971,2175
1,776,2175
2,18931,2150
3,49918,2125
4,620,2125


In [47]:
df = df.merge(df_user_counts, on="user")
df.head()

Unnamed: 0,user,check_in_time,latitude,longitude,location_id,counts
0,0,2010-10-19T23:55:27Z,30.235909,-97.79514,22847,225
1,0,2010-10-18T22:17:43Z,30.269103,-97.749395,420315,225
2,0,2010-10-17T23:42:03Z,30.255731,-97.763386,316637,225
3,0,2010-10-17T19:26:05Z,30.263418,-97.757597,16516,225
4,0,2010-10-16T18:50:42Z,30.274292,-97.740523,5535878,225


In [48]:
df_sample = df.sample(
    n= SAMPLE_SIZE, 
    random_state=1, 
    ignore_index=True,
    weights="counts"
)
df_sample.head()

Unnamed: 0,user,check_in_time,latitude,longitude,location_id,counts
0,5479,2010-10-03T16:10:09Z,51.924049,4.470105,332428,1693
1,39612,2010-02-21T20:36:02Z,40.763861,-73.972932,12535,1825
2,2,2010-09-21T02:33:01Z,34.089709,-118.268309,167337,2100
3,3384,2010-09-04T20:13:41Z,47.66795,-122.313285,1049172,1950
4,1201,2010-10-20T20:08:59Z,30.274481,-97.739068,898204,1825


Looks good. Now, a quick bit of checking...

In [49]:
for col in df_sample.columns.values:
    print(f"Number of NaN vals in '{col}': {df_sample[col].isna().sum()}")

Number of NaN vals in 'user': 0
Number of NaN vals in 'check_in_time': 0
Number of NaN vals in 'latitude': 0
Number of NaN vals in 'longitude': 0
Number of NaN vals in 'location_id': 0
Number of NaN vals in 'counts': 0


In [50]:
for col in ["user", "location_id"]:
    print(f"Number of unique vals in '{col}': {len(df_sample[col].unique())}")

Number of unique vals in 'user': 1366
Number of unique vals in 'location_id': 1969


So, there are a little over half as many users as there are rows, and nearly all the locations are unique (not ideal, but the level of effort to improve this isn't worth the gain for a relatively quick test)

---

## Create users and normalise data sources

The next step is to add a bit of richness to the user IDs, by creating a set of dummy users (that will be in their own table). We'll do this with the [randomuser package](https://pypi.org/project/randomuser/), which in turn uses the [randomuser.me](https://randomuser.me/) API

In [55]:
user_list = RandomUser.generate_users(len(df_sample["user"].unique()))

RandomUser contains a series of getter methods, to retrieve the info for each object. Here's an example:

In [65]:
user_list[1].get_nat()

'DE'

We want a dataframe of various properties from each object. Two very obvious ways of constructing this:
- When the pd.DataFrame constructor is called, using a series of list comprehensions (iterate through the list of objects multiple times)
- Iterate through the list of users once, create a series of lists of the properties required, then feed these to the df constructor.

Tbh, it probably makes little difference with such a short list- however, have gone for the 2nd option.

In [66]:
first_names, last_names, dob, nationality, gender, email = [], [], [], [], [], []
for user in user_list:
    first_names.append(user.get_first_name()),
    last_names.append(user.get_last_name()),
    dob.append(user.get_dob()),
    nationality.append(user.get_nat()),
    gender.append(user.get_gender()),
    email.append(user.get_email())

users_df = pd.DataFrame({
    "id": df_sample["user"].unique().tolist(),
    "first_name": first_names,
    "last_name": last_names,
    "dob": dob,
    "nationality": nationality,
    "gender": gender,
    "email": email
})

users_df.head()

Unnamed: 0,id,first_name,last_name,dob,nationality,gender,email
0,5479,Daniel,Ruiz,1974-06-27T15:07:58.533Z,ES,male,daniel.ruiz@example.com
1,39612,Magarete,Schönherr,1991-09-23T13:41:58.704Z,DE,female,magarete.schonherr@example.com
2,2,Karla,Holvik,1979-11-21T02:50:23.636Z,NO,female,karla.holvik@example.com
3,3384,Esperanza,Fernandez,1984-12-06T16:23:21.003Z,ES,female,esperanza.fernandez@example.com
4,1201,Jimmie,Williams,1979-12-12T21:16:29.980Z,US,male,jimmie.williams@example.com
