# Exploratory Data Analysis of Book Readers

## Getting acquainted with the data

Before diving into the exercises, let's see the structure of the data in the database.

In [None]:
import os
import pandas as pd
from sqlalchemy import *
from IPython.display import display
from source.utils import *

def as_generator(iterator):
    return (x for x in iterator)

def format_results(engine, query):
    result_set = engine.execute(query)
    return pd.DataFrame(as_generator(result_set), columns=result_set.keys())

psql = create_engine('postgresql://postgres:{}@db/postgres'.format(os.environ['POSTGRES_PASSWORD']))

metadata = MetaData()
metadata.reflect(bind=psql)
for table in metadata.sorted_tables:
    print(table.name)
    for column in table.columns:
        print('- {} {}'.format(column.name, column.type))
    print('-' * 4)

    
print("\nDenormalized data:")
format_results(psql,
               """
               SELECT r.visit_id AS reader_id, title, category_one, category_two, country, timezone,
                      location_accuracy, tracking_time, created_at
               FROM reading r INNER JOIN visits v ON r.visit_id = v.visitor_id
               INNER JOIN stories s ON r.story_id = s.id
               ORDER BY r.visit_id, r.tracking_time
               LIMIT 5
               """)

Okay, now let's proceed to the action.

## How much reading horror readers did each day
**IMPORTANT** Dear reviewer, please refer to [my notes](#Important-Observations) at the end of the notebook for clarification.

In [None]:
import pandas as pd
from source.utils import *

def as_generator(iterator):
    return (x for x in iterator)

def format_results(engine, query):
    result_set = engine.execute(query)
    return pd.DataFrame(as_generator(result_set), columns=result_set.keys())

### SQL

In [None]:
# We have to account for the fact that horror readers can also read other genres
n1_sql = format_results(psql,
                        """
                        SELECT date_trunc(\'day\', tracking_time) AS date, COUNT(id) AS reads
                        FROM reading
                        WHERE reading.visit_id IN (
                            SELECT DISTINCT r.visit_id
                            FROM reading r INNER JOIN stories s ON r.story_id = s.id
                            WHERE s.category_one = \'horror\' OR s.category_two = \'horror\'
                        )
                        GROUP BY date_trunc(\'day\', reading.tracking_time)
                        ORDER BY date
                        """)

display(n1_sql)

### Pandas

In [None]:
def first(arr):
    return arr[0]

def truncate_to_day(timestamp):
    return first(timestamp.split('T'))

visits = pd.read_csv('data/visits.csv')
reading = pd.read_csv('data/reading.csv')
stories = pd.read_csv('data/stories.csv')

reading.tracking_time = reading.tracking_time.apply(truncate_to_day)
reading.created_at = reading.created_at.apply(truncate_to_day)

horror_stories = stories[(stories.category_one == 'horror') | (stories.category_two == 'horror')]
horror_readers = pd.merge(left=reading, right=horror_stories, left_on='story_id', right_on='id').visit_id

horror_reading = reading[reading.visit_id.isin(horror_readers)]
horror_reading_per_day = horror_reading.groupby('tracking_time').size()

n1_pandas = pd.DataFrame({'date': horror_reading_per_day.index, 'readings': horror_reading_per_day.values})
display(n1_pandas)

### Testing
Let's compare the results of both methods, side-by-side.

In [None]:
pd.concat([n1_sql, n1_pandas], axis=1)

## How many horror readers are there
### SQL

In [None]:
# We have to account for the fact that horror readers can also read other genres
n2_sql = format_results(psql,
                """
                SELECT date_trunc(\'day\', reading.tracking_time) AS date, COUNT(DISTINCT visit_id) AS unique_readers
                FROM reading
                WHERE reading.visit_id IN (
                    SELECT DISTINCT r.visit_id
                    FROM reading r INNER JOIN stories s ON r.story_id = s.id
                    WHERE s.category_one = \'horror\' OR s.category_two = \'horror\'
                )
                GROUP BY date_trunc(\'day\', reading.tracking_time)
                ORDER BY date
                """)

display(n2_sql)

### Pandas

In [None]:
def count_unique(x):
    return len(set(x))

grouped_by_day = horror_reading.groupby('tracking_time')
unique_readers_per_day = grouped_by_day.visit_id.agg(count_unique)

n2_pandas = pd.DataFrame({'date': unique_readers_per_day.index, 'readings': unique_readers_per_day.values})
display(n2_pandas)

### Testing
Let's compare the results of both methods, side-by-side.

In [None]:
pd.concat([n2_sql, n2_pandas], axis=1)

## What countries are the readers from
### SQL

In [None]:
n3_sql = format_results(psql,
              """
              SELECT date_trunc(\'day\', reading.tracking_time) AS date, country, COUNT(DISTINCT visit_id) AS reads
              FROM reading INNER JOIN visits ON reading.visit_id = visits.visitor_id
              WHERE reading.visit_id IN (
                SELECT DISTINCT r.visit_id
                FROM reading r INNER JOIN stories s ON r.story_id = s.id
                WHERE s.category_one = \'horror\' OR s.category_two = \'horror\'
              )
              GROUP BY date_trunc(\'day\', reading.tracking_time), country
              ORDER BY date, country
              """)

display(n3_sql)

### Pandas

In [None]:
def transpose(list_tuples):
    return list(zip(*list_tuples))

grouped_by_day_and_country = pd.merge(left=horror_reading, right=visits, left_on='visit_id', right_on='visitor_id') \
                                .groupby(['tracking_time', 'country'])
                                      
unique_readers_per_day_country = grouped_by_day_and_country.visit_id.agg(count_unique)
dates, countries = transpose(unique_readers_per_day_country.index)

n3_pandas = pd.DataFrame({'date': dates,
                          'country': countries,
                          'readings': unique_readers_per_day_country.values})

display(n3_pandas)

### Testing
Let's compare the results of both methods, side-by-side.

In [None]:
pd.concat([n3_sql, n3_pandas], axis=1)

## Notes
Looking at the columns `tracking_time` and `created_at` of table `reading`, the values appear to be always the same, Upon further inspection, we can see that this is not always the case, though. Probably, `tracking_time` registers the time when the reading really happened and `created_at` registers the time of inserting into the data, as `created_at` >= `tracking_time` on the data. Events are generated when the user scrolls down while reading a story, and contain the following information:
```JSON
{
    "events": [
        {
            "id": "c0e778e7-522d-47d8-a726-0b4e2ca1fedd",
            "name": "Reading Chapter",
            "properties": {
                "storyId": 18965,
                "chapterNumber": 2,
                "from": {}
            },
            "time": 1509148130.507
        }
    ],
    "visit_token": "ca96c8c6-9a51-4bb2-8b14-4f1a5575f18f",
    "visitor_token": "06509f38-4325-45d4-af65-4a8c585e904b"
}
```

Still on `reading`, it isn't clear which of the fields holds the identity of the reader. It could be either `visitor_id` or `user_id`, judging by the names, but both of them have problems. Inspecting the `visits` table, we can see the same `user_id` associated with **multiples values for `country`**, which would very unlikely in a real setting. On the other side, `visitor_id` cannot be used because there is no overlap between values of `reading.visitor_id` and `visits.visitor_id`, rendering a `JOIN` impossible. However, all is not lost: we can join the tables using `reading.visit_id` and `visit.visitor_id`, which was my choice for the test. See queries below.

In [None]:
print("Joining on 'visitor_id' in both tables")
display(format_results(psql, "SELECT * FROM reading r INNER JOIN visits v ON r.visitor_id=v.visitor_id"))
print("\n")
print("Joining on 'reading.visit_id' and 'visits.visitor_id'")
display(format_results(psql, "SELECT * FROM reading r INNER JOIN visits v ON r.visit_id=v.visitor_id LIMIT 3"))