# Number of Users Active Daily in a Week

Question: Given a dataset of activity at a website, count how many users accessed the website every day in a given week from their mobile device. 

## Generating the Data

In [81]:
from random import randrange
from datetime import timedelta, datetime
import numpy as np

def random_date(start, end):
    """
    This function will return a random datetime between two datetime 
    objects.
    """
    delta = end - start
    int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
    random_second = randrange(int_delta)
    return start + timedelta(seconds=random_second)

def write_csv():
    """Writes the csv file containing our data
    """
    d1 = datetime.strptime('1/1/2015 1:30 PM', '%m/%d/%Y %I:%M %p')
    d2 = datetime.strptime('1/1/2016 4:50 AM', '%m/%d/%Y %I:%M %p')
    f = open('data.csv','w')
    f.write('timestamp,user_id,media, page\n')
    for i in range(100000):
        timestamp = str(random_date(d1, d2))
        user_id = str(np.random.randint(1,100))
        media = str(np.random.choice(['mobile','desktop']))
        page = str(np.random.rand())
        f.write(','.join([timestamp,user_id,media,page])+'\n')
    f.close()

In [82]:
import pandas as pd
df = pd.read_csv('data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 4 columns):
timestamp    100000 non-null object
user_id      100000 non-null int64
media        100000 non-null object
 page        100000 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 3.8+ MB


## The Solution

The high level process is as follows:

1. Filter the data for the week of interst
2. Remove visits that were not mobile. 
3. Group by `user_id` and `timestamp` to remove duplicates.
4. Count the non-duplicates that occured 7 times. 

In [83]:
df.timestamp = df.timestamp.str.replace('-', '')

In [84]:
df.timestamp = df.timestamp.apply(lambda t: t[:8])

In [85]:
last_week = df['20150102' < df.timestamp ]

In [86]:
last_week = last_week[last_week.timestamp < '20150110']

In [98]:
last_week = last_week[last_week.media == 'mobile']

In [109]:
unique_pairs = last_week.groupby(['user_id', 'timestamp']).count() # Count is just to aggregate

In [111]:
unique_pairs = unique_pairs.reset_index()

In [112]:
sum(unique_pairs.user_id.value_counts() > 6)

10

So 10 users connected daily during this particular week.

## Performance Concerns

All of the operations here require iterating through the dataset once. So it is $O(n)$.