# 1. Introduction

## Guidelines
Please complete the assignment inside this notebook. Make sure the code can be executed easily.

- Write production-ready code using OOP when relevant.
- For question 1, create simple unit tests for your code where applicable.
- For question 1, add comments and documentation strings for all methods. 
- Discuss your design choices.
- For question 1 and 2, discuss the complexity (Big O notation) of your solutions, both memory wise and performance wise.
- For question3, provide map visualization when relevant
- Try to stick to the most popular scientific Python libraries.
- Provide us with the instructions needed to run your code (e.g. requirements.txt, setup.py). Ideally, a simple virtualenv build and a `pip install` will do the trick.

## Input data for question 1 and 2
You should have received three csv files. Each csv-file represents the locations where a person was stationary for a certain amount of time. 
The csv-files contain the following fields:

- Latitude: The latitude of the detected GPS coordinates Longitude: The longitude of the detected GPS coordinates
- Timestamp: The start time of the stationary in the following format:
    - YYYY = year
    - MM = month of year
    - dd = day of month
    - HH=hourofday
    - mm = minute of hour
    - Z = timezone offset
- Duration: The length of time the person was stationary (in milliseconds)
    
All questions in this assignment are related to this data.

In [1]:
import json

import plotly.offline as py
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import plotly.express as px

from sentiance.location import Location, UserLocations
from sentiance.home_work import probable_home_work_locations
from sentiance.home_work import *

## Question 1: Data lookup
Create a method that generates a lookup table allowing us to effiently check whether or not a user has ever visited a location even if the new location is not exactly the same as the visited location (some noise is added to the longitude/latitude pairs).

I've created a class Location which is a immutable tuple containing latitude and longitude coordinates <br>

The latitue and longitude are encoded into a single number so that locations that are close to each other also encode to a number that is close.<br>

You can also specify an encoding precision. Depending on that precision it will then consider locations that are close enough as the same<br>

There is also a class UserLocations which is basically a set of Locations objects that is created from the specified file, again speficying a precision to define which locations are considered the same <br>

The computation is based on:<br>

  - https://www.factual.com/blog/how-geohashes-work/
  - https://en.wikipedia.org/wiki/Geohash

You can invoke the script like (--help for information about the input arguments)<br>

  .. code-block:: bash

    $python location.py "../data/Copy of person.1.csv" 51.209335 4.3883753
    (51.209335, 4.3883753) found in set

Complexity<br>

  The computation of the geohash is linear (O(nbits)) in the number of bits, the implementation itself could be done faster by using bit operations only<br>
  To build up a set of user locations each record in the csv file of user locations needs to be processed and the geohash computed, so this is linear in the number of locations, i.e. O(nbits * locations)<br>
  Once the set of user locations is built, you can do a lookup in constant time, depending on the number of precision bits that need to be computed, i.e. O(nbits) or O(1) if we consider this constant<br>
  Memory complexity: depends on the precision, the lower the precision the more locations are collapsed on one another. Per record that is stored one only needs to reserve memory for storing the Location object (latitude, longitude, nbits, geohash). So the complexity is sublinear if the precision in not to high. There is of course also overhead for keeping the set structure<br>



In [5]:
datafile = "../data/Copy of person.1.csv"
precision = 34

locations = UserLocations(datafile, precision)
user_location = Location(51.209325, 4.3883763, precision)

if user_location in locations:
    print(f"{user_location} found in set")
else:
    print(f"{user_location} not found in set (maybe decrease precision?)")


(51.209325, 4.3883763) found in set


## Question 2: Home and work detection
The goal of this question, is to design an unsupervised algorithm that allows one to distinguish the likely home locations of a user from his likely work locations.

Note that a person might have multiple home and work locations, or might not have a work location at all. Also note that some data points might be noisy, incorrect and/or incomplete.

Discuss your choice of algorithms, rules, methods, distance measures, etc.

  The algorithm:<br>
  The goal is to compute some features per location, where location is determined by the precision of the Location object (as discussed in Question 1). So samples with lat long coordinates that are close enough will be mapped to the same location.<br>

  Per sample we first add some columns/features: <br>
  
    - date
    - day of the week that the person visited the location, 
    - hour of the day that the person arrived at the location
    - duration in hours that the person stayed here
    - geohash: the location encoding with precision of 30 bits

  This step is linear in the number of samples<br>

  In the next step we group all samples together that map to the same location/geohash. Per group we calculate a new sample with the following features:<br>
  
    - day frequency: how frequent does the user visit the place
    - mean duration: how long does he/she spend at that location on average
    - mean starthour: on average at what hour of the day does he/she arrive at the location
    - mean latitude
    - mean longitude
    - number_of_days: compute how many of the 7 weekdays that he/she spends more than MIN_DAY_DENSITY (=0.5) of the time at that location (e.g. for home this will likely be close to 7 days)
    

  This step should also be linear in the number of samples (and reduces the data set as certain samples of similar location are collapsed)<br>

  We filter out locations where the user spends less than 1 hour and that have a day frequency that is less than the average<br>
 
  This leaves very few locations. Then we check that all these remaining locations (in the order of 5) that all the distances are within reason (200km), the locations that fall outside that range are removed. The computations of these distances have quadratic complexity, but as the number of remaining samples are very low, this can be ignored.<br>

  The locations that have a number of day visits more than or equal to 6 are classified as home, the others as work<br>

  I've also applied a clustering algorithm MeanShift that seems to find clusters that contain almost the same home work locations. the complexity of meanshift according to documentation is O(T*n*log(n)) where T is the dimension and n the samples<br>


In [6]:
home_work_1 = probable_home_work_locations("../data/Copy of person.1.csv")
print("Probably home an work locations for person 1")
print(json.dumps(home_work_1, indent=2))
home_work_2 = probable_home_work_locations("../data/Copy of person.2.csv")
print("Probably home an work locations for person 2")
print(json.dumps(home_work_2, indent=2))
home_work_3 = probable_home_work_locations("../data/Copy of person.3.csv")
print("Probably home an work locations for person 3")
print(json.dumps(home_work_3, indent=2))

Probably home an work locations for person 1
{
  "home": [
    [
      51.171635,
      4.347074890677966
    ]
  ],
  "work": [
    [
      51.207032244444456,
      4.387411042222222
    ],
    [
      51.21558766666668,
      4.393515745238095
    ]
  ]
}
Probably home an work locations for person 2
{
  "home": [
    [
      51.21486793913043,
      4.410902866086957
    ]
  ],
  "work": [
    [
      51.2068213678161,
      4.3880423252873575
    ],
    [
      51.21592946666667,
      4.393929054999999
    ]
  ]
}
Probably home an work locations for person 3
{
  "home": [
    [
      50.951219956989235,
      4.889916853763441
    ]
  ],
  "work": [
    [
      51.20687137777779,
      4.388040984444444
    ],
    [
      51.2160074,
      4.393897115
    ],
    [
      51.2311558974359,
      4.4039641948717945
    ]
  ]
}


In [38]:
home_work_person1 = apply_cluster_model("../data/Copy of person.1.csv","../models/work_home_classifier.bin")
print("Probably home an work locations for person 1")
print(json.dumps(home_work_person1, indent=2))
home_work_person2 = apply_cluster_model("../data/Copy of person.2.csv","../models/work_home_classifier.bin")
print("Probably home an work locations for person 2")
print(json.dumps(home_work_person2, indent=2))
home_work_person3 = apply_cluster_model("../data/Copy of person.3.csv","../models/work_home_classifier.bin")
print("Probably home an work locations for person 3")
print(json.dumps(home_work_person3, indent=2))

Probably home an work locations for person 1
{
  "home": [
    [
      51.171635,
      4.347074890677966
    ]
  ],
  "work": [
    [
      51.21558766666668,
      4.393515745238095
    ],
    [
      51.207032244444456,
      4.387411042222222
    ]
  ]
}
Probably home an work locations for person 2
{
  "home": [
    [
      51.21486793913043,
      4.410902866086957
    ]
  ],
  "work": [
    [
      51.21592946666667,
      4.393929054999999
    ],
    [
      51.2068213678161,
      4.3880423252873575
    ],
    [
      51.21948917647058,
      4.402292326470588
    ]
  ]
}
Probably home an work locations for person 3
{
  "home": [
    [
      50.951219956989235,
      4.889916853763441
    ]
  ],
  "work": [
    [
      51.20687137777779,
      4.388040984444444
    ],
    [
      51.2311558974359,
      4.4039641948717945
    ],
    [
      51.2160074,
      4.393897115
    ]
  ]
}


## Question 3: Activity classification

The goal of this question is to design a very simple linear classifier based on mobile accelerometer sensor to discriminate between car and walking movements and implement it on device. The pipeline consists of data collection, feature extraction, and inference on short sensor segments.

The dataset for question 3 is very limited on purpose.  sensors.csv consists of one-minute segments, 10 for each class.  An accelerometer reading of a segment i has three acceleration axes called seg_i_x, seg_i_y, seg_i_z and one time axis called seg_i_taxis. The time axis is given in nanosecond since epoch. labels.csv allow the mapping between segment id and the corresponding label (0 for one class, 1 for the other)

The goal is to show your understanding of the problem, your way of reasoning, and your mobile programming skills. As the dataset is very limited, we don’t expect you to design an algorithm that will generalize efficiently.  

1. Can you visualize the data and use it to intuitively deduce what label corresponds to walking and what label to car? Why?
2. What is the sampling rate of the signals? In order to denoise them, low pass filter was applied. What was, more or less, the cut-off frequency of the filter?
3. After a statistical analysis of the signals, define 2 features to compute per sensor segment that will allow an algorithm to discriminate between the two classes. 
4. Train a simple 2-dimensional linear classifier (threshold function) on that dataset. Use the entire dataset for training; no need to use a validation or testing set.
5. Plot the two clusters on a 2-dimensional plane with the corresponding decision function.

Note: Acceleration units and conventions are different between iOS and Android. Take it into account if you port the pipeline on iOS as the training dataset has been collected on Android  

