### Add Name

### Add Student Number 

# Assignment 1: Time-series data

**The use of generative AI is not allowed** 

**Objective**
The goal of this assignment is to develop an understanding of how trajectory data can be analyzed in both the time domain (using Autocorrelation) and the frequency domain (using Periodograms).
Refer **Lecture 2**. 

You will learn:

-Learn how trajectory data is recorded and tracked.

-Simulate synthetic trajectory data with realistic imperfections (e.g., noise, missing points, irregularities).

-Apply autocorrelation and periodogram methods to analyze patterns in the simulated data.

-Extend the analysis to real-life trajectory data and interpret results.


**This assignment consists of five parts:**

SECTION I

1. Create synthetic data to test the algorithm you design.
2. Write two functions:
    - autocorrelation.
    - period extraction function.
3. Use a periodogram function.
4. Compare the sensitivity of the algorithms to typical imperfections occur in real data (noise, missing data, random and non-periodic patterns).

SECTION II

Test the algorithms on geolife data.

5. Try handling missing values , apply the autocorrelation and periodogram, explain your findings.
   
 


**WARNING: Make sure to read through the entire assignment before starting to code. The tasks build on each other!**

Good luck with the assignment! Deadline is **September 30th, at 23:59** hrs. Please push your code to the GitHub classroom before the deadline. **Make sure to read the *"Submission procedure"* section in the *"README.md"* file to ensure reproducibility.**


## 0th part: Prerequisites
Please add any packages you use to the code cell below.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

# etc.



## 1st part: Create Synthetic Data

Imagine that you have a GPS device that takes a measurement every 1 hr. Create a synthetic data trajectory with two periodicities for 365 days. Consider the time interval of records to be 1 hr. For ex: consider from point A to point B , it takes 60 mins to travel.  The trajectory data should include two types of periodicities (e.g., one daily and one weekly). For example, you can take your home-Snellius trajectory as a daily trajectory repeating from Monday to Friday (daily period) and your weekly trip to the supermarket or city center from home during the weekend as the second one (weekly period). Assume that you have a regular schedule (e.g. leaving every day at 8 and returning at 5). Simulate the whole trajectory (the path and the time you spend in each place). You can find GPS locations via Google Maps (right-click on map - what is here?) or any other online map service. Try to think of scenarios that make this data as accurate as possible. You can be creative about your home location because that's private information!

The GPS location of the entrance of the Snellius building is: 52.169709, 4.457111. 

Use a constant time between waypoints / GPS locations for this exercise.

Tip I : a periodicity of 24 is a recurring event every 24 hrs (i.e. a daily event), a periodicity of 168 hrs is a weekly event.

Tip II: We talked so far about processing one time series. In case two time-series acquired from two coordinates is difficult to handle, try to use only one, or combine them into one value (e.g. sqrt(lat^2 + long^2)).

Tip III: Stuck with how to simulate data? Check the Lab of the first week and get some inspiration there.

In [None]:
# simulate data in form of two timeseries latitude and longitude of length (144*365):
def simulate(home_coord, supermarket_coord, snellius_coord):

    return simulated_data


## 2nd part: Write two functions

Your task is to write:

 1. A function that performs an [autocorrelation](https://en.wikipedia.org/wiki/Autocorrelation) (Lecture slide 20) on the synthetic data and returns the correlation value and corresponding delay for every possible delay. Note: you have to write this function yourself. You will not get points if you use a ready-to-use autocorrelation function from a library.

 2. A visualisation of the autocorrelation function that shows the correlation value as a function of the delay. The quality of the graph will also be graded (e.g., axes labels). 
 3. A function that evaluates the output of the autocorrelation function and manages to extract the two simulated periodicities i.e (24, 168) as output and indicate which periodicity is more prominent (i.e frequent). 
 
Tip I: In case of autocorrelaton function, you can implement a circular version(shifting values from the end of time-series to its begining).

Tip II:  If your function takes too much time to run, you can also check out vectorized operations by Numpy or scipy.

Tip III: For guidance on implementing the autocorrelation function and visualizing its results, refer to the paper (“**Recognition of Periodic Behavioral Patterns from Streaming Mobility Data**”) cited in **Lecture Slide 22**.




In [None]:
# write your own ACF
def autocorrelation(data):

    return autocorr


In [None]:
# display ACF on y-axis, time on x-axis


In [None]:
# write your evaluation function
def autocorr_periodicity(simulated_data):

    autocorr = autocorrelation(simulated_data)
    #
    #  finding periods out of autocorrelation by automatically identifying the most dominant peaks
    #
    return autocorr_periods


## 3rd part: Periodograms
Take an existing periodogram function from a python packaged (for example Scipy) and run it on the simulated data. Does your evaluation function from the 2nd part similar to the results of the periodogram? 

Generate two periodograms:

a) Using only longitude or latitude data

b) Using the combined longitude and latitude data

Explain the difference in findings. Do they have similar periodograms? 

In [None]:
# work with periodogram, find a function available in one of the python packages


In [None]:
# write your own periodogram function if needed
def periodogram():

    return ...


In [None]:
# evaluate results of periodogram, write your evaluation function
def periodogram_periodicity(simulated_data):

    # call the periodogram
    #
    # find periods from periodogram by automatically identifying the most dominant peaks
    #
    return periodogram_periods


It might be difficult to find all the correct dominant peaks so we will be lenient in grading if we can see that the code is correct.

## 4th part: Performance

Noise in the data can have different causes:

- Missing measurements at different proportions (randomly or in bursts).
- Noise around the location data. Let's assume your GPS sensor has an approximate range of 50 meters, and noisy points (being 100s of meters away) occur occasionaly due to the cloud cover or the multipath affect of the GPS signal.
- Noise around the temporal data. For example, you assume that you leave home everyday at 8 and return at 5, but you might actually leave a bit earlier or a bit later.
- Irregular behavior by skipping or adding a trajectory. For example, going to school on a saturday or skipping groceries for a week. You can also define a number of new places and paths and add them to the trajectory randomly (e.g. going to the cinema every month with some probability)

Choose at least two noise sources, add them to your simulated data and compare the performance of your ACF and periodogram function. Parametrize your process of injecting these noises according to a rate and check how sensitive your algorithms are to different proportions of these sources of imperfections. Which one performs best? Under what circumstances?



In [None]:
# rate has a value in [0,1] and is used as parameter to define the level of noise added
def add_noise_one(data, rate):
  
    return noisy_data


In [None]:
def add_noise_two(data, rate):

    return noisy_data


In [None]:
# compare performance
noisy_data1 = add_noise_one(data, 0.5)
noisy_data2 = add_noise_two(data, 0.5)

# autocorrPeriodicity(noisydata1)
# periodogramPeriodicity(noisydata2)


### Conclusion
Write a brief report on your findings (150 words max):

(Write your report here)

## 5th part: Real life data

You will use data from the geolife datasets to test the time-series analysis methods that you developed. Your task is to apply the methods you just learned and interpret the data. 

[Download the data here](https://www.microsoft.com/en-us/download/details.aspx?id=52367). You will find multiple participants' trajectories in the dataset. For this assignment, we will focus on participant **125**.

[The user guide of the entire dataset can be found here](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/User20Guide-1.2.pdf).

Use these questions as a guideline for your analysis: 

a) What is the general structure of the data? How much noise do you observe? What is the temporal granularity of your data? How long did your participant log their movement? 


b) Do you see any missing recordings if you create the time-series like the one you simulated? Can you try to find a way to fill in the missing values in the time series of both coordinates?


c) Apply Autocorrelation (your own from Section I) and periodogram. Did you find periodic behaviors? What are the periodicities? Briefly summarize your observations in conclusion below (max 100 words).


d) If you cannot identify periodic behaviors: Can you mention why? What makes your data challenging? What realistic aspect of data is missing in your simulation? Having these challenges in mind what would be your topmost priorities, if you were to design a data collection protocol? Please explain in the conclusion below (max 200 words)





In [3]:
# import data


In [None]:
# handle missing values


In [None]:
# run (your own) ACF


In [None]:
# run periodogram on the data


### Conclusion 

All observations related to questions (a–d) are expected to be included in the Conclusion (350 words max).

## How do we grade this assignment?
Please pay attention to the following points. We consider these in calculating your final grade.

First of all, please check if you have pushed this notebook to the GitHub classroom before the deadline. You can check online on the GitHub classroom if the most up to date version of your code is present there, we can't grade your notebook if it is not there! **Make sure to read the *"Submission procedure"* section in the *"README.md"* file to ensure reproducibility.**

**1st part:**

1.   Your simulated data should have latitude and longitude.
2.   Your simulated data should have the correct dimensionality/frequency.
2.   It should also have both a daily and weekly cycle.



**2nd part:**


1. You should have correctly implemented the autocorrelation function.
2. We consider the graphs generated in grading.
3. We consider if the function returns 2 periodic cycles correctly.
4. We consider the extraction of the more prominent cycle. 



**3rd part:**

1.   We consider the implementation of the function based on periodogram.
2.   We consider if you applied periodogram on both types of data i.e longitude/latitude , combined one (latitude and longitude) and explain your findings.


**4th part:**
1. We consider if two noise sources are added.
2. We check if both autocorrelation and periodogramd are applied to data.
3. We check if the results of the experiments are presented in a useful way.
4. We check the conclusions.


**5th part:**
1. We check if you have explored data.
2. We check if you tried to handle missing values if any.
2. We check if you have used periodogram and autocorrelation function.
3. We check your Conclusion