# Web-Mining - Assignment 4 (12 points total)

This **Home Assignment** is to be submitted and you will be given points for each of the tasks. This task is all about time series. You will first write a function that gets the viewcount as a function of time for wikipedia articles. You will then go ahead and try to find viewcount histories that show interesting behaviour. Finally you are asked to implement single exponential and holtz smoothing.
You can use the numpy, requests and all of the standard library.

## Formalities
**Submit in a group of 3-4 people until 13.07.2021 23:59CET. The deadline is strict!**

## Evaluation and Grading
General advice for programming excercises at *CSSH*:
Evaluation of your submission is done semi automatically. Think of it as this notebook being 
executed once. Afterwards, some test functions are appended to this file and executed respectively.

Therefore:
* Submit valid _Python3_ code only!
* Use external libraries only when specified by task.
* Ensure your definitions (functions, classes, methods, variables) follow the specification if
  given. The concrete signature of e.g. a function usually can be inferred from task description, 
  code skeletons and test cases.
* Ensure the notebook does not rely on current notebook or system state!
  * Use `Kernel --> Restart & Run All` to see if you are using any definitions, variables etc. that 
    are not in scope anymore.
  * Double check if your code relies on presence of files or directories other than those mentioned
    in given tasks. Tests run under Linux, hence don't use Windows style paths 
    (`some\path`, `C:\another\path`). Also, use paths only that are relative to and within your
    working directory (OK: `some/path`, `./some/path`; NOT OK: `/home/alice/python`, 
    `../../python`).
* Keep your code idempotent! Running it or parts of it multiple times must not yield different
  results. Minimize usage of global variables.
* Ensure your code / notebook terminates in reasonable time.

**There's a story behind each of these points! Don't expect us to fix your stuff!**

Regarding the scores, you will get no points for a task if:
- your function throws an unexpected error (e.g. takes the wrong number of arguments)
- gets stuck in an infinite loop
- takes much much longer than expected (e.g. >1s to compute the mean of two numbers)
- does not produce the desired output (e.g. returns an descendingly sorted list even though we asked for ascending, returns the mean and the std even though we asked for only the mean, prints an output instead of returning it, ...)

In [1]:
# credentials of all team members (you may add or remove items from the dictionary)
team_members = [
    {
        'first_name': 'Pruthvi',
        'last_name': 'Hegde',
        'student_id': 404809
    },
    {
        'first_name': 'Mike',
        'last_name': 'Grüne',
        'student_id': 381076
    },
    {
        'first_name': 'Seyed Pouria',
        'last_name': 'Mirelmi',
        'student_id': 416910
    },
    {
        'first_name': 'Haron',
        'last_name': 'Nqiri',
        'student_id': 343289
    }
]

# Task 1: Time series (12 points total)

## a) Fetching wikipedia view counts (3)

Write a function `get_counts(title, start, end, language_edition="en")` that takes in the title of a wikipedia page, a starting date and an ending date, as well as an optional language edition for that particular title. `start` and `end` are supplied as python `date` objects.

It returns the timestamps (as a list of `date` objects) and the view counts (as a list of integers) for that particular article in the given timespan. If there are errors when retrieving the data return empty lists. Return them lower incluse, upper inclusive.

Link example for Albert Einstein retrieving from the first of november 2017 up to the 30 november 2018: 'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/20171101/20181130'.

In [2]:
from typing import List
from datetime import date
import requests
from time import sleep
import urllib
import json
import datetime

In [3]:
def get_counts(title : str, start : date, end : date, language_edition="en", sleep_duration=1) -> (List[date], List[int]):
    sleep(sleep_duration)# please use sleep so as to not flood wikimedia with requests
    start_date = str(start).replace("-","")
    end_date = str(end).replace("-","")
    
    url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/" + language_edition + ".wikipedia/all-access/all-agents/" + title +"/daily/" + start_date + "/" + end_date 
    try:
        req = urllib.request.Request(url)
        with urllib.request.urlopen(req) as response:
            page = response.read()
        res = json.loads(page)
        timestamp = [datetime.datetime.strptime(r['timestamp'][0:8], "%Y%m%d").date() for r in res['items']]
        
        view_counts = [r['views'] for r in res['items']]
        return timestamp, view_counts
    except urllib.error.URLError as e:
        return [],[]
    
    

In [4]:

s = date(2018,11,30)
e = date(2018,12,3)
print(get_counts("Albert_Einstein",s,e))
#([datetime.date(2018, 11, 30),
#  datetime.date(2018, 12, 1),
#  datetime.date(2018, 12, 2),
#  datetime.date(2018, 12, 3)],
# [17631, 14710, 16126, 17995])

s = date(2018,12,30)
e = date(2019, 1,3)
print(get_counts("Bier",s,e, "de"))
#([datetime.date(2018, 12, 30),
# datetime.date(2018, 12, 31),
# datetime.date(2019, 1, 1),
# datetime.date(2019, 1, 2),
# datetime.date(2019, 1, 3)], 
# [1090, 783, 732, 790, 842])


s = date(2010,12,30)
e = date(2011, 1,3)
print(get_counts("Albert_Einstein",s,e))
# error case
# [], []

s = date(2019,12,30)
e = date(2019, 1, 3)
print(get_counts("Albert_Einstein",s,e))
# error case
# [], []

s = date(2019,12,30)
e = date(2020, 1, 3)
print(get_counts("asdjhsalkjhdkash",s,e))
# error case
# [], []

([datetime.date(2018, 11, 30), datetime.date(2018, 12, 1), datetime.date(2018, 12, 2), datetime.date(2018, 12, 3)], [17631, 14710, 16126, 17995])
([datetime.date(2018, 12, 30), datetime.date(2018, 12, 31), datetime.date(2019, 1, 1), datetime.date(2019, 1, 2), datetime.date(2019, 1, 3)], [1090, 783, 732, 790, 842])
([], [])
([], [])
([], [])


## b) Experimentation (1+1.5+1.5 = 4)

Find combinations of `(title, start, end, language_edition)` such that when plotted you can clearly see

1) There is no clear seasonality or periodicity (`nothing.png`)

2) There is a clear seasonality (`seasonality.png`)

3) There is clearly is something repetitive but it is not seasonality (`repetitive.png`)

Save plots of the three different kinds of time series under their corresponding names.
For the seasonality plot also show the length of one period in the plot. Use a different article for each of the tasks. To have good enough statistics, make sure there are at least ~100 views per day.
Put the article title and the language edition in the title of the plot.

In [5]:
import matplotlib.pyplot as plt

In [6]:
s = date(2018,1,1)
e = date(2020, 12, 31)
dates,counts = get_counts("Albert_Einstein",s,e, "en")
fig_nothing = plt.figure(figsize=(12,12), facecolor=(1, 1, 1))
plt.plot(dates, counts)
plt.title("Albert_Einstein, en")
plt.xlabel("Date")
plt.ylabel("View count")
plt.savefig("nothing.png")
plt.close()

In [7]:
s = date(2018,1,1)
e = date(2020, 12, 31)
dates,counts = get_counts("Sunburn",s,e, "en")
fig_seasonality = plt.figure(figsize=(12,12), facecolor=(1, 1, 1))
plt.plot(dates, counts)
plt.title("Sunburn, en")
plt.xlabel("Date")
plt.ylabel("View count")
plt.axvspan(date(2018,1,1), date(2018,12,31), color='red', alpha=0.5)
plt.axvspan(date(2019,1,1), date(2019,12,31), color='green', alpha=0.5)
plt.axvspan(date(2020,1,1), date(2020,12,31), color='blue', alpha=0.5)
plt.savefig("seasonality.png")
plt.close()

## c) Time Series prediction (single exponential) (2)

Write a function `single_exponential_smoothing(values, h, alpha)` that performs time series prediction/forecast for the next 1..h time-steps using single exponential smoothing with parameter `alpha`. Initialialise l_0 = `values[0]`.

In [8]:
import numpy as np

In [9]:
def single_exponential_smoothing(values : List[float], h : int, alpha : float) -> List[float]:
    
    l = [values[0]]
    
    for i in range(len(values)-1):
        
        l.append(alpha * values[i+1] + (1 - alpha) * (l[i]))
    
    return_list = []
    
    for i in range(h):
        return_list.append (l[-1])
    
    return return_list

In [10]:
print(single_exponential_smoothing([1,2,3,4], 2, 0.5))
# l = [1, 1.5, 2.25, 3.125]
# [3.125, 3.125]
print(single_exponential_smoothing([1,1.9,3.1,4.1], 2, 0.3))
# l = [1, 1.27, 1.819, 2.5032999999999994]
# [2.5032999999999994, 2.5032999999999994]
print(single_exponential_smoothing(list(np.sin(np.linspace(0,3))), 4, 0.1))
# [0.5173425352401494, 0.5173425352401494, 0.5173425352401494, 0.5173425352401494]

[3.125, 3.125]
[2.5032999999999994, 2.5032999999999994]
[0.5173425352401494, 0.5173425352401494, 0.5173425352401494, 0.5173425352401494]


## d) Time Series prediction (Holts) (3)

Write a function `holts_smoothing(values, h, alpha, beta)` that performs time series prediction/forecast for the next 1..h time-steps using Holts smoothing with parameters `alpha` and `beta`. Initialialise l_0 = `values[0]` and b_0 = `values[1] - values[0]`.

In [11]:
def holts_smoothing(values : List[float], h : int, alpha : float, beta : float) -> List[float]:
    l = [values[0]]
    b = [values[1] - values[0]]
    result_list = []
    for i in range(len(values) - 1):
        l.append(alpha * values[i+1] + (1 - alpha) * (l[i] + b[i]))
        b.append(beta * (l[i+1] - l[i]) + (1 - beta) * b[i])
    for i in range(h):
        result_list.append(l[len(values) - 1] + (i+1) * b[len(values) - 1])
    return result_list

In [12]:
print(holts_smoothing([1,2,3,4], 2, 0.5, 0.5))
# [5.0, 6.0]
# l=[1, 2.0, 3.0, 4.0]
# b=[1, 1.0, 1.0, 1.0]
print(holts_smoothing([1,1.9,3.1,4.1], 2, 0.3, 0.3))
# [4.854369999999999, 5.806839999999999]
# l= [1, 1.9, 2.8899999999999997, 3.9018999999999995]
# b= [0.8999999999999999, 0.8999999999999999, 0.9269999999999998, 0.9524699999999997]
print(holts_smoothing(list(np.sin(np.linspace(0,3))), 4, 0.1, 0.5))
# [0.09857994080989951, 0.035351062398172095, -0.027877816013555323, -0.09110669442528274]

[5.0, 6.0]
[4.854369999999999, 5.806839999999999]
[0.09857994080989951, 0.035351062398172095, -0.027877816013555323, -0.09110669442528274]
