/
CptWorkshop.Rmd
645 lines (519 loc) · 19 KB
/
CptWorkshop.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
---
title: "An introduction to changepoints"
subtitle: "Using R"
author: "Rebecca Killick(r.killick@lancs.ac.uk)"
date: "eRum Workshop 2016"
output:
#html_notebook:
beamer_presentation:
theme: lancaster
includes:
in_header: header.tex
# Robin changed the theme so options weren't needed, i.e. footline and logo
---
```{r,include=FALSE}
library(changepoint)
library(changepoint.np)
knitr::opts_chunk$set(fig.align='center',fig.show='hold',fig.width=4,fig.height=4,size='footnotesize')
knitr::opts_knit$set(progress = FALSE,verbose = FALSE)
```
## Workshop Plan
* What are changepoints?
* Notation
* Likelihood based changepoints
+ Change in mean
+ Change in variance
+ (coffee break)
+ Change in mean & variance
* How many changes?
* Non-parametric changepoints
* Checking assumptions (if time allows)
There will be tasks throughout the sections
## What are Changepoints?
Changepoints are also known as:
* breakpoints
* segmentation
* structural breaks
* regime switching
* detecting disorder
and can be found in a wide range of literature including
* quality control
* economics
* medicine
* environment
* linguistics
* \ldots
## What are changepoints?
For data $y_1, \ldots, y_n$, if a changepoint exists at $\tau$, then $y_1,\ldots,y_{\tau}$ differ from $y_{\tau+1},\ldots,y_n$ in some way.
There are many different types of change.
```{r, echo=F, out.width='.33\\textwidth'}
par(mar=c(4,4,.3,.3))
set.seed(1)
# Change in mean example following EFK
x=1:500
y=c(rnorm(100,1,sd=0.5),rnorm(150,0,sd=0.5),rnorm(200,2,sd=0.5),rnorm(50,0.5,sd=0.5))
plot(x,y,type='l',xlab='',ylab='')
lines(x=1:100,y=rep(1,100),col='red',lwd=3)
lines(x=101:250,y=rep(0,150),col='red',lwd=3)
lines(x=251:450,y=rep(2,200),col='red',lwd=3)
lines(x=451:500,y=rep(0.5,50),col='red',lwd=3)
# Change in variance example following EFK
x=1:500
y=c(rnorm(100,0,sd=0.1),rnorm(150,0,sd=0.7),rnorm(200,0,sd=0.25),rnorm(50,0,sd=1))
plot(x,y,type='l',xlab='',ylab='')
# Change in regression
x=1:500
y=c(0.01*x[1:100],1.5-0.02*(x[101:250]-101),(10^-5)*(-150000+2.5*(x[251:450]^2-251^2)-(x[251:450]-250)),rep(1,50))
ynoise=y+rnorm(500,0,0.2)
plot(x,ynoise,type='l',xlab='',ylab='')
lines(x=1:100,y=0.01*x[1:100],lwd=3,col='red')
lines(x=101:250,y=1.5-0.02*(x[101:250]-101),lwd=3,col='red')
lines(x=251:450,y=(10^-5)*(-150000+2.5*(x[251:450]^2-251^2)-(x[251:450]-250)),lwd=3,col='red')
lines(x=451:500,y=rep(1,50),lwd=3,col='red')
```
## What is the goal?
* Has a change occurred?
* If yes, where is the change?
* What is the difference between the pre and post change data?
+ Maybe this is the type of change
+ Maybe it is the parameter values before and after the change
* What is the probability that a change has occured?
* How certain are we of the changepoint location?
* How many changes have occurred (+ all the above for each change)?
* Why has there been a change?
## Notation and Concepts
```{r, echo=F,out.height='.8\\textheight'}
par(mar = c(4, 4, 0.1, 0.1))
set.seed(1)
data=c(rnorm(100,0,1),rnorm(100,5,1),rnorm(100,0,1))
plot(data,xlab='',ylab='',xaxt='n',type='l')
axis(1,at=c(0,100,200,300),labels=c(0,expression(tau[1]),expression(tau[2]),300))
```
##Notation and Concepts
Thus a changepoint model for a change in mean has the following formulation:
$$
y_t = \left\{ \begin{array}{lcl} \mu_1 & \mbox{if} & 1\leq t \leq \tau_1 \\
\mu_2 & \mbox{if} & \tau_1 < t \leq \tau_2 \\
\vdots & & \vdots \\
\mu_{m+1} & \mbox{if} & \tau_m < t \leq \tau_{m+1}=n \end{array} \right.
$$
##More complicated changes
```{r, echo=F,out.width='.5\\textwidth'}
set.seed(1)
par(mar=c(4,4,.3,.3))
# Change in ar example
x=1:500
y=c(arima.sim(model=list(ar=0.8),n=100),arima.sim(model=list(ar=c(0.5,0.2)),n=150),arima.sim(model=list(ar=c(-0.2)),n=200),arima.sim(model=list(ar=c(-0.5)),n=50))
plot(x,y,type='l',xlab='',ylab='')
# Change in seasonality and noise
x=1:500
y=c(sin(x[1:250]/21)+cos(x[1:250]/21),sin((1.1*x[251:500]/15))+cos((1.1*x[251:500]/15)))
ynoise=y+c(rnorm(100,0,sd=0.1),rnorm(150,0,sd=0.25),rnorm(200,0,sd=0.3),rnorm(50,0,sd=.4))
plot(x,ynoise,type='l',xlab='',ylab='')
```
## Online vs Offline
* Online
+ Processes data as it arrives or in batches
+ Goal is quickest detection of a change
+ Often used in processing control, intrusion detection
* Offline
+ Processes all the data in one go
+ Goal is accurate detection of a change
+ Often used in genome analysis, audiology
## Online vs Offline
```{r, echo=F,out.width='.5\\textwidth'}
set.seed(1)
par(mar=c(4,4,.3,.3))
x=1:110
y=c(rnorm(100),rnorm(10,2,1))
# online example
library(cpm)
cpm=detectChangePoint(y,cpmType="Student")
plot(x,y,type='n',xlab='',ylab='')
lines(x[1:cpm$detectionTime],y[1:cpm$detectionTime])
lines(x[(cpm$detectionTime+1):110],y[(cpm$detectionTime+1):110],lty=5,col='grey')
abline(v=cpm$changePoint)
abline(v=cpm$detectionTime,col='red')
#offline example
plot(x,y,type='l',xlab='',ylab='')
cpt=cpt.mean(y)
abline(v=cpts(cpt))
```
## Packages
Today we will use the
`library(changepoint)`
`library(changepoint.np)`
packages.
Other notable `R` packages are available for changepoint analysis including
* `strucchange` - for changes in regression
* `bcp` - if you want to be Bayesian
* `cpm` - for online changes (`changepoint.online` coming soon)
See my talk tomorrow in the **15:15** session for AR(1) and trend detection.
## Change in mean
Assume we have time-series data where
$$
Y_t|\theta_t \sim \mbox{N}(\theta_t,1),
$$
but where the means, $\theta_t$, are piecewise constant through time.
```{r, echo=FALSE, out.width='.5\\textwidth'}
# Change in mean example following EFK
x=1:500
y=c(rnorm(100,1,sd=0.5),rnorm(150,0,sd=0.5),rnorm(200,2,sd=0.5),rnorm(50,0.5,sd=0.5))
plot(x,y,type='l',xlab='',ylab='')
lines(x=1:100,y=rep(1,100),col='red',lwd=3)
lines(x=101:250,y=rep(0,150),col='red',lwd=3)
lines(x=251:450,y=rep(2,200),col='red',lwd=3)
lines(x=451:500,y=rep(0.5,50),col='red',lwd=3)
```
## Inferring Changepoints
We want to infer the number and position of the points at which the mean changes. One approach:
**Likelihood Ratio Test**
To detect a single changepoint we can use the likelihood ratio test statistic:
$$
LR=\max_\tau\{\ell(y_{1:\tau})+\ell(y_{\tau+1:n})-\ell(y_{1:n})\}.
$$
We infer a changepoint if $LR>\beta$ for some (suitably chosen) $\beta$. If we infer a changepoint its position is estimated as
$$
\tau=\arg \max \{\ell(y_{1:\tau})+\ell(y_{\tau+1:n})-\ell(y_{1:n})\}.
$$
## `changepoint` R package
The `changepoint` R package contains 3 wrapper functions:
* `cpt.mean` - mean only changes
* `cpt.var` - variance only changes
* `cpt.meanvar` - mean and variance changes
The package also contains:
* functions/methods for the `cpt` S4 class
* 5 data sets
* 4 other R functions that are made available for those who know what they are doing and might want to extend/modify the package.
## The `cpt` class
* S4 class
* Slots store all the information from the analysis
+ e.g. data.set, cpts, param.est, pen.value, ncpts.max
* Slots are accessed via their names e.g. `cpts(x)`
* Standard methods are available for the class e.g. plot, summary
* Additional generic functions are available e.g. seg.len, ncpts
* Each core function outputs a `cpt` object
## `cpt.mean`
`cpt.mean(data, penalty="MBIC", pen.value=0, method="AMOC", Q=5, test.stat="Normal", class=TRUE, param.estimates=TRUE,minseglen=1)`
* `data` - vector or `ts` object
* `penalty` - cut-off point, MBIC, SIC, BIC, AIC, Hannan-Quinn, Asymptotic, Manual.
* `pen.value` - Type I error for Asymptotic, number or character for manual.
* `method` - AMOC, PELT, SegNeigh, BinSeg.
* `Q` - max number of changes for SegNeigh or BinSeg.
* `test.stat` - Test statistic, Normal or CUSUM.
* `class` - return a `cpt` object or not.
* `param.estimates` - return parameter estimates or not.
* `minseglen` - minimum number of data points between changes.
## Single Change in Mean
```{r, out.width='.3\\textwidth'}
set.seed(1)
m1=c(rnorm(100,0,1),rnorm(100,5,1))
m1.amoc=cpt.mean(m1)
cpts(m1.amoc)
m1.cusum=cpt.mean(m1,pen.value=1,penalty='Manual',
test.stat='CUSUM')
```
## Single Change in Mean
```{r}
plot(m1.amoc)
```
## Example: Nile
Data from Cobb (1978): readings of the annual flow volume of the Nile River at Aswan from 1871 to 1970.
```{r,fig.height=3,fig.width=7,out.height='0.35\\textheight',out.width='\\textwidth'}
data(Nile)
ts.plot(Nile)
```
Hypothesized that there was a change around the turn of the century.
## Task
Use the `cpt.mean` function to see if there is evidence for a change in mean in the Nile river data.
```{r}
data(Nile)
```
If you identify a change, where is it and what are the pre and post change means?
##
Multiple Changepoints
## Likelihood Ratio Test
Define $m$ to be the number of changepoints, with positions $\bm{\tau}=(\tau_0,\tau_1,\ldots,\tau_{m+1})$ where $\tau_0=0$ and $\tau_{m+1}=n$.
Then one application of the Likelihood ratio test can be viewed as
$$
\min_{m\in\{0,1\},\bm{\tau}} \left\{
\sum_{i=1}^{m+1} \left[-\ell(y_{\tau_{i-1}:\tau_{i}})\right] + \beta m \right\}
$$
Repeated application is thus aiming to minimise
$$
\min_{m,\bm{\tau}} \left\{
\sum_{i=1}^{m+1} \left[-\ell(y_{\tau_{i-1}:\tau_{i}})\right] + \beta m \right\}
$$
## Penalised Likelihood
The above can be viewed as a special case of penalised likelihood. Here the aim
is to maximise the *likelihood* over the number and position of the changepoints, but
*subject to* a penalty, that depends on the number of changepoints. The penalty is to avoid
over-fitting.
This is equivalent to minimising
$$
\min_{m,\bm{\tau}} \left\{
\sum_{i=1}^{m+1} \left[-\ell(y_{\tau_{i-1}:\tau_{i}})\right] + \beta f(m) \right\}
$$
for a suitable penalty function $f(m)$ and penalty constant $\beta$.
## Identifying changes?
All these methods can be cast in terms of minimising a function of $m$ and $\bm{\tau}$ of the form:
$$
\sum_{i=1}^{m+1}{\left[\mathcal{C}(y_{(\tau_{i-1}+1):\tau_i})\right] + \beta f(m)}.
$$
This function depends on the data just through a sum of a *cost* for each segment. There is then a penalty term that depends on the number of segments.
### Open Research Question
What penalty should I use?
Several have attempted to answer this question, but in reality have added their own criteria to the list. At best, we have specific criteria shown to be optimal in very specific settings.
## The Challenge
* What are the values of $\tau_1,\ldots,\tau_m$?
* What is $m$?
* For $n$ data points there are $2^{n-1}$ possible solutions
* If $m$ is known there are still $\binom{n-1}{m-1}$ solutions
* If $n=1000$ and $m=10$, $2.634096 \times 10^{21}$ solutions
* How do we search the solution space efficiently?
## Methods in `changepoint`
* At Most One Change (`AMOC`)
*Approximate* but computationally *fast*:
* Binary Segmenation (`BinSeg`) (Scott and Knott (1974)) which is $\mathcal{O}(n\log n)$ in CPU time.
*Slower* but *exact*:
* Segment Neighbourhood (`SegNeigh`) (Auger and Lawrence (1989)) is $\mathcal{O}(Qn^2)$.
*Fast* and *exact*:
* Pruned Exact Linear Time (`PELT`) (Killick et al. (2012)) At worst $\mathcal{O}(n^2)$. For linear penalties $f(m)=m$, scaling changes, $\mathcal{O}(n)$.
## cpt.var
`cpt.var(data, penalty, pen.value, know.mean=FALSE, mu=NA, method, Q, test.stat="Normal", class, param.estimates, minseglen=2)`
Majority of arguments are the same as for `cpt.mean`
* `know.mean` - if known we don't count it as an estimated parameter when calculating
penalties.
* `mu` - Mean if known.
* `test.stat` - Normal or CSS (cumulative sums of squares)
* `minseglen` - Default is 2
## Changes in Variance
```{r,results='hold'}
set.seed(1)
v1=c(rnorm(100,0,1),rnorm(100,0,2),rnorm(100,0,10),
rnorm(100,0,9))
v1.man=cpt.var(v1,method='PELT',penalty='Manual',
pen.value='2*log(n)')
cpts(v1.man)
param.est(v1.man)
```
## Changes in Variance
Ratios of true variances (4, 25, 0.81)
```{r,fig.height=3,fig.width=7,out.height='0.4\\textheight',out.width='\\textwidth'}
plot(v1.man,cpt.width=3)
```
## FTSE100
Yahoo! Finance data, daily returns from FTSE100 index.
2nd April 1984 until the 13th September 2012
```{r,fig.height=3,fig.width=7,out.height='0.4\\textheight',out.width='\\textwidth'}
data(ftse100)
plot(ftse100,type='l')
```
## Task
Use the `cpt.var` function to see if there is evidence for changes in variance in the FTSE100 data.
```{r}
data(ftse100)
```
If you identify changes, where are they and what are the variances in each segment?
## `cpt.meanvar`
`cpt.meanvar(data, penalty, pen.value, method, Q, test.stat="Normal", class, param.estimates, shape=1,minseglen=2)`
Again the same underlying structure as `cpt.mean`.
* `test.stat` - choice of Normal, Gamma, Exponential, Poisson.
* `shape` - assumed shape parameter for Gamma.
* `minseglen` - minimum segment length of 2
## Mean & Variance
```{r}
set.seed(1)
mv1=c(rexp(50,rate=1),rexp(50,5),rexp(50,2),rexp(50,7))
mv1.pelt=cpt.meanvar(mv1,test.stat='Exponential',
method='BinSeg',Q=10,penalty="SIC")
cpts(mv1.pelt)
param.est(mv1.pelt)
```
## Mean & Variance
```{r}
plot(mv1.pelt,cpt.width=3,cpt.col='blue')
```
## Task
G+C content within part of Human Chromosome 1, data from NCBI.
3kb windows along the Human Chromosome from 10Mb to 33Mb.
Use the `cpt.meanvar` function to identify regions with different C+G content.
```{r,fig.height=3,fig.width=7,out.height='0.4\\textheight',out.width='\\textwidth'}
data(HC1)
ts.plot(HC1)
```
## CROPS
**C**hangepoints for a **r**ange **o**f **p**enaltie**s**
Use `penalty='CROPS'` with `method='PELT'` to get all segmentations for a range of penalty values.
```{r}
v1.crops=cpt.var(v1,method="PELT",penalty="CROPS",
pen.value=c(5,500))
```
## CROPS
```{r}
cpts.full(v1.crops)
```
## CROPS
```{r,fig.height=3,fig.width=7,out.height='0.4\\textheight',out.width='\\textwidth'}
pen.value.full(v1.crops)
plot(v1.crops,ncpts=5)
```
## CROPS
```{r,fig.height=4,fig.width=4,out.height='0.7\\textheight'}
plot(v1.crops,diagnostic=TRUE)
```
<!-- ## Lavielle (2005) -->
<!-- For $1\leq K\leq K_{MAX}$: -->
<!-- $$ -->
<!-- J_K = \frac{\ell_{K_{MAX}}-\ell_K}{\ell_{K_{MAX}}-\ell_1} \left(K_{MAX}-1\right) + 1 -->
<!-- $$ -->
<!-- Then for $2\leq K\leq K_{MAX}-1$: -->
<!-- \begin{align} -->
<!-- D_K &= J_{K-1}-2J_K+J_{K+1} \\ -->
<!-- D_1 &= \infty -->
<!-- \end{align} -->
<!-- Then -->
<!-- $$ -->
<!-- \hat{K} = \max\{1\leq K\leq K_{MAX}-1 : D_K>C\} -->
<!-- $$ -->
<!-- $C$ is the threshold for a change. -->
## Task
Look at the FTSE100 data again and use the CROPS technique to determine an appropriate number of changes.
## `cpt.np`
`cpt.np(data, penalty, pen.value, method, test.stat="empirical_distribution", class, minseglen=1, nquantiles=10)`
Again the same underlying structure as `cpt.mean`.
* `test.stat` - choice of empirical_distribution
* `minseglen` - minimum segment length of 1
* `nquantiles` - number of quantiles to use
## Example
```{r}
set.seed(12)
J <- function(x){(1+sign(x))/2}
n <- 1000
tau <- c(0.1,0.13,0.15,0.23,0.25,0.4,0.44,0.65,0.76,0.78,
0.81)*n
h <- c(2.01, -2.51, 1.51, -2.01, 2.51, -2.11, 1.05, 2.16,
-1.56, 2.56, -2.11)
sigma <- 0.5
t <- seq(0,1,length.out = n)
data <- array()
for (i in 1:n){
data[i] <- sum(h*J(n*t[i] - tau)) + (sigma * rnorm(1))
}
```
## Example
```{r}
out <- cpt.np(data, method="PELT",minseglen=2,
nquantiles =4*log(length(data)))
cpts(out)
```
## Example
```{r,fig.height=3,fig.width=7,out.height='0.4\\textheight',out.width='\\textwidth'}
plot(out)
```
## Task
Look at the `HeartRate` data from the `changepoint.np` package. Use one of the non-parametric functions to see if there is evidence for changes in heart rate.
```{r}
data(HeartRate)
```
## Checking Assumptions
The main assumptions for a Normal likelihood ratio test for a change in mean are:
* Independent data points;
* Normal distributed points pre and post change;
* Constant variance across the data.
How can we check these?
## Checking Assumptions
In reality we can't check assumptions prior to analysis.
```{r,out.height='0.5\\textheight'}
ts.plot(m1)
```
## Checking Assumptions
```{r,out.height='0.6\\textheight'}
hist(m1)
```
## Checking Assumptions
```{r}
shapiro.test(m1)
ks.test(m1,pnorm,mean=mean(m1),sd=sd(m1))
```
## Checking Assumptions
```{r,out.height='0.6\\textheight'}
acf(m1)
```
## How to check
* Check each segment independently
```{r}
cpt.seg=cbind(c(0,cpts(m1.amoc)),seg.len(m1.amoc))
data=data.set(m1.amoc)
shapiro.func=function(x){
out=shapiro.test(data[(x[1]+1):(x[1]+x[2])])
return(c(out$statistic,p=out$p.value))}
apply(cpt.seg,1,shapiro.func)
```
## Segment Check
```{r}
ks.func=function(x){
tmp=data[(x[1]+1):(x[1]+x[2])]
out=ks.test(tmp,pnorm,mean=mean(tmp),sd=sd(tmp))
return(c(out$statistic,p=out$p.value))}
apply(cpt.seg,1,ks.func)
```
## Segment Check
```{r,out.width='0.45\\textwidth',out.height='0.5\\textheight'}
qqnorm.func=function(x){
qqnorm(data[(x[1]+1):(x[1]+x[2])])
qqline(data[(x[1]+1):(x[1]+x[2])])}
out=apply(cpt.seg,1,qqnorm.func)
```
## Segment Check
```{r,out.width='0.45\\textwidth',out.height='0.5\\textheight'}
acf.func=function(x){
acf(data[(x[1]+1):(x[1]+x[2])])}
out=apply(cpt.seg,1,acf.func)
```
## How to check
* Check the residuals
```{r}
means=param.est(m1.amoc)$mean
m1.resid=m1-rep(means,seg.len(m1.amoc))
shapiro.test(m1.resid)
```
## Residual Check
```{r}
ks.test(m1.resid,pnorm,mean=mean(m1.resid),sd=sd(m1.resid))
```
## Residual Check
```{r,out.height='0.6\\textheight'}
qqnorm(m1.resid)
qqline(m1.resid)
```
## Residual Check
```{r,out.height='0.6\\textheight'}
acf(m1.resid)
```
## Task
Check the assumptions you have made on the simulated, Nile, FTSE100 and HeartRate data using either the segment or residual check.
What effect might any invalid assumptions have on the inference?
## Consolidating Task
Download the ratings for the following TV shows from the [IMDB](http://www.imdb.com/) and analyze the series using some of the techniques you have learnt from today. For each series, do you identify any changes? Are the assumptions you are making valid? What effect might any invalid assumptions have on the inference?
* [Doctor Who](http://www.imdb.com/title/tt0436992/epdate)
* [Grey's Anaytomy](http://www.imdb.com/title/tt0413573/epdate)
* [Mistresses](http://www.imdb.com/title/tt2295809/epdate)
* [The Simpsons](http://www.imdb.com/title/tt0096697/epdate)
* [Top Gear](http://www.imdb.com/title/tt1628033/epdate)
(Understandably IMBD does not allow screen scraping nor downloads of information for redistribution so you will have to copy and paste the table into Excel, or equivalent, yourself in order to get the ratings data into R.)
## Bonus
Just from looking at the data, can you predict which shows have been cancelled?
## References
[JSS:](https://www.jstatsoft.org/article/view/v058i03) Killick, Eckley (2014)
[PELT:](http://www.tandfonline.com/doi/abs/10.1080/01621459.2012.737745) Killick, Fearnhead, Eckley (2012)
[CROPS:](http://dx.doi.org/10.1080/10618600.2015.1116445) Kaynes, Eckley, Fearnhead (2015)
[cpt.np:](http://link.springer.com/article/10.1007/s11222-016-9687-5) Haynes, Fearnhead, Eckley (2016)
<!-- [Lavielle](http://dx.doi.org/10.1016/j.sigpro.2005.01.012) (2005) -->
## Coming soon
$\ldots$ to a changepoint package near you
* Join-pin regression
* FPOP (faster than binary segmentation but exact)
* Online PELT
* Multivariate changepoints
* Long memory or changepoint?