
# Advanced Recommender Systems with Python

Welcome to the code notebook for creating Advanced Recommender Systems with Python. This is an optional lecture notebook for you to check out. Currently there is no video for this lecture because of the level of mathematics used and the heavy use of SciPy here.

Recommendation Systems usually rely on larger data sets and specifically need to be organized in a particular fashion. Because of this, we won't have a project to go along with this topic, instead we will have a more intensive walkthrough process on creating a recommendation system with Python with the same Movie Lens Data Set.

*Note: The actual mathematics behind recommender systems is pretty heavy in Linear Algebra.*
___

## Methods Used

Two most common types of recommender systems are **Content-Based** and **Collaborative Filtering (CF)**. 

* Collaborative filtering produces recommendations based on the knowledge of users’ attitude to items, that is it uses the "wisdom of the crowd" to recommend items. 
* Content-based recommender systems focus on the attributes of the items and give you recommendations based on the similarity between them.

## Collaborative Filtering

In general, Collaborative filtering (CF) is more commonly used than content-based systems because it usually gives better results and is relatively easy to understand (from an overall implementation perspective). The algorithm has the ability to do feature learning on its own, which means that it can start to learn for itself what features to use. 

CF can be divided into **Memory-Based Collaborative Filtering** and **Model-Based Collaborative filtering**. 

In this tutorial, we will implement Model-Based CF by using singular value decomposition (SVD) and Memory-Based CF by computing cosine similarity. 

## The Data

We will use famous MovieLens dataset, which is one of the most common datasets used when implementing and testing recommender engines. It contains 100k movie ratings from 943 users and a selection of 1682 movies.

You can download the dataset [here](http://files.grouplens.org/datasets/movielens/ml-100k.zip) or just use the u.data file that is already included in this folder.

____
## Getting Started

Let's import some libraries we will need:

In [1]:
import numpy as np
import pandas as pd

We can then read in the **u.data** file, which contains the full dataset. You can read a brief description of the dataset [here](http://files.grouplens.org/datasets/movielens/ml-100k-README.txt).

Note how we specify the separator argument for a Tab separated file.

In [2]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=column_names)

Let's take a quick look at the data.

In [3]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


Note how we only have the item_id, not the movie name. We can use the Movie_ID_Titles csv file to grab the movie names and merge it with this dataframe:

In [4]:
movie_titles = pd.read_csv("Movie_Id_Titles")
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


Then merge the dataframes:

In [5]:
df = pd.merge(df,movie_titles,on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)


Now let's take a quick look at the number of unique users and movies.

In [6]:
n_users = df.user_id.nunique()
n_items = df.item_id.nunique()

print('Num. of Users: '+ str(n_users))
print('Num of Movies: '+str(n_items))

Num. of Users: 944
Num of Movies: 1682


## Train Test Split

Recommendation Systems by their very nature are very difficult to evaluate, but we will still show you how to evaluate them in this tutorial. In order to do this, we'll split our data into two sets. However, we won't do our classic X_train,X_test,y_train,y_test split. Instead we can actually just segement the data into two sets of data:

In [8]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.25)

## Memory-Based Collaborative Filtering

Memory-Based Collaborative Filtering approaches can be divided into two main sections: **user-item filtering** and **item-item filtering**. 

A *user-item filtering* will take a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked. 

In contrast, *item-item filtering* will take an item, find users who liked that item, and find other items that those users or similar users also liked. It takes items and outputs other items as recommendations. 

* *Item-Item Collaborative Filtering*: “Users who liked this item also liked …”
* *User-Item Collaborative Filtering*: “Users who are similar to you also liked …”

In both cases, you create a user-item matrix which built from the entire dataset.

Since we have split the data into testing and training we will need to create two ``[943 x 1682]`` matrices (all users by all movies). 

The training matrix contains 75% of the ratings and the testing matrix contains 25% of the ratings.  

Example of user-item matrix:
<img class="aligncenter size-thumbnail img-responsive" src="http://s33.postimg.org/ay0ty90fj/BLOG_CCA_8.png" alt="blog8"/>

After you have built the user-item matrix you calculate the similarity and create a similarity matrix. 

The similarity values between items in *Item-Item Collaborative Filtering* are measured by observing all the users who have rated both items.  

<img class="aligncenter size-thumbnail img-responsive" style="max-width:100%; width: 50%; max-width: none" src="http://s33.postimg.org/i522ma83z/BLOG_CCA_10.png"/>

For *User-Item Collaborative Filtering* the similarity values between users are measured by observing all the items that are rated by both users.

<img class="aligncenter size-thumbnail img-responsive" style="max-width:100%; width: 50%; max-width: none" src="http://s33.postimg.org/mlh3z3z4f/BLOG_CCA_11.png"/>

A distance metric commonly used in recommender systems is *cosine similarity*, where the ratings are seen as vectors in ``n``-dimensional space and the similarity is calculated based on the angle between these vectors. 
Cosine similiarity for users *a* and *m* can be calculated using the formula below, where you take dot product of  the user vector *$u_k$* and the user vector *$u_a$* and divide it by multiplication of the Euclidean lengths of the vectors.
<img class="aligncenter size-thumbnail img-responsive" src="https://latex.codecogs.com/gif.latex?s_u^{cos}(u_k,u_a)=\frac{u_k&space;\cdot&space;u_a&space;}{&space;\left&space;\|&space;u_k&space;\right&space;\|&space;\left&space;\|&space;u_a&space;\right&space;\|&space;}&space;=\frac{\sum&space;x_{k,m}x_{a,m}}{\sqrt{\sum&space;x_{k,m}^2\sum&space;x_{a,m}^2}}"/>

To calculate similarity between items *m* and *b* you use the formula:

<img class="aligncenter size-thumbnail img-responsive" src="https://latex.codecogs.com/gif.latex?s_u^{cos}(i_m,i_b)=\frac{i_m&space;\cdot&space;i_b&space;}{&space;\left&space;\|&space;i_m&space;\right&space;\|&space;\left&space;\|&space;i_b&space;\right&space;\|&space;}&space;=\frac{\sum&space;x_{a,m}x_{a,b}}{\sqrt{\sum&space;x_{a,m}^2\sum&space;x_{a,b}^2}}
"/>

Your first step will be to create the user-item matrix. Since you have both testing and training data you need to create two matrices.  

In [12]:
train_data.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
97440,655,1167,3,887428384,"Sum of Us, The (1994)"
6580,363,88,2,891498087,Sleepless in Seattle (1993)
60505,401,147,2,891032662,"Long Kiss Goodnight, The (1996)"
9523,269,181,2,891448871,Return of the Jedi (1983)
38156,618,204,3,891307098,Back to the Future (1985)


In [11]:

for line in train_data.itertuples():
    print(line[1])

655
363
401
269
618
406
142
343
699
311
93
713
742
880
2
217
782
509
100
506
279
543
881
125
301
87
219
311
59
450
883
21
798
194
478
70
276
256
666
361
280
102
833
933
556
119
279
436
374
660
638
551
843
318
405
246
778
181
815
49
830
249
936
109
215
104
890
593
478
795
806
648
883
305
487
475
911
551
442
274
608
606
270
766
157
666
721
417
420
823
504
638
831
290
89
412
429
389
52
224
772
815
334
751
653
875
840
95
58
843
545
431
859
833
308
532
301
645
279
315
184
643
394
297
311
325
232
591
643
93
347
476
533
638
79
313
758
215
222
416
104
222
181
727
496
875
804
402
393
592
504
738
625
655
16
340
545
595
664
118
475
151
194
532
686
397
634
604
125
932
130
495
608
864
453
221
506
773
454
380
450
497
90
880
668
524
766
450
712
354
843
778
796
727
297
230
349
279
642
227
430
774
18
854
710
916
313
125
867
437
705
181
116
776
399
480
666
201
92
13
158
934
355
429
767
488
943
54
869
752
215
76
77
786
12
222
321
694
521
280
643
267
756
833
875
551
886
722
307
889
232
244
490
405
276
691

86
354
668
363
541
405
168
790
92
793
845
551
896
178
773
940
180
881
77
301
749
159
453
308
532
339
926
497
270
600
830
868
159
301
467
404
489
771
943
665
279
830
551
664
790
342
95
658
867
911
591
862
316
627
668
489
416
320
530
6
13
289
279
351
758
931
726
723
710
144
290
869
916
13
912
303
405
661
561
536
405
766
540
13
754
264
407
774
354
139
372
548
922
21
373
363
716
756
533
655
503
916
867
551
157
733
808
280
246
901
350
840
62
293
437
125
747
716
690
788
887
56
738
868
384
10
907
572
178
87
423
506
479
76
19
658
479
234
148
916
898
880
194
457
524
112
181
393
676
597
593
493
670
776
233
894
271
536
534
493
73
790
826
738
931
826
488
500
894
180
435
157
550
498
320
639
397
747
4
894
245
189
94
413
487
716
405
539
222
339
293
840
501
624
234
657
526
367
840
18
26
325
498
642
253
639
405
681
716
299
653
694
880
417
889
533
458
609
270
11
36
916
782
295
214
249
16
159
693
829
708
664
313
533
145
899
298
679
18
8
269
64
474
286
188
611
692
684
62
545
908
268
313
887
524
246
632
26

90
318
100
412
442
858
320
621
747
345
409
679
317
934
641
346
553
717
693
334
392
892
96
97
553
326
201
618
457
257
114
504
354
830
579
521
13
939
546
319
436
454
405
476
748
833
782
200
286
727
301
862
244
406
825
236
151
344
128
276
59
899
715
829
254
70
389
69
234
901
197
707
57
714
417
673
890
111
642
405
332
58
690
690
398
709
243
625
332
806
158
798
264
506
417
835
271
629
790
554
586
887
295
653
519
158
152
92
593
846
534
831
262
927
334
291
213
655
811
848
805
777
504
406
43
882
720
16
102
234
234
537
886
268
782
87
152
631
891
186
311
606
82
711
684
196
786
387
298
387
405
397
116
934
659
693
875
875
899
178
755
587
393
193
798
308
562
379
354
130
880
457
181
936
135
698
645
566
615
804
393
749
43
11
823
677
818
303
463
840
425
387
705
830
72
24
932
101
829
106
373
436
450
373
879
504
568
392
292
13
250
159
503
942
15
773
456
738
911
334
932
357
518
56
537
504
525
503
301
618
184
422
610
146
24
812
889
797
200
499
178
104
615
543
189
158
460
243
10
868
234
271
13
597
453
177


893
422
334
896
551
921
94
92
380
862
533
178
451
644
334
833
873
537
623
514
405
526
632
503
194
854
896
159
567
203
393
198
85
249
180
531
263
49
120
330
221
870
846
269
559
276
773
885
653
109
1
634
94
393
757
721
433
141
63
181
413
825
804
672
249
758
286
90
125
588
911
763
424
643
318
402
298
301
456
256
234
305
521
409
405
533
453
543
128
13
174
94
843
361
151
747
60
7
342
250
109
43
605
456
814
708
913
387
223
417
200
548
472
774
215
918
880
751
458
868
318
83
441
46
334
256
6
699
612
756
181
823
814
210
194
43
882
5
526
551
417
97
307
707
521
249
464
643
809
92
663
605
852
343
543
405
804
291
381
711
551
399
246
406
385
297
738
938
374
363
21
191
421
889
330
254
318
694
181
828
403
109
359
435
303
95
655
776
889
7
919
663
534
486
923
567
716
370
608
758
224
694
612
326
843
790
450
714
622
693
53
577
496
591
421
575
815
354
264
339
94
354
193
919
717
201
823
650
179
795
342
159
773
405
311
227
436
227
600
90
382
181
178
436
20
269
584
366
137
749
600
54
881
712
201
630
232
223
3

91
326
363
809
239
141
181
796
653
387
942
179
126
534
660
880
711
32
608
398
629
271
843
85
435
490
576
627
379
160
870
795
758
87
524
418
279
7
622
119
647
655
130
619
826
179
15
340
440
58
429
58
889
69
84
296
167
58
540
798
128
164
292
617
663
293
18
244
200
181
280
43
325
642
692
934
496
833
328
569
130
639
227
49
708
493
864
486
13
235
931
13
347
62
919
560
406
157
130
847
831
743
291
222
796
280
112
643
6
323
184
303
246
905
735
633
255
712
125
286
328
699
109
683
416
181
831
311
140
752
195
715
538
379
373
854
916
394
435
389
99
764
896
114
342
463
846
387
498
680
459
922
493
731
701
298
654
777
178
332
416
460
643
429
374
214
387
867
332
368
378
186
84
130
655
730
579
892
326
862
133
556
434
774
200
291
125
381
204
910
537
558
126
894
268
542
788
206
537
343
840
868
99
471
450
489
903
192
505
927
747
239
272
697
645
587
790
313
82
76
405
303
305
224
11
846
781
201
194
508
303
297
804
568
308
789
345
328
727
230
405
124
59
782
272
13
926
655
749
235
378
570
392
79
486
343
379
2

498
854
802
875
666
128
889
368
95
655
880
682
894
861
491
327
405
868
747
682
880
487
423
678
848
833
448
532
25
165
625
193
494
397
870
374
650
690
843
56
561
85
503
881
311
757
48
796
312
85
380
104
560
846
232
59
44
844
655
693
622
242
280
218
429
59
654
804
690
881
276
459
833
527
897
83
368
479
393
766
666
892
671
937
826
940
815
655
493
592
648
697
308
627
683
276
218
646
115
72
665
437
425
363
888
174
125
189
236
789
7
666
925
210
617
6
634
893
435
839
749
15
899
13
633
459
276
690
838
279
295
880
541
639
847
721
308
184
763
10
297
712
425
371
62
169
499
618
313
580
158
90
680
622
655
747
786
790
79
846
6
695
592
727
922
402
883
18
470
721
378
11
880
64
469
940
17
394
643
724
887
498
601
102
914
26
459
727
747
871
354
551
908
10
5
683
661
896
383
905
41
927
307
232
655
7
324
291
426
109
262
943
567
610
642
682
608
836
230
693
246
851
542
536
308
262
429
757
682
748
716
450
802
788
417
143
773
655
749
660
345
886
232
747
333
92
752
305
305
675
435
370
379
327
606
845
798
463
183

210
379
450
99
621
903
200
468
797
622
220
210
899
48
653
141
437
178
234
731
848
233
6
260
495
246
864
911
74
104
311
791
243
97
595
393
450
1
758
625
627
655
655
436
647
76
538
548
23
496
269
472
495
174
330
112
334
698
538
554
370
161
140
200
601
734
586
188
250
533
749
128
933
293
462
335
549
60
497
255
13
554
795
666
514
298
533
336
804
301
870
554
682
358
203
209
11
280
654
311
487
515
554
699
151
210
296
639
160
543
731
235
438
566
627
894
43
851
389
580
711
805
834
262
939
374
197
650
435
594
493
630
339
130
679
491
474
703
69
634
82
43
38
248
181
771
334
807
533
85
632
900
458
916
385
380
219
23
39
788
468
450
250
383
452
349
653
506
391
892
472
751
435
405
904
788
749
22
655
561
279
846
326
774
6
59
177
455
545
301
152
274
23
405
500
299
334
486
130
666
929
454
844
360
465
435
854
453
846
22
210
592
683
248
374
378
484
238
394
82
428
648
222
648
903
821
871
271
1
683
664
135
504
665
380
425
472
58
47
7
98
320
605
682
458
747
438
727
622
871
442
648
721
862
184
7
62
440
889
32

31
346
232
664
307
210
860
145
51
488
450
18
144
178
905
666
399
8
92
481
565
552
250
476
26
313
796
303
537
880
901
796
213
6
773
391
109
224
560
387
592
59
786
299
212
883
653
308
927
163
342
840
145
745
393
782
109
301
860
892
450
533
296
655
144
543
45
131
363
248
64
455
392
490
861
325
835
359
49
233
648
875
76
76
667
416
790
44
117
269
846
508
559
425
301
943
341
547
404
146
493
478
216
292
315
704
867
343
291
620
181
868
354
554
838
524
351
798
650
805
405
830
634
194
295
537
881
755
516
801
686
886
193
655
455
577
932
13
308
191
716
201
174
294
537
474
13
72
856
16
757
92
348
537
363
553
216
923
627
472
524
919
474
327
412
234
666
840
297
682
586
486
305
183
56
541
796
27
201
279
783
450
804
474
313
943
308
280
184
201
461
420
1
59
145
537
894
366
504
526
940
186
788
883
246
387
521
229
222
655
653
445
77
472
751
545
280
198
270
406
39
303
85
230
566
254
892
919
378
690
521
500
515
283
101
248
271
643
638
618
940
795
806
717
883
902
647
739
130
934
454
256
64
144
896
481
934
84

417
577
846
200
177
268
297
446
850
312
757
303
181
339
363
334
6
731
757
753
682
236
145
79
916
474
57
456
387
417
26
363
375
501
59
216
256
536
472
934
327
649
835
402
660
624
150
352
485
346
774
457
679
148
654
648
195
758
151
117
657
682
234
189
730
479
381
279
748
500
890
210
115
409
197
936
58
90
275
456
551
299
86
260
141
487
148
396
38
200
22
251
188
389
733
749
804
201
316
269
493
11
537
894
328
6
200
889
307
707
454
924
85
620
567
13
497
896
385
280
310
595
88
293
591
492
90
405
92
343
128
168
796
890
521
533
136
794
109
160
209
611
565
496
271
749
880
357
639
665
608
505
883
806
641
846
177
943
145
805
659
130
457
175
256
385
479
455
836
18
60
833
320
286
532
305
365
257
308
506
461
125
327
286
455
885
268
585
546
601
434
268
168
276
113
313
919
727
222
551
749
456
405
474
805
387
71
484
907
757
774
339
840
100
500
880
757
279
76
561
58
489
524
7
630
148
282
235
523
721
59
193
393
937
569
786
13
771
548
343
387
343
527
932
63
253
201
537
533
77
339
707
756
58
296
233
222
533

870
284
222
283
878
314
580
279
246
850
796
58
934
456
457
551
805
450
95
472
460
120
75
254
429
620
236
153
554
495
303
839
90
291
744
98
749
747
56
419
97
18
333
833
122
679
930
508
938
592
450
630
210
334
279
177
189
308
323
526
280
371
474
749
877
110
234
181
127
1
279
99
620
892
279
382
881
843
377
45
406
420
654
60
553
327
879
870
248
881
326
13
387
178
186
213
900
406
452
868
871
234
588
92
532
889
60
707
655
650
380
72
85
425
70
445
474
234
881
279
592
805
64
599
790
215
472
704
13
524
487
711
905
567
600
910
682
41
713
605
299
592
3
239
279
331
615
532
458
249
805
181
201
774
551
712
452
311
201
261
914
379
486
648
181
21
499
545
293
73
201
919
52
223
23
653
733
916
73
595
788
301
707
102
222
617
836
878
847
665
95
716
24
629
648
913
360
181
230
474
24
654
826
307
397
655
297
409
177
398
500
378
1
116
82
95
875
465
545
774
201
749
210
541
508
775
864
787
405
394
260
839
426
743
254
177
880
151
768
59
506
292
537
907
219
900
397
500
399
796
292
778
666
25
545
917
557
381
660
54

277
896
343
592
345
178
823
64
761
92
727
207
751
752
405
119
469
325
344
629
835
1
733
661
524
314
274
585
56
238
405
826
298
318
406
685
647
1
534
368
657
496
897
497
378
790
318
497
290
343
349
689
659
674
299
385
271
73
653
90
705
865
380
152
391
628
327
64
130
41
847
60
616
934
7
600
513
559
232
125
299
144
932
27
677
269
151
487
144
269
312
883
659
33
465
293
256
92
889
429
405
610
28
344
880
650
868
393
834
721
576
363
872
197
389
635
680
757
130
606
406
393
189
399
92
249
835
687
416
337
758
314
435
846
546
938
569
682
864
711
564
577
449
454
760
932
506
139
381
709
169
530
752
128
463
84
389
24
425
75
363
13
788
416
405
458
675
668
379
359
6
711
621
886
805
642
472
243
222
447
618
90
124
172
846
393
290
7
72
472
185
774
937
287
593
932
346
618
234
312
788
761
896
311
141
798
201
790
637
393
7
683
251
655
532
43
185
43
430
544
580
666
335
592
164
883
699
569
450
109
815
406
573
291
881
535
854
399
833
796
186
62
679
537
279
308
342
727
60
380
843
180
195
529
777
22
798
891
561


62
249
389
844
835
727
703
711
345
387
744
919
875
10
627
301
830
627
234
655
445
669
181
119
705
194
317
110
7
346
904
766
401
221
81
805
16
281
11
297
244
537
672
328
416
140
916
766
398
593
49
328
508
578
870
214
90
932
650
290
927
660
677
746
814
188
454
429
95
28
406
605
886
804
407
259
898
550
468
54
654
682
49
716
378
5
882
184
234
723
194
848
532
592
843
601
389
436
82
286
95
591
540
557
676
666
336
506
535
663
128
648
128
881
881
256
458
234
464
332
5
936
706
311
263
194
392
405
472
54
533
269
748
125
716
210
435
90
271
655
698
699
838
13
698
531
223
437
201
854
457
653
276
246
634
286
270
501
618
15
269
272
164
393
682
880
380
305
528
208
108
12
96
194
87
378
222
653
207
313
75
286
934
187
547
838
721
379
299
342
48
374
378
524
521
559
184
538
95
734
371
230
445
500
89
51
62
882
164
328
181
299
449
758
503
236
907
181
554
745
90
682
103
244
693
826
229
116
600
913
566
346
908
900
566
380
299
59
222
452
807
587
311
437
468
90
339
880
328
943
98
373
806
883
399
178
101
178
514


316
352
189
293
1
577
286
508
556
918
13
293
13
387
533
757
394
487
763
312
751
44
577
381
301
332
694
431
875
823
778
308
749
432
707
233
293
807
248
378
618
197
782
457
709
886
618
236
243
194
157
308
303
881
548
464
482
6
446
276
129
457
197
751
806
87
274
756
145
385
894
222
405
804
805
421
854
277
553
268
699
64
842
291
848
902
6
23
354
549
13
206
896
734
684
330
698
870
903
436
713
515
465
7
398
500
295
836
452
32
406
385
936
519
618
373
655
292
815
684
511
377
39
235
425
486
185
711
805
379
851
265
217
829
870
38
705
623
552
715
145
279
746
829
474
864
109
692
232
325
447
890
886
773
238
151
880
134
13
108
852
106
354
702
864
934
279
405
582
230
618
577
178
894
189
481
901
6
453
83
592
251
579
535
935
504
313
756
101
585
374
233
334
347
60
665
79
217
121
294
110
621
181
780
458
644
196
472
601
436
776
888
7
666
207
158
892
276
796
820
255
484
890
880
573
315
543
567
863
125
754
119
447
420
709
830
263
258
267
654
468
85
327
805
807
256
286
854
771
747
833
671
474
275
13
288
405


521
749
782
474
533
567
26
876
664
455
715
280
727
497
660
429
802
312
694
560
514
152
312
655
438
549
863
389
374
222
235
278
13
292
225
325
43
568
640
13
637
882
548
162
405
207
199
130
87
715
662
711
94
196
653
429
542
408
419
851
244
823
749
836
303
276
627
40
110
432
207
711
141
782
13
592
823
181
207
143
880
170
922
297
62
802
911
393
611
94
939
429
539
897
205
625
906
684
655
719
114
566
887
250
69
733
295
381
630
655
184
880
838
514
263
524
548
92
222
696
398
403
151
184
416
747
13
13
506
708
307
528
620
102
484
393
297
697
293
514
663
919
660
109
224
889
295
280
130
568
503
561
474
215
7
313
429
145
815
429
862
604
621
280
709
478
786
312
697
651
44
313
591
577
562
13
92
648
811
757
312
924
834
655
537
213
721
363
405
821
648
807
817
405
384
312
785
541
665
743
7
195
617
102
532
897
332
407
643
665
758
152
749
416
666
846
399
208
393
942
887
885
286
489
537
460
184
498
459
943
712
556
385
454
416
82
537
804
577
125
44
406
4
1
222
862
798
275
507
478
913
1
586
868
831
10
18
457

451
682
724
667
405
334
607
280
1
64
360
326
745
500
204
759
298
378
745
493
43
347
837
890
742
677
112
416
833
222
37
642
791
388
95
713
870
835
881
933
128
95
23
119
810
276
330
299
94
7
547
286
102
373
889
554
865
94
939
354
766
757
494
933
348
912
741
540
127
456
650
284
393
276
52
332
825
391
303
634
492
880
197
233
846
58
342
503
276
942
606
807
393
924
838
416
499
342
243
13
456
405
276
907
687
741
466
497
643
768
280
730
709
537
807
476
682
457
653
137
916
42
193
486
577
311
453
637
330
504
115
514
590
620
130
234
429
483
6
897
500
429
864
348
42
411
864
152
422
921
138
642
836
110
852
244
275
381
897
305
115
468
44
82
268
195
180
650
754
285
868
823
661
13
940
379
406
916
332
244
932
916
537
774
242
233
933
755
528
514
98
65
274
88
346
758
626
108
826
429
815
377
865
116
253
804
892
145
36
271
573
318
194
659
18
565
757
15
433
130
244
655
524
13
559
94
463
870
343
327
617
504
606
536
411
361
435
273
889
646
535
592
795
286
676
833
397
334
1
230
486
711
201
844
839
60
711
90
50

778
214
234
818
660
622
279
299
741
94
830
305
412
301
452
184
704
312
666
194
11
171
743
506
87
890
667
144
735
181
297
20
922
35
932
349
844
385
655
177
13
449
6
455
502
796
60
869
916
406
458
479
634
77
13
334
896
167
650
117
234
934
583
490
634
738
198
59
886
51
381
6
655
524
617
889
7
883
398
382
307
92
249
313
398
354
892
664
426
15
327
489
606
465
199
429
393
796
398
851
291
23
537
121
894
835
58
719
826
354
201
318
42
487
805
313
239
308
245
286
268
655
936
325
682
455
379
75
78
59
268
312
627
775
249
702
102
54
253
796
892
497
232
56
62
454
933
1
117
538
303
896
189
402
889
18
924
339
13
826
826
399
862
421
934
881
472
44
82
754
939
95
592
346
452
60
535
550
576
195
21
344
878
521
504
545
234
627
774
913
788
387
933
385
393
420
939
491
379
497
788
268
830
269
343
291
380
405
298
796
630
815
347
577
724
585
14
450
504
897
324
459
554
301
222
804
350
543
873
48
521
119
304
244
655
867
899
682
458
533
196
276
154
833
881
435
303
444
13
477
476
551
455
164
58
880
648
13
843
521
50

291
655
790
768
655
854
417
280
239
735
896
487
839
356
868
387
716
716
415
916
194
943
303
344
500
748
537
776
486
311
742
699
293
94
670
818
276
630
807
355
804
846
65
385
721
416
903
588
102
413
659
877
195
327
244
16
537
354
454
299
59
405
860
487
806
207
740
64
58
624
119
606
727
643
288
190
145
453
393
178
371
883
637
894
666
434
514
454
441
621
253
841
862
514
870
222
326
312
690
402
875
647
85
561
876
601
796
80
435
889
655
503
399
155
627
712
442
536
327
749
764
435
437
69
665
92
498
357
193
325
690
236
657
886
13
880
918
567
850
537
193
914
590
524
167
870
335
463
467
819
151
199
319
327
327
655
790
527
798
821
939
694
880
89
758
381
782
532
58
244
339
852
339
848
197
840
290
56
717
608
334
790
318
364
248
561
503
873
361
407
145
363
373
303
221
932
62
776
629
523
834
883
20
943
705
417
763
931
907
267
342
385
201
474
770
536
608
245
405
92
443
95
930
862
407
284
363
312
152
244
507
198
268
504
203
743
547
593
805
246
13
804
592
38
62
850
796
933
694
693
234
104
629
116
483
8

773
453
295
669
11
788
883
198
429
867
615
452
130
911
378
174
639
269
619
344
907
178
828
262
13
30
291
663
326
579
409
894
1
622
131
897
856
721
606
805
234
62
899
159
92
591
624
908
755
343
271
464
546
313
870
846
304
425
23
682
286
382
421
650
472
738
870
87
425
52
523
454
18
407
59
699
326
231
239
92
592
34
670
378
303
537
809
790
361
201
41
923
21
193
533
236
499
480
682
18
445
606
514
459
545
315
788
128
33
161
551
653
326
429
256
336
592
268
798
430
128
338
374
405
916
455
881
832
378
402
554
747
591
459
277
592
551
567
326
325
495
223
85
521
654
178
903
615
450
377
535
13
896
15
650
939
355
495
311
42
561
698
852
244
624
733
885
566
606
889
347
49
786
491
59
65
262
340
198
514
284
222
350
67
499
715
932
210
508
194
529
632
339
87
250
532
234
532
661
313
606
694
843
559
152
565
405
188
883
95
426
264
916
280
514
880
317
174
532
484
864
254
2
183
184
141
343
497
63
234
711
416
215
493
842
896
405
373
230
328
128
880
92
365
645
301
256
667
618
803
669
823
834
153
456
739
806
711


345
111
884
660
592
568
650
106
659
490
301
916
346
514
601
92
164
561
899
789
640
447
276
711
49
623
340
287
279
889
455
682
363
747
538
783
673
875
410
621
255
254
174
429
200
18
92
269
22
291
28
501
269
291
181
690
647
201
679
201
710
13
880
276
13
17
587
658
85
894
112
122
934
438
234
104
928
937
30
346
93
622
425
85
622
889
487
940
492
458
884
653
195
648
267
671
246
29
715
642
454
223
178
293
659
271
429
325
36
203
532
13
637
489
378
425
22
851
416
660
373
615
562
532
671
628
409
377
828
332
653
173
476
17
308
798
272
324
88
474
643
457
710
125
671
721
840
94
823
393
193
928
486
798
540
903
458
275
601
711
321
2
194
85
650
650
159
693
139
452
59
387
320
868
94
318
561
650
863
256
708
276
675
87
279
162
698
389
94
707
742
311
756
273
695
524
608
820
357
468
883
742
145
436
853
497
539
541
228
559
533
279
807
345
537
655
614
542
45
733
682
711
154
536
921
407
164
715
625
758
115
308
474
194
222
251
807
137
703
825
217
474
210
438
716
327
303
430
682
457
384
56
172
896
8
295
391
534

411
601
256
183
907
90
523
716
62
841
933
815
213
630
835
210
830
889
245
450
855
221
405
109
627
325
429
48
13
276
504
766
448
494
796
867
645
936
2
358
106
663
823
125
63
128
451
119
412
553
234
26
650
497
16
239
919
735
618
639
830
92
95
555
358
777
207
373
368
669
672
543
36
593
348
416
23
548
621
234
709
840
488
311
70
707
608
592
497
379
918
770
60
293
363
825
234
457
472
846
844
233
456
847
193
458
913
249
655
796
561
324
279
548
716
738
523
296
594
889
236
868
222
457
771
749
280
916
833
880
291
378
537
378
320
207
255
637
25
864
846
429
458
654
833
125
2
698
342
274
528
548
833
751
666
550
758
119
850
534
894
244
327
75
778
499
804
109
128
13
864
465
721
561
541
445
181
326
903
869
592
436
712
554
551
5
854
26
450
527
169
42
182
234
768
495
332
763
875
222
474
177
274
553
630
655
292
387
682
621
834
110
297
94
478
429
833
764
454
351
21
406
184
146
141
859
18
284
590
765
234
926
144
77
75
727
405
59
474
457
541
618
200
798
815
315
852
622
682
344
323
827
298
561
479
582
389
38

144
184
686
881
58
21
474
190
413
235
301
422
436
303
896
559
18
299
342
472
773
128
303
336
433
459
272
868
916
354
13
562
70
222
18
795
59
769
788
521
308
768
201
535
314
79
897
212
296
412
416
215
839
660
328
705
894
405
16
234
804
696
869
10
256
231
795
489
575
488
144
615
768
64
289
653
724
543
744
187
276
276
1
30
203
534
385
343
113
344
694
291
749
807
622
293
455
292
293
77
935
331
663
456
532
840
44
682
263
351
230
870
367
666
674
906
416
798
864
353
57
601
521
325
291
738
279
707
92
299
181
727
871
911
295
276
407
886
527
555
178
164
160
655
679
551
608
806
878
18
342
493
369
588
318
85
707
295
836
551
406
303
486
363
622
543
268
102
480
935
255
640
232
236
406
276
234
840
527
239
661
429
188
308
376
505
268
509
109
495
466
598
663
899
505
70
591
514
307
52
853
524
640
458
616
435
717
289
889
283
537
338
197
542
936
804
645
13
782
476
804
715
95
119
269
754
145
837
311
577
669
792
234
600
234
13
389
871
89
640
606
188
675
83
780
59
846
506
437
305
386
642
406
7
208
343
489
61

318
886
896
747
496
942
853
855
844
774
265
919
497
671
343
851
94
124
640
347
72
82
10
380
339
405
716
659
216
597
209
387
919
686
345
653
625
436
6
527
666
807
201
18
755
727
276
472
774
551
593
49
564
514
358
907
396
545
919
299
656
332
890
537
244
835
135
498
910
457
929
97
749
892
201
373
280
852
435
833
894
537
907
280
747
856
387
632
10
279
276
38
788
370
495
683
497
398
844
916
174
18
370
49
608
503
524
82
235
865
65
407
503
197
218
5
562
5
509
639
196
303
345
87
18
405
7
334
374
405
399
299
869
85
190
810
263
620
336
128
109
852
506
385
44
662
361
260
897
406
904
536
118
881
639
234
276
587
535
322
270
435
556
523
119
474
884
122
886
41
338
822
267
542
154
276
896
543
183
540
141
765
633
186
340
117
280
329
22
104
492
919
655
882
506
318
846
249
543
281
450
500
84
267
501
379
417
291
804
214
783
768
23
7
892
854
393
44
399
491
834
543
178
173
167
435
629
13
537
629
216
835
846
854
453
474
177
927
87
174
62
913
37
142
546
326
393
719
627
95
39
296
385
1
747
259
186
868
807
782


19
524
186
252
7
297
94
404
644
180
451
223
326
873
846
885
755
190
861
669
611
151
701
343
130
554
425
270
253
11
727
290
665
91
385
276
22
426
201
487
201
279
381
311
293
860
285
306
405
311
234
838
220
543
374
377
102
847
16
144
551
334
867
746
648
885
781
189
654
7
417
920
759
537
735
16
727
66
551
871
398
561
882
82
407
537
620
405
90
178
308
486
503
174
506
921
432
714
831
249
334
804
308
405
234
455
23
504
907
748
354
543
454
226
592
751
625
881
234
284
393
708
562
83
295
548
456
574
505
686
543
62
429
303
756
280
77
504
904
529
17
588
510
63
893
743
119
507
330
43
276
715
243
788
83
932
222
38
633
258
112
705
840
119
178
637
14
345
561
62
135
216
650
715
757
691
151
373
250
352
18
314
705
405
297
178
295
655
425
271
495
830
377
838
59
456
733
699
702
197
291
682
632
295
225
435
182
505
274
435
738
663
763
798
643
286
646
868
399
391
727
554
786
315
43
450
60
550
563
642
671
201
496
890
636
303
95
655
699
655
761
252
625
533
829
758
435
833
130
147
771
618
130
279
222
521
184
29

308
393
313
503
463
683
394
393
841
500
154
874
322
661
125
424
790
503
407
184
682
265
606
871
535
85
62
748
193
349
619
224
621
97
316
190
844
637
343
201
815
463
756
280
689
807
497
654
381
588
203
898
92
533
774
484
601
145
666
257
610
577
826
204
219
875
692
886
109
110
316
380
234
790
815
835
140
833
405
301
586
70
326
880
314
474
643
398
889
478
181
405
297
655
236
666
314
457
308
751
582
222
608
677
785
537
385
712
205
724
367
13
804
232
394
307
711
505
934
399
201
833
887
796
82
655
379
643
650
94
429
591
6
316
889
862
809
165
307
698
851
618
782
855
254
399
141
325
479
938
280
154
156
405
101
790
18
470
940
363
693
6
620
391
439
56
381
405
293
468
727
880
486
92
107
500
660
831
707
727
97
355
828
582
416
244
925
201
201
280
182
901
659
755
903
655
505
804
804
22
705
642
870
416
198
294
886
804
727
642
833
144
435
60
698
246
500
837
204
204
295
795
435
263
458
811
268
457
231
853
405
222
321
916
19
99
56
592
313
854
145
700
806
889
129
254
327
686
373
77
305
655
465
466
660
62

294
407
263
870
508
537
761
311
90
749
13
591
716
62
651
846
675
658
896
59
542
385
709
57
102
387
756
365
113
682
288
650
592
135
760
13
105
130
405
648
201
778
109
776
246
846
669
479
472
918
804
10
738
537
736
303
788
687
21
330
347
500
514
709
259
85
874
831
393
902
923
859
224
788
405
567
249
889
856
614
390
767
869
378
406
399
435
194
109
92
145
234
207
638
795
565
535
497
659
44
84
194
933
671
870
889
385
133
880
608
301
754
268
551
495
327
336
642
373
94
933
455
655
503
393
843
786
435
411
450
221
56
796
921
454
647
567
489
396
222
784
591
612
280
181
109
883
345
211
544
263
655
474
841
327
272
661
825
804
838
42
286
541
73
481
102
144
63
100
301
215
379
301
271
401
62
144
411
644
629
574
642
85
621
6
213
619
460
168
85
577
244
291
267
308
399
506
24
298
17
379
265
262
525
716
682
830
919
342
532
184
425
880
741
456
194
864
827
119
655
190
524
290
170
690
405
201
416
653
519
591
506
440
637
699
334
667
450
194
632
632
855
109
833
72
788
524
399
883
199
405
49
10
804
158
190
916

382
158
907
159
193
916
209
542
286
387
3
279
235
56
189
291
279
343
368
795
476
341
892
655
267
723
181
889
445
665
677
161
846
213
399
409
318
940
500
500
435
907
804
524
416
128
457
627
681
427
935
280
121
287
449
532
173
618
392
400
268
405
643
269
178
393
372
627
916
724
373
176
802
846
305
246
456
694
268
788
284
856
586
807
489
458
479
130
450
435
85
690
291
393
546
224
393
435
199
425
354
474
493
416
752
211
840
492
269
233
144
860
437
527
64
804
788
312
924
618
452
573
822
716
452
469
466
617
286
561
332
846
130
478
448
371
763
650
551
435
583
671
385
435
694
372
378
59
711
676
499
708
184
276
398
857
409
465
75
100
765
473
887
234
565
864
870
653
666
476
879
214
18
671
344
114
194
804
463
303
164
354
327
629
782
495
632
409
693
152
1
417
825
308
864
128
222
605
222
682
521
387
38
262
379
416
350
95
102
855
374
292
854
450
896
882
389
551
495
763
224
2
885
160
846
95
357
315
95
437
141
275
628
782
65
863
178
786
387
399
527
523
670
144
588
313
416
293
707
711
450
221
216
280
1

In [13]:
#Create two user-item matrices, one for training and another for testing
# User ID on rows, item ID on columns, rating is the value at the intersection
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]  

test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

In [27]:
train_data_matrix

array([[0., 3., 4., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 5., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [20]:
train_data_matrix.shape

(944, 1682)

You can use the [pairwise_distances](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html) function from sklearn to calculate the cosine similarity. Note, the output will range from 0 to 1 since the ratings are all positive.

In [17]:
# Pairwise distances go row by row and output an nxn matrix of each row's similarity,
# based on the column values using cosine similarity of each row to every other row,
# including each row to itself (similarity = 1) on the diagonal of the matrix.
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_data_matrix, metric='cosine',)
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

In [19]:
user_similarity.shape

(944, 944)

In [21]:
item_similarity.shape

(1682, 1682)

Next step is to make predictions. You have already created similarity matrices: `user_similarity` and `item_similarity` and therefore you can make a prediction by applying following formula for user-based CF:

<img class="aligncenter size-thumbnail img-responsive" src="https://latex.codecogs.com/gif.latex?\hat{x}_{k,m}&space;=&space;\bar{x}_{k}&space;&plus;&space;\frac{\sum\limits_{u_a}&space;sim_u(u_k,&space;u_a)&space;(x_{a,m}&space;-&space;\bar{x_{u_a}})}{\sum\limits_{u_a}|sim_u(u_k,&space;u_a)|}"/>

You can look at the similarity between users *k* and *a* as weights that are multiplied by the ratings of a similar user *a* (corrected for the average rating of that user). You will need to normalize it so that the ratings stay between 1 and 5 and, as a final step, sum the average ratings for the user that you are trying to predict. 

The idea here is that some users may tend always to give high or low ratings to all movies. The relative difference in the ratings that these users give is more important than the absolute values. To give an example: suppose, user *k* gives 4 stars to his favourite movies and 3 stars to all other good movies. Suppose now that another user *t* rates movies that he/she likes with 5 stars, and the movies he/she fell asleep over with 3 stars. These two users could have a very similar taste but treat the rating system differently. 

When making a prediction for item-based CF you don't need to correct for users average rating since query user itself is used to do predictions.

<img class="aligncenter size-thumbnail img-responsive" src="https://latex.codecogs.com/gif.latex?\hat{x}_{k,m}&space;=&space;\frac{\sum\limits_{i_b}&space;sim_i(i_m,&space;i_b)&space;(x_{k,b})&space;}{\sum\limits_{i_b}|sim_i(i_m,&space;i_b)|}"/>

In [23]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #You use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])     
    return pred

In [24]:
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

### Evaluation
There are many evaluation metrics but one of the most popular metric used to evaluate accuracy of predicted ratings is *Root Mean Squared Error (RMSE)*. 
<img src="https://latex.codecogs.com/gif.latex?RMSE&space;=\sqrt{\frac{1}{N}&space;\sum&space;(x_i&space;-\hat{x_i})^2}" title="RMSE =\sqrt{\frac{1}{N} \sum (x_i -\hat{x_i})^2}" />

You can use the [mean_square_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) (MSE) function from `sklearn`, where the RMSE is just the square root of MSE. To read more about different evaluation metrics you can take a look at [this article](http://research.microsoft.com/pubs/115396/EvaluationMetrics.TR.pdf). 

Since you only want to consider predicted ratings that are in the test dataset, you filter out all other elements in the prediction matrix with `prediction[ground_truth.nonzero()]`. 

In [25]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [26]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: 3.131107272662847
Item-based CF RMSE: 3.458356206303761


Memory-based algorithms are easy to implement and produce reasonable prediction quality. 
The drawback of memory-based CF is that it doesn't scale to real-world scenarios and doesn't address the well-known cold-start problem, that is when new user or new item enters the system. Model-based CF methods are scalable and can deal with higher sparsity level than memory-based models, but also suffer when new users or items that don't have any ratings enter the system. I would like to thank Ethan Rosenthal for his [post](http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/) about Memory-Based Collaborative Filtering. 

# Model-based Collaborative Filtering

Model-based Collaborative Filtering is based on **matrix factorization (MF)** which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF. The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items. 
When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization you can restructure the  user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector. You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.

Let's calculate the sparsity level of MovieLens dataset:

100003

In [34]:
# len(df) is the number of rows of the original event/timestamped rating data frame, i.e. the ratings we actually
# have in our data. The number of users and items is the distinct/unique count, so the total possible ratings we 
# could have. 
sparsity=round(1.0-len(df)/float(n_users*n_items),3)
print('The sparsity level of MovieLens100K is ' +  str(sparsity*100) + '%')

The sparsity level of MovieLens100K is 93.7%


To give an example of the learned latent preferences of the users and items: let's say for the MovieLens dataset you have the following information: _(user id, age, location, gender, movie id, director, actor, language, year, rating)_. By applying matrix factorization the model learns that important user features are _age group (under 10, 10-18, 18-30, 30-90)_, _location_ and _gender_, and for movie features it learns that _decade_, _director_ and _actor_ are most important. Now if you look into the information you have stored, there is no such feature as the _decade_, but the model can learn on its own. The important aspect is that the CF model only uses data (user_id, movie_id, rating) to learn the latent features. If there is little data available model-based CF model will predict poorly, since it will be more difficult to learn the latent features. 

Models that use both ratings and content features are called **Hybrid Recommender Systems** where both Collaborative Filtering and Content-based Models are combined. Hybrid recommender systems usually show higher accuracy than Collaborative Filtering or Content-based Models on their own: they are capable to address the cold-start problem better since if you don't have any ratings for a user or an item you could use the metadata from the user or item to make a prediction. 

### SVD
A well-known matrix factorization method is **Singular value decomposition (SVD)**. Collaborative Filtering can be formulated by approximating a matrix `X` by using singular value decomposition. The winning team at the Netflix Prize competition used SVD matrix factorization models to produce product recommendations, for more information I recommend to read articles: [Netflix Recommendations: Beyond the 5 stars](http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html) and [Netflix Prize and SVD](http://buzzard.ups.edu/courses/2014spring/420projects/math420-UPS-spring-2014-gower-netflix-SVD.pdf).
The general equation can be expressed as follows:
<img src="https://latex.codecogs.com/gif.latex?X=USV^T" title="X=USV^T" />


Given `m x n` matrix `X`:
* *`U`* is an *`(m x r)`* orthogonal matrix
* *`S`* is an *`(r x r)`* diagonal matrix with non-negative real numbers on the diagonal
* *V^T* is an *`(r x n)`* orthogonal matrix

Elements on the diagnoal in `S` are known as *singular values of `X`*. 


Matrix *`X`* can be factorized to *`U`*, *`S`* and *`V`*. The *`U`* matrix represents the feature vectors corresponding to the users in the hidden feature space and the *`V`* matrix represents the feature vectors corresponding to the items in the hidden feature space.
<img class="aligncenter size-thumbnail img-responsive" style="max-width:100%; width: 50%; max-width: none" src="http://s33.postimg.org/kwgsb5g1b/BLOG_CCA_5.png"/>

Now you can make a prediction by taking dot product of *`U`*, *`S`* and *`V^T`*.

<img class="aligncenter size-thumbnail img-responsive" style="max-width:100%; width: 50%; max-width: none" src="http://s33.postimg.org/ch9lcm6pb/BLOG_CCA_4.png"/>

In [35]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

#get SVD components from train matrix. Choose k.
u, s, vt = svds(train_data_matrix, k = 20)
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print('User-based CF MSE: ' + str(rmse(X_pred, test_data_matrix)))

User-based CF MSE: 2.727093975231784


Carelessly addressing only the relatively few known entries is highly prone to overfitting. SVD can be very slow and computationally expensive. More recent work minimizes the squared error by applying alternating least square or stochastic gradient descent and uses regularization terms to prevent overfitting. Alternating least square and stochastic gradient descent methods for CF will be covered in the next tutorials.


Review:

* We have covered how to implement simple **Collaborative Filtering** methods, both memory-based CF and model-based CF.
* **Memory-based models** are based on similarity between items or users, where we use cosine-similarity.
* **Model-based CF** is based on matrix factorization where we use SVD to factorize the matrix.
* Building recommender systems that perform well in cold-start scenarios (where little data is available on new users and items) remains a challenge. The standard collaborative filtering method performs poorly is such settings. 

## Looking for more?

If you want to tackle your own recommendation system analysis, check out these data sets. Note: The files are quite large in most cases, not all the links may stay up to host the data, but the majority of them still work. Or just Google for your own data set!

**Movies Recommendation:**

MovieLens - Movie Recommendation Data Sets http://www.grouplens.org/node/73

Yahoo! - Movie, Music, and Images Ratings Data Sets http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Jester - Movie Ratings Data Sets (Collaborative Filtering Dataset) http://www.ieor.berkeley.edu/~goldberg/jester-data/

Cornell University - Movie-review data for use in sentiment-analysis experiments http://www.cs.cornell.edu/people/pabo/movie-review-data/

**Music Recommendation:**

Last.fm - Music Recommendation Data Sets http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/index.html

Yahoo! - Movie, Music, and Images Ratings Data Sets http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Audioscrobbler - Music Recommendation Data Sets http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html

Amazon - Audio CD recommendations http://131.193.40.52/data/

**Books Recommendation:**

Institut für Informatik, Universität Freiburg - Book Ratings Data Sets http://www.informatik.uni-freiburg.de/~cziegler/BX/
Food Recommendation:

Chicago Entree - Food Ratings Data Sets http://archive.ics.uci.edu/ml/datasets/Entree+Chicago+Recommendation+Data
Merchandise Recommendation:

**Healthcare Recommendation:**

Nursing Home - Provider Ratings Data Set http://data.medicare.gov/dataset/Nursing-Home-Compare-Provider-Ratings/mufm-vy8d

Hospital Ratings - Survey of Patients Hospital Experiences http://data.medicare.gov/dataset/Survey-of-Patients-Hospital-Experiences-HCAHPS-/rj76-22dk

**Dating Recommendation:**

www.libimseti.cz - Dating website recommendation (collaborative filtering) http://www.occamslab.com/petricek/data/
Scholarly Paper Recommendation:

National University of Singapore - Scholarly Paper Recommendation http://www.comp.nus.edu.sg/~sugiyama/SchPaperRecData.html

# Great Job!