顾觐皓:第10次作业02(使用python2操作)

使用graphlab对于音乐数据构建推荐系统

In [1]:
import graphlab
In [2]:
graphlab.canvas.set_target("ipynb")
In [4]:
train_file = 'D:/data/data/10000.txt'
sf = graphlab.SFrame.read_csv(train_file, header=False, delimiter='\t',
                       verbose=False) 
#SFrame是从其他来源提取数据以在Turi Create中使用的主要数据结构
# verbose=false 指运行的时候不显示详细数据
sf = sf.rename({'X1':'user_id', 'X2':'music_id', 'X3':'rating'})
This non-commercial license of GraphLab Create for academic use is assigned to 1207567528@qq.com and will expire on December 11, 2020.
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\GU\AppData\Local\Temp\graphlab_server_1576383550.log.0

分割测试集和训练集

In [5]:
train_set, test_set = sf.random_split(0.8, seed=1)

流行度

In [6]:
popularity_model = graphlab.popularity_recommender.create(train_set, 
                                                    'user_id', 'music_id', 
                                                    target = 'rating')
Recsys training: model = popularity
Preparing data set.
    Data has 1599753 observations with 76085 users and 10000 items.
    Data prepared in: 0.917896s
1599753 observations to process; with 10000 unique items.

内容相似性

In [11]:
item_sim_model = graphlab.item_similarity_recommender.create(train_set, 
                                                       'user_id', 'music_id', 
                                                       target = 'rating', 
                                                       similarity_type='cosine')
Recsys training: model = item_similarity
Preparing data set.
    Data has 1599753 observations with 76085 users and 10000 items.
    Data prepared in: 0.897625s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 972us                          | 3.75       |
| 15.935ms                       | 100        |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 223.38ms                            | 0                | 0               |
| 1.16s                               | 100              | 10000           |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 1.23966s

因式分解

In [12]:
factorization_machine_model = graphlab.recommender.factorization_recommender.create(train_set, 
                                                                              'user_id', 'music_id',
                                                                              target='rating')
Recsys training: model = factorization_recommender
Preparing data set.
    Data has 1599753 observations with 76085 users and 10000 items.
    Data prepared in: 0.890721s
Training factorization_recommender for recommendations.
+--------------------------------+--------------------------------------------------+----------+
| Parameter                      | Description                                      | Value    |
+--------------------------------+--------------------------------------------------+----------+
| num_factors                    | Factor Dimension                                 | 8        |
| regularization                 | L2 Regularization on Factors                     | 1e-008   |
| solver                         | Solver used for training                         | sgd      |
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-010   |
| max_iterations                 | Maximum Number of Iterations                     | 50       |
+--------------------------------+--------------------------------------------------+----------+
  Optimizing model using SGD; tuning step size.
  Using 199969 / 1599753 points for tuning the step size.
+---------+-------------------+------------------------------------------+
| Attempt | Initial Step Size | Estimated Objective Value                |
+---------+-------------------+------------------------------------------+
| 0       | 25                | No Decrease (249.914 >= 64.5212)         |
| 1       | 6.25              | No Decrease (238.586 >= 64.5212)         |
| 2       | 1.5625            | No Decrease (215.147 >= 64.5212)         |
| 3       | 0.390625          | No Decrease (111.655 >= 64.5212)         |
| 4       | 0.0976562         | 38.0152                                  |
| 5       | 0.0488281         | 31.7788                                  |
| 6       | 0.0244141         | 45.2587                                  |
| 7       | 0.012207          | 54.1741                                  |
| 8       | 0.00610352        | 58.7081                                  |
+---------+-------------------+------------------------------------------+
| Final   | 0.0488281         | 31.7788                                  |
+---------+-------------------+------------------------------------------+
Starting Optimization.
+---------+--------------+-------------------+-----------------------+-------------+
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
+---------+--------------+-------------------+-----------------------+-------------+
| Initial | 0us          | 43.795            | 6.61778               |             |
+---------+--------------+-------------------+-----------------------+-------------+
| 1       | 139.627ms    | 43.4513           | 6.59136               | 0.0488281   |
| 2       | 273.276ms    | 40.8715           | 6.39272               | 0.0290334   |
| 3       | 402.923ms    | 37.9983           | 6.16392               | 0.0214205   |
| 4       | 545.543ms    | 35.1521           | 5.92853               | 0.0172633   |
| 5       | 663.227ms    | 32.4723           | 5.69801               | 0.014603    |
| 6       | 778.918ms    | 30.5578           | 5.52743               | 0.0127367   |
| 10      | 1.24s        | 24.4957           | 4.94866               | 0.008683    |
| 11      | 1.37s        | 23.415            | 4.83821               | 0.00808399  |
| 20      | 2.34s        | 17.5223           | 4.18503               | 0.00516295  |
| 30      | 3.42s        | 14.1436           | 3.75966               | 0.00320311  |
| 40      | 4.51s        | 10.5326           | 3.244                 | 0.00182538  |
| 50      | 5.70s        | 9.45221           | 3.07291               | 0.00154408  |
+---------+--------------+-------------------+-----------------------+-------------+
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training RMSE.
       Final objective value: 8.51022
       Final training RMSE: 2.91561
In [13]:
len(train_set)
Out[13]:
1599753

比较模型

In [14]:
result = graphlab.recommender.util.compare_models(test_set, 
                                            [popularity_model, item_sim_model, factorization_machine_model],
                                            user_sample=.5, skip_set=train_set)
compare_models: using 34355 users to estimate model performance
PROGRESS: Evaluate model M0
recommendations finished on 1000/34355 queries. users per second: 31335.2
recommendations finished on 2000/34355 queries. users per second: 34575.8
recommendations finished on 3000/34355 queries. users per second: 34575.4
recommendations finished on 4000/34355 queries. users per second: 33702.9
recommendations finished on 5000/34355 queries. users per second: 33201.4
recommendations finished on 6000/34355 queries. users per second: 33798.2
recommendations finished on 7000/34355 queries. users per second: 33582.6
recommendations finished on 8000/34355 queries. users per second: 34875.9
recommendations finished on 9000/34355 queries. users per second: 35810.1
recommendations finished on 10000/34355 queries. users per second: 35809.9
recommendations finished on 11000/34355 queries. users per second: 35350.8
recommendations finished on 12000/34355 queries. users per second: 35703.7
recommendations finished on 13000/34355 queries. users per second: 35614.2
recommendations finished on 14000/34355 queries. users per second: 36086.1
recommendations finished on 15000/34355 queries. users per second: 36328.8
recommendations finished on 16000/34355 queries. users per second: 36627.5
recommendations finished on 17000/34355 queries. users per second: 36975
recommendations finished on 18000/34355 queries. users per second: 37136.1
recommendations finished on 19000/34355 queries. users per second: 37501.7
recommendations finished on 20000/34355 queries. users per second: 37553.4
recommendations finished on 21000/34355 queries. users per second: 37735.2
recommendations finished on 22000/34355 queries. users per second: 37515.1
recommendations finished on 23000/34355 queries. users per second: 36780.7
recommendations finished on 24000/34355 queries. users per second: 36908.3
recommendations finished on 25000/34355 queries. users per second: 36754.9
recommendations finished on 26000/34355 queries. users per second: 36769.5
recommendations finished on 27000/34355 queries. users per second: 36933.4
recommendations finished on 28000/34355 queries. users per second: 37087
recommendations finished on 29000/34355 queries. users per second: 37278.9
recommendations finished on 30000/34355 queries. users per second: 37366.8
recommendations finished on 31000/34355 queries. users per second: 37585.2
recommendations finished on 32000/34355 queries. users per second: 37659.2
recommendations finished on 33000/34355 queries. users per second: 37177.8
recommendations finished on 34000/34355 queries. users per second: 36500
Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    | 0.000349294134769 | 2.54228239206e-05 |
|   2    | 0.000334740212487 | 8.64040383002e-05 |
|   3    | 0.000349294134769 |  0.00020999007583 |
|   4    | 0.000349294134769 | 0.000314076153343 |
|   5    | 0.000442439237374 | 0.000499685915434 |
|   6    | 0.000412361131325 | 0.000553750166486 |
|   7    | 0.000407509823898 | 0.000649420154093 |
|   8    | 0.000385678940474 | 0.000676837619798 |
|   9    | 0.000456022898171 | 0.000939422792513 |
|   10   | 0.000427885315092 |  0.00098193073489 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 6.198692978789529)

Per User RMSE (best)
+-------------------------------+-------+------+
|            user_id            | count | rmse |
+-------------------------------+-------+------+
| cafbf96566378466408b7b3c76... |   1   | 0.0  |
+-------------------------------+-------+------+
[1 rows x 3 columns]


Per User RMSE (worst)
+-------------------------------+-------+---------------+
|            user_id            | count |      rmse     |
+-------------------------------+-------+---------------+
| 38767872c514c1b43bab5c7b21... |   2   | 341.207176087 |
+-------------------------------+-------+---------------+
[1 rows x 3 columns]


Per Item RMSE (best)
+--------------------+-------+-----------------+
|      music_id      | count |       rmse      |
+--------------------+-------+-----------------+
| SOIFQIE12A6D4F78A4 |   2   | 0.0512820512821 |
+--------------------+-------+-----------------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+--------------------+-------+---------------+
|      music_id      | count |      rmse     |
+--------------------+-------+---------------+
| SOXQIUR12A8AE4654A |   9   | 148.225142853 |
+--------------------+-------+---------------+
[1 rows x 3 columns]

PROGRESS: Evaluate model M1
recommendations finished on 1000/34355 queries. users per second: 30384.1
recommendations finished on 2000/34355 queries. users per second: 32344.7
recommendations finished on 3000/34355 queries. users per second: 32000
recommendations finished on 4000/34355 queries. users per second: 31830.9
recommendations finished on 5000/34355 queries. users per second: 31932.4
recommendations finished on 6000/34355 queries. users per second: 31333.7
recommendations finished on 7000/34355 queries. users per second: 31903.2
recommendations finished on 8000/34355 queries. users per second: 31830.9
recommendations finished on 9000/34355 queries. users per second: 31552.6
recommendations finished on 10000/34355 queries. users per second: 31042.5
recommendations finished on 11000/34355 queries. users per second: 29971.1
recommendations finished on 12000/34355 queries. users per second: 30080.2
recommendations finished on 13000/34355 queries. users per second: 30243
recommendations finished on 14000/34355 queries. users per second: 30253.1
recommendations finished on 15000/34355 queries. users per second: 30507.3
recommendations finished on 16000/34355 queries. users per second: 30557.7
recommendations finished on 17000/34355 queries. users per second: 30712.6
recommendations finished on 18000/34355 queries. users per second: 30589.5
recommendations finished on 19000/34355 queries. users per second: 30579.2
recommendations finished on 20000/34355 queries. users per second: 30430.2
recommendations finished on 21000/34355 queries. users per second: 30296.6
recommendations finished on 22000/34355 queries. users per second: 30052.9
recommendations finished on 23000/34355 queries. users per second: 29950
recommendations finished on 24000/34355 queries. users per second: 29893.4
recommendations finished on 25000/34355 queries. users per second: 29984.2
recommendations finished on 26000/34355 queries. users per second: 29964.9
recommendations finished on 27000/34355 queries. users per second: 29684.4
recommendations finished on 28000/34355 queries. users per second: 29677.5
recommendations finished on 29000/34355 queries. users per second: 29520.3
recommendations finished on 30000/34355 queries. users per second: 29232.4
recommendations finished on 31000/34355 queries. users per second: 29351.2
recommendations finished on 32000/34355 queries. users per second: 29221.8
recommendations finished on 33000/34355 queries. users per second: 28797.4
recommendations finished on 34000/34355 queries. users per second: 28314.7
Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    | 0.0505021103187 | 0.0143594006398 |
|   2    |  0.063003929559 | 0.0332542799912 |
|   3    | 0.0741085722602 | 0.0535576966113 |
|   4    | 0.0762552757968 |  0.069991135306 |
|   5    | 0.0758375782273 | 0.0846796386471 |
|   6    | 0.0740406539562 | 0.0970926584385 |
|   7    | 0.0715678732561 |  0.107440751262 |
|   8    | 0.0693821859991 |   0.1171303127  |
|   9    | 0.0668413137341 |  0.125932191148 |
|   10   | 0.0645641100277 |  0.133722231481 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 6.912246363901288)

Per User RMSE (best)
+-------------------------------+-------+-------------------+
|            user_id            | count |        rmse       |
+-------------------------------+-------+-------------------+
| dad5cd4678a6f6df34932432bc... |   1   | 0.000917145184108 |
+-------------------------------+-------+-------------------+
[1 rows x 3 columns]


Per User RMSE (worst)
+-------------------------------+-------+---------------+
|            user_id            | count |      rmse     |
+-------------------------------+-------+---------------+
| 38767872c514c1b43bab5c7b21... |   2   | 343.853110153 |
+-------------------------------+-------+---------------+
[1 rows x 3 columns]


Per Item RMSE (best)
+--------------------+-------+----------------+
|      music_id      | count |      rmse      |
+--------------------+-------+----------------+
| SOBJHIC12A6D4F4A2D |   1   | 0.801424074173 |
+--------------------+-------+----------------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+--------------------+-------+---------------+
|      music_id      | count |      rmse     |
+--------------------+-------+---------------+
| SOXQIUR12A8AE4654A |   9   | 151.768616398 |
+--------------------+-------+---------------+
[1 rows x 3 columns]

PROGRESS: Evaluate model M2
recommendations finished on 1000/34355 queries. users per second: 25068.3
recommendations finished on 2000/34355 queries. users per second: 25709.9
recommendations finished on 3000/34355 queries. users per second: 25931.6
recommendations finished on 4000/34355 queries. users per second: 25546
recommendations finished on 5000/34355 queries. users per second: 25320.2
recommendations finished on 6000/34355 queries. users per second: 25709.8
recommendations finished on 7000/34355 queries. users per second: 25899.5
recommendations finished on 8000/34355 queries. users per second: 25959.4
recommendations finished on 9000/34355 queries. users per second: 25783.1
recommendations finished on 10000/34355 queries. users per second: 25976.1
recommendations finished on 11000/34355 queries. users per second: 25531.1
recommendations finished on 12000/34355 queries. users per second: 25709.6
recommendations finished on 13000/34355 queries. users per second: 25508.4
recommendations finished on 14000/34355 queries. users per second: 25522.7
recommendations finished on 15000/34355 queries. users per second: 25709.6
recommendations finished on 16000/34355 queries. users per second: 25792.3
recommendations finished on 17000/34355 queries. users per second: 25944.4
recommendations finished on 18000/34355 queries. users per second: 26006
recommendations finished on 19000/34355 queries. users per second: 26097
recommendations finished on 20000/34355 queries. users per second: 26179.5
recommendations finished on 21000/34355 queries. users per second: 26059.6
recommendations finished on 22000/34355 queries. users per second: 26012.8
recommendations finished on 23000/34355 queries. users per second: 25911.8
recommendations finished on 24000/34355 queries. users per second: 25987.2
recommendations finished on 25000/34355 queries. users per second: 25922.3
recommendations finished on 26000/34355 queries. users per second: 25914
recommendations finished on 27000/34355 queries. users per second: 25906.4
recommendations finished on 28000/34355 queries. users per second: 25899.3
recommendations finished on 29000/34355 queries. users per second: 25938.9
recommendations finished on 30000/34355 queries. users per second: 25976
recommendations finished on 31000/34355 queries. users per second: 25837.8
recommendations finished on 32000/34355 queries. users per second: 25668.4
recommendations finished on 33000/34355 queries. users per second: 25590.3
recommendations finished on 34000/34355 queries. users per second: 25085.3
Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    | 0.000582156891282 | 0.000106979963428 |
|   2    | 0.000669480424975 | 0.000273252548199 |
|   3    | 0.000688885654684 | 0.000445507693558 |
|   4    | 0.000662203463834 | 0.000540984718356 |
|   5    | 0.000657837287149 | 0.000668722486175 |
|   6    | 0.000684034347257 | 0.000829871087798 |
|   7    |  0.00069027174252 | 0.000957046365855 |
|   8    | 0.000662203463834 |  0.00102257662173 |
|   9    | 0.000675948834878 |  0.00117115080382 |
|   10   | 0.000692766700626 |  0.00134605299146 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 8.039708044425241)

Per User RMSE (best)
+-------------------------------+-------+-------------------+
|            user_id            | count |        rmse       |
+-------------------------------+-------+-------------------+
| 37c9c3c41472d1b8e9b02d8595... |   1   | 3.27778519269e-05 |
+-------------------------------+-------+-------------------+
[1 rows x 3 columns]


Per User RMSE (worst)
+-------------------------------+-------+--------------+
|            user_id            | count |     rmse     |
+-------------------------------+-------+--------------+
| d2232ac7a1ec17b283b5dff243... |   7   | 334.35977482 |
+-------------------------------+-------+--------------+
[1 rows x 3 columns]


Per Item RMSE (best)
+--------------------+-------+-----------------+
|      music_id      | count |       rmse      |
+--------------------+-------+-----------------+
| SOWYUFF12AB0185E62 |   2   | 0.0512414403283 |
+--------------------+-------+-----------------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+--------------------+-------+-------------+
|      music_id      | count |     rmse    |
+--------------------+-------+-------------+
| SOLGIWB12A58A77A05 |   43  | 134.0908771 |
+--------------------+-------+-------------+
[1 rows x 3 columns]

In [16]:
K = 10
users = graphlab.SArray(sf['user_id'].unique().head(100))

推荐

In [17]:
recs = item_sim_model.recommend(users=users, k=K)
recs.head()
Out[17]:
user_id music_id score rank
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
SOXUQNR12AF72A69D6 3.02242265145 1
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
SOUFAZA12AC3DFAB20 1.33684277534 2
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
SOSFSTC12A8C141219 1.09198212624 3
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
SOVIWFP12A58A7D1BD 1.04516386986 4
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
SOBMTQD12AB01833D0 1.02945168813 5
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
SOCMNRG12AB0189D3F 0.975643793742 6
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
SOXOHUM12A67ADC826 0.950687328974 7
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
SOWBFVW12A6D4F612B 0.909237066905 8
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
SOXFYTY127E9433E7D 0.897727807363 9
279292bb36dbfc7f505e36ebf
038c81eb1d1d63e ...
SOYBLYP12A58A79D32 0.897092819214 10
[10 rows x 4 columns]
In [ ]: