# 2. Content-based рекомендательная система образовательных курсов – Spark Dataframes

`Автор`: Елена Сидорова

`e-mail`: e_sidorova_94@mail.ru

## Описание задачи

По имеющимся данным портала eclass.cc построить content-based рекомендации по образовательным курсам. Запрещено использовать библиотеки pandas, sklearn и аналогичные.

## Описание данных

Имеются следующие данные на вход:
- набор данных о всех курсах.
- id курсов, для которых надо дать рекомендации


Данные выглядят следующим образом:


`{"lang": "en",
"name": "Accounting Cycle: The Foundation of Business Measurement and Reporting",
"cat": "3/business_management|6/economics_finance",
"provider": "Canvas Network",
"id": 4,
"desc": "This course introduces the basic financial statements used by most businesses, as well as the essential tools used to prepare them. This course will serve as a resource to help business students succeed in their upcoming university-level accounting classes, and as a refresher for upper division accounting students who are struggling to recall elementary concepts essential to more advanced accounting topics. Business owners will also benefit from this class by gaining essential skills necessary to organize and manage information pertinent to operating their business. At the conclusion of the class, students will understand the balance sheet, income statement, and cash flow statement. They will be able to differentiate between cash basis and accrual basis techniques, and know when each is appropriate. They\u2019ll also understand the accounting equation, how to journalize and post transactions, how to adjust and close accounts, and how to prepare key financial reports. All material for this class is written and delivered by the professor, and can be previewed here. Students must have access to a spreadsheet program to participate."}`

## Результат

Для каждого id курса из личного кабинета необходимо дать топ-10 наиболее похожих на него курсов. Рекомендованные курсы должны быть того же языка, что и курс, для которого строится рекомендация.

## Решение

In [55]:
from IPython.display import IFrame, Image

***Импорт Spark***

In [1]:
import os
import sys
os.environ["PYSPARK_PYTHON"]='/opt/anaconda/envs/bd9/bin/python'
os.environ["SPARK_HOME"]='/usr/hdp/current/spark2-client'
os.environ["PYSPARK_SUBMIT_ARGS"]='--num-executors 3 pyspark-shell'

spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
    raise ValueError('SPARK_HOME environment variable is not set')

sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.7-src.zip'))
exec(open(os.path.join(spark_home, 'python/pyspark/shell.py')).read())

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/

Using Python version 3.6.5 (default, Apr 29 2018 16:14:56)
SparkSession available as 'spark'.


***Настройка спарк-сессии***

In [3]:
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark import Row
import json

conf = SparkConf()

spark = (SparkSession
         .builder
         .config(conf=conf)
         .appName("sparkdf-ees")
         .getOrCreate())

***Данные***

In [4]:
# датасет со всеми курсами
data = spark.read.json("/labs/slaba02/DO_record_per_line.json")

Изучиим структуру данных о всех курсах:

In [5]:
data.show(2,False,True)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

id курсов, для которых нужно дать рекомендации:

In [6]:
given_courses = [[23126, u'en', u'Compass - powerful SASS library that makes your life easier'], 
                 [21617, u'en', u'Preparing for the AP* Computer Science A Exam \u2014 Part 2'], 
                 [16627, u'es', u'Aprende Excel: Nivel Intermedio by Alfonso Rinsche'], 
                 [11556, u'es', u'Aprendizaje Colaborativo by UNID Universidad Interamericana para el Desarrollo'], 
                 [16704, u'ru', u'\u041f\u0440\u043e\u0433\u0440\u0430\u043c\u043c\u0438\u0440\u043e\u0432\u0430\u043d\u0438\u0435 \u043d\u0430 Lazarus'], 
                 [13702, u'ru', u'\u041c\u0430\u0442\u0435\u043c\u0430\u0442\u0438\u0447\u0435\u0441\u043a\u0430\u044f \u044d\u043a\u043e\u043d\u043e\u043c\u0438\u043a\u0430']]

***Пайплайн***

In [7]:
# Пайплайн = Препроцессинг + Токенизатор + HashingTF + IDF + join dataset'ов + cos_sim (udf) + формирование рек-ций

In [8]:
import pyspark.sql.functions as f

In [9]:
from pyspark.sql.functions import regexp_replace, trim
from nltk.stem.snowball import SnowballStemmer

In [10]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, Normalizer, StopWordsRemover
from pyspark.ml import Pipeline
from pyspark.sql.functions import udf, col, isnan, isnull, broadcast, desc, lower
from pyspark.sql.types import FloatType, ArrayType, StringType
import json
import re

In [11]:
# Создание UDF с применением регулярных выражений для исключения пунктуации 
import pyspark.sql.functions as f
from pyspark.sql.functions import pandas_udf
import re

def clear_string(series):
    regex = re.compile(u'[\w\d]{2,}', re.U)
    words = series.str.findall(regex)
    return words

tokenizer_udf = pandas_udf(clear_string, ArrayType(StringType()))

***Анализ***

Исходный датасет в нужном формате:

In [12]:
df = data.select(data.id, data.lang, data.desc)

In [13]:
df.show(6)

+---+----+--------------------+
| id|lang|                desc|
+---+----+--------------------+
|  4|  en|This course intro...|
|  5|  en|This online cours...|
|  6|  fr|This course is ta...|
|  7|  en|We live in a digi...|
|  8|  en|This self-paced c...|
|  9|  en|This game-based c...|
+---+----+--------------------+
only showing top 6 rows



In [14]:
df.printSchema()

root
 |-- id: long (nullable = true)
 |-- lang: string (nullable = true)
 |-- desc: string (nullable = true)



Преобразуем датафрейм с данными курсами, для которых подбираем рекомендации, в spark-df:

In [15]:
gc = spark.createDataFrame(given_courses)
type(gc)

pyspark.sql.dataframe.DataFrame

In [16]:
gc = gc.withColumnRenamed("_1","id").withColumnRenamed("_2","lang").withColumnRenamed("_3","desc")

In [17]:
gc.printSchema()

root
 |-- id: long (nullable = true)
 |-- lang: string (nullable = true)
 |-- desc: string (nullable = true)



In [18]:
gc.show()

+-----+----+--------------------+
|   id|lang|                desc|
+-----+----+--------------------+
|23126|  en|Compass - powerfu...|
|21617|  en|Preparing for the...|
|16627|  es|Aprende Excel: Ni...|
|11556|  es|Aprendizaje Colab...|
|16704|  ru|Программирование ...|
|13702|  ru|Математическая эк...|
+-----+----+--------------------+



***Препроцессинг***

*Регистр*

Приведем данные к нижнему регистру

In [29]:
df_clean = df.select('id', 'lang', 'desc', f.lower(df.desc).alias("desc_clean"))

In [31]:
df_clean.show(5)

+---+----+--------------------+--------------------+
| id|lang|                desc|          desc_clean|
+---+----+--------------------+--------------------+
|  4|  en|This course intro...|this course intro...|
|  5|  en|This online cours...|this online cours...|
|  6|  fr|This course is ta...|this course is ta...|
|  7|  en|We live in a digi...|we live in a digi...|
|  8|  en|This self-paced c...|this self-paced c...|
+---+----+--------------------+--------------------+
only showing top 5 rows



In [30]:
gc_clean = gc.select('id', 'lang', 'desc', f.lower(gc.desc).alias('desc_clean'))

In [32]:
gc_clean.show(6)

+-----+----+--------------------+--------------------+
|   id|lang|                desc|          desc_clean|
+-----+----+--------------------+--------------------+
|23126|  en|Compass - powerfu...|compass - powerfu...|
|21617|  en|Preparing for the...|preparing for the...|
|16627|  es|Aprende Excel: Ni...|aprende excel: ni...|
|11556|  es|Aprendizaje Colab...|aprendizaje colab...|
|16704|  ru|Программирование ...|программирование ...|
|13702|  ru|Математическая эк...|математическая эк...|
+-----+----+--------------------+--------------------+



*Очистка текста*

Удалим знаки пунктуации и символы, токенизируем текст:

In [33]:
df_clean = df_clean.withColumn('cleaned', tokenizer_udf('desc_clean'))

In [34]:
gc_clean = gc_clean.withColumn('cleaned', tokenizer_udf('desc_clean'))

In [35]:
df_clean.show(1, truncate=False)

+---+----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [37]:
gc_clean.show(6, truncate=False)

+-----+----+------------------------------------------------------------------------------+------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|id   |lang|desc                                                                          |desc_clean                                                                    |cleaned                                                                                 |
+-----+----+------------------------------------------------------------------------------+------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|23126|en  |Compass - powerful SASS library that makes your life easier                   |compass - powerful sass library that makes your life easier                   |[compass, powerful, sass, library, that, makes, yo

In [24]:
#gc_clean = gc_clean.withColumn('cleaned', regexp_replace('desc_clean', pattern_punct, ''))

*Токенизация*

In [39]:
remover = StopWordsRemover(inputCol='cleaned', outputCol='desc_nosw')

In [41]:
df_no_sw = remover.transform(df_clean).select('id', 'lang', 'desc_nosw')

In [53]:
gc_no_sw = remover.transform(gc_clean).select('id', 'lang', 'desc_nosw')

In [43]:
df_no_sw.show(10)

+---+----+--------------------+
| id|lang|           desc_nosw|
+---+----+--------------------+
|  4|  en|[course, introduc...|
|  5|  en|[online, course, ...|
|  6|  fr|[course, taught, ...|
|  7|  en|[live, digitally,...|
|  8|  en|[self, paced, cou...|
|  9|  en|[game, based, cou...|
| 10|  en|[digital, teachin...|
| 11|  en|[goal, digital, l...|
| 12|  en|[ready, explore, ...|
| 13|  en|[self, paced, cou...|
+---+----+--------------------+
only showing top 10 rows



In [56]:
gc_no_sw.show(truncate=False, vertical=True)

-RECORD 0-----------------------------------------------------------------------------------------
 id        | 23126                                                                                
 lang      | en                                                                                   
 desc_nosw | [compass, powerful, sass, library, makes, life, easier]                              
-RECORD 1-----------------------------------------------------------------------------------------
 id        | 21617                                                                                
 lang      | en                                                                                   
 desc_nosw | [preparing, ap, computer, science, exam, part]                                       
-RECORD 2-----------------------------------------------------------------------------------------
 id        | 16627                                                                                
 lang     

In [46]:
df_no_sw.show(1, truncate=False, vertical=True)

-RECORD 0------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 id        | 4                                              

*Рассчитаем TF-IDF:*

In [80]:
from pyspark.ml.feature import HashingTF, IDF

In [81]:
hashingTF = HashingTF(inputCol="desc_nosw", outputCol="tf")
tf = hashingTF.transform(df_no_sw)

In [82]:
idf = IDF(inputCol="tf", outputCol="feature").fit(tf)
tfidf = idf.transform(tf)

In [86]:
#tf.show(1, truncate = False)

In [84]:
tfidf.show(1, truncate = False)

+---+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------

In [68]:
from pyspark.ml.feature import Normalizer
normalizer = Normalizer(inputCol="feature", outputCol="norm")
data = normalizer.transform(tfidf)

In [88]:
data.show(1, truncate=False, vertical = True)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [89]:
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix

In [93]:
dot_udf = f.udf(lambda x,y: float(x.dot(y)), DoubleType())

In [None]:
df_no_sw

In [97]:
df_final = data.alias("i").join(data.alias("j"), (f.col("i.id") != f.col("j.id"))& f.col("i.id").isin(23126,21617,16627,11556,16704,13702))\
    .select(
        f.col("i.ID").alias("i"), 
        f.col("j.ID").alias("j"), 
        dot_udf("i.norm", "j.norm").alias("dot"))\
    .sort("i", "j")

In [98]:
df_final.show()

+-----+---+--------------------+
|    i|  j|                 dot|
+-----+---+--------------------+
|11556|  4|6.356649296700284E-5|
|11556|  5|6.766219932768996E-5|
|11556|  6| 0.09439590552671846|
|11556|  7|0.001722725314314...|
|11556|  8|1.215670544295139E-4|
|11556|  9|0.001353692530605...|
|11556| 10|7.949580488215974E-5|
|11556| 11|8.314847555647642E-5|
|11556| 12|0.001198424492890...|
|11556| 13|0.001776946563245...|
|11556| 14|2.619132702270784...|
|11556| 15|1.042156969979818...|
|11556| 16|7.758117885137207E-5|
|11556| 17|6.860815483366755E-5|
|11556| 18|                 0.0|
|11556| 19|6.520649841016345E-5|
|11556| 20|5.684917551517781E-4|
|11556| 21|3.139817032014577E-5|
|11556| 22|3.363110563680040...|
|11556| 23|0.003431198480861...|
+-----+---+--------------------+
only showing top 20 rows



Выберем топ-10 курсов для нашего запроса:

In [99]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec  = Window.partitionBy("i").orderBy(col("dot").desc())

In [103]:
ranks = df_final.withColumn("row_number",row_number().over(windowSpec))

In [130]:
ranks.filter(ranks.row_number<11) \
    .show(60, truncate=False)

+-----+-----+-------------------+----------+
|i    |j    |dot                |row_number|
+-----+-----+-------------------+----------+
|23126|14760|0.6735876684588573 |1         |
|23126|13665|0.6418197548622134 |2         |
|23126|13782|0.6323685742905414 |3         |
|23126|15909|0.4568708744264577 |4         |
|23126|25782|0.3165326572724938 |5         |
|23126|17499|0.2987613080523377 |6         |
|23126|19270|0.28898002193673855|7         |
|23126|13348|0.28562514570281733|8         |
|23126|25071|0.24736140444285365|9         |
|23126|7153 |0.23640836528890993|10        |
|16627|11431|0.6556959673640088 |1         |
|16627|12247|0.5208318852331626 |2         |
|16627|17964|0.5011261300667872 |3         |
|16627|11575|0.49228274783942605|4         |
|16627|12660|0.4854467733287461 |5         |
|16627|5687 |0.4791412945989792 |6         |
|16627|25010|0.4738549672797989 |7         |
|16627|5558 |0.4738044200461289 |8         |
|16627|10738|0.4723437248696306 |9         |
|16627|179

***Итог***

In [17]:
import json

In [131]:
answer = {
"23126": [14760, 13665, 13782, 15909, 25782, 17499, 19270, 13348, 25071, 7153],  
"21617": [21609, 21616, 22298, 21608, 21628, 21630, 21623, 21081, 19417, 21624],
"16627": [11431, 12247, 17964, 11575, 12660, 5687, 25010, 5558, 10738, 17961],  
"11556": [16488, 13461, 468, 23357, 19330, 7833, 9289, 16929, 22710, 10447],
"16704": [4592, 1247, 1236, 1228, 1365, 1164, 1273, 20288, 1233, 8203],
"13702": [864, 21079, 15946, 8313, 8123, 1041, 28074, 13057, 8617, 21987]
}

In [132]:
with open("lab02.json", "w") as f:
    f.write(json.dumps(answer, indent=4))

Закрываем спарк-сессию:

In [133]:
sc.stop()