# Бинарные файлы

Как можно хранить произвольные данные на диске? Для этого у нас есть два варианта: текстовый и бинарный формат. 

Текстовый формат это человекочитаемый формат, где все записано как текст - строки, числа, другие типы. Такой формат удобно читать глазами, но работать с ним может быть крайне неэффективно: для того, чтобы его прочитать, надо перевести байтики в текст, а затем уже парсить текст в какие-то желаемые типы. Так, например, происходит в известном всеми формате csv. 

Бинарный формат предполагает, что в нем лежат байтики, но только читающий знает, как их правильно интепретировать. А значит, можно пропустить текстовый этап в парсинге и сразу переводить байты в то, что задано в структуре. Как правило, это получается сильно эффективнее.

In [3]:
!kaggle datasets download muthuj7/weather-dataset

Downloading weather-dataset.zip to /home/ubuntu/lsml-2024
 90%|██████████████████████████████████    | 2.00M/2.23M [00:00<00:00, 3.32MB/s]
100%|██████████████████████████████████████| 2.23M/2.23M [00:00<00:00, 3.18MB/s]


In [5]:
!unzip weather-dataset.zip 

Archive:  weather-dataset.zip
  inflating: weatherHistory.csv      


Возьмем датасет погоды и выделим оттуда несколько интерсных для нас переменных. 

In [60]:
sub_df = df[['Summary', 'Precip Type', 'Temperature (C)',
       'Apparent Temperature (C)', 'Humidity', 'Wind Speed (km/h)',
       'Wind Bearing (degrees)', 'Visibility (km)', 'Loud Cover',
       'Pressure (millibars)']]

In [61]:
sub_df.loc[:, 'Precip Type'][pd.isna(sub_df['Precip Type'])] = 'nan'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub_df.loc[:, 'Precip Type'][pd.isna(sub_df['Precip Type'])] = 'nan'


In [63]:
from sklearn.preprocessing import LabelEncoder


sub_df.loc[:, 'Summary'] = LabelEncoder().fit_transform(sub_df['Summary']).astype(int)
sub_df.loc[:, 'Precip Type'] = LabelEncoder().fit_transform(sub_df['Precip Type']).astype(int)
sub_df.to_csv("weather_small.csv", index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


Теперь в weather_small.csv у нас будут храниться 2 int64 и 7 float32 переменных. 

In [66]:
sub_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Summary                   96453 non-null  int64  
 1   Precip Type               96453 non-null  int64  
 2   Temperature (C)           96453 non-null  float64
 3   Apparent Temperature (C)  96453 non-null  float64
 4   Humidity                  96453 non-null  float64
 5   Wind Speed (km/h)         96453 non-null  float64
 6   Wind Bearing (degrees)    96453 non-null  float64
 7   Visibility (km)           96453 non-null  float64
 8   Loud Cover                96453 non-null  float64
 9   Pressure (millibars)      96453 non-null  float64
dtypes: float64(8), int64(2)
memory usage: 7.4 MB


Оказывается, в numpy есть поддержка сложных типов. Давайте попробуем записать наш csv в numpy, а затем буквально положить этот numpy байтик за байтиком в файл.

In [194]:
import numpy as np

sub_df_numpy = np.ndarray(shape=sub_df.shape[0], dtype=np.dtype(f"<i8,<i8,<7f4"))
sub_df_numpy['f0'] = sub_df.iloc[:, 0].values
sub_df_numpy['f1'] = sub_df.iloc[:, 1].values
sub_df_numpy['f2'] = sub_df.iloc[:, 2:-1].values

In [182]:
sub_df_numpy.tofile('weather_small.bin')

А теперь сравним скорость считывания:

In [183]:
%%timeit
tmp_csv = pd.read_csv("weather_small.csv")

106 ms ± 593 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [185]:
%%timeit
tmp_np = np.fromfile("weather_small.bin", dtype=np.dtype(f"<i8,<i8,<7f4"))

686 µs ± 47.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Видим, что прирост в скорости чтения получился почти в ~50 раз. Из минусов, нам пришлось знать заранее структуру файла и передать ее в чтение. 


Почему бы не воспользоваться np.save/np.load? Формально, это бинарный формат, который создан для произвольного парсинга numpy файлов:

In [195]:
np.save("weather_small.npy", sub_df_numpy)

In [196]:
%%timeit
np.load("weather_small.npy")

950 µs ± 22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Получили некоторое замедление... Откуда - не совсем понятно.

Что если мы попробуем закодировать структуру файла руками в этот же файл? Затем при чтении, зная, что структура занимает N байт, и мы умеем ее интерпретировать, мы можем прочитать и сами данные.

In [188]:
import struct


# для упрощения будем работать с файлом из N флотов
sub_df_numpy = np.ndarray(shape=sub_df.shape[0], dtype=np.dtype(f"<7f4"))
sub_df_numpy = sub_df.iloc[:, 2:-1].values.astype(np.float32)

    
def binary_write_of_arbitrary_size(data, file):
    # кодируем в начале файла количество флотов, это будет наш заголовок
    with open(file, 'wb') as f:
        f.write(struct.pack("l", data.shape[1]))
    with open(output_file, 'a') as f:
        f.seek(8) # делаем сдвиг в файле, чтобы не перезаписать заголовок
        a = np.ndarray(shape=data.shape[0], dtype=np.dtype(f"<{data.shape[1]}f4"))
        a[:] = sub_df_numpy
        a.tofile(f)


def binary_read_of_arbitrary_size(file):
    with open(file, 'rb') as f:
        f_string = f.read()
        read_size = struct.unpack("l", f_string[:8]) # сначала читаем количество флотов
        data = np.frombuffer(f_string[8:], dtype=np.dtype(f"<{read_size}f4")) # а затем передаем его в парсинг
    return data

binary_write_of_arbitrary_size(sub_df_numpy, "weather_small.bin")  

assert (binary_read_of_arbitrary_size("weather_small.bin") == sub_df_numpy).all()

In [189]:
%%timeit 

tmp = binary_read_of_arbitrary_size("weather_small.bin")

643 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [190]:
np.save("weather_small.npy", sub_df_numpy)

In [191]:
%%timeit 
np.load("weather_small.npy")

606 µs ± 7.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


А тут никакого преимущества мы уже не получили.

Далее мы поговорим о более продвинутых бинарных форматах для больших данных: Parquet и Arrow.

# Advanced Spark

Сегодня пройдемся по каким-то аспектам фреймворка Spark, которые не затрагивали в предыдущий раз, но которые могут оказаться очень полезными.

Датасет на сегодня - данные с сайта Airbnb

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 500)

In [2]:
! hdfs dfs -mkdir -p /user/airbnb

Скачаем датасет из https://public.opendatasoft.com/explore/dataset/airbnb-listings/information

Можно также сказать с нашего зеркала - https://storage.yandexcloud.net/lsml-kosmos/mirror/airbnb-data.csv

In [4]:
# ! wget 'https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/airbnb-listings/exports/csv?lang=en&timezone=Europe%2FMoscow&use_labels=true&csv_separator=%3B' -O airbnb.csv

In [132]:
! file airbnb.csv

airbnb.csv: UTF-8 Unicode (with BOM) text, with very long lines, with CRLF, LF line terminators


In [133]:
! hdfs dfs -put airbnb.csv /user/airbnb/data.csv

In [134]:
! hdfs dfs -ls -h /user/airbnb

Found 1 items
-rw-r--r--   1 ubuntu hadoop      1.7 G 2024-02-25 11:57 /user/airbnb/data.csv


In [7]:
!pip install pyspark[sql]

Collecting pyspark[sql]
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[K     |██████████████████████████████▉ | 305.0 MB 78.7 MB/s eta 0:00:01

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 316.9 MB 9.6 kB/s 
[?25hCollecting py4j==0.10.9.7
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
[K     |████████████████████████████████| 200 kB 50.0 MB/s eta 0:00:01
[?25hCollecting numpy>=1.15
  Downloading numpy-1.24.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[K     |████████████████████████████████| 17.3 MB 56.9 MB/s eta 0:00:01
[?25hCollecting pandas>=1.0.5
  Downloading pandas-2.0.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
[K     |████████████████████████████████| 12.4 MB 53.5 MB/s eta 0:00:01
[?25hCollecting pyarrow>=4.0.0
  Downloading pyarrow-15.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.5 MB)
[K     |████████████████████████████████| 38.5 MB 58.1 MB/s eta 0:00:01
[?25hCollecting tzdata>=2022.1
  Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
[K     |████████████████████████████████| 345 kB 67.5 MB/s eta 0:00:01
[?25hCollecting py

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
sc = pyspark.SparkContext(appName="lsml-app-1")

In [3]:
from pyspark.sql import SparkSession, Row

In [2]:
se = SparkSession(sc)

In [4]:
data = se.read.option("mode", "DROPMALFORMED").option('sep', ';').csv("/user/airbnb/data.csv", header=True, inferSchema=True)

In [140]:
data.printSchema()

root
 |-- ID: string (nullable = true)
 |-- Listing Url: string (nullable = true)
 |-- Scrape ID: string (nullable = true)
 |-- Last Scraped: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Summary: string (nullable = true)
 |-- Space: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Experiences Offered: string (nullable = true)
 |-- Neighborhood Overview: string (nullable = true)
 |-- Notes: string (nullable = true)
 |-- Transit: string (nullable = true)
 |-- Access: string (nullable = true)
 |-- Interaction: string (nullable = true)
 |-- House Rules: string (nullable = true)
 |-- Thumbnail Url: string (nullable = true)
 |-- Medium Url: string (nullable = true)
 |-- Picture Url: string (nullable = true)
 |-- XL Picture Url: string (nullable = true)
 |-- Host ID: string (nullable = true)
 |-- Host URL: string (nullable = true)
 |-- Host Name: string (nullable = true)
 |-- Host Since: string (nullable = true)
 |-- Host Location: string (nullable

In [141]:
data.limit(4).toPandas()

Unnamed: 0,ID,Listing Url,Scrape ID,Last Scraped,Name,Summary,Space,Description,Experiences Offered,Neighborhood Overview,...,Review Scores Communication,Review Scores Location,Review Scores Value,License,Jurisdiction Names,Cancellation Policy,Calculated host listings count,Reviews per Month,Geolocation,Features
0,984863,https://www.airbnb.com/rooms/984863,20160504002227,2016-05-04,Condo à Montréal,Condo deux chambres et capacité de 6 personnes...,Condo 4 et demi à louer à 7 minutes à pieds du...,Condo 4 et demi à louer à 7 minutes à pieds du...,none,,...,9.0,7.0,10.0,,,flexible,1,0.09,"45.545578373282225, -73.54708801812593","Host Has Profile Pic,Host Identity Verified,Is..."
1,8884113,https://www.airbnb.com/rooms/8884113,20160504002227,2016-05-04,"Chambre, lit confortable",Endroit calme ayant seulement 2 appartement da...,"À seulement 5 min de marche, vous avez accès a...",Endroit calme ayant seulement 2 appartement da...,none,Il a des épiceries spécialisée (bio et asiatiq...,...,10.0,8.0,9.0,,,strict,2,1.6,"45.44670375240608, -73.64160993835216","Host Has Profile Pic,Host Identity Verified,Is..."
2,7698993,https://www.airbnb.com/rooms/7698993,20160504002227,2016-05-04,1 1/2 petit studio comfo & propre!,Je suis une jeune femme qui travaille en commu...,,Je suis une jeune femme qui travaille en commu...,none,,...,9.0,6.0,8.0,,,flexible,1,0.22,"45.56100621774929, -73.57475396375014","Host Has Profile Pic,Is Location Exact"
3,6162989,https://www.airbnb.com/rooms/6162989,20160504002227,2016-05-04,Grande et lumineuse chambre,Quartier tranquille avec tout à proximité. Pr...,Très belle chambre dans une maison ensoleillée...,Quartier tranquille avec tout à proximité. Pr...,none,Quartier dynamique et tranquille. À moins de ...,...,,,,,,flexible,1,,"45.54665669341026, -73.67346630877904","Host Has Profile Pic,Is Location Exact"


## Advanced Binary Formats


Как обещали, сейчас мы поговорим про более продвинутые бинарные форматы. Начнем с Parquet

### Parquet


Что же такое паркет? Это бинарный формат, основная цель которого это оптимизация I/O - то есть хранения данных, чтения, быстрой передачи через сеть. Есть несколько основных ингредиентов, за счет которых он работает:


1. Hybrid Storage Format

Часто говорят, что Parquet это колоночный формат. На самом деле это не совсем так:

<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*QEQJjtnDb3JQ2xqhzARZZw.png" alt="Alternative text" />


Это позволяет соблюдать trade-off между более row-intensive и column-intensive командами. 

Данные батчи данных в гибридном формате называются Row Groups.

2. Хранение метаинформации

Допустим, мы хотим посчитать какой-то предикат вида "колонка i < 5". Тогда, сохраняя метаинформацию для каждой Row Group, например, максимум и минимум, мы можем быстро пропускать лишние блоки данных:

<img src="https://miro.medium.com/v2/resize:fit:700/format:webp/1*EzPLuhgFw2hbsQHTP7lEvA.png" alt="Alternative text" />

3. Dictionary encoding

Если мы храним строки, которые встречаются неуникально много раз, то можно закодировать их:

<img src="https://malinxiao.files.wordpress.com/2021/12/image-47.png?w=396" alt="Alternative text" />


4. И многое другое...
Классное видео для более подробного знакомства: https://www.youtube.com/watch?v=1j8SdS7s_NY

In [197]:
%%time

data.count()

CPU times: user 3.4 ms, sys: 16 µs, total: 3.42 ms
Wall time: 5.65 s


914210

Посчитаем например среднее квадрата значения в колонке price

In [206]:
%%time

data.rdd.map(lambda x: float(x.Price or 0.0) ** 2).mean()

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 63, rc1a-dataproc-d-t299z6uw2n0sko0q.mdb.yandexcloud.net, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/ubuntu/appcache/application_1708859170992_0001/container_1708859170992_0001_01_000002/pyspark.zip/pyspark/sql/types.py", line 1595, in __getattr__
    idx = self.__fields__.index(item)
ValueError: 'Price' is not in list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/ubuntu/appcache/application_1708859170992_0001/container_1708859170992_0001_01_000002/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/hadoop/yarn/nm-local-dir/usercache/ubuntu/appcache/application_1708859170992_0001/container_1708859170992_0001_01_000002/pyspark.zip/pyspark/worker.py", line 595, in process
    out_iter = func(split_index, iterator)
  File "/usr/lib/spark/python/pyspark/rdd.py", line 2596, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/lib/spark/python/pyspark/rdd.py", line 2596, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/lib/spark/python/pyspark/rdd.py", line 425, in func
    return f(iterator)
  File "/usr/lib/spark/python/pyspark/rdd.py", line 1151, in <lambda>
    return self.mapPartitions(lambda i: [StatCounter(i)]).reduce(redFunc)
  File "/hadoop/yarn/nm-local-dir/usercache/ubuntu/appcache/application_1708859170992_0001/container_1708859170992_0001_01_000002/pyspark.zip/pyspark/statcounter.py", line 42, in __init__
    for v in values:
  File "/hadoop/yarn/nm-local-dir/usercache/ubuntu/appcache/application_1708859170992_0001/container_1708859170992_0001_01_000002/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<timed eval>", line 1, in <lambda>
  File "/hadoop/yarn/nm-local-dir/usercache/ubuntu/appcache/application_1708859170992_0001/container_1708859170992_0001_01_000002/pyspark.zip/pyspark/sql/types.py", line 1600, in __getattr__
    raise AttributeError(item)
AttributeError: Price

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1004)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2154)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:463)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:466)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2135)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2154)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2179)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:168)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/ubuntu/appcache/application_1708859170992_0001/container_1708859170992_0001_01_000002/pyspark.zip/pyspark/sql/types.py", line 1595, in __getattr__
    idx = self.__fields__.index(item)
ValueError: 'Price' is not in list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/ubuntu/appcache/application_1708859170992_0001/container_1708859170992_0001_01_000002/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/hadoop/yarn/nm-local-dir/usercache/ubuntu/appcache/application_1708859170992_0001/container_1708859170992_0001_01_000002/pyspark.zip/pyspark/worker.py", line 595, in process
    out_iter = func(split_index, iterator)
  File "/usr/lib/spark/python/pyspark/rdd.py", line 2596, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/lib/spark/python/pyspark/rdd.py", line 2596, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/lib/spark/python/pyspark/rdd.py", line 425, in func
    return f(iterator)
  File "/usr/lib/spark/python/pyspark/rdd.py", line 1151, in <lambda>
    return self.mapPartitions(lambda i: [StatCounter(i)]).reduce(redFunc)
  File "/hadoop/yarn/nm-local-dir/usercache/ubuntu/appcache/application_1708859170992_0001/container_1708859170992_0001_01_000002/pyspark.zip/pyspark/statcounter.py", line 42, in __init__
    for v in values:
  File "/hadoop/yarn/nm-local-dir/usercache/ubuntu/appcache/application_1708859170992_0001/container_1708859170992_0001_01_000002/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<timed eval>", line 1, in <lambda>
  File "/hadoop/yarn/nm-local-dir/usercache/ubuntu/appcache/application_1708859170992_0001/container_1708859170992_0001_01_000002/pyspark.zip/pyspark/sql/types.py", line 1600, in __getattr__
    raise AttributeError(item)
AttributeError: Price

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1004)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2154)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:463)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:466)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


~32 секунды. Пробуем сконвертировать в паркет.

In [198]:
! hdfs dfs -rm -r /user/airbnb/parquet
! hdfs dfs -mkdir -p /user/airbnb/parquet

rm: `/user/airbnb/parquet': No such file or directory


In [199]:
for column in data.columns:
    data = data.withColumnRenamed(column, column.lower().replace(' ', '_'))

In [200]:
data.write.parquet("/user/airbnb/parquet/data.parquet")

In [214]:
!hdfs dfs -du -h /user/airbnb/parquet/data.parquet  | wc -l

15


In [209]:
!hdfs dfs -du -h /user/airbnb/parquet/data.parquet 

0       0       /user/airbnb/parquet/data.parquet/_SUCCESS
40.6 M  40.6 M  /user/airbnb/parquet/data.parquet/part-00000-0848ac78-8c7b-494b-9c16-7d39e570ce51-c000.snappy.parquet
41.0 M  41.0 M  /user/airbnb/parquet/data.parquet/part-00001-0848ac78-8c7b-494b-9c16-7d39e570ce51-c000.snappy.parquet
41.0 M  41.0 M  /user/airbnb/parquet/data.parquet/part-00002-0848ac78-8c7b-494b-9c16-7d39e570ce51-c000.snappy.parquet
40.7 M  40.7 M  /user/airbnb/parquet/data.parquet/part-00003-0848ac78-8c7b-494b-9c16-7d39e570ce51-c000.snappy.parquet
41.0 M  41.0 M  /user/airbnb/parquet/data.parquet/part-00004-0848ac78-8c7b-494b-9c16-7d39e570ce51-c000.snappy.parquet
40.8 M  40.8 M  /user/airbnb/parquet/data.parquet/part-00005-0848ac78-8c7b-494b-9c16-7d39e570ce51-c000.snappy.parquet
40.7 M  40.7 M  /user/airbnb/parquet/data.parquet/part-00006-0848ac78-8c7b-494b-9c16-7d39e570ce51-c000.snappy.parquet
40.9 M  40.9 M  /user/airbnb/parquet/data.parquet/part-00007-0848ac78-8c7b-494b-9c16-7d39e570ce51-c000.snap

In [215]:
15 * 40

600

In [212]:
!hdfs dfs -du -h /user/airbnb/data.csv

1.7 G  1.7 G  /user/airbnb/data.csv


In [201]:
data_parquet = se.read.parquet("/user/airbnb/parquet/data.parquet")

In [202]:
data_parquet.printSchema()

root
 |-- id: string (nullable = true)
 |-- listing_url: string (nullable = true)
 |-- scrape_id: string (nullable = true)
 |-- last_scraped: string (nullable = true)
 |-- name: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- space: string (nullable = true)
 |-- description: string (nullable = true)
 |-- experiences_offered: string (nullable = true)
 |-- neighborhood_overview: string (nullable = true)
 |-- notes: string (nullable = true)
 |-- transit: string (nullable = true)
 |-- access: string (nullable = true)
 |-- interaction: string (nullable = true)
 |-- house_rules: string (nullable = true)
 |-- thumbnail_url: string (nullable = true)
 |-- medium_url: string (nullable = true)
 |-- picture_url: string (nullable = true)
 |-- xl_picture_url: string (nullable = true)
 |-- host_id: string (nullable = true)
 |-- host_url: string (nullable = true)
 |-- host_name: string (nullable = true)
 |-- host_since: string (nullable = true)
 |-- host_location: string (nullable

Попробуем повторить запрос выше.

In [203]:
%%time

data_parquet.rdd.map(lambda x: float(x.price or 0) ** 2).mean()

CPU times: user 12.1 ms, sys: 7.75 ms, total: 19.9 ms
Wall time: 31.6 s


42017.68492409332

~26 секунд! Не фантастика, но 10 секунд сэкономили. И тут нам поможет следующий формат:

## Arrow

Arrow это **in-memory** формат хранения данных. То есть он не особо оптимизирован под хранение как Parquet (хотя он тоже бинарный). Но сила его в вычислениях и универсальном формате хранения. Какие же там фишки?

**Cache locality, pipelining, and SIMD instructions**

Если с первыми двумя мы знакомы, то что такое SIMD?

<img src="https://lh3.googleusercontent.com/proxy/TAv_dMn00xlWRJ3lrVkh9DQ8wQzbGpKzRJE-LiiPV8NDa2F4COQVoUB051cLxSMERPybntoKJjpltBsGCp3YeT75k7pUDT4qNmq_86TdRVDWsuf8kxQrmJT31roPcW3PyWkNTesvSo2_fZkuox0Mntret5iltjmO1LSVaZXJvGwDqITo3XhV7XgRZ5UNXXKTwKPfYG8" alt="Alternative text" />


In [204]:
data = data_parquet

In [224]:
data.registerTempTable("airbnb")

#### Посмотрим еще раз на таблицу

In [225]:
data.limit(4).toPandas()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_communication,review_scores_location,review_scores_value,license,jurisdiction_names,cancellation_policy,calculated_host_listings_count,reviews_per_month,geolocation,features
0,7354215,https://www.airbnb.com/rooms/7354215,20170304065726,2017-03-05,Garden flat in the heart of Hoxton,My modern flat in Hoxton is available for rent...,Very modern ground floor flat with a unique pr...,My modern flat in Hoxton is available for rent...,none,"Hoxton is full of interesting restaurants, ba...",...,10,9,9,,,strict,2,0.92,"51.53135864588921, -0.08121186956367775","Host Has Profile Pic,Host Identity Verified"
1,15171031,https://www.airbnb.com/rooms/15171031,20170304065726,2017-03-05,Spacious 1BR Flat near Canary Wharf,This modern and stylish one bedroom apartment ...,This large and comfortable one bedroom apartme...,This modern and stylish one bedroom apartment ...,none,The Canary Wharf and Docklands area is a thriv...,...,10,9,9,,,moderate,3,2.59,"51.49939453571559, -0.013528809923329303","Host Has Profile Pic,Is Location Exact,Instant..."
2,7142115,https://www.airbnb.com/rooms/7142115,20170304065726,2017-03-05,Luxury 2 Bed Lantern Court Aprt-III,"With exquisite views, an excellent range of am...","With an exciting mix of classy décor, ideal lo...","With exquisite views, an excellent range of am...",none,Our serviced apartments offer the perfect bas...,...,6,7,6,,,strict,71,0.11,"51.49704207630152, -0.015862151838027014","Host Has Profile Pic,Host Identity Verified,Is..."
3,14480308,https://www.airbnb.com/rooms/14480308,20170304065726,2017-03-05,6.3 Double room in Brick Lane.,This comfy room is part of a 3 bed flat with a...,,This comfy room is part of a 3 bed flat with a...,none,,...,9,7,7,,,strict,53,0.45,"51.525368157427444, -0.06760488564204839","Host Has Profile Pic,Host Identity Verified,Is..."


В SQL есть готовые функции для манипуляции с данными. Весь список можно найти здесь - https://spark.apache.org/docs/2.3.0/api/sql/index.html . Конкретно сейчас воспользуемся split, которая превращает строку в массив строк, разбивая ее по указанному символу.

In [279]:

se.sql("""
SELECT cast(host_response_rate as int) * 2 as f_host_response_rate
FROM airbnb
LIMIT 10
""").toPandas()

126 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


`explode` распупочивает список в отдельные записи в таблице

In [14]:
se.sql("""
    SELECT id, explode(split(amenities, ',')) as amenities
    FROM airbnb
    LIMIT 10
""").toPandas()

Unnamed: 0,id,amenities
0,761378,TV
1,761378,Wireless Internet
2,761378,Elevator in building
3,761378,Buzzer/wireless intercom
4,10600490,TV
5,10600490,Cable TV
6,10600490,Internet
7,10600490,Wireless Internet
8,10600490,Kitchen
9,10600490,Breakfast


Чтож, давайте немного позанимается машинным обучением. Задача будет простая и понятная - предсказываем цену для квартиры.

Давайте придумывать признаки

In [15]:
se.sql("""
SELECT (cast(now() as long) - cast(cast(host_since as timestamp) as long)) / (60 * 60 * 24) as f_host_for
FROM airbnb
LIMIT 10
""").toPandas()

Unnamed: 0,f_host_for
0,3728.81059
1,2556.81059
2,2743.81059
3,2310.81059
4,2999.81059
5,3260.81059
6,3200.81059
7,3843.81059
8,2804.81059
9,3736.81059


In [230]:
se.sql("""
SELECT cast(host_response_rate as int) as f_host_response_rate, cast(host_acceptance_rate as int) as f_host_acceptance_rate, cast(host_total_listings_count as int) as f_host_total_listings_count
FROM airbnb
LIMIT 10
""").toPandas()

Unnamed: 0,f_host_response_rate,f_host_acceptance_rate,f_host_total_listings_count
0,100.0,,2
1,100.0,,3
2,100.0,,74
3,100.0,,63
4,,,1
5,,,1
6,53.0,,1
7,83.0,,3
8,,,1
9,100.0,,1


In [17]:
se.sql("""
SELECT size(split(host_verifications, ',')) as f_num_of_ver
FROM airbnb
LIMIT 10
""").toPandas()

Unnamed: 0,f_num_of_ver
0,2
1,3
2,4
3,3
4,4
5,6
6,3
7,5
8,3
9,5


#### Used defined functions

Все числа было бы неплохо привести к нормальному виду. 
Делать это чисто из SQL достаточно непросто, поэтому можно попробовать заиспользовать питоновский код прямо в SQL

In [280]:
def to_number(raw_value):
    try:
        return float(raw_value)
    except:
        return 0.0

In [281]:
se.udf.register("to_number", to_number, "float")

<function __main__.to_number(raw_value)>

Соберем в одну табличку с признаками

In [292]:
se.sql("""
SELECT id, 
       (cast(now() as long) - cast(cast(host_since as timestamp) as long)) / (60 * 60 * 24) as f_host_for,
       to_number(cast(host_response_rate as int)) as f_host_response_rate, 
       to_number(cast(host_acceptance_rate as int)) as f_host_acceptance_rate, 
       to_number(cast(host_total_listings_count as int)) as f_host_total_listings_count,
       to_number(size(split(host_verifications, ','))) as f_num_of_ver,
       to_number(cast(host_listings_count as int)) as f_host_listings_count
FROM airbnb
""").registerTempTable("hosts_features")

In [293]:
%%timeit
se.sql("""
SELECT *
FROM hosts_features
LIMIT 5
""").toPandas()

364 ms ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [22]:
review_columns = [
    c
    for c in data.columns
    if c.startswith('review')
] + ["number_of_reviews"]

In [23]:
review_columns

['review_scores_rating',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value',
 'reviews_per_month',
 'number_of_reviews']

Пошаманим с запросом через вставку в SQL

In [24]:
query = ", ".join([
    "to_number({c}) as f_{c}".format(c=c)
    for c in review_columns
])

In [25]:
query

'to_number(review_scores_rating) as f_review_scores_rating, to_number(review_scores_accuracy) as f_review_scores_accuracy, to_number(review_scores_cleanliness) as f_review_scores_cleanliness, to_number(review_scores_checkin) as f_review_scores_checkin, to_number(review_scores_communication) as f_review_scores_communication, to_number(review_scores_location) as f_review_scores_location, to_number(review_scores_value) as f_review_scores_value, to_number(reviews_per_month) as f_reviews_per_month, to_number(number_of_reviews) as f_number_of_reviews'

In [26]:
se.sql("""
SELECT id, {}
FROM airbnb
""".format(query)).registerTempTable("reviews_features")

In [27]:
se.sql("""
SELECT *
FROM reviews_features
LIMIT 5
""").toPandas()

Unnamed: 0,id,f_review_scores_rating,f_review_scores_accuracy,f_review_scores_cleanliness,f_review_scores_checkin,f_review_scores_communication,f_review_scores_location,f_review_scores_value,f_reviews_per_month,f_number_of_reviews
0,761378,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,10600490,100.0,10.0,10.0,10.0,10.0,10.0,6.0,0.16,2.0
2,7490732,97.0,10.0,10.0,10.0,10.0,10.0,10.0,2.13,42.0
3,15097313,96.0,10.0,10.0,10.0,10.0,10.0,10.0,1.45,6.0
4,6987332,60.0,8.0,6.0,10.0,10.0,10.0,8.0,0.05,1.0


In [28]:
data.limit(4).toPandas()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_communication,review_scores_location,review_scores_value,license,jurisdiction_names,cancellation_policy,calculated_host_listings_count,reviews_per_month,geolocation,features
0,761378,https://www.airbnb.com/rooms/761378,20170404145355,2017-04-06,1 private bedroom (25m²) in a parisian appartment,,"Location: Paris Arrondissement 8, Paris, Ile-d...","Location: Paris Arrondissement 8, Paris, Ile-d...",none,,...,,,,,Paris,flexible,1,,"48.87822120123633, 2.323067504549647","Host Has Profile Pic,Is Location Exact"
1,10600490,https://www.airbnb.com/rooms/10600490,20170404145355,2017-04-06,"Near to Champs Elysées , 200 m²","A few meters from Champs Elysees, very nice re...",,"A few meters from Champs Elysees, very nice re...",none,,...,10.0,10.0,6.0,,Paris,strict,1,0.16,"48.872101253444285, 2.3094301914142195","Host Has Profile Pic,Is Location Exact,Instant..."
2,7490732,https://www.airbnb.com/rooms/7490732,20170404145355,2017-04-05,Cosy appartment at Champs-Elysees,One bedroom 50m2 cosy appartment at Champs Ely...,A 50m² appartment on the 6th floor (with a lif...,One bedroom 50m2 cosy appartment at Champs Ely...,none,The famous Champs Elysees avenue is just aroun...,...,10.0,10.0,10.0,,Paris,strict,1,2.13,"48.869850206500956, 2.310182986403705","Host Is Superhost,Host Has Profile Pic,Host Id..."
3,15097313,https://www.airbnb.com/rooms/15097313,20170404145355,2017-04-05,Quiet apartment in the Heart of Paris,"L'appartement (70 m²) est au cœur de Paris, à ...",Un appartement refait à neuf très récemment. I...,"L'appartement (70 m²) est au cœur de Paris, à ...",none,"A 2 pas des Champs Elysées, où bat le cœur de ...",...,10.0,10.0,10.0,,Paris,moderate,1,1.45,"48.86832315709884, 2.3028395028763975","Host Has Profile Pic,Is Location Exact,Instant..."


Есть еще целая пачка хороший признаков в том числе и категориальных про саму квартиру. Закодируем их.

In [29]:
se.sql("""
SELECT distinct(property_type)
FROM airbnb
""").toPandas()

Unnamed: 0,property_type
0,Heritage hotel (India)
1,Apartment
2,Townhouse
3,Bed & Breakfast
4,Earth House
5,Pension (Korea)
6,Guest suite
7,Timeshare
8,Hut
9,


In [30]:
se.sql("""
SELECT distinct(room_type)
FROM airbnb
""").toPandas()

Unnamed: 0,room_type
0,Shared room
1,
2,Entire home/apt
3,9
4,Private room


In [31]:
se.sql("""
SELECT distinct(bed_type)
FROM airbnb
""").toPandas()

Unnamed: 0,bed_type
0,
1,Airbed
2,Futon
3,Pull-out Sofa
4,Couch
5,9
6,Real Bed


In [32]:
se.sql("""
    SELECT distinct(explode(split(amenities, ',')))
    FROM airbnb
""").toPandas()

Unnamed: 0,col
0,Lock on Bedroom Door
1,Indoor fireplace
2,Wheelchair accessible
3,Private bathroom
4,Private living room
...,...
130,Baby bath
131,Elevator in building
132,Free parking on premises
133,24-hour check-in


#### Программно создаваемые запросы через DataFrame API

Чтобы более гибко контролировать запросы, можно использовать не тестовый способ запуска SQL, а программный через DataFrame API.

Подробнее про то, как составлять такие запросы и какие готовые фукнции уже существуют - на официальном сайте https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

In [9]:
from pyspark.sql import functions as F

In [34]:
data.printSchema()

root
 |-- id: string (nullable = true)
 |-- listing_url: string (nullable = true)
 |-- scrape_id: string (nullable = true)
 |-- last_scraped: string (nullable = true)
 |-- name: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- space: string (nullable = true)
 |-- description: string (nullable = true)
 |-- experiences_offered: string (nullable = true)
 |-- neighborhood_overview: string (nullable = true)
 |-- notes: string (nullable = true)
 |-- transit: string (nullable = true)
 |-- access: string (nullable = true)
 |-- interaction: string (nullable = true)
 |-- house_rules: string (nullable = true)
 |-- thumbnail_url: string (nullable = true)
 |-- medium_url: string (nullable = true)
 |-- picture_url: string (nullable = true)
 |-- xl_picture_url: string (nullable = true)
 |-- host_id: string (nullable = true)
 |-- host_url: string (nullable = true)
 |-- host_name: string (nullable = true)
 |-- host_since: string (nullable = true)
 |-- host_location: string (nullable

In [35]:
data.select(['id', 'name', 'price']).limit(10).toPandas()

Unnamed: 0,id,name,price
0,761378,1 private bedroom (25m²) in a parisian appartment,89
1,10600490,"Near to Champs Elysées , 200 m²",789
2,7490732,Cosy appartment at Champs-Elysees,100
3,15097313,Quiet apartment in the Heart of Paris,85
4,6987332,Appartement Paris Centre (9ème),75
5,9340539,09-LUXURY LOFT CHAMPS ELYSÉES,199
6,6064355,Bright apartment Paris 8,95
7,13794754,ICONIC LUXURY~3BR/3BATH &BALCONY IN CHAMPS ELY...,650
8,6371225,Chambre dans appartement - Marais,52
9,3495940,Cosy Studio Arts & Metiers Paris,85


In [36]:
data.select(['id', 'name', 'price']).where(F.col('id') == '14916824').limit(10).toPandas()

Unnamed: 0,id,name,price
0,14916824,Chambre privée centre-ville/Private room downtown,25


In [37]:
amenities_c = se.sql("""
    SELECT distinct(explode(split(lower(amenities), ',')))
    FROM airbnb
""").rdd.map(lambda x: x.col).collect()

In [38]:
amenities_c

['refrigerator',
 'step-free access',
 'stove',
 'wide hallway clearance',
 'path to entrance lit at night',
 'ev charger',
 'wide doorway',
 'grab-rails for shower and toilet',
 'pets allowed',
 'cooking basics',
 'heating',
 'lake access',
 'patio or balcony',
 'washer / dryer',
 'wide clearance to shower and toilet',
 'doorman',
 'private living room',
 'game console',
 'long term stays allowed',
 'buzzer/wireless intercom',
 'coffee maker',
 'pocket wifi',
 'oven',
 'tub with shower bench',
 'host greets you',
 'tv',
 'pets live on this property',
 'garden or backyard',
 'crib',
 'carbon monoxide detector',
 'laptop friendly workspace',
 'hair dryer',
 'dishes and silverware',
 'wireless internet',
 'hangers',
 'pool',
 'kitchen',
 'safety card',
 'extra pillows and blankets',
 'fire extinguisher',
 'table corner guards',
 'family/kid friendly',
 'wide clearance to bed',
 'paid parking off premises',
 'indoor fireplace',
 'translation missing: en.hosting_amenity_50',
 'washer',
 'l

`F.when` проверяет условие и если оно верно то выставляет указанное значение, если не верно, то значение указанное в `otherwise`. 

In [39]:
import string

allowed = set(string.ascii_letters + string.digits + " ,")

def slugify(text):
    if not text:
        return ""
    text = "".join([ch for ch in text.lower() if ch in allowed])
    return text.replace(' ', '_')

In [40]:
f_slugify = se.udf.register("slugify", slugify, "string")

In [41]:
exprs = [
    F.when(
        F.array_contains(F.split(f_slugify('amenities'), ','), slugify(amenity)),
        1
    )
    .otherwise(0).alias("f_cat_amenity_" + slugify(amenity)) 
    for amenity in amenities_c
]

In [42]:
data.select('id', *exprs).limit(5).toPandas()

Unnamed: 0,id,f_cat_amenity_refrigerator,f_cat_amenity_stepfree_access,f_cat_amenity_stove,f_cat_amenity_wide_hallway_clearance,f_cat_amenity_path_to_entrance_lit_at_night,f_cat_amenity_ev_charger,f_cat_amenity_wide_doorway,f_cat_amenity_grabrails_for_shower_and_toilet,f_cat_amenity_pets_allowed,...,f_cat_amenity_bbq_grill,f_cat_amenity_dishwasher,f_cat_amenity_smart_lock,f_cat_amenity_babysitter_recommendations,f_cat_amenity_pack_n_playtravel_crib,f_cat_amenity_essentials,f_cat_amenity_beach_essentials,f_cat_amenity_beachfront,f_cat_amenity_accessibleheight_bed,f_cat_amenity_24hour_checkin
0,761378,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,10600490,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
2,7490732,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
3,15097313,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,6987332,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [43]:
data.select('id', *exprs).limit(1).collect()

[Row(id='761378', f_cat_amenity_refrigerator=0, f_cat_amenity_stepfree_access=0, f_cat_amenity_stove=0, f_cat_amenity_wide_hallway_clearance=0, f_cat_amenity_path_to_entrance_lit_at_night=0, f_cat_amenity_ev_charger=0, f_cat_amenity_wide_doorway=0, f_cat_amenity_grabrails_for_shower_and_toilet=0, f_cat_amenity_pets_allowed=0, f_cat_amenity_cooking_basics=0, f_cat_amenity_heating=0, f_cat_amenity_lake_access=0, f_cat_amenity_patio_or_balcony=0, f_cat_amenity_washer__dryer=0, f_cat_amenity_wide_clearance_to_shower_and_toilet=0, f_cat_amenity_doorman=0, f_cat_amenity_private_living_room=0, f_cat_amenity_game_console=0, f_cat_amenity_long_term_stays_allowed=0, f_cat_amenity_buzzerwireless_intercom=1, f_cat_amenity_coffee_maker=0, f_cat_amenity_pocket_wifi=0, f_cat_amenity_oven=0, f_cat_amenity_tub_with_shower_bench=0, f_cat_amenity_host_greets_you=0, f_cat_amenity_tv=1, f_cat_amenity_pets_live_on_this_property=0, f_cat_amenity_garden_or_backyard=0, f_cat_amenity_crib=0, f_cat_amenity_carbo

In [44]:
data.select('id', *exprs).registerTempTable("amenity_features")

In [45]:
bed_types = data.select('bed_type').distinct().rdd.map(lambda x: x.bed_type).collect()

In [46]:
bed_types = [x for x in bed_types if x is not None]

In [47]:
exprs = [
    F.when(
        F.col('bed_type') == btype,
        1
    )
    .otherwise(0).alias("f_cat_bad_type_" + slugify(btype))
    for btype in bed_types
]

In [48]:
data.select('id', *exprs).limit(5).toPandas()

Unnamed: 0,id,f_cat_bad_type_airbed,f_cat_bad_type_futon,f_cat_bad_type_pullout_sofa,f_cat_bad_type_couch,f_cat_bad_type_9,f_cat_bad_type_real_bed
0,761378,0,0,0,0,0,1
1,10600490,0,0,0,0,0,1
2,7490732,0,0,0,0,0,1
3,15097313,0,0,0,0,0,1
4,6987332,0,0,0,0,0,1


In [49]:
data.select('id', *exprs).registerTempTable("bed_type_features")

In [50]:
room_types = data.select('room_type').distinct().rdd.map(lambda x: x.room_type).collect()
room_types = [x for x in room_types if x is not None]
exprs = [
    F.when(
        F.col('bed_type') == btype,
        1
    )
    .otherwise(0).alias("f_cat_room_type_" + slugify(btype))
    for btype in room_types
]
data.select('id', *exprs).registerTempTable("room_types_features")

In [51]:
property_types = data.select('property_type').distinct().rdd.map(lambda x: x.property_type).collect()
property_types = [x for x in property_types if x is not None]
exprs = [
    F.when(
        F.col('property_type') == btype,
        1
    )
    .otherwise(0).alias("f_cat_property_type_" + slugify(btype))
    for btype in property_types
]
data.select('id', *exprs).registerTempTable("property_types_features")

In [52]:
app_features_c = se.sql("""
    SELECT distinct(explode(split(lower(features), ',')))
    FROM airbnb
""").rdd.map(lambda x: x.col).collect()

exprs = [
    F.when(
        F.array_contains(F.split(f_slugify('features'), ','), slugify(appf)),
        1
    )
    .otherwise(0).alias("f_cat_app_feature_" + slugify(appf)) 
    for appf in app_features_c
]
data.select('id', *exprs).registerTempTable("app_features_features")

In [53]:
se.sql("""
SELECT id,
       to_number(accommodates) as f_accommodates,
       to_number(bathrooms) as f_bathrooms,
       to_number(bedrooms) as f_bedrooms,
       to_number(beds) as f_beds,
       to_number(guests_included) as f_guests_included,
       to_number(cleaning_fee) as f_cleaning_fee,
       to_number(square_feet) as f_square_feet,
       to_number(extra_people) as f_extra_people
FROM airbnb
""").registerTempTable("accommodation_features")

In [54]:
se.sql("""
SELECT *
FROM accommodation_features
LIMIT 5
""").toPandas()

Unnamed: 0,id,f_accommodates,f_bathrooms,f_bedrooms,f_beds,f_guests_included,f_cleaning_fee,f_square_feet,f_extra_people
0,761378,2.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
1,10600490,6.0,2.0,3.0,4.0,1.0,0.0,0.0,0.0
2,7490732,2.0,1.5,1.0,1.0,2.0,40.0,0.0,0.0
3,15097313,3.0,2.0,1.0,2.0,1.0,70.0,0.0,0.0
4,6987332,4.0,1.0,2.0,2.0,1.0,0.0,0.0,0.0


In [55]:
data.limit(4).toPandas()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_communication,review_scores_location,review_scores_value,license,jurisdiction_names,cancellation_policy,calculated_host_listings_count,reviews_per_month,geolocation,features
0,761378,https://www.airbnb.com/rooms/761378,20170404145355,2017-04-06,1 private bedroom (25m²) in a parisian appartment,,"Location: Paris Arrondissement 8, Paris, Ile-d...","Location: Paris Arrondissement 8, Paris, Ile-d...",none,,...,,,,,Paris,flexible,1,,"48.87822120123633, 2.323067504549647","Host Has Profile Pic,Is Location Exact"
1,10600490,https://www.airbnb.com/rooms/10600490,20170404145355,2017-04-06,"Near to Champs Elysées , 200 m²","A few meters from Champs Elysees, very nice re...",,"A few meters from Champs Elysees, very nice re...",none,,...,10.0,10.0,6.0,,Paris,strict,1,0.16,"48.872101253444285, 2.3094301914142195","Host Has Profile Pic,Is Location Exact,Instant..."
2,7490732,https://www.airbnb.com/rooms/7490732,20170404145355,2017-04-05,Cosy appartment at Champs-Elysees,One bedroom 50m2 cosy appartment at Champs Ely...,A 50m² appartment on the 6th floor (with a lif...,One bedroom 50m2 cosy appartment at Champs Ely...,none,The famous Champs Elysees avenue is just aroun...,...,10.0,10.0,10.0,,Paris,strict,1,2.13,"48.869850206500956, 2.310182986403705","Host Is Superhost,Host Has Profile Pic,Host Id..."
3,15097313,https://www.airbnb.com/rooms/15097313,20170404145355,2017-04-05,Quiet apartment in the Heart of Paris,"L'appartement (70 m²) est au cœur de Paris, à ...",Un appartement refait à neuf très récemment. I...,"L'appartement (70 m²) est au cœur de Paris, à ...",none,"A 2 pas des Champs Elysées, où bat le cœur de ...",...,10.0,10.0,10.0,,Paris,moderate,1,1.45,"48.86832315709884, 2.3028395028763975","Host Has Profile Pic,Is Location Exact,Instant..."


In [56]:
location_c = data.select('country_code').distinct().rdd.map(lambda x: x.country_code).collect()
location_c = {slugify(x) for x in location_c if x is not None}
exprs = [
    F.when(
        F.col('country_code') == x,
        1
    )
    .otherwise(0).alias("f_cat_country_" + x)
    for x in location_c
]
data.select('id', *exprs).registerTempTable("location_features")

In [57]:
data.select('id', *exprs).columns

['id',
 'f_cat_country_va',
 'f_cat_country_cn',
 'f_cat_country_fr',
 'f_cat_country_gb',
 'f_cat_country_ca',
 'f_cat_country_it',
 'f_cat_country_dk',
 'f_cat_country_vu',
 'f_cat_country_au',
 'f_cat_country_uy',
 'f_cat_country_nl',
 'f_cat_country_at',
 'f_cat_country_de',
 'f_cat_country_hk',
 'f_cat_country_gr',
 'f_cat_country_ch',
 'f_cat_country_us',
 'f_cat_country_ie',
 'f_cat_country_es',
 'f_cat_country_be']

In [58]:
bias = data.select('id', F.when(F.col('id') == F.col('id'), 1).alias('f_bias'))

In [59]:
bias.limit(10).toPandas()

Unnamed: 0,id,f_bias
0,761378,1
1,10600490,1
2,7490732,1
3,15097313,1
4,6987332,1
5,9340539,1
6,6064355,1
7,13794754,1
8,6371225,1
9,3495940,1


In [60]:
bias.registerTempTable("bias")

Чтож, для нашего примера должно быть достаточно. Соберем итоговый датасет.

In [64]:
datadet_df = se.sql("""
SELECT to_number(price) as target, *
FROM
    airbnb a
    join hosts_features hf on hf.id = a.id
    join reviews_features rf on rf.id = a.id
    join amenity_features af on af.id = a.id
    join bed_type_features btf on btf.id = a.id
    join room_types_features rtf on rtf.id = a.id
    join property_types_features ptf on ptf.id = a.id
    join accommodation_features accf on accf.id = a.id
    join location_features lf on lf.id = a.id
    join app_features_features aff on aff.id = a.id
    join bias b on b.id = a.id
WHERE 
    to_number(price) > 0
""")

In [65]:
datadet_df.count()

356437

In [66]:
cols = datadet_df.columns
non_features_c = [
    c for c in cols
    if not (c == 'target' or c.startswith('f_'))
]

In [67]:
non_features_c[:10]

['id',
 'listing_url',
 'scrape_id',
 'last_scraped',
 'name',
 'summary',
 'space',
 'description',
 'experiences_offered',
 'neighborhood_overview']

In [70]:
se.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

In [75]:
datadet_df.drop(*non_features_c).write.parquet("/user/airbnb/parquet/dataset.parquet")

In [76]:
! hdfs dfs -ls /user/airbnb/parquet

Found 2 items
drwxr-xr-x   - ubuntu hadoop          0 2023-01-14 18:01 /user/airbnb/parquet/data.parquet
drwxr-xr-x   - ubuntu hadoop          0 2023-01-14 19:43 /user/airbnb/parquet/dataset.parquet


In [6]:
dataset_fd = se.read.parquet('/user/airbnb/parquet/dataset.parquet')

In [10]:
dataset_fd.select(F.mean('f_host_for').alias('f_host_for_mean'), F.stddev('f_host_for').alias('f_host_for_dev')).limit(5).toPandas()

Unnamed: 0,f_host_for_mean,f_host_for_dev
0,3012.319323,565.931719


In [11]:
f_real_cols = [f for f in dataset_fd.columns if f.startswith('f') and not f.startswith('f_cat') and f != 'f_bias']
f_cat_cols = [f for f in dataset_fd.columns if f.startswith('f') and (f.startswith('f_cat') or f == 'f_bias')]

In [13]:
exps_mean = [
    F.mean(c).alias('{}_mean'.format(c))
    for c in f_real_cols
]

exps_dev = [
    F.stddev(c).alias('{}_dev'.format(c))
    for c in f_real_cols
]

norm_f = dataset_fd.select(*exps_mean, *exps_dev).rdd.take(1)

In [14]:
norm_dict = norm_f[0].asDict()

In [16]:
exps = [
    (
        (F.col(c) - norm_dict["{}_mean".format(c)]) / (1 + norm_dict["{}_dev".format(c)])
    ).alias(c)
    for c in f_real_cols
]

In [17]:
dataset_fd.select("target", *f_cat_cols, *exps).limit(5).toPandas()

Unnamed: 0,target,f_cat_amenity_refrigerator,f_cat_amenity_stepfree_access,f_cat_amenity_stove,f_cat_amenity_wide_hallway_clearance,f_cat_amenity_path_to_entrance_lit_at_night,f_cat_amenity_ev_charger,f_cat_amenity_wide_doorway,f_cat_amenity_grabrails_for_shower_and_toilet,f_cat_amenity_pets_allowed,f_cat_amenity_cooking_basics,f_cat_amenity_heating,f_cat_amenity_lake_access,f_cat_amenity_patio_or_balcony,f_cat_amenity_washer__dryer,f_cat_amenity_wide_clearance_to_shower_and_toilet,f_cat_amenity_doorman,f_cat_amenity_private_living_room,f_cat_amenity_game_console,f_cat_amenity_long_term_stays_allowed,f_cat_amenity_buzzerwireless_intercom,f_cat_amenity_coffee_maker,f_cat_amenity_pocket_wifi,f_cat_amenity_oven,f_cat_amenity_tub_with_shower_bench,f_cat_amenity_host_greets_you,f_cat_amenity_tv,f_cat_amenity_pets_live_on_this_property,f_cat_amenity_garden_or_backyard,f_cat_amenity_crib,f_cat_amenity_carbon_monoxide_detector,f_cat_amenity_laptop_friendly_workspace,f_cat_amenity_hair_dryer,f_cat_amenity_dishes_and_silverware,f_cat_amenity_wireless_internet,f_cat_amenity_hangers,f_cat_amenity_pool,f_cat_amenity_kitchen,f_cat_amenity_safety_card,f_cat_amenity_extra_pillows_and_blankets,f_cat_amenity_fire_extinguisher,f_cat_amenity_table_corner_guards,f_cat_amenity_familykid_friendly,f_cat_amenity_wide_clearance_to_bed,f_cat_amenity_paid_parking_off_premises,f_cat_amenity_indoor_fireplace,f_cat_amenity_translation_missing_enhostingamenity50,f_cat_amenity_washer,f_cat_amenity_lockbox,f_cat_amenity_gym,f_cat_amenity_cable_tv,f_cat_amenity_keypad,f_cat_amenity_waterfront,f_cat_amenity_bed_linens,f_cat_amenity_accessibleheight_toilet,f_cat_amenity_hot_tub,f_cat_amenity_dogs,f_cat_amenity_elevator_in_building,f_cat_amenity_wheelchair_accessible,f_cat_amenity_other_pets,f_cat_amenity_cats,f_cat_amenity_iron,f_cat_amenity_fireplace_guards,f_cat_amenity_changing_table,f_cat_amenity_suitable_for_events,f_cat_amenity_first_aid_kit,f_cat_amenity_ethernet_connection,f_cat_amenity_self_checkin,f_cat_amenity_flat_smooth_pathway_to_front_door,f_cat_amenity_internet,f_cat_amenity_window_guards,f_cat_amenity_lock_on_bedroom_door,f_cat_amenity_breakfast,f_cat_amenity_childrens_books_and_toys,f_cat_amenity_childrens_dinnerware,f_cat_amenity_firm_matress,f_cat_amenity_baby_bath,f_cat_amenity_doorman_entry,f_cat_amenity_microwave,f_cat_amenity_dryer,f_cat_amenity_free_parking_on_street,f_cat_amenity_9,f_cat_amenity_private_bathroom,f_cat_amenity_smartlock,f_cat_amenity_shampoo,f_cat_amenity_rollin_shower_with_shower_bench_or_chair,f_cat_amenity_high_chair,f_cat_amenity_hot_water,f_cat_amenity_firm_mattress,f_cat_amenity_free_parking_on_premises,f_cat_amenity_smoking_allowed,f_cat_amenity_single_level_home,f_cat_amenity_roomdarkening_shades,f_cat_amenity_private_entrance,f_cat_amenity_outlet_covers,f_cat_amenity_cleaning_before_checkout,f_cat_amenity_disabled_parking_spot,f_cat_amenity_baby_monitor,f_cat_amenity_bathtub,f_cat_amenity_smoke_detector,f_cat_amenity_air_conditioning,f_cat_amenity_translation_missing_enhostingamenity49,f_cat_amenity_luggage_dropoff_allowed,f_cat_amenity_stair_gates,f_cat_amenity_bbq_grill,f_cat_amenity_dishwasher,f_cat_amenity_smart_lock,f_cat_amenity_babysitter_recommendations,f_cat_amenity_pack_n_playtravel_crib,f_cat_amenity_essentials,f_cat_amenity_beach_essentials,f_cat_amenity_beachfront,f_cat_amenity_accessibleheight_bed,f_cat_amenity_24hour_checkin,f_cat_bad_type_airbed,f_cat_bad_type_futon,f_cat_bad_type_pullout_sofa,f_cat_bad_type_couch,f_cat_bad_type_9,f_cat_bad_type_real_bed,f_cat_room_type_shared_room,f_cat_room_type_entire_homeapt,f_cat_room_type_9,f_cat_room_type_private_room,f_cat_property_type_heritage_hotel_india,f_cat_property_type_apartment,f_cat_property_type_townhouse,f_cat_property_type_bed__breakfast,f_cat_property_type_earth_house,f_cat_property_type_pension_korea,f_cat_property_type_guest_suite,f_cat_property_type_timeshare,f_cat_property_type_hut,f_cat_property_type_camperrv,f_cat_property_type_boutique_hotel,f_cat_property_type_castle,f_cat_property_type_loft,f_cat_property_type_guesthouse,f_cat_property_type_hostel,f_cat_property_type_lighthouse,f_cat_property_type_cave,f_cat_property_type_villa,f_cat_property_type_ryokan_japan,f_cat_property_type_car,f_cat_property_type_entire_floor,f_cat_property_type_other,f_cat_property_type_serviced_apartment,f_cat_property_type_treehouse,f_cat_property_type_inlaw,f_cat_property_type_nature_lodge,f_cat_property_type_dorm,f_cat_property_type_igloo,f_cat_property_type_condominium,f_cat_property_type_house,f_cat_property_type_chalet,f_cat_property_type_yurt,f_cat_property_type_tipi,f_cat_property_type_parking_space,f_cat_property_type_island,f_cat_property_type_tent,f_cat_property_type_train,f_cat_property_type_boat,f_cat_property_type_vacation_home,f_cat_property_type_20170402,f_cat_property_type_casa_particular,f_cat_property_type_bungalow,f_cat_property_type_plane,f_cat_property_type_cabin,f_cat_country_va,f_cat_country_cn,f_cat_country_fr,f_cat_country_gb,f_cat_country_ca,f_cat_country_it,f_cat_country_dk,f_cat_country_vu,f_cat_country_au,f_cat_country_uy,f_cat_country_nl,f_cat_country_at,f_cat_country_de,f_cat_country_hk,f_cat_country_gr,f_cat_country_ch,f_cat_country_us,f_cat_country_ie,f_cat_country_es,f_cat_country_be,f_cat_app_feature_host_has_profile_pic,f_cat_app_feature_host_identity_verified,f_cat_app_feature_instant_bookable,f_cat_app_feature_host_is_superhost,f_cat_app_feature_requires_license,f_cat_app_feature_require_guest_profile_picture,f_cat_app_feature_require_guest_phone_verification,f_cat_app_feature_is_location_exact,f_bias,f_host_for,f_host_response_rate,f_host_acceptance_rate,f_host_total_listings_count,f_num_of_ver,f_host_listings_count,f_review_scores_rating,f_review_scores_accuracy,f_review_scores_cleanliness,f_review_scores_checkin,f_review_scores_communication,f_review_scores_location,f_review_scores_value,f_reviews_per_month,f_number_of_reviews,f_accommodates,f_bathrooms,f_bedrooms,f_beds,f_guests_included,f_cleaning_fee,f_square_feet,f_extra_people
0,450.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0.907513,0.705464,0.0,-0.08824,0.877686,-0.08824,0.654941,0.598025,0.445704,0.562951,0.561807,0.415295,0.438672,-0.113266,-0.032952,-0.407954,-0.144854,-0.184156,-0.361613,-0.228716,-0.6095,-0.043745,-0.381889
1,65.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,1,0,1,1,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,-0.267232,-1.524437,0.0,-0.12204,-1.271305,-0.12204,-1.53855,-1.261773,-1.249272,-1.284841,-1.287762,-1.276485,-1.268037,-0.401075,-0.483396,-0.407954,-0.144854,-0.184156,-0.361613,-0.228716,-0.6095,-0.043745,-0.381889
2,300.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0.366002,-1.524437,0.0,-0.12204,-0.411709,-0.12204,-1.53855,-1.261773,-1.249272,-1.284841,-1.287762,-1.276485,-1.268037,-0.401075,-0.483396,0.258448,0.494909,0.85872,0.891621,-0.228716,-0.6095,-0.043745,-0.381889
3,199.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,1,0,1,0,0,0,0,1,0,0,0,0,1,0,1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,1,-0.290162,0.660866,0.0,25.160765,0.01809,25.160765,-1.53855,-1.261773,-1.249272,-1.284841,-1.287762,-1.276485,-1.268037,-0.401075,-0.483396,0.591649,0.494909,0.337282,0.056132,-0.228716,1.123896,-0.043745,-0.381889
4,80.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0.440085,0.705464,0.0,-0.12204,-0.411709,-0.12204,-1.53855,-1.261773,-1.249272,-1.284841,-1.287762,-1.276485,-1.268037,-0.401075,-0.483396,0.258448,-0.144854,-0.184156,0.056132,-0.228716,-0.6095,-0.043745,-0.381889


In [18]:
train_df, test_df = dataset_fd.select("target", *f_cat_cols, *exps).randomSplit([0.8, 0.2], 432)

In [19]:
train = train_df.rdd.cache()
test = test_df.rdd.cache()

In [20]:
example = train.first()

In [21]:
example

Row(target=1.0, f_cat_amenity_refrigerator=0, f_cat_amenity_stepfree_access=0, f_cat_amenity_stove=0, f_cat_amenity_wide_hallway_clearance=0, f_cat_amenity_path_to_entrance_lit_at_night=0, f_cat_amenity_ev_charger=0, f_cat_amenity_wide_doorway=0, f_cat_amenity_grabrails_for_shower_and_toilet=0, f_cat_amenity_pets_allowed=1, f_cat_amenity_cooking_basics=0, f_cat_amenity_heating=0, f_cat_amenity_lake_access=0, f_cat_amenity_patio_or_balcony=0, f_cat_amenity_washer__dryer=0, f_cat_amenity_wide_clearance_to_shower_and_toilet=0, f_cat_amenity_doorman=0, f_cat_amenity_private_living_room=0, f_cat_amenity_game_console=0, f_cat_amenity_long_term_stays_allowed=0, f_cat_amenity_buzzerwireless_intercom=0, f_cat_amenity_coffee_maker=0, f_cat_amenity_pocket_wifi=0, f_cat_amenity_oven=0, f_cat_amenity_tub_with_shower_bench=0, f_cat_amenity_host_greets_you=0, f_cat_amenity_tv=0, f_cat_amenity_pets_live_on_this_property=0, f_cat_amenity_garden_or_backyard=0, f_cat_amenity_crib=0, f_cat_amenity_carbon_

In [22]:
features_num = len(example.asDict()) - 1

In [23]:
features_num

219

In [24]:
import numpy as np
from functools import partial

#### Broadcasts & Accumulators

Обучать будем обычную линейную регрессию

In [25]:
def compute_gradient(weights_broadcast, loss, example):
    # достаем целевую переменную и признаки из наблюдения
    gradient = np.zeros(len(weights_broadcast.value))
    data = example.asDict()
    
    y = data['target']
    data.pop('target')

    # признаки сортируем по названию для того, чтобы позиции точно не разъезались
    x = np.array([v or 0 for k, v in sorted(data.items(), key=lambda x: x[0])])

    # делаем предсказание с текущими весами
    prediction = x.dot(weights_broadcast.value)

    # считаем градиент на объекте
    gradient = x * (prediction - y) * 2
    
    # считаем потери
    loss.add((prediction - y) ** 2)
    
    return gradient

In [26]:

# Параметры
epochs = 20
l2_lambda = 0.01
l1_lambda = 0.01

np.random.seed(42201)

# Изначальные веса инициализируем случайно
weights = np.random.random(features_num)

N = train.count()

# Цикл по эпохам
for i in range(epochs):
    weights_broadcast = sc.broadcast(weights)  # Эту переменную будет бродкастить на всех воркеров
    loss = sc.accumulator(0.0) # В эту переменную будет частичные лоссы на объектах
    
    # Считаем средний градиент
    gradient = (
        train
        .map(partial(compute_gradient, weights_broadcast, loss))
        .mean()
    )
    
    gradient += 2 * l2_lambda * weights  # L2 регуляризация
    gradient += l1_lambda * np.sign(weights) # L1 регуляризация
    
    learning_rate = 1 / (10 + i)
    
    weights -= learning_rate * gradient
    weights_broadcast.destroy()
    
    print("epoch:", i, "loss:", loss.value / N)

epoch: 0 loss: 39050.76561167529
epoch: 1 loss: 58635.00059953672
epoch: 2 loss: 87842.85564740805
epoch: 3 loss: 107631.84648126252
epoch: 4 loss: 102506.42292118471
epoch: 5 loss: 75564.15227470099
epoch: 6 loss: 45770.29601907248
epoch: 7 loss: 26332.00466094047
epoch: 8 loss: 18094.909810619843
epoch: 9 loss: 15650.381284590347
epoch: 10 loss: 15134.347868910017
epoch: 11 loss: 15016.615026575377
epoch: 12 loss: 14964.685466258139
epoch: 13 loss: 14923.609671540673
epoch: 14 loss: 14888.382114634098
epoch: 15 loss: 14857.464665680298
epoch: 16 loss: 14830.04404165144
epoch: 17 loss: 14805.491232427556
epoch: 18 loss: 14783.329652838325
epoch: 19 loss: 14763.187994003372


In [27]:
weights

array([ 2.13498808e+01,  9.99790051e+00,  1.76983052e+01,  1.20125372e+01,
        1.76447001e+01, -2.94708651e+00,  9.51532007e-01,  6.32847066e-01,
        2.60437351e-01,  6.83234316e+00,  9.22838270e-01,  5.12362716e-02,
        1.79449840e-01,  3.99086374e-01,  7.88091238e-01, -2.50470642e-05,
        3.93356898e-01,  5.02336515e-01, -2.30211639e+00,  6.42306244e+00,
        8.27720939e+00, -3.77762674e+00, -7.34182534e-01,  1.26219709e+00,
        8.20493428e-01,  7.89428339e-01,  9.25506129e-01,  9.16597603e-01,
        1.00629468e+00,  1.77254528e-01,  2.87292503e-01,  9.88629077e-01,
        1.01767545e+00,  1.72209863e-02,  2.52901199e+00, -7.23569698e-02,
        8.86850260e+00,  1.15715925e+00,  8.00017394e+00,  8.65537471e-01,
        2.50109087e-01,  8.90924026e-01,  1.15032919e+01,  1.30691557e+00,
        2.74303112e-01,  6.93149184e-01,  1.86223212e-01, -2.64808370e+00,
        8.79509952e-01,  5.39279115e+00,  4.19614413e-02,  8.37308530e-01,
        7.37299625e-01,  

In [45]:
features_names = sorted(train.first().asDict().keys())
features_names.remove('target')

top_10_f = sorted(
    zip(weights.tolist(), features_names),
    key=lambda x: -abs(x[0])
)[:10]

In [47]:
import json

In [48]:
print(json.dumps(top_10_f, indent=2))

[
  [
    57.21647738622203,
    "f_cleaning_fee"
  ],
  [
    21.349880827617387,
    "f_accommodates"
  ],
  [
    21.158322069654044,
    "f_extra_people"
  ],
  [
    17.6983052347302,
    "f_bedrooms"
  ],
  [
    17.64470012058097,
    "f_bias"
  ],
  [
    17.590923198790946,
    "f_cat_bad_type_real_bed"
  ],
  [
    17.38004386910084,
    "f_cat_app_feature_host_has_profile_pic"
  ],
  [
    12.167257548196062,
    "f_cat_amenity_tv"
  ],
  [
    12.012537211780298,
    "f_beds"
  ],
  [
    11.503291944412933,
    "f_cat_amenity_familykid_friendly"
  ]
]


In [49]:
def calc_ss_res(weights_broadcast, example):
    # достаем целевую переменную и признаки из наблюдения
    gradient = np.zeros(len(weights_broadcast.value))
    data = example.asDict()
    
    y = data['target']
    data.pop('target')

    # признаки сортируем по названию для того, чтобы позиции точно не разъезались
    x = np.array([v or 0 for k, v in sorted(data.items(), key=lambda x: x[0])])

    # делаем предсказание с текущими весами
    prediction = x.dot(weights_broadcast.value)
    return (y - prediction) ** 2

In [50]:
weights_broadcast = sc.broadcast(weights)

y_avg = train.map(lambda x: x.target).mean()
ss_tot = train.map(lambda x: (x.target - y_avg) ** 2).sum()
ss_res = train.map(partial(calc_ss_res, weights_broadcast)).sum()

r2_score = 1 - ss_res / ss_tot
print(r2_score)

0.36023435565888606


In [51]:
weights_broadcast = sc.broadcast(weights)

y_avg = test.map(lambda x: x.target).mean()
ss_tot = test.map(lambda x: (x.target - y_avg) ** 2).sum()
ss_res = test.map(partial(calc_ss_res, weights_broadcast)).sum()

r2_score = 1 - ss_res / ss_tot
print(r2_score)

0.35992947703086775


In [53]:
weights_broadcast = sc.broadcast(weights)
rmse = train.map(partial(calc_ss_res, weights_broadcast)).mean() ** 0.5

print(rmse)

121.42800970659499


Это конечно не самый впечетляющий результат, но все таки уже что-то! Это сделанная собственными руками модель, которую мы можем обучить на произвольно большом датасете!

### Вы жжете бабло

<img src="https://grizzle.com/wp-content/uploads/2020/01/money-cash-fire-1200x900.png" width="300">

Напоминаю, выключайте ресурсы. Они жрут ваши деньги.