Skip to content

Latest commit

 

History

History
168 lines (98 loc) · 7.3 KB

20210518_02.md

File metadata and controls

168 lines (98 loc) · 7.3 KB

结合PostgreSQL, MADlib, Tensorflow 实现机器学习是时序分析. 使用本地数据, 不需要move data.

作者

digoal

日期

2021-05-18

标签

PostgreSQL , tensorflow , plpython , madlib


背景

结合PostgreSQL, MADlib, Tensorflow, plpythonu编程接口, 实现机器学习和时序分析.

优点: 效率最高. 使用本地数据, 不需要move data.

参考文档:

integrate machine learning into our databases and showed examples using Apache MADlib and Tensorflow to analyse the Boston Housing Dataset.

https://www.enterprisedb.com/blog/postgresql-and-machine-learning

In this blog mini-series, we'll take a deeper dive into the code I wrote to perform this analysis using Tensorflow. Whilst MADlib is certainly useful for building intelligent analytics into your databases, use of Tensorflow is arguably much more interesting as we can easily build whatever we want using pl/python3 as we'll have access to all the nuts and bolts of Tensorflow (or PyTorch or scikit-learn etc.) as well as virtually the entire Python package ecosystem which includes incredibly handy libraries such as Pandas and Numpy.

https://www.enterprisedb.com/blog/regression-analysis-postgresql-tensorflow-part-1-getting-started

https://www.enterprisedb.com/blog/regression-analysis-postgresql-tensorflow-part-2-data-pre-processing

https://www.enterprisedb.com/blog/regression-analysis-postgresql-tensorflow-part-3-data-analysis

参考代码

https://github.com/dpage/ml-experiments/tree/main/regression

Regression

We can use Tensorflow from within PostgreSQL to perform regression tasks for
modelling and prediction. Essentially we're using a neural network to learn the
relationship between inputs and outputs; for example, given:

x = y [some operation] z

we're modelling [some operation], essentially approximating the formula.

Note that whilst in some cases there may be a specific, defined relationship
between the inputs and outputs (e.g. x2 = y2 +
z2 - Pythagoras' theorem), in other cases there may not be. These
are typically the cases that interest us as they allow us to analyse data for
business intelligence purposes. A good example is predicting the price of a
house based on factors such as location, number of bedrooms and reception rooms,
type of build etc.

PostgreSQL

Tensorflow

We need to configure PostgreSQL in order to run Tensorflow. This consists of a
couple of steps:

  1. Install pl/plython3 in your PostgreSQL database, e.g:

    CREATE EXTENSION plpython3u;    
    
  2. Install Tensorflow (and any other required modules) in the Python environment
    used by the PostgreSQL server. In my case, that's the EDB LanguagePack on
    macOS:

    % sudo /Library/edb/languagepack/v1/Python-3.7/bin/pip3 install tensorflow numpy    

It should then be possible to create pl/python3 functions in PostgreSQL.

Apache Madlib

Some examples use Apache MADLib instead of Tensorflow. See the documentation on
the MADLib Confluence page.

Scripts

There are various SQL scripts in this directory that represent the various
experiments I've worked on. Most create:

  • A table to store training inputs and outputs.
  • A function to generate data for the training table.
  • A function to train a model and make a prediction based on that model.

Obviously in real-world applications the model creation and prediction functions
would probably be separated, and training data would likely come from existing
tables/views.

Note: All files written by the scripts will be owned by the user account
under which PostgreSQL is running, and unless a full path is given, will be
written relative to the data directory. I've used /Users/Shared/tf as the
working directory; you may want to change that.

ohms_law.sql

This attempts to teach a network a basic operation based on Ohms Law;

voltage (v) = i (current) * r (resistance)

It's worth noting that the results of this model are terrible, so don't try
to use it as the basis for anything else. At the time of writing I haven't yet
figured out why this is the case, though I have some hunches.

pythagoras.sql

This attempts to teach a network Pythagoras' Theorem:

x2 = y2 + z2

The square of the length of the hypotenuse of a right angled triangle is the
sum of the square of the other sides.

random1.sql

This attempts to teach a network a completely fictitious equation with five
input variables (in pl/pgsql):

z := cbrt((a * b) / (sin(c) * sqrt(d)) + (e * e * e));

tf-housing.sql

This is based on the well known Boston Housing dataset.
The SQL file contains the definition for a table to hold the data (loading it
is an exercise for the reader) and a function for training, testing and using
a model. This function differs from other in that Pandas data frames are used
in place of Numpy, and an attempt is made to remove rows container outliers
from the dataset before training, in order to increase accuracy of results.

ml-housing.sql

This script implements the same regression analysis as tf-housing.sql, except
that it uses Apache MADLib instead of Tensorflow.

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

digoal's wechat