% Tutorial of Frovedis Python Interface
This document is a tutorial of Frovedis Python interface.
Frovedis is a MPI library that provides
- Matrix library using above API
- Machine learning algorithm library
- Dataframe for preprocessing
The Python interface wraps these functionalities and makes it possible to call them from Python script. Since the library is optimized for SX-Aurora TSUBASA, you can utilize vector architecture without being aware of it. You can use it also on x86 servers.
It is implemented by using a server program. An MPI program with Frovedis functionalities (frovedis_server) is invoked and the Python interpreter communicates with it.
In this tutorial, we assume that Frovedis is installed from rpm. Please follow /opt/nec/frovedis/ getting_started.md. As described in the file, if you want to use frovedis_server on x86, please do:
$ source /opt/nec/frovedis/x86/bin/x86env.sh
If you want to use vector engine (VE), please do:
$ source /opt/nec/frovedis/ve/bin/veenv.sh
Main purpose of the script is to set PYTHONPATH
and LD_LIBRARY_PATH
.
It also switches mpirun
to call (x86 or ve). If you did not source
MPI set up script for VE, veenv.sh
also source it internally.
We tested the wrapper using Python version 2.7 and version 3.6. Python version 2.7 is installed in CentOS/RedHat7 by default; Python version 3.6 can be installed using software collection on CentOS/RedHat7 and installed in CentOS/RedHat8 by default.
Since our wrapper is just a Python library and shared library, you can use tools like virtualenv, Jupyter, etc. together with the wrapper.
In this tutorial, we use python with virtualenv, because using pip for system installed Python is dangerous (virtualenv and pip will be installed together with Frovedis by yum).
First, please create your environment. In the case of Python 2.7:
$ virtualenv frovedis_tutorial
In the case of Python 3:
$ python3 -m venv frovedis_tutorial
If you want to use Python version 3.6 on CentOS/RedHat7, please install it from software collection and enable it as follows:
$ sudo yum install centos-release-scl
$ sudo yum install rh-python36
$ scl enable rh-python36 bash
Then, please activate the environment and install scikit-learn and pandas:
$ source frovedis_tutorial/bin/activate
(frovedis_tutorial) $ pip install scikit-learn pandas
If you want to run the tutorials on jupyter-notebook in the virtual environment, you need to run following:
(frovedis_tutorial) $ pip install jupyter
If you run jupyter notebook server on a server machine and run your browser on a client machine, following setting would need to be added in your ~/.jupyter/jupyter_notebook_config.py
c = get_config()
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.open_browser = False
c.NotebookApp.notebook_dir = '/path/to/save/notebook'
Then, you can run
(frovedis_tutorial) $ jupyter-notebook
and access the server with the token printed by the command.
In addition, please copy the ./src
directory to somewhere you have write
permission, because it will create files.
Please look at "src/tut3/tut.py". It loads "breast cancer" data from scikit-learn, and run logistic regression on the data.
Lines with trailing # frovedis
is specific for Frovedis. Lines with
trailing # sklearn
is for scikit-learn instead.
To use Frovedis, you need to import FrovedisServer:
from frovedis.exrpc.server import FrovedisServer
Then, import LogisticRegression in this case:
from frovedis.mllib.linear_model import LogisticRegression
In the case of scikit-learn, following module is imported instead:
from sklearn.linear_model import LogisticRegression
Before using the logistic regression routine, you need to invoke frovedis_server:
FrovedisServer.initialize("mpirun -np 4 {}".format(os.environ['FROVEDIS_SERVER']))
You need to specify the command to invoke the server as the argument
of initialize. Since the server is an MPI program, mpirun
is used
here. The option -np
is for specifying the number of MPI
processes. Here, 4 processes will be used. You can use multiple
cards (in the case of vector engine) and/or multiple servers by
specifying command line option appropriately.
The last argument of mpirun
is the binary to execute. Here, the path
of the binary is obtained from the environment variable
FROVEDIS_SERVER
, which is set in x86env.sh
or veenv.sh
.
The LogisticRegression call is the same as scikit-learn. Within the call, the data in Python interpreter is sent to frovedis_server and the machine learning algorithm is executed there.
After executing the machine learning algorithm, please shutdown the server:
FrovedisServer.shut_down()
As you can see, what you need to do is changing the importing module and add initialize / shutdown the server.
You can run the sample by
(frovedis_tutorial) $ python tut.py
score: 0.9507908611599297
Even if you change the import to use scikit-learn, it should produce similar result.
In this case, the speed of training of Frovedis is actually slower than scikit-learn. This is because the size of the data is very small (569, 30).
The frovedis server will be terminated when the interpreter exits.
If it is not terminated because of abnormal termination, please kill the
server manually by calling command like pkill mpi
. In the case of
VE, you can check if the server is running or not by
/opt/nec/ve/bin/ps -elf
, for example (or $ VE_NODE_NUMBER=0 /opt/nec/ve/bin/top
, where you can change the VE node number by the
environment variable).
You can also refer to the notebooks installed in
${INSTALLPATH}/doc/notebook
.
At this moment, we support following algorithms (sklearn is link to scikit-learn manual):
linear_model.LogisticRegression
(sklearn)linear_model.LinearRegression
(sklearn)linear_model.Ridge
(sklearn)linear_model.Lasso
(sklearn)linear_model.SGDClassifier
(sklearn)linear_model.SGDRegressor
(sklearn)svm.LinearSVC
(sklearn)svm.LinearSVR
(sklearn)svm.SVC
(sklearn)tree.DecisionTreeClassifier
(sklearn)tree.DecisionTreeRegressor
(sklearn)ensemble.RandomForestClassifier
(sklearn)ensemble.RandomForestRegressor
(sklearn)ensemble.GradientBoostingClassifier
(sklearn)ensemble.GradientBoostingRegressor
(sklearn)neighbors.KNeighborsClassifier
(sklearn)neighbors.KneighborsRegressor
(sklearn)neighbors.NearestNeighbors
(sklearn)naive_bayes.MultinomialNB
(sklearn)naive_bayes.BernoulliNB
(sklearn)cluster.KMeans
(sklearn)cluster.AgglomerativeClustering
(sklearn)cluster.DBSCAN
(sklearn)cluster.SpectralClustering
(sklearn)mixture.GaussianMixture
(sklearn)manifold.SpectralEmbedding
(sklearn)manifold.TSNE
(sklearn)decomposition.TruncatedSVD
(sklearn)decomposition.PCA
(sklearn)decomposition.LatentDirichletAllocation
(sklearn)preprocessing.StandardScaler
(sklearn)
Please add frovedis.mllib.
to import these modules.
(In the case of scikit-learn, sklearn.
is added to import them.)
The interface is almost the same as scikit-learn.
Other than scikit-learn algorithms, we support following algorithms.
frovedis.mllib.fm.FactorizationMachineClassifier
frovedis.mllib.recommendation.ALS
frovedis.mllib.fpm.FPGrowth
frovedis.mllib.feature.Word2Vec
In addition, following graph algorithms are supported. The interface is almost the same as networkx.
frovedis.graph.pagerank
frovedis.graph.connected_components
frovedis.graph.single_source_shortest_path
frovedis.graph.bfs_edges
frovedis.graph.bfs_tree
frovedis.graph.bfs_predecessors
frovedis.graph.bfs_successors
frovedis.graph.descendants_at_distance
You can use both dense and sparse matrix as the input of machine learning just like scikit-learn. It is automatically sent to Frovedis server, and automatically distributed among MPI processes. (SX-Aurora TSUBASA shows much better performance with sparse matrix.)
For more information, please refer to the manual. You can also find other samples in /opt/nec/frovedis/ x86/foreign_if_demo/python/.
As we mentioned, you can use variable of Python side directly as the input of machine learning algorithms that works on Frovedis server. In addition, you can also use the distributed matrix and vector at Frovedis server explicitly, which can be used as input of the machine learning algorithms.
Since you can keep the data at Frovedis server side, you can reduce the communication cost of sending data from Python to the server if you reuse the data.
Please look at "src/tut5-1/tut.py". It creates sparse matrix at the Frovedis server side from scipy csr matrix.
mat = csr_matrix((data, indices, indptr),
dtype=np.float64,
shape=(3, 3))
Here, mat is scipy's csr format of sparse matrix. (in Frovedis, it is
called as crs format.)
Then, FrovedisServer.initialize
is called. This time, -np
is 2.
After that,
fmat = FrovedisCRSMatrix(mat)
creates crs matrix at Frovedis server. To check if it is really
created, debug_print()
is called. It should print like:
matrix:
num_row = 3, num_col = 3
node 0
local_num_row = 2, local_num_col = 3
val : 1 2 3
idx : 0 2 2
off : 0 2 3
node 1
local_num_row = 1, local_num_col = 3
val : 4 5 6
idx : 0 1 2
off : 0 3
It is printed at the server side. It shows that first 2 rows are in the node 0 and third row is in the node 1.
The data at Frovedis server is saved by fmat.save("./result")
.
The contents of this file should look like:
0:1 2:2
2:3
0:4 1:5 2:6
Each item is separated by space, and each row is separated as line. Each item is like "POS:VAL"; POS is 0-based column position. This is the sparse matrix text file format of Frovedis.
The memory of the server side is released when the variable fmat
is
garbage collected. But you can explicitly release it by calling
fmat.release()
.
You can create sparse matrix by loading from a file.
fmat2 = FrovedisCRSMatrix().load_text("./result")
creates a new matrix from the saved data. fmat2.debug_print()
should produce the same output as the above.
In this case, we used text file format, but you can also use binary
file format by using save_binary
and load_binary
. It should be
much faster than text format on vector engine. Please refer to the C++
tutorial for binary format.
The file "src/tut5-2/tut.py" is dense matrix version. In this case,
FrovedisRowmajorMatrix
is created from numpy.matrix
.
You can try FrovedisColmajorMatrix
version that is written as
comment. Here, debug_print()
shows internal data. If you want to see
the data as row major way, use get_rowmajor_view()
instead.
The text format of rowmajor matrix is like:
1 2 3 4
5 6 7 8
8 7 6 5
4 3 2 1
If you use data at Frovedis server side as the input of machine learning algorithms, you need to be aware of the type; for example, LogisticRegression takes FrovedisColmajorMatrix, but does not take FrovedisRowmajorMatrix. Please refer to the manuals for more details
Label of the machine learning algorithms is a vector, and you can also use the distributed vector at Frovedis server explicitly. The file "src/tut5.3/tut.py" shows how to create it.
dv = FrovedisDvector([1,2,3,4,5,6,7,8],dtype=np.float64)
dv.debug_print()
The debug_print()
should print like this:
dvector(size: 8):
1 2 3 4 5 6 7 8
So far, we explained sparse matrix (FrovedisCRSMatrix), dense matrix (FrovedisRowmajorMatrix, FrovedisColmajorMatrix), and distributed vector (FrovedisDvector). We also another kind of distributed dense matrix called FrovedisBlockcyclicMatrix.
FrovedisBlockcyclicMatrix supports distributed matrix operations that is backed by ScaLAPACK/PBLAS. It can be utilized for large scale matrix operations. Please see "src/tut5-4/tut.py". It contains examples of various PBLAS functionalities.
First, input numpy matrices x, y, m, and n are created. Frovedis server side block cyclic matrix can be created like:
bcx = FrovedisBlockcyclicMatrix(x)
In ScaLAPACK/PBLAS, vectors are represented as one dimensional matrix.
First example swaps two vectors by PBLAS.swap(bcx,bcy)
. To check if
they are swapped, you can call debug_print()
of these
variables. However in this example, the blockcyclic matrix is copied
back to Python interpreter and converted to numpy matrix by
to_numpy_matrix()
and printed.
Next example is multiplying by scalar: PBLAS.scal(bcx,2)
. As you
see, PBLAS interface overwrites the original matrix.
PBLAS.axpy(bcx,bcy,2)
does y = ax + y, here a is 2.
PBLAS.copy(bcx,bcy)
copies the matrix (y = x).
PBLAS.dot(bcx,bcy)
calculates dot product of x and y. Here, you can
use numpy matrix x
and y
instead of bcx
and bcy
. In this case,
blockcyclic matrix is created automatically. Other operations like
nrm2
, gemv
, ger
, gemm
, and geadd
also take numpy matrix as
input.
PBLAS.nrm2(bcx)
calculates L2 norm of the vector.
PBLAS.gemv(bcm,bcx)
calculates matrix vector multiplication (m *
x). The result is newly created blockcyclic matrix (vector).
PBLAS.gemm(bcm,bcn)
does matrix-matrix multiplication (m * n).
The result is also newly created blockcyclic matrix.
PBLAS.geadd(bcm,bcn)
does matrix addition like n = m + n.
Lastly, you can explicitly release the blockcyclic matrix by calling
release()
, though they are automatically released when the variable
is garbage collected.
Next, we will explain ScaLAPACK functionalities. Please see "src/tut5-5/tut.py".
This time, FrovedisBlockcyclicMatrix is created by loading from a file.
bcm = FrovedisBlockcyclicMatrix(dtype=np.float64)
bcm.load("./input")
FrovedisBlockcyclicMatrix can be saved by save, and binary format can
also be used by load_binary
and save_binary
.
To save the matrix, it is converted to Python numpy matrix.
m = bcm.to_numpy_matrix()
First example is getrf
, which does LU factorization.
rf = SCALAPACK.getrf(bcm)
The argument matrix is overwritten to factorized matrix. The return value contains pivoting information (ipiv), which is needed to use the factorized matrix later.
Next, by using the factorized matrix, inverse of the matrix is
calculated using getri
.
SCALAPACK.getri(bcm,rf.ipiv())
As mentioned, rf.ipif()
is used as the input of getri
. The result
is overwritten to the argument matrix. The result is printed by
print (bcm.to_numpy_matrix())
. The result would be like:
[[ 2.53333333 -0.36666667 -0.03333333]
[-1.46666667 0.63333333 -0.03333333]
[-0.03333333 -0.13333333 0.03333333]]
You can also use the result of LU factorization for solving the system
of linear equation by using getrs
.
Next example solves the system of linear equation directly using
gesv
.
bcm = FrovedisBlockcyclicMatrix(m)
x = np.matrix([[1],[2],[3]], dtype=np.float64)
bcx = FrovedisBlockcyclicMatrix(x)
SCALAPACK.gesv(bcm,bcx)
The variable bcm
is set again (since it was modified) and bcx
is
created from numpy matrix x
; then gesv(bcm,bcx)
is called.
The result is overwritten to bcx
; print (bcx.to_numpy_matrix())
would produce:
[[ 1.7]
[-0.3]
[-0.2]]
Last example is singular value decomposition (SVD) by gesvd
. Unlike
TruncatedSVD
, it computes full SVD (it takes more time than
TruncatedSVD
if you only need part of the SVD result).
bcm = FrovedisBlockcyclicMatrix(m)
svd = SCALAPACK.gesvd(bcm)
Calling gesvd(bcm)
creates an object svd
that contains result.
The to_numpy_results()
function extracts left singular vectors
(umat), singular values (svec), and right singular vectors (vmat).
(umat,svec,vmat) = svd.to_numpy_results()
print (umat)
print (svec)
print (vmat)
It would produce like:
[[-0.03411749 -0.21215376 -0.97664056]
[-0.13817611 -0.96682347 0.21484819]
[-0.98981986 0.14227847 0.00367101]]
[69.30483143 2.5940231 0.33374433]
[[-0.19214106 -0.48689005 -0.85206801]
[-0.31038551 -0.79352539 0.52342936]
[-0.93099014 0.36504183 0.0013452 ]]
You can also save and load the SVD result.
In addition to machine learning algorithms, we support Pandas like DataFrame.
First, please install pandas
to your virtual environment.
Though pandas is installed to the system Python when Frovedis is
installed, virtualenv does not copy system installed packages by
default.
(frovedis_tutorial) $ pip install pandas
Then, please see "src/tut6-1/tut.py".
First, pandas DataFrame pdf1
and pdf2
are created. Then,
FrovedisDataframe is created from pandas DataFrame as fdf1
and
fdf2
.
peopleDF = {
'Ename' : ['Michael', 'Andy', 'Tanaka', 'Raul', 'Yuta'],
'Age' : [29, 30, 27, 19, 31],
'Country' : ['USA', 'England', 'Japan', 'France', 'Japan']
}
countryDF = {
'Ccode' : [1, 2, 3, 4],
'Country' : ['USA', 'England', 'Japan', 'France']
}
pdf1 = pd.DataFrame(peopleDF)
pdf2 = pd.DataFrame(countryDF)
fdf1 = FrovedisDataframe(pdf1)
fdf2 = FrovedisDataframe(pdf2)
To show the contents of FrovedisDataframe, you can use show():
fdf1.show()
fdf2.show()
They should produce output like:
index Ename Age Country
0 Michael 29 USA
1 Andy 30 England
2 Tanaka 27 Japan
3 Raul 19 France
4 Yuta 31 Japan
index Ccode Country
0 1 USA
1 2 England
2 3 Japan
3 4 France
To select columns, you can write like:
fdf1[["Ename","Age"]].show()
It should produce output like:
Ename Age
Michael 29
Andy 30
Tanaka 27
Raul 19
Yuta 31
To filter the rows, you can write like:
fdf1[(fdf1.Age > 19) & (fdf1.Country == 'Japan')].show()
It should produce output like:
index Ename Age Country
2 Tanaka 27 Japan
4 Yuta 31 Japan
To sort the rows, you can write like:
fdf1.sort("Age",ascending=False).show()
Since ascending=False
, it is sorted in descending order of Age.
Output should be like:
index Ename Age Country
4 Yuta 31 Japan
1 Andy 30 England
0 Michael 29 USA
2 Tanaka 27 Japan
3 Raul 19 France
You can specify multiple columns for sorting.
fdf1.sort(["Country", "Age"]).show()
This sorts the rows by Country, and then by Age in the same Country name. The output should be like:
index Ename Age Country
1 Andy 30 England
3 Raul 19 France
2 Tanaka 27 Japan
4 Yuta 31 Japan
0 Michael 29 USA
Please note that the rows whose Country is Japan is sorted by Age.
To groupby the table, first call groupby
and then call agg
to
aggregate the value like:
fdf1.groupby('Country').agg({'Age': ['max','min','mean'],
'Ename': ['count']}).show()
It should produce output like:
Country max_Age min_Age mean_Age count_Ename
England 30 30 30 1
Japan 31 27 29 2
France 19 19 19 1
USA 29 29 29 1
To join (or merge in Pandas term) tables, it is required that the column names are unique in the current implementation. So first we rename the column name.
fdf3 = fdf2.rename({'Country' : 'Cname'})
Then, join like this:
fdf1.merge(fdf3, left_on="Country", right_on="Cname").show()
It produces output like:
index Ename Age Country index_right Ccode Cname
0 Michael 29 USA 0 1 USA
1 Andy 30 England 1 2 England
2 Tanaka 27 Japan 2 3 Japan
3 Raul 19 France 3 4 France
4 Yuta 31 Japan 2 3 Japan
You can chain operations. Here, join, sort, and select are chained.
fdf1.merge(fdf3, left_on="Country", right_on="Cname") \
.sort("Age")[["Age", "Ename", "Country"]].show()
It produces output like:
Age Ename Country
19 Raul France
27 Tanaka Japan
29 Michael USA
30 Andy England
31 Yuta Japan
You can get the statistics of the columns like min, max, sum, avg,
std, and count by calling describe()
to see all these information.
print (fdf1.describe())
This prints like:
Age
count 5.000000
mean 27.200000
std 4.816638
sum 136.000000
min 19.000000
max 31.000000
So far, we only used Frovedis side DataFrame. It is also possible to convert to Pandas DataFrame or use Pandas DataFrame together.
pdf2.rename(columns={'Country' : 'Cname'},inplace=True)
joined = fdf1.merge(pdf2, left_on="Country", right_on="Cname")
Here, Frovedis DataFrame is joined with Pandas DataFrame. The output should be the same as previous join.
You can convert Frovedis DataFrame using to_pandas()
.
Frovedis DataFrame can be converted to matrix. Please see "src/tut6-2/tut.py".
First, Pandas DataFrame is created and converted to Frovedis DataFrame.
data = {'A': [10, 12, 13, 15],
'B': [10.23, 12.20, 34.90, 100.12],
'C': ['male', 'female', 'female', 'male'],
}
pdf = pd.DataFrame(data)
df = FrovedisDataframe(pdf)
print (pdf)
The DataFrame is:
A B C
0 10 10.23 male
1 12 12.20 female
2 13 34.90 female
3 15 100.12 male
You can create FrovedisRowmajorMatrix
by specifying the columns. The
columns should be integer or floating point values. In this case,
row_mat = df.to_frovedis_rowmajor_matrix(['A', 'B'], dtype=np.float64)
print (row_mat.to_numpy_matrix())
In this case, columns A
and B
are selected and converted to
matrix. This produces
[[ 10. 10.23]
[ 12. 12.2 ]
[ 13. 34.9 ]
[ 15. 100.12]]
You can also create FrovedisColmajorMatrix
by to_frovedis_colmajor_matrix
.
Then, you can specify columns as category variable. In this case, it can be any data type; it is converted using on-hot encoding. In this case, the result becomes FrovedisCRSMatrix.
crs_mat,info = df.to_frovedis_crs_matrix(['A', 'B', 'C'],
['C'], need_info=True)
crs_mat.debug_print()
Here, columns 'A', 'B', and 'C' is selected to create the matrix.
The second argument is to specify which column is used as categorical
variable. In this case column 'C' is specified. If need_info=True
,
info
data structure is also returned. It is used to create a matrix
from FrovedisDataFrame next time (explained later).
The result of debug print is as follows:
num_row = 4, num_col = 4
node 0
local_num_row = 2, local_num_col = 4
val : 10 10.23 1 12 12.2 1
idx : 0 1 3 0 1 2
off : 0 3 6
node 1
local_num_row = 2, local_num_col = 4
val : 13 34.9 1 15 100.12 1
idx : 0 1 2 0 1 3
off : 0 3 6
If it is shown as dense matrix, it should look like:
10 10.23 0 1
12 12.2 1 0
13 34.9 1 0
15 100.12 0 1
Here, 'female' is assigned to 2nd column (start from 0), and 'male' is assigned to 3rd column.
If you use this data for machine learning, you would want to convert
other matrix using the same way for inference, for example. The info
structure is used for this purpose.
For example,
A B C
0 12 34.56 male
1 13 78.90 male
This DataFrame is converted to FrovedisCRSMatrix using the info
created above:
crs_mat2 = df2.to_frovedis_crs_matrix_using_info(info)
crs_mat2.debug_print()
This should produce output like:
matrix:
num_row = 4, num_col = 4
node 0
local_num_row = 1, local_num_col = 4
val : 12 34.56 1
idx : 0 1 3
off : 0 3
node 1
local_num_row = 1, local_num_col = 4
val : 13 78.9 1
idx : 0 1 3
off : 0 3
If it is shown as dense matrix, it should look like:
12 34.56 0 1
13 78.9 0 1
As you can see, 'male' is assigned to 3rd column, not 2nd column.
The data structure info
can be saved and loaded to a file.
In addition, Frovedis DataFrame has out-of-core functionality, which allows to handle larger data than memory capacity by spilling out part of the data into storage.
To enable out-of-core functionality, users just need to set an
environment variable: FROVEDIS_DFCOLUMN_SPILLABLE=true
. You can set
the environment by exporting it before calling python interpreter,
like:
$ export FROVEDIS_DFCOLUMN_SPILLABLE=true
$ python your_program.py
or, you can specify it when you call frovedis server, like:
FrovedisServer.initialize("FROVEDIS_DFCOLUMN_SPILLABLE=true mpirun -np 4 {}".
format(os.environ['FROVEDIS_SERVER']))
The default directory that is used to spill the data is /var/tmp
.
If you want to change the directory to use, please set an environment
variable: FROVEDIS_TMPDIR=/path/to/spill
.
Basically, the system spills all the columns; since all the dataframe operations needs to access only the specified columns, the required columns are restored from the storage and processed.
However, spilling all the columns is sometimes inefficient. So there
is buffer (we call this queue) to keep the column in memory before
spilling to storage. The size of the queue is 1GB by default. You can
change the size by using an environment variable
FROVEDIS_DFCOLUMN_SPILLQ_SIZE
. The value is in MB
(e.g. FROVEDIS_DFCOLUMN_SPILLQ_SIZE=1024
, in this case 1GB).
Manuals are in ../manual/python
directory.
In addition to PDF file, you can also use man
command (MANPATH is
set in x86env.sh
or veenv.sh
). For python interface, the section is
3p
(same name of the manual may exist in section 3
or 3s
.), so
you can run like man -s 3p logistic_regression
.