# Vaex

### Vaex, what's that?
- Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. 
- It calculates statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid for more than a billion (10^9) samples/rows per second. 
- Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

In [38]:
!pip install vaex



In [39]:
import vaex

import pandas as pd
import numpy as np

In [40]:
n_rows = 100000 # one hundred thousand random data
n_cols = 10
df = pd.DataFrame(np.random.randint(0, 100, size=(n_rows, n_cols)), columns=['c%d' % i for i in range(n_cols)])
df.head()

Unnamed: 0,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9
0,4,42,80,16,18,40,80,33,33,18
1,80,89,0,30,48,20,74,65,51,99
2,19,93,70,14,38,83,60,48,60,91
3,3,95,96,64,31,82,66,74,80,13
4,94,22,55,27,3,67,16,23,79,46


In [41]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
c0    100000 non-null int32
c1    100000 non-null int32
c2    100000 non-null int32
c3    100000 non-null int32
c4    100000 non-null int32
c5    100000 non-null int32
c6    100000 non-null int32
c7    100000 non-null int32
c8    100000 non-null int32
c9    100000 non-null int32
dtypes: int32(10)
memory usage: 3.8 MB


### Creating Csv files

In [42]:
file_path = 'main_dataset.csv'
df.to_csv(file_path, index=False)

### Create Hdf5 files

Vaex required us to give data in form of hdf5 format

In [43]:
vaex_df = vaex.from_csv(file_path, convert=True, chunk_size=5_000_000)

In [44]:
type(vaex_df)

vaex.hdf5.dataset.Hdf5MemoryMapped

### Read Hdf5 files using Vaex library

In [45]:
vaex_df = vaex.open('main_dataset.csv.hdf5')

In [46]:
type(vaex_df)

vaex.hdf5.dataset.Hdf5MemoryMapped

In [47]:
vaex_df.head()

#,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9
0,85,51,28,42,36,33,30,80,33,8
1,99,42,56,36,24,32,3,29,76,40
2,44,23,44,64,27,14,6,94,90,62
3,30,48,34,90,57,93,61,78,35,25
4,7,33,44,69,75,60,71,83,6,67
5,95,84,78,81,6,99,55,38,52,35
6,14,96,59,94,55,79,85,18,51,22
7,59,98,31,22,83,36,60,36,72,58
8,4,92,77,20,70,48,38,97,84,47
9,22,29,88,6,1,40,12,89,80,48


### Expression system
- Let's try to implement some expressions using vaex
- Don't waste memory or time with feature engineering, we (lazily) transform your data when needed.

In [48]:
%%time
vaex_df['multiplication_col13']=vaex_df.c1*vaex_df.c3

Wall time: 0 ns


In [49]:
vaex_df['multiplication_col13']

Expression = multiplication_col13
Length: 100,000 dtype: int64 (column)
-------------------------------------
    0  2142
    1  1512
    2  1472
    3  4320
    4  2277
    ...    
99995  7332
99996  2852
99997  2200
99998   510
99999  1040

### Out-of-core DataFrame
Filtering and evaluating expressions will not waste memory by making copies; the data is kept untouched on disk, and will be streamed only when needed. Delay the time before you need a cluster.

In [50]:
vaex_df[vaex_df.c2>70]

#,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,multiplication_col13
0,95,84,78,81,6,99,55,38,52,35,6804
1,4,92,77,20,70,48,38,97,84,47,1840
2,22,29,88,6,1,40,12,89,80,48,174
3,56,52,76,53,85,6,69,49,4,40,2756
4,26,94,88,51,22,30,21,32,38,93,4794
...,...,...,...,...,...,...,...,...,...,...,...
28961,76,38,80,91,39,74,64,86,24,45,3458
28962,2,35,98,94,28,45,35,64,59,80,3290
28963,56,98,73,85,0,23,51,58,15,16,8330
28964,1,34,99,0,45,38,80,26,26,37,0


#### Filtering will not make a memory copy

In [51]:
dff=vaex_df[vaex_df.c2>70]

#### All the agorithms work out of core, the limit is the size of your hard driver

In [52]:
dff.c2.minmax(progress='widget')

HBox(children=(FloatProgress(value=0.0, max=1.0), Label(value='In progress...')))

array([71, 99], dtype=int64)

### Fast groupby / aggregations
Vaex implements parallelized, highly performant groupby operations, especially when using categories (>1 billion/second).

In [53]:
%%time
vaex_df_group=vaex_df.groupby(vaex_df.c1,agg=vaex.agg.mean(vaex_df.c4))
vaex_df_group

Wall time: 29.6 ms


In [54]:
%%time
vaex_df.groupby(vaex_df.c1,agg='count')

Wall time: 22.3 ms


#,c1,count
0,51,1018
1,42,927
2,23,932
3,48,982
4,33,980
...,...,...
95,28,1054
96,19,1026
97,74,986
98,36,996


### Great! we learned Vaex.