In [1]:
import vaex

Download one of the "-totals" files from <https://dumps.wikimedia.org/other/pagecounts-ez/merged/>.

(See <https://dumps.wikimedia.org/other/pagecounts-ez/> for rudimentary documentation.)

In [2]:
month = '2019-12'
filename_base = f'pagecounts-{month}-views-ge-5-totals'
download_filename = f'{filename_base}.bz2'

In [3]:
!wget https://dumps.wikimedia.org/other/pagecounts-ez/merged/{download_filename}

--2020-01-21 14:24:36--  https://dumps.wikimedia.org/other/pagecounts-ez/merged/pagecounts-2019-12-views-ge-5-totals.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 554304140 (529M) [application/octet-stream]
Saving to: ‘pagecounts-2019-12-views-ge-5-totals.bz2’


2020-01-21 14:41:02 (549 KB/s) - ‘pagecounts-2019-12-views-ge-5-totals.bz2’ saved [554304140/554304140]



The bzip2-compressed text file consists of lines (rows) with space-separated fields (columns). We have these 3 columns:

In [4]:
column_names = ['wiki code', 'article title', 'monthly total']

In order to make use of vaex' "out of core" functionality
(i.e., **not** loading the complete data into RAM),
[`vaex.open(...)`](https://vaex.readthedocs.io/en/latest/api.html#vaex.open) has to be used.
So do **not** use [`vaex.from_csv(...)`](https://vaex.readthedocs.io/en/latest/api.html#vaex.from_csv)
or [`vaex.from_ascii(...)`](https://vaex.readthedocs.io/en/latest/api.html#vaex.from_ascii)
as they would (try to) load all data into RAM.
(`vaex.open(...)` [seems to call `from_csv(...)`](https://github.com/vaexio/vaex/blob/core-v1.4.0/packages/vaex-core/vaex/__init__.py#L207),
but probably with special arguments to avoid in-memory caching?)

However, `vaex.open(...)` requires specific file name extensions to identify the data format used.
For character-separated values, [it's `.csv` or `.csv.bz2`](https://github.com/vaexio/vaex/blob/core-v1.4.0/packages/vaex-core/vaex/__init__.py#L206).
bzip2-decompression is handled by vaex automatically.

So for `vaex.open(...)` to correctly recognize the file, we have to rename it.

In [5]:
import_filename = f'{filename_base}.csv.bz2'

In [6]:
!mv {download_filename} {import_filename}

The default seperator for CSV files is `','`, so we have to override it.
Other than in `vaex.from_ascii(...)`, the argument for that is called `sep` rather than `seperator`.
(It's actually passed on to [pandas' `read_csv(...)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv).)

The data contains no header row,
so we pass the column names as an argument.
This implicitly disables the data first line of data being considered
to be a header row rather than payload content.

By setting the `convert` flag argument, we tell vaex
to create a hdf5 file with the content from the opened file.
This will speed up both
out-of-core operations
as well as
subsequent `vaex.open(...)` calls on the same file.

In [7]:
df = vaex.open(
    import_filename,
    sep=' ',
    names=column_names,
    convert=True,
)

`z` is for Wikipedia

In [8]:
lang = 'en'
article_title = 'Neuschwanstein_Castle'

df.filter(df.wiki_code==f'{lang}.z').filter(df.article_title==article_title).monthly_total

Expression = monthly_total
Length: 1 dtype: int64 (column)
-------------------------------
0  28965