<a href="https://colab.research.google.com/github/cateto/python4NLP/blob/main/util/alternative_for_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# pandas 2.0

- 기존 pandas가 가진 제약
  - 100MB~1GB 의 데이터는 효율적이며 성능에 거의 문제가 되지 않으나, 그 이상의 경우 성능이 낮고 실행시간이 긺. 또한 다중처리르 지원하지 않아서 큰 데이터셋을 처리하기에 한계가 있었음.
- 2.0의 개선점
  - pyarrow를 지원 (병렬로 읽고 처리가 쉽게 데이터를 구성) 

In [10]:
%time
import pandas as pd

df = pd.DataFrame({'Patient':['Anna','Be','Charlie','Duke','Earth','Faux','Goal','Him'], 
                   'Weight':[41,56,78,55,80,84,36,91], 
                   'Segment':[1,2,1,1,3,2,1,1] })

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.2 µs


In [11]:
df

Unnamed: 0,Patient,Weight,Segment
0,Anna,41,1
1,Be,56,2
2,Charlie,78,1
3,Duke,55,1
4,Earth,80,3
5,Faux,84,2
6,Goal,36,1
7,Him,91,1


# Polars

- Rust로 구현된 데이터 분석 및 처리 도구
- 대용량 데이터에 대한 처리 성능이 뛰어남
- pandas 보다 더 다양한 데이터 타입을 지원함
- 병렬처리를 지원함
- pandas와 비슷한 API

In [1]:
!pip install polars

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [8]:
%time
import polars as pl

 
df = pl.DataFrame({'Patient':['Anna','Be','Charlie','Duke','Earth','Faux','Goal','Him'], 
                   'Weight':[41,56,78,55,80,84,36,91], 
                   'Segment':[1,2,1,1,3,2,1,1] })

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs


In [9]:
df

Patient,Weight,Segment
str,i64,i64
"""Anna""",41,1
"""Be""",56,2
"""Charlie""",78,1
"""Duke""",55,1
"""Earth""",80,3
"""Faux""",84,2
"""Goal""",36,1
"""Him""",91,1


# Dask

- 멀티프로세싱
- 작업의 병렬화

In [12]:
!pip install dask

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [13]:
import dask 
from dask.distributed import Client, progress
client = Client(processes=False, threads_per_worker=4,
                n_workers=1, memory_limit='2GB')
client

INFO:distributed.http.proxy:To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO:distributed.scheduler:State start
INFO:distributed.scheduler:  Scheduler at: inproc://172.28.0.12/588/1
INFO:distributed.scheduler:  dashboard at:            localhost:8787
INFO:distributed.worker:      Start worker at: inproc://172.28.0.12/588/4
INFO:distributed.worker:         Listening to:          inproc172.28.0.12
INFO:distributed.worker:          Worker name:                          0
INFO:distributed.worker:         dashboard at:          172.28.0.12:42917
INFO:distributed.worker:Waiting to connect to: inproc://172.28.0.12/588/1
INFO:distributed.worker:-------------------------------------------------
INFO:distributed.worker:              Threads:                          4
INFO:distributed.worker:               Memory:                   1.86 GiB
INFO:distributed.worker:      Local Directory: /tmp/dask-worker-space/worker-ub

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://172.28.0.12:8787/status,

0,1
Dashboard: http://172.28.0.12:8787/status,Workers: 1
Total threads: 4,Total memory: 1.86 GiB
Status: running,Using processes: False

0,1
Comm: inproc://172.28.0.12/588/1,Workers: 1
Dashboard: http://172.28.0.12:8787/status,Total threads: 4
Started: Just now,Total memory: 1.86 GiB

0,1
Comm: inproc://172.28.0.12/588/4,Total threads: 4
Dashboard: http://172.28.0.12:42917/status,Memory: 1.86 GiB
Nanny: None,
Local directory: /tmp/dask-worker-space/worker-ubzdec3g,Local directory: /tmp/dask-worker-space/worker-ubzdec3g


In [14]:
import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
x

Unnamed: 0,Array,Chunk
Bytes,762.94 MiB,7.63 MiB
Shape,"(10000, 10000)","(1000, 1000)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 762.94 MiB 7.63 MiB Shape (10000, 10000) (1000, 1000) Dask graph 100 chunks in 1 graph layer Data type float64 numpy.ndarray",10000  10000,

Unnamed: 0,Array,Chunk
Bytes,762.94 MiB,7.63 MiB
Shape,"(10000, 10000)","(1000, 1000)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


# Vaex

- 지연 연산
- 시각화 툴
- 가용 컴퓨팅 리소스 최적화

In [16]:
!pip install vaex

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting vaex
  Downloading vaex-4.16.0-py3-none-any.whl (4.7 kB)
Collecting vaex-core<4.17,>=4.16.0 (from vaex)
  Downloading vaex_core-4.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting vaex-astro<0.10,>=0.9.3 (from vaex)
  Downloading vaex_astro-0.9.3-py3-none-any.whl (20 kB)
Collecting vaex-hdf5<0.15,>=0.13.0 (from vaex)
  Downloading vaex_hdf5-0.14.1-py3-none-any.whl (16 kB)
Collecting vaex-viz<0.6,>=0.5.4 (from vaex)
  Downloading vaex_viz-0.5.4-py3-none-any.whl (19 kB)
Collecting vaex-server<0.9,>=0.8.1 (from vaex)
  Downloading vaex_server-0.8.1-py3-none-any.whl (23 kB)
Collecting vaex-jupyter<0.9,>=0.8.1 (from vaex)
  Downloading vaex_jupyter-0.8.1-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [17]:
import vaex
df = vaex.example()
df

Downloading https://github.com/vaexio/vaex-datasets/releases/download/v1.0/helmi-dezeeuw-2000-FeH-v2-10percent.hdf5 to /root/.vaex/data/helmi-dezeeuw-2000-FeH-v2-10percent.hdf5


#,id,x,y,z,vx,vy,vz,E,L,Lz,FeH
0,0,1.2318684,-0.39692867,-0.59805775,301.15527,174.05948,27.427546,-149431.4,407.38898,333.95554,-1.0053853
1,23,-0.16370061,3.6542213,-0.25490645,-195.00023,170.47217,142.53023,-124247.95,890.24115,684.6676,-1.708667
2,32,-2.120256,3.3260527,1.7078403,-48.63423,171.6473,-2.0794373,-138500.55,372.2411,-202.17618,-1.8336141
3,8,4.715589,4.585251,2.2515438,-232.42084,-294.85083,62.85865,-60037.04,1297.6304,-324.6875,-1.4786882
4,16,7.217187,11.994717,-1.0645622,-1.6891745,181.32935,-11.333611,-83206.84,1332.799,1328.949,-1.8570484
...,...,...,...,...,...,...,...,...,...,...,...
329995,21,1.9938701,0.7892761,0.2220599,-216.9299,16.12442,-211.24438,-146457.44,457.72247,203.36758,-1.7451677
329996,25,3.7180912,0.7213376,1.6415337,-185.9216,-117.250824,-105.49866,-126627.11,335.00256,-301.837,-0.9822322
329997,14,0.36885077,13.029609,-3.6339347,-53.677147,-145.15771,76.7091,-84912.26,817.1376,645.8507,-1.7645613
329998,18,-0.112592645,1.4529126,2.1689527,179.30865,205.7971,-68.75873,-133498.47,724.00024,-283.69104,-1.8808953
