## CAUTION: Whenever you see a cell containing `import` statements, you should stop your jupyter kernel and restart or stop the Docker container and start again

You will **run out of memory** if you don't do that!

In [None]:
import torch
import time

## Matrix storage - contiguous rows vs contiguos columns

In [None]:
A = torch.rand(10_000, 10_000)   # default is float32, so this is about 400 MB

Which will be faster now: sum over rows or over columns? Rows. That is because by default the rows are stored contiguously in memory.

Sum over columns:

In [None]:
start = time.time()
A.sum(dim=0)       # 0 is down, so this is summing over columns
end = time.time()
(end-start)*1000   # milliseconds

Sum over rows:

In [None]:
start = time.time()
A.sum(dim=1)       # 1 is across, so this is summing over rows
end = time.time()
(end-start)*1000   # milliseconds

Now let's create the initial matrix as the transpose. Now summation over columns should be faster than rows.

Just taking a transpose using `.T` won't be sufficient to change the data in memory! Your initial matrix object itself should be storing columns as rows.

In [None]:
A = torch.rand(10_000, 10_000).T   # default is float32, so this is about 400 MB

Sum over columns (should now be faster):

In [None]:
start = time.time()
A.sum(dim=0)       # 0 is down, so this is summing over columns
end = time.time()
(end-start)*1000   # milliseconds

Sum over rows (should now be slower):

In [None]:
start = time.time()
A.sum(dim=1)       # 1 is across, so this is summing over rows
end = time.time()
(end-start)*1000   # milliseconds

## PyArrow

In [None]:
!wget https://pages.cs.wisc.edu/~harter/cs544/data/hdma-wi-2021.zip

In [None]:
!unzip hdma-wi-2021.zip

In [None]:
!ls -lah

In [None]:
import time
import pandas as pd

In [None]:
start = time.time()
pd.read_csv("hdma-wi-2021.csv")
end = time.time()
end-start # seconds

In [None]:
import pyarrow.csv
import time
import pyarrow.compute as pc

In [None]:
start = time.time()
tbl = pyarrow.csv.read_csv("hdma-wi-2021.csv")
end = time.time()

In [None]:
end-start

In [None]:
start = time.time()
df = tbl.to_pandas()
end = time.time()
end-start

In [None]:
pc.utf8_lower(tbl["lei"]).to_pandas()

In [None]:
pc.mean(tbl["income"].drop_null()).as_py()

In [None]:
tbl[:10].to_pandas()