### Install requirements (on your VM)

- `pip3 install jupyter pytorch`
- Launch `jupyter notebook`and make sure to establish proper ssh forwarding tunnel

In [1]:
!wget https://ms.sites.cs.wisc.edu/cs544/data/hdma-wi-2021.zip

--2024-03-01 17:20:08--  https://ms.sites.cs.wisc.edu/cs544/data/hdma-wi-2021.zip
Resolving ms.sites.cs.wisc.edu (ms.sites.cs.wisc.edu)... 108.156.107.32, 108.156.107.107, 108.156.107.90, ...
Connecting to ms.sites.cs.wisc.edu (ms.sites.cs.wisc.edu)|108.156.107.32|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21494278 (20M) [application/zip]
Saving to: ‘hdma-wi-2021.zip’


2024-03-01 17:20:09 (22.4 MB/s) - ‘hdma-wi-2021.zip’ saved [21494278/21494278]



In [2]:
!unzip hdma-wi-2021.zip

Archive:  hdma-wi-2021.zip
  inflating: hdma-wi-2021.csv        


In [3]:
import pyarrow as pa
import pyarrow.parquet
import pyarrow.csv

In [4]:
!ls -lah

total 216M
drwxrwxr-x  3 meenakshisyamkumar meenakshisyamkumar 4.0K Mar  1 17:20 .
drwxrwxr-x 14 meenakshisyamkumar meenakshisyamkumar 4.0K Feb 28 18:30 ..
drwxrwxr-x  2 meenakshisyamkumar meenakshisyamkumar 4.0K Mar  1 16:58 .ipynb_checkpoints
-rw-rw-r--  1 meenakshisyamkumar meenakshisyamkumar 8.5K Mar  1 17:19 demo.ipynb
-rw-rw-r--  1 meenakshisyamkumar meenakshisyamkumar  266 Feb 28 18:50 demo_commands
-rw-rw-r--  1 meenakshisyamkumar meenakshisyamkumar  13M Mar  1 17:19 gzip_parquet
-rw-r-----  1 meenakshisyamkumar meenakshisyamkumar 167M Nov  1  2022 hdma-wi-2021.csv
-rw-rw-r--  1 meenakshisyamkumar meenakshisyamkumar  21M Feb 29 16:21 hdma-wi-2021.zip
-rw-rw-r--  1 meenakshisyamkumar meenakshisyamkumar  16M Mar  1 17:19 snappy_parquet


#### Built-in magic commands

- begin with `%%` only work inside `jupyter notebook`
- https://ipython.readthedocs.io/en/stable/interactive/magics.html
- example: `%%time`: measures execution time

In [5]:
%%time
t = pa.csv.read_csv("hdma-wi-2021.csv")

CPU times: user 2.74 s, sys: 1.99 s, total: 4.73 s
Wall time: 2.45 s


Let's write the data into parquet format.

In [6]:
pa.parquet.write_table(t, "hdma-wi-2021.parquet")

## WARNING: You will run out of memory and your VM will freeze if you try to read this data twice (csv and parquet format)

- Make sure to do "Kernal > Restart Kernel.."

In [1]:
import pyarrow as pa
import pyarrow.parquet
import pyarrow.csv

In [2]:
%%time
t = pa.parquet.read_table("hdma-wi-2021.parquet")

CPU times: user 788 ms, sys: 345 ms, total: 1.13 s
Wall time: 626 ms


### Binary versus Text

Let's try to read first 100 bytes of csv file.

In [3]:
with open("hdma-wi-2021.csv", "rb") as f:
    print(f.read(100))

b'activity_year,lei,derived_msa-md,state_code,county_code,census_tract,conforming_loan_limit,derived_l'


As opposed to csv which stores text type of data, first 100 bytes of csv file actually stores data in the form of bytes.

In [4]:
with open("hdma-wi-2021.parquet", "rb") as f:
    print(f.read(100))

b'PAR1\x15\x04\x15\x10\x15\x14L\x15\x02\x15\x00\x12\x00\x00\x08\x1c\xe5\x07\x00\x00\x00\x00\x00\x00\x15\x00\x15\x1a\x15\x1e,\x15\x8e\xce6\x15\x10\x15\x06\x15\x06\x1c\x18\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x18\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x16\x00(\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x18\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\r0\x04\x00\x00\x00\x8e\xce6'


### Orientation: parquet - column orientation

Let's say that we are only interested in two columns. Reading that is very fast using parquet format.

In [5]:
%%time
t2 = pa.parquet.read_table("hdma-wi-2021.parquet", columns=["lei", "census_tract"])

CPU times: user 10.5 ms, sys: 33.9 ms, total: 44.4 ms
Wall time: 34.8 ms


### Compression: snappy versus gzip

In [6]:
%%time
pa.parquet.write_table(t, "snappy_parquet", compression="snappy")

CPU times: user 1.44 s, sys: 39.7 ms, total: 1.48 s
Wall time: 1.8 s


In [7]:
%%time
pa.parquet.write_table(t, "gzip_parquet", compression="gzip")

CPU times: user 3.7 s, sys: 36.8 ms, total: 3.74 s
Wall time: 3.95 s


In [8]:
!ls -lah

total 232M
drwxrwxr-x  3 meenakshisyamkumar meenakshisyamkumar 4.0K Mar  1 17:20 .
drwxrwxr-x 14 meenakshisyamkumar meenakshisyamkumar 4.0K Feb 28 18:30 ..
drwxrwxr-x  2 meenakshisyamkumar meenakshisyamkumar 4.0K Mar  1 16:58 .ipynb_checkpoints
-rw-rw-r--  1 meenakshisyamkumar meenakshisyamkumar 8.0K Mar  1 17:20 demo.ipynb
-rw-rw-r--  1 meenakshisyamkumar meenakshisyamkumar  266 Feb 28 18:50 demo_commands
-rw-rw-r--  1 meenakshisyamkumar meenakshisyamkumar  13M Mar  1 17:22 gzip_parquet
-rw-r-----  1 meenakshisyamkumar meenakshisyamkumar 167M Nov  1  2022 hdma-wi-2021.csv
-rw-rw-r--  1 meenakshisyamkumar meenakshisyamkumar  16M Mar  1 17:20 hdma-wi-2021.parquet
-rw-rw-r--  1 meenakshisyamkumar meenakshisyamkumar  21M Feb 29 16:21 hdma-wi-2021.zip
-rw-rw-r--  1 meenakshisyamkumar meenakshisyamkumar  16M Mar  1 17:21 snappy_parquet
