In [77]:
import pyarrow
import numpy as np
import pandas as pd

## Pyarrow arrays

In [48]:
import pyarrow as pa

name = pa.array(['John', 'Cameron', 'Jacob', 'Arlette'])
age = pa.array([10, 15, 20, 18])
country = pa.array(['UK', 'UK', 'US', 'France'])

In [49]:
name

<pyarrow.lib.StringArray object at 0x7f7bb9aca460>
[
  "John",
  "Cameron",
  "Jacob",
  "Arlette"
]

In [50]:
age

<pyarrow.lib.Int64Array object at 0x7f7b88fca220>
[
  10,
  15,
  20,
  18
]

You can specify the type of an array after you specify the values.

For example:

In [51]:
age = pa.array([10, 15, 20, 18], 'int32')
age

<pyarrow.lib.Int32Array object at 0x7f7bb9aca580>
[
  10,
  15,
  20,
  18
]

## Pyarrow table

Pyarrow tables are in `pa.Table`. 

A `Table` is essentially a list of arrays with an string identifier and a type associated to each array.

For example:

In [71]:
table = pa.Table.from_arrays([name, age, country], names=['name','age','country'])
table

pyarrow.Table
name: string
age: int32
country: string
----
name: [["John","Cameron","Jacob","Arlette"]]
age: [[10,15,20,18]]
country: [["UK","UK","US","France"]]

One can acess a particular column of a `Table` with the syntax `Table[col]` where col is a column name

In [72]:
table['age']

<pyarrow.lib.ChunkedArray object at 0x7f7bb9b704f0>
[
  [
    10,
    15,
    20,
    18
  ]
]

Or also with an integer

In [76]:
table[0]

<pyarrow.lib.ChunkedArray object at 0x7f7bb9ee8810>
[
  [
    "John",
    "Cameron",
    "Jacob",
    "Arlette"
  ]
]

`Table` objects can be constructed **from pandas**

In [84]:
df = pd.DataFrame({ 'name': ['John', 'Cameron', 'Jacob', 'Arlette'],
                    'age': [10, 15, 20, 18],
                    'country': ['UK', 'UK', 'US', 'France']})
df

Unnamed: 0,name,age,country
0,John,10,UK
1,Cameron,15,UK
2,Jacob,20,US
3,Arlette,18,France


In [83]:
pa.Table.from_pandas(df)

pyarrow.Table
name: string
age: int64
country: string
----
name: [["John","Cameron","Jacob","Arlette"]]
age: [[10,15,20,18]]
country: [["UK","UK","US","France"]]

`Table` objects can be constructed **from a dict defining columns**

In [86]:
d = {'name': ['John', 'Cameron', 'Jacob', 'Arlette'],
     'age': [10, 15, 20, 18],
     'country': ['UK', 'UK', 'US', 'France']}

In [88]:
d

{'name': ['John', 'Cameron', 'Jacob', 'Arlette'],
 'age': [10, 15, 20, 18],
 'country': ['UK', 'UK', 'US', 'France']}

In [95]:
pa.Table.from_pydict(d)

pyarrow.Table
name: string
age: int64
country: string
----
name: [["John","Cameron","Jacob","Arlette"]]
age: [[10,15,20,18]]
country: [["UK","UK","US","France"]]

It can also be constructed **from a list defining rows**

In [100]:
plist = [{'name': 'John', 'age': 10, 'country': 'UK'},
         {'name': 'Cameron', 'age': 15, 'country': 'UK'},
         {'name': 'Jacob', 'age': 20, 'country': 'US'},
         {'name': 'Arlette', 'age': 18, 'country': 'France'}]

In [103]:
pa.Table.from_pylist(plist)

pyarrow.Table
name: string
age: int64
country: string
----
name: [["John","Cameron","Jacob","Arlette"]]
age: [[10,15,20,18]]
country: [["UK","UK","US","France"]]

## Pyarrow parquet

In [105]:
import pyarrow.parquet as pq

In [119]:
path_single_parquet = '/Users/dbuchaca/Datasets/CS_2023_12_05/part-00049-80bc68a9-066b-46aa-8fbe-c51575f8a275.c000.snappy.parquet'
path_parquets = '/Users/dbuchaca/Datasets/CS_2023_12_05/'

**`pq.read_metadata`**:  can read basic information about a `.parquet` 

In [124]:
pq.read_metadata(path)

<pyarrow._parquet.FileMetaData object at 0x7f7bb9e7d4f0>
  created_by: parquet-mr version 1.10.1 (build 1245db4d86bba34408789caad27aed2dea7a5f5b)
  num_columns: 33
  num_rows: 420108
  num_row_groups: 2
  format_version: 1.0
  serialized_size: 15006

**`pq.read_table`**: read a table from a `.parquet` using 

In [130]:
asin = pq.read_table(path,columns=['asin'])
title = pq.read_table(path,columns=['title'])
asin_title = pq.read_table(path,columns=['asin','title'])

In [153]:
display(asin[0][1])
display(title[0][1])

<pyarrow.StringScalar: 'B0CCZRBS9W'>

<pyarrow.StringScalar: 'Pixy Canvas Floater Frame 29x29 for 0.75 inch Deep Canvas Paintings/Canvas Prints/Wood Canvas Panels/Wall Art/Wall Decor/Home Decor/Artwork, (Rustic Grey, 29 x 29 inch, Square)'>

In [150]:
display(asin_title[0][2])
display(asin_title[1][1])

<pyarrow.StringScalar: 'B08QZS2BSD'>

<pyarrow.StringScalar: 'Pixy Canvas Floater Frame 29x29 for 0.75 inch Deep Canvas Paintings/Canvas Prints/Wood Canvas Panels/Wall Art/Wall Decor/Home Decor/Artwork, (Rustic Grey, 29 x 29 inch, Square)'>

## Pyarrow regex

In [5]:
from pyarrow import compute
from pyarrow.compute import extract_regex

In [6]:
res = extract_regex(['hi there','hello hio'], pattern='hi\s')
res

<pyarrow.lib.StructArray object at 0x7f7b8840d040>
-- is_valid:
  [
    true,
    false
  ]

In [7]:
res = extract_regex(['hi there','hello hio'], pattern='hio')

In [8]:
res

<pyarrow.lib.StructArray object at 0x7f7be92b1340>
-- is_valid:
  [
    false,
    true
  ]

In [9]:
res.is_valid()[0]

<pyarrow.BooleanScalar: False>

In [20]:
res.is_valid()

<pyarrow.lib.BooleanArray object at 0x7f7b9078fd60>
[
  false,
  true
]

We can convert to numpy array

In [21]:
np.array(res.is_valid())

array([False,  True])