## 02. Data Types of Polars

### Polars DataFrame

In [1]:
import polars as pl
import numpy as np

In [2]:
csv_file = './data/titanic.csv'

In [3]:
df = pl.read_csv(csv_file)

df.head()

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


A Polars `DataFrame`:
- is a tabular dataset stored in an Arrow Table (see below)
- has a height and a width
- has unique string column names
- has a data type for each column
- has methods for transforming the data stored in the Arrow Table

In [5]:
# shows # of columns
df.width

12

In [6]:
# show # of rows (samples)
df.height

891

In [7]:
# shows both # of columns and samples
df.shape

(891, 12)

## Data type schema

Every column in a Dataframe has a data type called a `dtpye`. <p>
We can get a `pl.Schema` that maps column names to dtypes with the `.shcema` attribute

In [8]:
df.schema

Schema([('PassengerId', Int64),
        ('Survived', Int64),
        ('Pclass', Int64),
        ('Name', String),
        ('Sex', String),
        ('Age', Float64),
        ('SibSp', Int64),
        ('Parch', Int64),
        ('Ticket', String),
        ('Fare', Float64),
        ('Cabin', String),
        ('Embarked', String)])

We can also create a `pl.Schema` manually

In [9]:
pl.Schema(
    [
        ("a", pl.Int64),
        ("b", pl.Float64)
    ]
)

Schema([('a', Int64), ('b', Float64)])

또한 `pl.Schema` 말고도 `.dtypes`로도 출력이 가능하다.

In [10]:
df.dtypes

[Int64,
 Int64,
 Int64,
 String,
 String,
 Float64,
 Int64,
 Int64,
 String,
 Float64,
 String,
 String]

당연히 하나의 열에 대한 `dtypes`도 출력이 가능하다.

In [11]:
df['Name'].dtype

String

## Apache Arrow

Pandas는 numpy에 데이터를 담지만 Polars는 Arrow에 데이터를 담는다.

In [12]:
df.to_arrow()

pyarrow.Table
PassengerId: int64
Survived: int64
Pclass: int64
Name: large_string
Sex: large_string
Age: double
SibSp: int64
Parch: int64
Ticket: large_string
Fare: double
Cabin: large_string
Embarked: large_string
----
PassengerId: [[1,2,3,4,5,...,54,55,56,57,58],[59,60,61,62,63,...,116,117,118,119,120],...,[830,831,832,833,834,...,884,885,886,887,888],[889,890,891]]
Survived: [[0,1,1,1,0,...,1,0,1,1,0],[1,0,0,1,0,...,0,0,0,0,0],...,[1,1,1,0,0,...,0,0,0,0,1],[0,1,0]]
Pclass: [[3,1,3,1,3,...,2,1,1,2,3],[2,3,3,1,1,...,3,3,2,1,3],...,[1,3,2,3,3,...,2,3,3,2,1],[3,1,3]]
Name: [["Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Thayer)","Heikkinen, Miss. Laina","Futrelle, Mrs. Jacques Heath (Lily May Peel)","Allen, Mr. William Henry",...,"Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson)","Ostby, Mr. Engelhart Cornelius","Woolner, Mr. Hugh","Rugg, Miss. Emily","Novel, Mr. Mansouer"],["West, Miss. Constance Mirium","Goodwin, Master. William Frederick","Sirayanian, Mr. O

In [15]:
type(df['Age'])

polars.series.series.Series

In [17]:
df['Age'].to_arrow()

<pyarrow.lib.DoubleArray object at 0x10af2de40>
[
  22,
  38,
  26,
  35,
  35,
  null,
  54,
  2,
  27,
  14,
  ...
  33,
  22,
  28,
  25,
  39,
  27,
  19,
  null,
  26,
  32
]

In [18]:
# arrow로 저장된 polars series를 numpy로도 출력 가능하다.
df['Age'].to_numpy()

array([22.  , 38.  , 26.  , 35.  , 35.  ,   nan, 54.  ,  2.  , 27.  ,
       14.  ,  4.  , 58.  , 20.  , 39.  , 14.  , 55.  ,  2.  ,   nan,
       31.  ,   nan, 35.  , 34.  , 15.  , 28.  ,  8.  , 38.  ,   nan,
       19.  ,   nan,   nan, 40.  ,   nan,   nan, 66.  , 28.  , 42.  ,
         nan, 21.  , 18.  , 14.  , 40.  , 27.  ,   nan,  3.  , 19.  ,
         nan,   nan,   nan,   nan, 18.  ,  7.  , 21.  , 49.  , 29.  ,
       65.  ,   nan, 21.  , 28.5 ,  5.  , 11.  , 22.  , 38.  , 45.  ,
        4.  ,   nan,   nan, 29.  , 19.  , 17.  , 26.  , 32.  , 16.  ,
       21.  , 26.  , 32.  , 25.  ,   nan,   nan,  0.83, 30.  , 22.  ,
       29.  ,   nan, 28.  , 17.  , 33.  , 16.  ,   nan, 23.  , 24.  ,
       29.  , 20.  , 46.  , 26.  , 59.  ,   nan, 71.  , 23.  , 34.  ,
       34.  , 28.  ,   nan, 21.  , 33.  , 37.  , 28.  , 21.  ,   nan,
       38.  ,   nan, 47.  , 14.5 , 22.  , 20.  , 17.  , 21.  , 70.5 ,
       29.  , 24.  ,  2.  , 21.  ,   nan, 32.5 , 32.5 , 54.  , 12.  ,
         nan, 24.  ,

In [21]:
# can use numpy functions on Arrow Arrays
np.sqrt(df['Age'])

Age
f64
4.690416
6.164414
5.09902
5.91608
5.91608
…
5.196152
4.358899
""
5.09902


Arrow는 numpy보다 더 가볍고 빠른 장점이 있지만, <p>
다음의 경우는 numpy가 더 좋을 수 있다. <p>
* transposing a dataframe
* doing matrix mulitiplication/linear algebra on a dataframe

## Series and DataFrame

We can create a `Series` from a `Dataframe` column with square brackets

In [22]:
(
    df['Age']
    .head(3)
)

Age
f64
22.0
38.0
26.0


Note that `Series` has a name (`Age`) and a dtype (floating 64-bit)

In [24]:
(
    df
    .select(
        pl.col("Age")
    )
    .to_series()
    .head(3)
)

Age
f64
22.0
38.0
26.0


We can convert a `Series` into a one-columns DataFrame using `to_frame`

In [28]:
series_ = df['Name']

print(series_.shape) # (891,) -> col이 없는 하나의 벡터느낌
series_


(891,)


Name
str
"""Braund, Mr. Owen Harris"""
"""Cumings, Mrs. John Bradley (Fl…"
"""Heikkinen, Miss. Laina"""
"""Futrelle, Mrs. Jacques Heath (…"
"""Allen, Mr. William Henry"""
…
"""Montvila, Rev. Juozas"""
"""Graham, Miss. Margaret Edith"""
"""Johnston, Miss. Catherine Hele…"
"""Behr, Mr. Karl Howell"""


In [30]:
frame_ = (
    series_
    .to_frame()
)

print(frame_.shape) # (891, 1) -> col이 생겻다.
frame_

(891, 1)


Name
str
"""Braund, Mr. Owen Harris"""
"""Cumings, Mrs. John Bradley (Fl…"
"""Heikkinen, Miss. Laina"""
"""Futrelle, Mrs. Jacques Heath (…"
"""Allen, Mr. William Henry"""
…
"""Montvila, Rev. Juozas"""
"""Graham, Miss. Margaret Edith"""
"""Johnston, Miss. Catherine Hele…"
"""Behr, Mr. Karl Howell"""


## Create `Series` or `DataFrame` from `list` or `dict`

In [31]:
values = [1, 2, 3]
pl.Series(values)

1
2
3


If the `name` argument is not set then it defaults to an empty string. <p>
The name can be passed as the **first** argument.

In [32]:
pl.Series('Values', values)

Values
i64
1
2
3


We can also convert a `Series` to a `list` with `to_list`

In [35]:
pl.Series(name = 'Values', values = values).to_list()

[1, 2, 3]

In [36]:
# more polarity
(
    pl.Series(
        name = 'Values',
        values = values
    )
    .to_list()
)

[1, 2, 3]

In [38]:
data = [
    [1, 2, 3],
    [4, 5, 6]
]

(
    pl.DataFrame(
        data = data, # the data wants to transform into dataframe
        schema= ["Col0", "Col1"] # colnames of each data
    )
)

Col0,Col1
i64,i64
1,4
2,5
3,6


In [42]:
data = {
    "col0" : [1, 2, 3],
    "col1" : [4, 5, 6]
}

ex_df = (
    pl.DataFrame(
        data = data
    )
)

print(ex_df)
print(ex_df.schema)

shape: (3, 2)
┌──────┬──────┐
│ col0 ┆ col1 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 4    │
│ 2    ┆ 5    │
│ 3    ┆ 6    │
└──────┴──────┘
Schema({'col0': Int64, 'col1': Int64})


In [43]:
data_dict = {"col0":[1,2,3],"col1":[4,5,6]}

ex_df_2 = (
    pl.DataFrame(
        data_dict,
        schema={
            "col0":pl.Int64,
            "col1":pl.Int32 # We can specify dtypes by passing a `dict` to the schema argument.
        }
    )
)

print(ex_df_2)
print(ex_df_2.schema)

shape: (3, 2)
┌──────┬──────┐
│ col0 ┆ col1 │
│ ---  ┆ ---  │
│ i64  ┆ i32  │
╞══════╪══════╡
│ 1    ┆ 4    │
│ 2    ┆ 5    │
│ 3    ┆ 6    │
└──────┴──────┘
Schema({'col0': Int64, 'col1': Int32})
