# Accessing your data in DataFrames


---

*To preserve the mystery, select from the notebook menus*

`Edit -> Clear All Outputs`

---

In [1]:
import polars as pl

In [2]:
data_dict = {'letters':['A','B','c','D','eee'], 
             'hundreds':[100,200,300,400,500], 
             'tens':[10.0,20.0,30.0,40.0,50.0],
             'boolean':[True,False,True,True,False]}

In [3]:
df = pl.DataFrame(data_dict)
df

letters,hundreds,tens,boolean
str,i64,f64,bool
"""A""",100,10.0,True
"""B""",200,20.0,False
"""c""",300,30.0,True
"""D""",400,40.0,True
"""eee""",500,50.0,False


## DataFrame Attributes

### Each column has a data "type"

- **String** text string
- **Int64** is a 64-bit integer (whole number). The number of bits is just the amount of internal storage used for that number. *For integers it limits how big the number can be.*
- **Float64** is a "floating point" number (number with decimal places). *For floats the number of bits limits the precision of the number.*
- **Boolean** is a booleal value, which is just True/False

### Schema includes both the column names and types

which is more handy than `df.dtypes`, which still exists in Polars

In [5]:
df.schema

Schema([('letters', String),
        ('hundreds', Int64),
        ('tens', Float64),
        ('boolean', Boolean)])

### Polars has no index

In Pandas, the index is the "row names", but Polars stores its values in a different way (columns), so it's not as focused on accessing rows of data. (There are still ways to do it, but you'll have to get used to not having as easy row access.)

### DataFrame columns

You can get a list of column names

In [5]:
df.columns

['letters', 'hundreds', 'tens', 'boolean']

---

## Returning columns with `.select()`

### Single name gives single column

- This returns a Series.
- *This is a sortcut method. Preferred syntax is to explicitly call `pl.col('name')` as shown below.*

In [6]:
df.select('hundreds')

hundreds
i64
100
200
300
400
500


### Comma-separated column names inside the parentheses gives multiple columns

In Pandas you need a Python list, but here you can just do a comma separated sequence of names. *Again, this is a shortcut method*

In [7]:
df.select('tens','hundreds')

tens,hundreds
f64,i64
10.0,100
20.0,200
30.0,300
40.0,400
50.0,500


### Preferred method for specifying a column: `pl.col('name')`

In [11]:
df.select(pl.col('tens','hundreds'))

tens,hundreds
f64,i64
10.0,100
20.0,200
30.0,300
40.0,400
50.0,500


### Other methods for specifying which columns you want – not just by name!

This will give you back all of the `Int64` integer columns

In [10]:
df.select(pl.col(pl.Int64))

hundreds
i64
100
200
300
400
500


In [17]:
df.select(pl.all())

letters,hundreds,tens,boolean
str,i64,f64,bool
"""A""",100,10.0,True
"""B""",200,20.0,False
"""c""",300,30.0,True
"""D""",400,40.0,True
"""eee""",500,50.0,False


#### You seem to need both the beginning and end marks when excluding by RegEx

In [30]:
df.select(pl.all().exclude('^boo.*$'))

letters,hundreds,tens
str,i64,f64
"""A""",100,10.0
"""B""",200,20.0
"""c""",300,30.0
"""D""",400,40.0
"""eee""",500,50.0


### Polars Selectors

There is a super handy set of generic selectors for columns by type or content: [Polars Selectors docs](https://docs.pola.rs/api/python/stable/reference/selectors.html)

In [8]:
import polars.selectors as cs

In [9]:
df.select(cs.numeric())

hundreds,tens
i64,f64
100,10.0
200,20.0
300,30.0
400,40.0
500,50.0


In [14]:
df.select(cs.string())

letters
str
"""A"""
"""B"""
"""c"""
"""D"""
"""eee"""


In [15]:
df.select(cs.boolean())

boolean
bool
True
False
True
True
False


In [25]:
df.select(cs.contains('en'))

tens
f64
10.0
20.0
30.0
40.0
50.0


In [10]:
df.select(cs.exclude('^boo.*$'))

letters,hundreds,tens
str,i64,f64
"""A""",100,10.0
"""B""",200,20.0
"""c""",300,30.0
"""D""",400,40.0
"""eee""",500,50.0


In [11]:
df.select(cs.exclude('^boo.*$',pl.Int64))

letters,tens
str,f64
"""A""",10.0
"""B""",20.0
"""c""",30.0
"""D""",40.0
"""eee""",50.0


---

## Returning rows using `.filter()`

- In Pandas, you select rows and columns with the same `.loc[rows,columns]` method
- **In Polars, you return rows with `.filter()`
- So, again, you normally won't use a single method like `.loc[]` in Polars, you choose
    - columns with `.select()` and
    - rows with `.filter()`

### Boolean tests

- **It's very common to want all the rows from your DataFrame which pass a certain test**, or set of criteria.
- In polars this test is an expression (whereas in Pandas it would have returned a boolean Series)

In [16]:
pl.col('tens') < 35

### Return a boolean 1-D Polars DataFrame by putting the test in a select call

In [32]:
df.select(pl.col('tens') < 35)

tens
bool
True
True
True
False
False


In [36]:
df.select((pl.col('tens') < 35).alias('tens_gt_35'))

tens_gt_35
bool
True
True
True
False
False


### Rows are returned where boolean test == True

**This avoids one of the screwy bits if Pandas! In Pandas, if we use a single bracket `df[]` with a boolean Series inside, we get back rows instead of columns!**

Here we know we're going to get back rows that pass the test because we're using `.filter()`


In [33]:
df.filter(pl.col('tens') < 35)

letters,hundreds,tens,boolean
str,i64,f64,bool
"""A""",100,10.0,True
"""B""",200,20.0,False
"""c""",300,30.0,True


In [34]:
df.filter(pl.col('boolean'))

letters,hundreds,tens,boolean
str,i64,f64,bool
"""A""",100,10.0,True
"""c""",300,30.0,True
"""D""",400,40.0,True


### More complicated conditionals

You can combine multiple conditions if you put them in parentheses and put a logical operator between.

- `&` = "and"
- `|` = "or"  *(It's called a "pipe" character, and it's `shift-\` above the Enter/Return key)*

In [37]:
df.filter((pl.col('tens') < 35) & (pl.col('hundreds') > 200))

letters,hundreds,tens,boolean
str,i64,f64,bool
"""c""",300,30.0,True


---

*To try the exercise below, if you haven't been executing the cells above, select this cell and from the Jupyter menus choose*

`Run -> Run All Above Selected Cell`

### **EXERCISE**

**Return all rows in `df` where the "hundreds" column value is greater than or equal to 400**

*Note: Type instead of using copy/paste for better retention*

---

### Returning a single item from a DataFrame

*NOTE: Not sure how important this one is. I'm including it for now because I wondered how to do it. This one is a little strange because you're using row indices, which I'm not sure how to find for a given row...*

from the [docs](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.item.html)

- If row/col not provided, this is equivalent to `df[0,0]`, with a check that the shape is (1,1).
- With row/col, this is equivalent to df[row,col].

In [38]:
df.item(0,'letters')

'A'

### Return single row as a list

*Again, not sure how much you'd use this. Filter always gives back a DataFrame. This you have to access through a row integer index, which I'm not sure how you find...*

In [42]:
df.row(0)

('A', 100, 10.0, True)

### Colon `:` for selecting whole or slices

There are ways of doing slicing with Polars, but again, I'm not sure how important it is...

---

*To try the exercise below, if you haven't been executing the cells above, select this cell and from the Jupyter menus choose*

`Run -> Run All Above Selected Cell`

---

---

*To try the exercise below, if you haven't been executing the cells above, select this cell and from the Jupyter menus choose*

`Run -> Run All Above Selected Cell`

### **EXERCISE**

**Using the `df2.filter()` notation, return all columns for all rows in which the "boolean" column is True**

*Note: Type instead of using copy/paste for better retention*

---

## `df.sql()` for returning a DataFrame with SQL

There doesn't seem to be a way of inserting variable values into the SQL string besides the standard `?` Python string injection method, but that risks SQL injection into your code if you're not controling that input.

### Format your query as a string in quotes

- You can refer to column names (that don't have spaces) directly in the expression

In [48]:
df.sql("""
    SELECT * FROM self
    WHERE tens < 35
    """)

letters,hundreds,tens,boolean
str,i64,f64,bool
"""A""",100,10.0,True
"""B""",200,20.0,False
"""c""",300,30.0,True


---

## No SettingWithCopyWarning in Polars

In Pandas, if you select a subset of your DataFrame and assign to a new variable as a way to keep just a couple columns of your DataFrame, you will run into a **SettingWithCopyWarning** if you try to change that new DataFrame. *Because of some Pandas inner-workings, when you slice or index into a DataFrame it's not actually clear whether it will create a "view" into the original DataFrame or a copy of the data!* 

That doesn't happen in Polars. **Columns are Immutable in Polars!**

In [57]:
df_nums = df.select(pl.col('hundreds','tens'))
df_nums

hundreds,tens
i64,f64
100,10.0
200,20.0
300,30.0
400,40.0
500,50.0


## Return DataFrame with new columns using `.with_columns()`

In [58]:
df_nums.with_columns(sums = pl.col('hundreds') + pl.col('tens'))


hundreds,tens,sums
i64,f64,f64
100,10.0,110.0
200,20.0,220.0
300,30.0,330.0
400,40.0,440.0
500,50.0,550.0


In [59]:
df_nums.with_columns((pl.col('hundreds') + pl.col('tens')).alias('sums'))

hundreds,tens,sums
i64,f64,f64
100,10.0,110.0
200,20.0,220.0
300,30.0,330.0
400,40.0,440.0
500,50.0,550.0


### Not sure yet how to change values in Polars since columns are immutable...