# 3. <a id='intro'>Polars</a>

This is the **Polars version** of the original **Lecture 2 (Pandas)**.

We keep the same learning goals and the same section structure:

- **Series**
- **DataFrames**
- **Importing Data**
- **Filtering**
- **Nulls**
- **Duplicates**
- **Groupby**
- **Reshape**
- **Merge**

> Key difference: **Polars has no index** (unlike pandas).  
> In Polars, keep identifiers as explicit columns (e.g., `state`, `date`, `id`), then filter/join using columns.

## 3.1. <a id='def'>Definition</a>

**Polars** is a fast, columnar DataFrame library for Python (built on Apache Arrow).  
It emphasizes:

- expression-based transformations (`with_columns`, `select`, `filter`)
- explicit columns (no hidden index)
- performance for larger datasets

In [1]:
import numpy as np
import polars as pl

## 3.2. <a id='series'>Polars Series</a>

Polars supports `pl.Series`, but (unlike pandas) it is **not** tied to an index.

### 3.2.1. <a id='3.2.1'>From `lists` to `Series`<a>

In [2]:
list_1 = [0.25, 0.5, 0.75, 1.0]
list_1

[0.25, 0.5, 0.75, 1.0]

In [3]:
data = pl.Series([0.25, 0.5, 0.75, 1.0])
data


0.25
0.5
0.75
1.0


In [4]:
data_2 = pl.Series("data_2", list_1)
data_2

data_2
f64
0.25
0.5
0.75
1.0


In [999]:
type(data)

polars.series.series.Series

In [1000]:
data

0.25
0.5
0.75
1.0


In [1001]:
print(data)

shape: (4,)
Series: '' [f64]
[
	0.25
	0.5
	0.75
	1.0
]


### 3.2.2. <a id='3.2.2'> From `NumPy array` to `Series` <a>

In [1002]:
vector_1 = np.array([10, 20, 1, 2, 3, 4, 5, 6, 7])
vector_1

array([10, 20,  1,  2,  3,  4,  5,  6,  7])

In [1003]:
df = pl.DataFrame({"vector_1": vector_1})
df


vector_1
i64
10
20
1
2
3
4
5
6
7


In [1004]:
array = np.array([10, 20, 1, 2, 3, 4, 5, 6, 7])

series1 = pl.Series("series1", array)
series1

series1
i64
10
20
1
2
3
4
5
6
7


### 3.2.3.  <a id='3.2.3'> From `Dictionary` to `Series` </a>

In Polars, a dictionary is typically converted into a **DataFrame** first, or you can create a Series from its values.

In [5]:
population_dict = {
    "California": 38332521,
    "Texas": 26448193,
    "New York": 19651127,
    "Florida": 19552860,
    "Illinois": 12882135,
}

population_df = pl.DataFrame({
    "state": list(population_dict.keys()),
    "population": list(population_dict.values()),
})

population_df

state,population
str,i64
"""California""",38332521
"""Texas""",26448193
"""New York""",19651127
"""Florida""",19552860
"""Illinois""",12882135


In [6]:
pop_series = pl.Series("population", list(population_dict.values()))
pop_series


population
i64
38332521
26448193
19651127
19552860
12882135


In [7]:
state = pl.Series("state", list(population_dict.keys()))
state


state
str
"""California"""
"""Texas"""
"""New York"""
"""Florida"""
"""Illinois"""


### 3.2.4.  <a id='3.2.4'> `Series` vs `NumPy`</a>

Polars expressions work on columns; NumPy works on arrays.

- Polars: column operations inside a DataFrame
- NumPy: array operations


In [8]:
# polars
claudia_pl = pl.Series("claudia", list(range(5, 21, 2)))
claudia_pl

claudia
i64
5
7
9
11
13
15
17
19


In [9]:
# numpy
claudia = np.arange(5, 21, 2)
claudia_pl = pl.Series("claudia", claudia)

claudia_pl

claudia
i64
5
7
9
11
13
15
17
19


## 3.3.  <a id='3.3'> DataFrame</a>

### 3.3.1. <a id='3.3.1'> DataFrame Generation</a>

#### From `lists` and `dict` to `DataFrame`

In [10]:
students = ["Alejandro", "Pedro", "Ramiro", "Axel", "Juan"]
math     = [15, 16, 10, 12, 13]
english  = [13, 9, 16, 14, 17]
art      = [12, 16, 15, 19, 10]

grades_A = {"Students": students, "Math": math, "English": english, "Art": art}
grades_A

{'Students': ['Alejandro', 'Pedro', 'Ramiro', 'Axel', 'Juan'],
 'Math': [15, 16, 10, 12, 13],
 'English': [13, 9, 16, 14, 17],
 'Art': [12, 16, 15, 19, 10]}

In [11]:
grades_A = {"Students": students, "Math": math, "English": english, "Art": art}
gradesA1 = pl.DataFrame(grades_A)
gradesA1

Students,Math,English,Art
str,i64,i64,i64
"""Alejandro""",15,13,12
"""Pedro""",16,9,16
"""Ramiro""",10,16,15
"""Axel""",12,14,19
"""Juan""",13,17,10


In [12]:
pl.DataFrame(grades_A)

Students,Math,English,Art
str,i64,i64,i64
"""Alejandro""",15,13,12
"""Pedro""",16,9,16
"""Ramiro""",10,16,15
"""Axel""",12,14,19
"""Juan""",13,17,10


#### From `lists` and `NumPy` to `DataFrame`

In [13]:
matrix_1 = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

df_matrix = pl.DataFrame(matrix_1, schema=["col1", "col2", "col3"])
df_matrix

col1,col2,col3
i64,i64,i64
1,2,3
4,5,6
7,8,9


In [1014]:
col_names = ["a", "b", "c"]

df = pl.DataFrame(matrix_1, schema=col_names)
df

a,b,c
i64,i64,i64
1,2,3
4,5,6
7,8,9


### 3.3.2. <a id='3.3.2'> Indexing</a>

Pandas uses `.loc`/`.iloc`.

Polars uses:
- `select` to choose columns
- `filter` to choose rows by conditions
- `slice` to choose rows by position

In [14]:
# Grades
students = ["Gissela", "Daniel", "Andres", "Sandra", "Rosalyn"]
math     = [16, 14, 17, 17, 17]
english  = [16, 17, 19, 18, 15]
art      = [11, 17, 13, 14, 17]

# Dictionary
diplomado = {"Students": students, "Math": math, "English": english, "Art": art}

# Polars DataFrame
gradesA1 = pl.DataFrame(diplomado)
gradesA1

Students,Math,English,Art
str,i64,i64,i64
"""Gissela""",16,16,11
"""Daniel""",14,17,17
"""Andres""",17,19,13
"""Sandra""",17,18,14
"""Rosalyn""",17,15,17


In [1016]:
gradesA1.slice(1, 1)   # start=1, length=1


Students,Math,English,Art
str,i64,i64,i64
"""Daniel""",14,17,17


In [15]:
gradesA1.filter( pl.col("Math") >= 13)


Students,Math,English,Art
str,i64,i64,i64
"""Gissela""",16,16,11
"""Daniel""",14,17,17
"""Andres""",17,19,13
"""Sandra""",17,18,14
"""Rosalyn""",17,15,17


In [1018]:
gradesA1.slice(1, 3)


Students,Math,English,Art
str,i64,i64,i64
"""Daniel""",14,17,17
"""Andres""",17,19,13
"""Sandra""",17,18,14


### 3.3.3. <a id='3.3.3'> General Methods</a>

- `head`, `tail`
- `describe`
- sorting
- adding columns with `with_columns`

In [16]:
deps = {
    "dep": ["Lima", "Piura", "Tumbes", "Cuzco", "Ica", "Puno"],
    "year": [2000, 2001, 2002, 2001, 2002, 2003],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2],
}

dep1 = pl.DataFrame(deps)
dep1

dep,year,pop
str,i64,f64
"""Lima""",2000,1.5
"""Piura""",2001,1.7
"""Tumbes""",2002,3.6
"""Cuzco""",2001,2.4
"""Ica""",2002,2.9
"""Puno""",2003,3.2


In [1020]:
dep1 = dep1.sort("dep", descending=False)

In [1021]:
dep1

dep,year,pop
str,i64,f64
"""Cuzco""",2001,2.4
"""Ica""",2002,2.9
"""Lima""",2000,1.5
"""Piura""",2001,1.7
"""Puno""",2003,3.2
"""Tumbes""",2002,3.6


In [1022]:
dep1_sort = dep1.sort(["year", "pop"], descending=True)
dep1_sort


dep,year,pop
str,i64,f64
"""Puno""",2003,3.2
"""Tumbes""",2002,3.6
"""Ica""",2002,2.9
"""Cuzco""",2001,2.4
"""Piura""",2001,1.7
"""Lima""",2000,1.5


In [1023]:
dep1

dep,year,pop
str,i64,f64
"""Cuzco""",2001,2.4
"""Ica""",2002,2.9
"""Lima""",2000,1.5
"""Piura""",2001,1.7
"""Puno""",2003,3.2
"""Tumbes""",2002,3.6


In [1024]:
# Grades
students = [ "Gissela", "Daniel", "Andres", "Sandra", "Rosalyn" ]
math     = [ 16, 14, 17, 17, 17 ]
english  = [ 16, 17, 19, 18, 15 ]
art      = [ 11, 17, 13, 14, 17 ]

# Dictionary
diplomado = {'Students':students, 'Math':math, 'English':english, 'Art':art}
gradesA1 = pl.DataFrame( data = diplomado )
gradesA1

Students,Math,English,Art
str,i64,i64,i64
"""Gissela""",16,16,11
"""Daniel""",14,17,17
"""Andres""",17,19,13
"""Sandra""",17,18,14
"""Rosalyn""",17,15,17


In [1025]:
gradesA1 = gradesA1.with_columns(
    (((pl.col("Math") + pl.col("English") + pl.col("Art")) / 3).round(2)).alias("avg")
)
gradesA1

Students,Math,English,Art,avg
str,i64,i64,i64,f64
"""Gissela""",16,16,11,14.33
"""Daniel""",14,17,17,16.0
"""Andres""",17,19,13,16.33
"""Sandra""",17,18,14,16.33
"""Rosalyn""",17,15,17,16.33


In [1026]:
# head
gradesA1.head( 4 )


Students,Math,English,Art,avg
str,i64,i64,i64,f64
"""Gissela""",16,16,11,14.33
"""Daniel""",14,17,17,16.0
"""Andres""",17,19,13,16.33
"""Sandra""",17,18,14,16.33


In [1027]:
gradesA1.head(3)

Students,Math,English,Art,avg
str,i64,i64,i64,f64
"""Gissela""",16,16,11,14.33
"""Daniel""",14,17,17,16.0
"""Andres""",17,19,13,16.33


In [17]:
# add new data gradesA2
students = [ "Rebeca", "Xavi", "Cristiano", "Ronaldo", "Leo" ]
math     = [ 15, 18, 14, 7, 10 ]
english  = [ 18, 9, 11, 12, 20 ]
art      = [ 10, 16, 20, 19, 5 ]

# Dictionary
grades_A2 = {'Students':students, 'Math':math, 'English':english, 'Art':art}
gradesA2 = pl.DataFrame( grades_A2 )
print( gradesA2 , "\n")

shape: (5, 4)
┌───────────┬──────┬─────────┬─────┐
│ Students  ┆ Math ┆ English ┆ Art │
│ ---       ┆ ---  ┆ ---     ┆ --- │
│ str       ┆ i64  ┆ i64     ┆ i64 │
╞═══════════╪══════╪═════════╪═════╡
│ Rebeca    ┆ 15   ┆ 18      ┆ 10  │
│ Xavi      ┆ 18   ┆ 9       ┆ 16  │
│ Cristiano ┆ 14   ┆ 11      ┆ 20  │
│ Ronaldo   ┆ 7    ┆ 12      ┆ 19  │
│ Leo       ┆ 10   ┆ 20      ┆ 5   │
└───────────┴──────┴─────────┴─────┘ 



In [1029]:
gradesA2

Students,Math,English,Art
str,i64,i64,i64
"""Rebeca""",15,18,10
"""Xavi""",18,9,16
"""Cristiano""",14,11,20
"""Ronaldo""",7,12,19
"""Leo""",10,20,5


In [1030]:
gradesA1

Students,Math,English,Art,avg
str,i64,i64,i64,f64
"""Gissela""",16,16,11,14.33
"""Daniel""",14,17,17,16.0
"""Andres""",17,19,13,16.33
"""Sandra""",17,18,14,16.33
"""Rosalyn""",17,15,17,16.33


In [18]:
grades_total = pl.concat([gradesA1, gradesA2], how="diagonal")
grades_total

Students,Math,English,Art
str,i64,i64,i64
"""Gissela""",16,16,11
"""Daniel""",14,17,17
"""Andres""",17,19,13
"""Sandra""",17,18,14
"""Rosalyn""",17,15,17
"""Rebeca""",15,18,10
"""Xavi""",18,9,16
"""Cristiano""",14,11,20
"""Ronaldo""",7,12,19
"""Leo""",10,20,5


In [1032]:
grades_total.slice(9, 1)


Students,Math,English,Art,avg
str,i64,i64,i64,f64
"""Leo""",10,20,5,


In [1033]:
gradesA1_1 = gradesA1.clone()  
grades_total = pl.concat([gradesA1_1, gradesA2], how="diagonal")
grades_total


Students,Math,English,Art,avg
str,i64,i64,i64,f64
"""Gissela""",16,16,11,14.33
"""Daniel""",14,17,17,16.0
"""Andres""",17,19,13,16.33
"""Sandra""",17,18,14,16.33
"""Rosalyn""",17,15,17,16.33
"""Rebeca""",15,18,10,
"""Xavi""",18,9,16,
"""Cristiano""",14,11,20,
"""Ronaldo""",7,12,19,
"""Leo""",10,20,5,


### 3.3.4. <a id='3.3.4'> Importing Data</a>
Polars does not read `.sav` directly, so we use **pyreadstat** to read the file, then convert to Polars.

If the file is not present, we create a small placeholder DataFrame so the notebook runs.

In [19]:
from pathlib import Path
import pyreadstat
import polars as pl

file_path = Path("..") / "_data" / "CAP_100_URBANO_RURAL_3.sav"  # because you're in Lectures/
df_pd, meta = pyreadstat.read_sav(str(file_path))

enapres2020_1 = pl.from_pandas(df_pd)
enapres2020_1


PER,ANIO,MES,CONGLOMERADO,NSELV,TSELV,VIVREM,NUMVIVREM,AREA,CCDD,NOMBREDD,CCPP,NOMBREPP,CCDI,NOMBREDI,VIVIENDA,TOT_HOGAR,HOGAR,RESFIN,P100_C,P101,P101_O,P102A,P102A_O,P103,P103_O,P104B,P104B_O,P105,P106,P106_O,P106A,P107,P107A,P107A_O,P107B,P107B_O,…,P184A_11,P184A_12,P184A_13,P184A_14,P184A_15,P184A_16,P184A_17,P184A_18,P185,P185A,P186,P186A,P187_1,P187_2,P187_3,P187_4,P187_5,P187_6,P187_7,P187_8,P187_9,P187_9_O,P188,P189_1,P189_2,P189_3,P189_4,P189_5,P189_6,P189_7,P189_8,P189_8_O,REGIONNATU,ESTRATO,MOD_ENC,FACTOR,FACTOR_CALIBRADO
f64,str,str,str,str,f64,f64,f64,f64,str,str,str,str,str,str,str,f64,f64,f64,f64,f64,str,f64,str,f64,str,f64,str,f64,f64,str,f64,f64,f64,str,f64,str,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64
2.0,"""2020""","""07""","""04046""","""0046""",5.0,0.0,,2.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""08""","""MONSEFU""","""003""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",,2.0,121.27014,
2.0,"""2020""","""09""","""03981""","""0120""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""006""",1.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",5.0,"""""",4.0,"""""",4.0,2.0,"""""",1.0,1.0,5.0,"""""",4.0,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496
2.0,"""2020""","""09""","""03981""","""0189""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""008""",2.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",6.0,"""""",4.0,"""""",2.0,2.0,"""""",1.0,1.0,3.0,"""""",4.0,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,,,,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496
2.0,"""2020""","""09""","""03981""","""0189""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""008""",2.0,2.0,1.0,1.0,1.0,"""""",1.0,"""""",6.0,"""""",4.0,"""""",2.0,2.0,"""""",1.0,1.0,3.0,"""""",4.0,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,,,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496
2.0,"""2020""","""07""","""04046""","""0043""",5.0,0.0,,2.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""08""","""MONSEFU""","""001""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",,2.0,121.27014,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2.0,"""2020""","""05""","""51489""","""0073""",1.0,0.0,,1.0,"""18""","""MOQUEGUA""","""03""","""ILO""","""02""","""EL ALGARROBAL""","""007""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",4.0,2.0,8.948965,
2.0,"""2020""","""05""","""51489""","""0079""",1.0,0.0,,1.0,"""18""","""MOQUEGUA""","""03""","""ILO""","""02""","""EL ALGARROBAL""","""008""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",4.0,2.0,8.948965,
3.0,"""2020""","""08""","""51548""","""0026""",1.0,0.0,,1.0,"""23""","""TACNA""","""01""","""TACNA""","""10""","""CORONEL GREGORIO ALBARRACIN LA…","""005""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",5.0,2.0,102.00987,
3.0,"""2020""","""08""","""51548""","""0046""",1.0,0.0,,1.0,"""23""","""TACNA""","""01""","""TACNA""","""10""","""CORONEL GREGORIO ALBARRACIN LA…","""008""",1.0,1.0,1.0,2.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",5.0,2.0,102.00987,


In [20]:
enapres2020_1.shape

(42153, 491)

### 3.3.5. <a id='3.3.5'>Filtering data</a>

In [21]:
df_urban_main = enapres2020_1.filter(pl.col("AREA") == 1)
df_urban_main


PER,ANIO,MES,CONGLOMERADO,NSELV,TSELV,VIVREM,NUMVIVREM,AREA,CCDD,NOMBREDD,CCPP,NOMBREPP,CCDI,NOMBREDI,VIVIENDA,TOT_HOGAR,HOGAR,RESFIN,P100_C,P101,P101_O,P102A,P102A_O,P103,P103_O,P104B,P104B_O,P105,P106,P106_O,P106A,P107,P107A,P107A_O,P107B,P107B_O,…,P184A_11,P184A_12,P184A_13,P184A_14,P184A_15,P184A_16,P184A_17,P184A_18,P185,P185A,P186,P186A,P187_1,P187_2,P187_3,P187_4,P187_5,P187_6,P187_7,P187_8,P187_9,P187_9_O,P188,P189_1,P189_2,P189_3,P189_4,P189_5,P189_6,P189_7,P189_8,P189_8_O,REGIONNATU,ESTRATO,MOD_ENC,FACTOR,FACTOR_CALIBRADO
f64,str,str,str,str,f64,f64,f64,f64,str,str,str,str,str,str,str,f64,f64,f64,f64,f64,str,f64,str,f64,str,f64,str,f64,f64,str,f64,f64,f64,str,f64,str,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64
2.0,"""2020""","""09""","""03981""","""0120""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""006""",1.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",5.0,"""""",4.0,"""""",4.0,2.0,"""""",1.0,1.0,5.0,"""""",4.0,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496
2.0,"""2020""","""09""","""03981""","""0189""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""008""",2.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",6.0,"""""",4.0,"""""",2.0,2.0,"""""",1.0,1.0,3.0,"""""",4.0,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,,,,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496
2.0,"""2020""","""09""","""03981""","""0189""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""008""",2.0,2.0,1.0,1.0,1.0,"""""",1.0,"""""",6.0,"""""",4.0,"""""",2.0,2.0,"""""",1.0,1.0,3.0,"""""",4.0,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,,,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496
2.0,"""2020""","""09""","""06703""","""0037""",1.0,0.0,,1.0,"""01""","""AMAZONAS""","""07""","""UTCUBAMBA""","""01""","""BAGUA GRANDE""","""004""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""3""",5.0,2.0,79.14997,
2.0,"""2020""","""09""","""03981""","""0057""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""004""",1.0,1.0,1.0,1.0,1.0,"""""",3.0,"""""",5.0,"""""",4.0,"""""",3.0,5.0,"""""",16.0,3.0,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0,,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2.0,"""2020""","""05""","""51489""","""0073""",1.0,0.0,,1.0,"""18""","""MOQUEGUA""","""03""","""ILO""","""02""","""EL ALGARROBAL""","""007""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",4.0,2.0,8.948965,
2.0,"""2020""","""05""","""51489""","""0079""",1.0,0.0,,1.0,"""18""","""MOQUEGUA""","""03""","""ILO""","""02""","""EL ALGARROBAL""","""008""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",4.0,2.0,8.948965,
3.0,"""2020""","""08""","""51548""","""0026""",1.0,0.0,,1.0,"""23""","""TACNA""","""01""","""TACNA""","""10""","""CORONEL GREGORIO ALBARRACIN LA…","""005""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",5.0,2.0,102.00987,
3.0,"""2020""","""08""","""51548""","""0046""",1.0,0.0,,1.0,"""23""","""TACNA""","""01""","""TACNA""","""10""","""CORONEL GREGORIO ALBARRACIN LA…","""008""",1.0,1.0,1.0,2.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",5.0,2.0,102.00987,


In [25]:

# Map codes -> labels for RESFIN from SPSS metadata
# resfin_map = labels_enapres2020_1.get("RESFIN", {})

# map_resfin = pl.DataFrame({
#     "RESFIN": [float(k) for k in resfin_map.keys()],
#     "RESFIN_label": list(resfin_map.values()),
# })

# Join labels into df_urban_main
# df_urban_main = df_urban_main.join(map_resfin, on="RESFIN", how="left")

# Filter "Completa"
df_urban = df_urban_main.filter(pl.col("RESFIN_label") == "Completa")
df_urban


ColumnNotFoundError: unable to find column "RESFIN_label"; valid columns: ["PER", "ANIO", "MES", "CONGLOMERADO", "NSELV", "TSELV", "VIVREM", "NUMVIVREM", "AREA", "CCDD", "NOMBREDD", "CCPP", "NOMBREPP", "CCDI", "NOMBREDI", "VIVIENDA", "TOT_HOGAR", "HOGAR", "RESFIN", "P100_C", "P101", "P101_O", "P102A", "P102A_O", "P103", "P103_O", "P104B", "P104B_O", "P105", "P106", "P106_O", "P106A", "P107", "P107A", "P107A_O", "P107B", "P107B_O", "P107C", "P108", "P108_1", "P108_1_O", "P108A", "P108B", "P108B_O", "P109", "P110A", "P110AA", "P110AB", "P110AC", "P111A", "P111AA", "P111AB_HORAS", "P111AB_MINUTOS", "P112_1", "P113_1", "P113A_1", "P113B_1_1", "P113B_1_1_ENT", "P113B_1_2", "P113B_1_2_ENT", "P113B_1_3", "P113B_1_3_ENT", "P113B_1_3_O", "P117_1", "P118_1", "P118_1_O", "P119_1", "P119_1_O", "P121_1", "P121_1_O", "P115_1", "P116_1_1", "P116_1_2", "P116_1_3", "P116_1_4", "P116_1_5", "P116_1_5_O", "P121A_1", "P121B_1", "P121C_1", "P121D_1_1", "P121D_1_2", "P121D_1_3", "P121D_1_4", "P121D_1_5", "P121D_1_5_O", "P112_2", "P113_2", "P113A_2", "P113B_2_1", "P113B_2_1_ENT", "P113B_2_2", "P113B_2_2_ENT", "P113B_2_3", "P113B_2_3_ENT", "P113B_2_3_O", "P117_2", "P118_2", "P118_2_O", "P119_2", "P119_2_O", "P121_2", "P121_2_O", "P115_2", "P116_2_1", "P116_2_2", "P116_2_3", "P116_2_4", "P116_2_5", "P116_2_5_O", "P121A_2", "P121B_2", "P121C_2", "P121D_2_1", "P121D_2_2", "P121D_2_3", "P121D_2_4", "P121D_2_5", "P121D_2_5_O", "P122A_1", "P122A_2", "P122A_3", "P122A_4", "P122A_5", "P122A_6", "P122A_6_O", "P122B_1", "P122B_2", "P122B_3", "P122B_4", "P122B_5", "P122B_6", "P122C_1", "P122C_2", "P122C_3", "P122C_4", "P122C_5", "P122C_6", "P122D_1", "P122D_2", "P122D_3", "P122D_4", "P122D_5", "P122D_6", "P122D_7", "P122D_8", "P122D_O", "P122E_1", "P122E_2", "P122E_3", "P122E_4", "P122E_5", "P122E_6", "P122E_7", "P122E_8", "P122F_1", "P122F_2", "P122F_3", "P122F_4", "P122F_5", "P122F_6", "P122F_7", "P122F_8", "P123", "P124", "P125", "P126", "P127_1", "P127_2", "P127_3", "P127_4", "P127_5", "P127_6", "P127_6_O", "P128A_1", "P128A_2", "P128A_3", "P128A_4", "P128A_5", "P128A_6", "P128A_7", "P128A_8", "P128A_O", "P129G", "P129G_O", "P129B", "P129C", "P100_C_2", "P129D_ENT", "P129D_DEC", "P129D_1", "P129E", "P129F", "P129F_O", "P129_DIA", "P129_MES", "P129_ANIO", "P129_HORA_INI", "P129_MIN_INI", "P130", "P130A", "P130B", "P130C", "P131_1", "P131_2", "P131_3", "P131_4", "P131_5", "P131_6", "P131_7", "P131_7_O", "P131A", "P131B", "P131B_O", "P131C_A", "P131C_B", "P131D", "P131D_O", "P132", "P133", "P134", "P135", "P136", "P137", "P137_O", "P138_1", "P138_2", "P138_3", "P138_N", "P139_1", "P139_2", "P139_3", "P139_4", "P139_O", "P140_1", "P140_2", "P140_3", "P141_1", "P141_2", "P141_3", "P141_4", "P141_5", "P141_6", "P141_7", "P141_7_O", "P142A", "P142A_O", "P143", "P144", "P145C_1_2", "P145C_1_4", "P145C_1_5", "P145C_1_6", "P145C_1_7", "P145C_2_1", "P145C_2_4", "P145C_2_5", "P145C_2_6", "P145C_2_7", "P145C_3_3", "P145C_3_4", "P145C_3_5", "P145C_3_6", "P145C_3_7", "P145C_4_3", "P145C_4_4", "P145C_4_5", "P145C_4_6", "P145C_4_7", "P146", "P147A_1", "P147A_2", "P147A_3", "P147A_4", "P147A_5", "P157", "P158", "P159", "P159_O", "P161_1", "P161A_1", "P161_2", "P161A_2", "P161_3", "P161A_3", "P161_4", "P161A_4", "P162_1", "P162_2", "P163_1", "P164_1_H", "P164_1_M", "P165_1", "P163_2", "P164_2_H", "P164_2_M", "P165_2", "P163_3", "P164_3_H", "P164_3_M", "P165_3", "P163_4", "P164_4_H", "P164_4_M", "P165_4", "P163_5", "P164_5_H", "P164_5_M", "P165_5", "P163_6", "P164_6_H", "P164_6_M", "P165_6", "P163_7", "P164_7_H", "P164_7_M", "P165_7", "P163_8", "P164_8_H", "P164_8_M", "P165_8", "P163_9", "P163_10", "P166A_1", "P167A_1_H", "P167A_1_M", "P168A_1", "P166A_2", "P167A_2_H", "P167A_2_M", "P168A_2", "P166A_3", "P167A_3_H", "P167A_3_M", "P168A_3", "P166A_4", "P167A_4_H", "P167A_4_M", "P168A_4", "P166A_5", "P167A_5_H", "P167A_5_M", "P168A_5", "P166A_6", "P167A_6_H", "P167A_6_M", "P168A_6", "P166A_7", "P167A_7_H", "P167A_7_M", "P168A_7", "P166A_8", "P167A_8_H", "P167A_8_M", "P168A_8", "P166A_9", "P172B_1", "P172C_1", "P172B_2", "P172C_2", "P172D", "P172E_1", "P172F_1_1", "P172F_1_2", "P172F_1_3", "P172F_1_4", "P172F_1_5", "P172F_1_6", "P172F_1_6_O", "P172E_2", "P172F_2_1", "P172F_2_2", "P172F_2_3", "P172F_2_4", "P172F_2_5", "P172F_2_6", "P172F_2_6_O", "P172E_3", "P172F_3_1", "P172F_3_2", "P172F_3_3", "P172F_3_4", "P172F_3_5", "P172F_3_6", "P172F_3_6_O", "P172E_4", "P172F_4_1", "P172F_4_2", "P172F_4_3", "P172F_4_4", "P172F_4_5", "P172F_4_6", "P172F_4_6_O", "P172E_5", "P172F_5_1", "P172F_5_2", "P172F_5_3", "P172F_5_4", "P172F_5_5", "P172F_5_6", "P172F_5_6_O", "P172E_6", "P172F_6_1", "P172F_6_2", "P172F_6_3", "P172F_6_4", "P172F_6_5", "P172F_6_6", "P172F_6_6_O", "P172E_7", "P172E_7_O", "P172F_7_1", "P172F_7_2", "P172F_7_3", "P172F_7_4", "P172F_7_5", "P172F_7_6", "P172F_7_6_O", "P172G", "P172H_1", "P172H_2", "P172H_3", "P172H_4", "P172H_5", "P172H_6", "P172H_7", "P172H_8", "P172H_8_O", "P178_1", "P178_2", "P178_3", "P178_4", "P178_5", "P178_6", "P178_7", "P179", "P180_1", "P180_2", "P180_3", "P180_4", "P180_5", "P180_6", "P180_O", "P181", "P184A_1", "P184A_2", "P184A_3", "P184A_4", "P184A_5", "P184A_6", "P184A_7", "P184A_8", "P184A_9", "P184A_10", "P184A_11", "P184A_12", "P184A_13", "P184A_14", "P184A_15", "P184A_16", "P184A_17", "P184A_18", "P185", "P185A", "P186", "P186A", "P187_1", "P187_2", "P187_3", "P187_4", "P187_5", "P187_6", "P187_7", "P187_8", "P187_9", "P187_9_O", "P188", "P189_1", "P189_2", "P189_3", "P189_4", "P189_5", "P189_6", "P189_7", "P189_8", "P189_8_O", "REGIONNATU", "ESTRATO", "MOD_ENC", "FACTOR", "FACTOR_CALIBRADO"]

### 3.3.6. <a id='3.3.6'>Dealing with nulls</a>

Polars null handling:
- `drop_nulls`
- `fill_null`
- `is_null`

In [27]:
df_urban = df_urban_main
df_urban_no_null = df_urban.drop_nulls(["P172D"])
df_urban_no_null


PER,ANIO,MES,CONGLOMERADO,NSELV,TSELV,VIVREM,NUMVIVREM,AREA,CCDD,NOMBREDD,CCPP,NOMBREPP,CCDI,NOMBREDI,VIVIENDA,TOT_HOGAR,HOGAR,RESFIN,P100_C,P101,P101_O,P102A,P102A_O,P103,P103_O,P104B,P104B_O,P105,P106,P106_O,P106A,P107,P107A,P107A_O,P107B,P107B_O,…,P184A_11,P184A_12,P184A_13,P184A_14,P184A_15,P184A_16,P184A_17,P184A_18,P185,P185A,P186,P186A,P187_1,P187_2,P187_3,P187_4,P187_5,P187_6,P187_7,P187_8,P187_9,P187_9_O,P188,P189_1,P189_2,P189_3,P189_4,P189_5,P189_6,P189_7,P189_8,P189_8_O,REGIONNATU,ESTRATO,MOD_ENC,FACTOR,FACTOR_CALIBRADO
f64,str,str,str,str,f64,f64,f64,f64,str,str,str,str,str,str,str,f64,f64,f64,f64,f64,str,f64,str,f64,str,f64,str,f64,f64,str,f64,f64,f64,str,f64,str,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64
2.0,"""2020""","""09""","""03981""","""0120""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""006""",1.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",5.0,"""""",4.0,"""""",4.0,2.0,"""""",1.0,1.0,5.0,"""""",4.0,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496
2.0,"""2020""","""09""","""03981""","""0189""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""008""",2.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",6.0,"""""",4.0,"""""",2.0,2.0,"""""",1.0,1.0,3.0,"""""",4.0,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,,,,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496
2.0,"""2020""","""09""","""03981""","""0189""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""008""",2.0,2.0,1.0,1.0,1.0,"""""",1.0,"""""",6.0,"""""",4.0,"""""",2.0,2.0,"""""",1.0,1.0,3.0,"""""",4.0,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,,,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496
2.0,"""2020""","""09""","""06703""","""0037""",1.0,0.0,,1.0,"""01""","""AMAZONAS""","""07""","""UTCUBAMBA""","""01""","""BAGUA GRANDE""","""004""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""3""",5.0,2.0,79.14997,
2.0,"""2020""","""09""","""03981""","""0057""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""004""",1.0,1.0,1.0,1.0,1.0,"""""",3.0,"""""",5.0,"""""",4.0,"""""",3.0,5.0,"""""",16.0,3.0,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0,,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2.0,"""2020""","""05""","""51489""","""0073""",1.0,0.0,,1.0,"""18""","""MOQUEGUA""","""03""","""ILO""","""02""","""EL ALGARROBAL""","""007""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",4.0,2.0,8.948965,
2.0,"""2020""","""05""","""51489""","""0079""",1.0,0.0,,1.0,"""18""","""MOQUEGUA""","""03""","""ILO""","""02""","""EL ALGARROBAL""","""008""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",4.0,2.0,8.948965,
3.0,"""2020""","""08""","""51548""","""0026""",1.0,0.0,,1.0,"""23""","""TACNA""","""01""","""TACNA""","""10""","""CORONEL GREGORIO ALBARRACIN LA…","""005""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",5.0,2.0,102.00987,
3.0,"""2020""","""08""","""51548""","""0046""",1.0,0.0,,1.0,"""23""","""TACNA""","""01""","""TACNA""","""10""","""CORONEL GREGORIO ALBARRACIN LA…","""008""",1.0,1.0,1.0,2.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",5.0,2.0,102.00987,


In [28]:
df_urban_fill = df_urban.with_columns(
    pl.col("P172D").fill_null("MISSING").alias("P172D_filled")
)
df_urban_fill


PER,ANIO,MES,CONGLOMERADO,NSELV,TSELV,VIVREM,NUMVIVREM,AREA,CCDD,NOMBREDD,CCPP,NOMBREPP,CCDI,NOMBREDI,VIVIENDA,TOT_HOGAR,HOGAR,RESFIN,P100_C,P101,P101_O,P102A,P102A_O,P103,P103_O,P104B,P104B_O,P105,P106,P106_O,P106A,P107,P107A,P107A_O,P107B,P107B_O,…,P184A_12,P184A_13,P184A_14,P184A_15,P184A_16,P184A_17,P184A_18,P185,P185A,P186,P186A,P187_1,P187_2,P187_3,P187_4,P187_5,P187_6,P187_7,P187_8,P187_9,P187_9_O,P188,P189_1,P189_2,P189_3,P189_4,P189_5,P189_6,P189_7,P189_8,P189_8_O,REGIONNATU,ESTRATO,MOD_ENC,FACTOR,FACTOR_CALIBRADO,P172D_filled
f64,str,str,str,str,f64,f64,f64,f64,str,str,str,str,str,str,str,f64,f64,f64,f64,f64,str,f64,str,f64,str,f64,str,f64,f64,str,f64,f64,f64,str,f64,str,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,str
2.0,"""2020""","""09""","""03981""","""0120""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""006""",1.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",5.0,"""""",4.0,"""""",4.0,2.0,"""""",1.0,1.0,5.0,"""""",4.0,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496,"""1.0"""
2.0,"""2020""","""09""","""03981""","""0189""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""008""",2.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",6.0,"""""",4.0,"""""",2.0,2.0,"""""",1.0,1.0,3.0,"""""",4.0,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,,,,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496,"""1.0"""
2.0,"""2020""","""09""","""03981""","""0189""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""008""",2.0,2.0,1.0,1.0,1.0,"""""",1.0,"""""",6.0,"""""",4.0,"""""",2.0,2.0,"""""",1.0,1.0,3.0,"""""",4.0,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,,,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496,"""1.0"""
2.0,"""2020""","""09""","""06703""","""0037""",1.0,0.0,,1.0,"""01""","""AMAZONAS""","""07""","""UTCUBAMBA""","""01""","""BAGUA GRANDE""","""004""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""3""",5.0,2.0,79.14997,,"""1.0"""
2.0,"""2020""","""09""","""03981""","""0057""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""06""","""LA VICTORIA""","""004""",1.0,1.0,1.0,1.0,1.0,"""""",3.0,"""""",5.0,"""""",4.0,"""""",3.0,5.0,"""""",16.0,3.0,,"""""",,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0,,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,245.10805,395.8496,"""1.0"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2.0,"""2020""","""05""","""51489""","""0073""",1.0,0.0,,1.0,"""18""","""MOQUEGUA""","""03""","""ILO""","""02""","""EL ALGARROBAL""","""007""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",4.0,2.0,8.948965,,"""1.0"""
2.0,"""2020""","""05""","""51489""","""0079""",1.0,0.0,,1.0,"""18""","""MOQUEGUA""","""03""","""ILO""","""02""","""EL ALGARROBAL""","""008""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",4.0,2.0,8.948965,,"""1.0"""
3.0,"""2020""","""08""","""51548""","""0026""",1.0,0.0,,1.0,"""23""","""TACNA""","""01""","""TACNA""","""10""","""CORONEL GREGORIO ALBARRACIN LA…","""005""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",5.0,2.0,102.00987,,"""2.0"""
3.0,"""2020""","""08""","""51548""","""0046""",1.0,0.0,,1.0,"""23""","""TACNA""","""01""","""TACNA""","""10""","""CORONEL GREGORIO ALBARRACIN LA…","""008""",1.0,1.0,1.0,2.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",5.0,2.0,102.00987,,"""1.0"""


In [1039]:
df_urban = df_urban.with_columns(
    pl.concat_str(
        [
            pl.col("PER").cast(pl.Utf8),
            pl.col("MES").cast(pl.Utf8),
            pl.col("CCDD").cast(pl.Utf8),
            pl.col("CCPP").cast(pl.Utf8),
            pl.col("CCDI").cast(pl.Utf8),
            pl.col("CONGLOMERADO").cast(pl.Utf8),
            pl.col("NSELV").cast(pl.Utf8),
            pl.col("VIVIENDA").cast(pl.Utf8),
            pl.col("HOGAR").cast(pl.Int64).cast(pl.Utf8),
        ],
        separator="_"
    ).alias("ID")
)

# Equivalent to pandas .is_unique
is_unique = df_urban.select((pl.col("ID").n_unique() == pl.len()).alias("is_unique")).item()
is_unique



True

### 3.3.7. <a id='3.3.7'>Duplicates</a>

- `is_duplicated` to mark duplicate rows
- `unique` to remove duplicates

In [29]:
cols = ["CCDD", "CCPP", "CCDI", "CONGLOMERADO", "NSELV", "VIVIENDA", "HOGAR"]

dup_rows = (
    df_urban
    .with_columns(pl.len().over(cols).alias("_n"))
    .filter(pl.col("_n") > 1)
    .drop("_n")
)

dup_rows

PER,ANIO,MES,CONGLOMERADO,NSELV,TSELV,VIVREM,NUMVIVREM,AREA,CCDD,NOMBREDD,CCPP,NOMBREPP,CCDI,NOMBREDI,VIVIENDA,TOT_HOGAR,HOGAR,RESFIN,P100_C,P101,P101_O,P102A,P102A_O,P103,P103_O,P104B,P104B_O,P105,P106,P106_O,P106A,P107,P107A,P107A_O,P107B,P107B_O,…,P184A_11,P184A_12,P184A_13,P184A_14,P184A_15,P184A_16,P184A_17,P184A_18,P185,P185A,P186,P186A,P187_1,P187_2,P187_3,P187_4,P187_5,P187_6,P187_7,P187_8,P187_9,P187_9_O,P188,P189_1,P189_2,P189_3,P189_4,P189_5,P189_6,P189_7,P189_8,P189_8_O,REGIONNATU,ESTRATO,MOD_ENC,FACTOR,FACTOR_CALIBRADO
f64,str,str,str,str,f64,f64,f64,f64,str,str,str,str,str,str,str,f64,f64,f64,f64,f64,str,f64,str,f64,str,f64,str,f64,f64,str,f64,f64,f64,str,f64,str,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64
1.0,"""2020""","""07""","""39066""","""0140""",1.0,0.0,,1.0,"""17""","""MADRE DE DIOS""","""03""","""TAHUAMANU""","""02""","""IBERIA""","""014""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""3""",4.0,2.0,17.688435,
3.0,"""2020""","""12""","""36856""","""0003""",1.0,0.0,,1.0,"""03""","""APURIMAC""","""05""","""COTABAMBAS""","""04""","""HAQUIRA""","""001""",1.0,1.0,1.0,2.0,2.0,"""""",1.0,"""""",5.0,"""""",4.0,"""""",3.0,2.0,"""""",14.0,1.0,1.0,"""""",1.0,"""""",…,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""2""",4.0,1.0,68.09559,178.85709
1.0,"""2020""","""02""","""06712""","""0020""",1.0,0.0,,1.0,"""01""","""AMAZONAS""","""07""","""UTCUBAMBA""","""01""","""BAGUA GRANDE""","""002""",1.0,1.0,2.0,1.0,1.0,"""""",3.0,"""""",5.0,"""""",4.0,"""""",2.0,2.0,"""""",1.0,1.0,1.0,"""""",1.0,"""""",…,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,,1.0,1.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""3""",4.0,1.0,66.73552,172.22223
3.0,"""2020""","""11""","""06712""","""0020""",1.0,0.0,,1.0,"""01""","""AMAZONAS""","""07""","""UTCUBAMBA""","""01""","""BAGUA GRANDE""","""002""",1.0,1.0,1.0,1.0,1.0,"""""",3.0,"""""",6.0,"""""",4.0,"""""",3.0,2.0,"""""",9.0,1.0,1.0,"""""",1.0,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,,,,1.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""3""",3.0,1.0,102.685425,154.64967
2.0,"""2020""","""05""","""15120""","""0110""",1.0,0.0,,1.0,"""25""","""UCAYALI""","""03""","""PADRE ABAD""","""01""","""PADRE ABAD""","""015""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""3""",4.0,2.0,15.969518,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2.0,"""2020""","""06""","""39066""","""0140""",1.0,0.0,,1.0,"""17""","""MADRE DE DIOS""","""03""","""TAHUAMANU""","""02""","""IBERIA""","""014""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""3""",4.0,2.0,42.489746,
3.0,"""2020""","""07""","""43264""","""0008""",1.0,0.0,,1.0,"""04""","""AREQUIPA""","""01""","""AREQUIPA""","""16""","""SABANDIA""","""001""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""2""",4.0,2.0,526.62616,
1.0,"""2020""","""10""","""43264""","""0008""",1.0,0.0,,1.0,"""04""","""AREQUIPA""","""01""","""AREQUIPA""","""16""","""SABANDIA""","""001""",1.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",5.0,"""""",1.0,"""""",7.0,2.0,"""""",1.0,1.0,3.0,"""""",4.0,"""""",…,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,3.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""2""",5.0,1.0,241.59677,360.54993
2.0,"""2020""","""01""","""43856""","""0005""",1.0,0.0,,1.0,"""18""","""MOQUEGUA""","""01""","""MARISCAL NIETO""","""01""","""MOQUEGUA""","""001""",1.0,1.0,2.0,1.0,1.0,"""""",1.0,"""""",5.0,"""""",1.0,"""""",3.0,2.0,"""""",1.0,1.0,1.0,"""""",2.0,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,2.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,16.738457,43.47482


In [30]:
df_urban_no_dpl = df_urban.unique(subset=cols, keep="first")
df_urban_no_dpl

PER,ANIO,MES,CONGLOMERADO,NSELV,TSELV,VIVREM,NUMVIVREM,AREA,CCDD,NOMBREDD,CCPP,NOMBREPP,CCDI,NOMBREDI,VIVIENDA,TOT_HOGAR,HOGAR,RESFIN,P100_C,P101,P101_O,P102A,P102A_O,P103,P103_O,P104B,P104B_O,P105,P106,P106_O,P106A,P107,P107A,P107A_O,P107B,P107B_O,…,P184A_11,P184A_12,P184A_13,P184A_14,P184A_15,P184A_16,P184A_17,P184A_18,P185,P185A,P186,P186A,P187_1,P187_2,P187_3,P187_4,P187_5,P187_6,P187_7,P187_8,P187_9,P187_9_O,P188,P189_1,P189_2,P189_3,P189_4,P189_5,P189_6,P189_7,P189_8,P189_8_O,REGIONNATU,ESTRATO,MOD_ENC,FACTOR,FACTOR_CALIBRADO
f64,str,str,str,str,f64,f64,f64,f64,str,str,str,str,str,str,str,f64,f64,f64,f64,f64,str,f64,str,f64,str,f64,str,f64,f64,str,f64,f64,f64,str,f64,str,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64
3.0,"""2020""","""02""","""18367""","""0088""",1.0,0.0,,1.0,"""07""","""CALLAO""","""01""","""CALLAO""","""01""","""CALLAO""","""004""",1.0,1.0,1.0,1.0,3.0,"""""",1.0,"""""",5.0,"""""",1.0,"""""",6.0,2.0,"""""",1.0,1.0,1.0,"""""",2.0,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,,,,1.0,1.0,2.0,,2.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",3.0,1.0,180.76799,320.7995
2.0,"""2020""","""08""","""23304""","""0152""",1.0,0.0,,1.0,"""15""","""LIMA""","""01""","""LIMA""","""18""","""LURIGANCHO""","""009""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",5.0,2.0,660.2094,
2.0,"""2020""","""03""","""07484""","""0119""",1.0,0.0,,1.0,"""16""","""LORETO""","""01""","""MAYNAS""","""08""","""PUNCHANA""","""007""",1.0,1.0,1.0,1.0,1.0,"""""",7.0,"""""",4.0,"""""",4.0,"""""",1.0,2.0,"""""",5.0,1.0,6.0,"""""",2.0,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""3""",5.0,1.0,134.11713,467.53418
3.0,"""2020""","""01""","""31806""","""0090""",1.0,0.0,,1.0,"""12""","""JUNIN""","""01""","""HUANCAYO""","""14""","""EL TAMBO""","""005""",1.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",5.0,"""""",1.0,"""""",3.0,2.0,"""""",1.0,1.0,3.0,"""""",4.0,"""""",…,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,,1.0,2.0,2.0,2.0,2.0,"""""",1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"""EMPRESA MINERA""","""2""",3.0,1.0,277.49554,476.47592
1.0,"""2020""","""04""","""00737""","""0013""",1.0,0.0,,1.0,"""20""","""PIURA""","""05""","""PAITA""","""04""","""COLAN""","""003""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",5.0,2.0,403.10953,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
3.0,"""2020""","""02""","""04667""","""0202""",1.0,0.0,,1.0,"""06""","""CAJAMARCA""","""08""","""JAEN""","""01""","""JAEN""","""007""",1.0,1.0,1.0,1.0,1.0,"""""",3.0,"""""",6.0,"""""",4.0,"""""",2.0,2.0,"""""",1.0,1.0,1.0,"""""",2.0,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""3""",4.0,1.0,185.26337,213.32634
2.0,"""2020""","""11""","""16446""","""0006""",1.0,0.0,,1.0,"""15""","""LIMA""","""02""","""BARRANCA""","""01""","""BARRANCA""","""001""",1.0,1.0,1.0,3.0,2.0,"""""",1.0,"""""",3.0,"""""",1.0,"""""",3.0,2.0,"""""",1.0,1.0,4.0,"""""",6.0,"""""",…,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,2.0,,,,1.0,1.0,2.0,1.0,1.0,2.0,1.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",2.0,1.0,451.69406,803.3122
1.0,"""2020""","""12""","""13908""","""0009""",1.0,0.0,,1.0,"""02""","""ANCASH""","""05""","""BOLOGNESI""","""01""","""CHIQUIAN""","""002""",1.0,1.0,1.0,1.0,1.0,"""""",3.0,"""""",6.0,"""""",4.0,"""""",3.0,2.0,"""""",1.0,1.0,1.0,"""""",1.0,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""2""",3.0,1.0,466.82516,606.8727
3.0,"""2020""","""09""","""38907""","""0035""",1.0,0.0,,1.0,"""17""","""MADRE DE DIOS""","""02""","""MANU""","""04""","""HUEPETUHE""","""005""",1.0,1.0,1.0,2.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""3""",1.0,2.0,31.101513,


In [1042]:
df_urban_no_dpl.select("ESTRATO")


ESTRATO
f64
3.0
4.0
3.0
3.0
2.0
…
5.0
1.0
3.0
2.0


### 3.3.8. <a id='3.3.8'>Groupby</a>

Polars uses `group_by(...).agg(...)`.

In [1043]:
df_tmp = df_urban_no_dpl.clone()

# Create numeric version of P172D (like astype('string').astype('Int64'))
df_tmp = df_tmp.with_columns(
    pl.col("P172D")
      .cast(pl.Utf8)
      .cast(pl.Int64, strict=False)   
      .alias("P172D_num")
)

# Groupby + sum
out = (
    df_tmp
    .group_by(["CCDD", "CCPP", "CCDI", "P172D"])
    .agg(pl.col("P172D_num").sum().alias("P172D_num_sum"))
)

out

CCDD,CCPP,CCDI,P172D,P172D_num_sum
str,str,str,f64,i64
"""12""","""07""","""08""",1.0,0
"""06""","""01""","""01""",2.0,0
"""15""","""01""","""13""",2.0,0
"""12""","""07""","""05""",1.0,0
"""12""","""02""","""12""",1.0,0
…,…,…,…,…
"""05""","""05""","""07""",2.0,0
"""05""","""08""","""01""",1.0,0
"""03""","""02""","""16""",1.0,0
"""04""","""03""","""07""",2.0,0


In [1044]:
df_tmp = df_tmp.with_columns(
    pl.when(pl.col("P172D").is_null())
      .then(None)
      .otherwise((pl.col("P172D").cast(pl.Utf8).str.to_uppercase() == "SI").cast(pl.Int64))
      .alias("P172D_si")
)

# Groupby mean
out = (
    df_tmp
    .group_by(["CCDD", "CCPP", "CCDI"])
    .agg(pl.col("P172D_si").mean().alias("P172D_si_mean"))
)

out

CCDD,CCPP,CCDI,P172D_si_mean
str,str,str,f64
"""15""","""01""","""04""",0.0
"""11""","""05""","""01""",0.0
"""21""","""09""","""01""",0.0
"""13""","""05""","""01""",0.0
"""14""","""01""","""05""",0.0
…,…,…,…
"""12""","""03""","""02""",0.0
"""13""","""06""","""14""",0.0
"""11""","""05""","""07""",0.0
"""16""","""03""","""05""",0.0


In [1045]:
df_tmp = df_tmp.with_columns(
    pl.col("P172D")
      .cast(pl.Utf8)
      .cast(pl.Float64, strict=False)
      .alias("P172D_num")
)

df_urban_no_dpl_mean = (
    df_tmp
    .group_by(["CCDD", "CCPP", "CCDI"])
    .agg(pl.col("P172D_num").mean().alias("P172D"))
)

df_urban_no_dpl_mean

CCDD,CCPP,CCDI,P172D
str,str,str,f64
"""22""","""10""","""01""",1.833333
"""13""","""12""","""02""",1.631579
"""19""","""01""","""03""",1.692308
"""15""","""05""","""14""",1.625
"""04""","""07""","""01""",1.615385
…,…,…,…
"""19""","""01""","""10""",1.0
"""10""","""06""","""04""",1.693878
"""25""","""01""","""01""",1.703297
"""10""","""01""","""11""",1.454545


#### Agg

You can compute multiple stats in a single groupby.

In [31]:
df_urban_no_dpl = df_urban_no_dpl.with_columns(
    pl.col("P172D").cast(pl.Float64, strict=False).alias("P172D")
)

df_urban_no_dpl

PER,ANIO,MES,CONGLOMERADO,NSELV,TSELV,VIVREM,NUMVIVREM,AREA,CCDD,NOMBREDD,CCPP,NOMBREPP,CCDI,NOMBREDI,VIVIENDA,TOT_HOGAR,HOGAR,RESFIN,P100_C,P101,P101_O,P102A,P102A_O,P103,P103_O,P104B,P104B_O,P105,P106,P106_O,P106A,P107,P107A,P107A_O,P107B,P107B_O,…,P184A_11,P184A_12,P184A_13,P184A_14,P184A_15,P184A_16,P184A_17,P184A_18,P185,P185A,P186,P186A,P187_1,P187_2,P187_3,P187_4,P187_5,P187_6,P187_7,P187_8,P187_9,P187_9_O,P188,P189_1,P189_2,P189_3,P189_4,P189_5,P189_6,P189_7,P189_8,P189_8_O,REGIONNATU,ESTRATO,MOD_ENC,FACTOR,FACTOR_CALIBRADO
f64,str,str,str,str,f64,f64,f64,f64,str,str,str,str,str,str,str,f64,f64,f64,f64,f64,str,f64,str,f64,str,f64,str,f64,f64,str,f64,f64,f64,str,f64,str,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64
3.0,"""2020""","""02""","""18367""","""0088""",1.0,0.0,,1.0,"""07""","""CALLAO""","""01""","""CALLAO""","""01""","""CALLAO""","""004""",1.0,1.0,1.0,1.0,3.0,"""""",1.0,"""""",5.0,"""""",1.0,"""""",6.0,2.0,"""""",1.0,1.0,1.0,"""""",2.0,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,,,,1.0,1.0,2.0,,2.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",3.0,1.0,180.76799,320.7995
2.0,"""2020""","""08""","""23304""","""0152""",1.0,0.0,,1.0,"""15""","""LIMA""","""01""","""LIMA""","""18""","""LURIGANCHO""","""009""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",5.0,2.0,660.2094,
2.0,"""2020""","""03""","""07484""","""0119""",1.0,0.0,,1.0,"""16""","""LORETO""","""01""","""MAYNAS""","""08""","""PUNCHANA""","""007""",1.0,1.0,1.0,1.0,1.0,"""""",7.0,"""""",4.0,"""""",4.0,"""""",1.0,2.0,"""""",5.0,1.0,6.0,"""""",2.0,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""3""",5.0,1.0,134.11713,467.53418
3.0,"""2020""","""01""","""31806""","""0090""",1.0,0.0,,1.0,"""12""","""JUNIN""","""01""","""HUANCAYO""","""14""","""EL TAMBO""","""005""",1.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",5.0,"""""",1.0,"""""",3.0,2.0,"""""",1.0,1.0,3.0,"""""",4.0,"""""",…,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,,1.0,2.0,2.0,2.0,2.0,"""""",1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"""EMPRESA MINERA""","""2""",3.0,1.0,277.49554,476.47592
1.0,"""2020""","""04""","""00737""","""0013""",1.0,0.0,,1.0,"""20""","""PIURA""","""05""","""PAITA""","""04""","""COLAN""","""003""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",5.0,2.0,403.10953,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
3.0,"""2020""","""02""","""04667""","""0202""",1.0,0.0,,1.0,"""06""","""CAJAMARCA""","""08""","""JAEN""","""01""","""JAEN""","""007""",1.0,1.0,1.0,1.0,1.0,"""""",3.0,"""""",6.0,"""""",4.0,"""""",2.0,2.0,"""""",1.0,1.0,1.0,"""""",2.0,"""""",…,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""3""",4.0,1.0,185.26337,213.32634
2.0,"""2020""","""11""","""16446""","""0006""",1.0,0.0,,1.0,"""15""","""LIMA""","""02""","""BARRANCA""","""01""","""BARRANCA""","""001""",1.0,1.0,1.0,3.0,2.0,"""""",1.0,"""""",3.0,"""""",1.0,"""""",3.0,2.0,"""""",1.0,1.0,4.0,"""""",6.0,"""""",…,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,2.0,,,,1.0,1.0,2.0,1.0,1.0,2.0,1.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",2.0,1.0,451.69406,803.3122
1.0,"""2020""","""12""","""13908""","""0009""",1.0,0.0,,1.0,"""02""","""ANCASH""","""05""","""BOLOGNESI""","""01""","""CHIQUIAN""","""002""",1.0,1.0,1.0,1.0,1.0,"""""",3.0,"""""",6.0,"""""",4.0,"""""",3.0,2.0,"""""",1.0,1.0,1.0,"""""",1.0,"""""",…,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""2""",3.0,1.0,466.82516,606.8727
3.0,"""2020""","""09""","""38907""","""0035""",1.0,0.0,,1.0,"""17""","""MADRE DE DIOS""","""02""","""MANU""","""04""","""HUEPETUHE""","""005""",1.0,1.0,1.0,2.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""3""",1.0,2.0,31.101513,


In [32]:
out = (
    df_urban_no_dpl
    .group_by(["CCDD", "CCPP", "CCDI"])
    .agg(pl.col("P172D").mean().alias("P172D"))
)

out


CCDD,CCPP,CCDI,P172D
str,str,str,f64
"""16""","""02""","""06""",2.0
"""25""","""02""","""01""",1.888889
"""20""","""04""","""01""",1.181818
"""12""","""04""","""34""",1.0
"""20""","""02""","""02""",1.285714
…,…,…,…
"""25""","""03""","""04""",2.0
"""04""","""03""","""07""",1.857143
"""15""","""08""","""03""",1.5
"""13""","""07""","""04""",1.533333


In [33]:
df3_rec = (
    df_urban_no_dpl
    .group_by(["CCDD", "CCPP", "CCDI"])
    .agg(
        pl.col("P172D").median().alias("recycle_median"),
        pl.col("P172D").mean().alias("recycle_mean"),
    )
)

df3_rec

CCDD,CCPP,CCDI,recycle_median,recycle_mean
str,str,str,f64,f64
"""02""","""17""","""02""",2.0,1.875
"""25""","""03""","""04""",2.0,2.0
"""20""","""07""","""01""",2.0,1.791045
"""18""","""01""","""01""",2.0,1.705376
"""13""","""08""","""01""",2.0,1.8125
…,…,…,…,…
"""10""","""01""","""11""",1.0,1.476744
"""10""","""09""","""01""",2.0,1.916667
"""06""","""03""","""01""",1.0,1.241379
"""03""","""02""","""01""",1.0,1.438596


### 3.3.9. <a id='3.3.9'>Reshape</a>

#####  From Wide to Long

Use `melt`.

In [34]:
df3_rec.head(2)

CCDD,CCPP,CCDI,recycle_median,recycle_mean
str,str,str,f64,f64
"""02""","""17""","""02""",2.0,1.875
"""25""","""03""","""04""",2.0,2.0


In [1050]:
df3_rec_stack = df3_rec.melt(
    id_vars=["CCDD", "CCPP", "CCDI"],
    variable_name="STATS",
    value_name="VALUES",
)

df3_rec_stack.head()

  df3_rec_stack = df3_rec.melt(


CCDD,CCPP,CCDI,STATS,VALUES
str,str,str,str,f64
"""03""","""02""","""01""","""recycle_median""",1.0
"""14""","""01""","""01""","""recycle_median""",2.0
"""23""","""01""","""01""","""recycle_median""",2.0
"""06""","""09""","""03""","""recycle_median""",2.0
"""15""","""05""","""13""","""recycle_median""",2.0


In [1051]:

df_wide = df3_rec_stack.pivot(
    index=["CCDD", "CCPP", "CCDI"],
    columns="STATS",
    values="VALUES",
    aggregate_function="first"   
)

df_wide.head()

  df_wide = df3_rec_stack.pivot(


CCDD,CCPP,CCDI,recycle_median,recycle_mean
str,str,str,f64,f64
"""03""","""02""","""01""",1.0,1.429907
"""14""","""01""","""01""",2.0,1.578595
"""23""","""01""","""01""",2.0,1.663366
"""06""","""09""","""03""",2.0,1.533333
"""15""","""05""","""13""",2.0,1.666667


In [1052]:
df_l_w = df3_rec_stack.pivot(
    index=["CCDD", "CCPP", "CCDI"],
    columns="STATS",
    values="VALUES",
    aggregate_function="first"   # use "mean" if duplicates exist
)

df_l_w.head()

  df_l_w = df3_rec_stack.pivot(


CCDD,CCPP,CCDI,recycle_median,recycle_mean
str,str,str,f64,f64
"""03""","""02""","""01""",1.0,1.429907
"""14""","""01""","""01""",2.0,1.578595
"""23""","""01""","""01""",2.0,1.663366
"""06""","""09""","""03""",2.0,1.533333
"""15""","""05""","""13""",2.0,1.666667


### 3.3.10. <a id='3.3.10'>Merge</a>

Polars uses `join`.


In [1053]:
df_urban_merge = df_urban_no_dpl.join(df_l_w, on=["CCDD", "CCPP", "CCDI"], how="left").head()
df_urban_merge


PER,ANIO,MES,CONGLOMERADO,NSELV,TSELV,VIVREM,NUMVIVREM,AREA,CCDD,NOMBREDD,CCPP,NOMBREPP,CCDI,NOMBREDI,VIVIENDA,TOT_HOGAR,HOGAR,RESFIN,P100_C,P101,P101_O,P102A,P102A_O,P103,P103_O,P104B,P104B_O,P105,P106,P106_O,P106A,P107,P107A,P107A_O,P107B,P107B_O,…,P184A_15,P184A_16,P184A_17,P184A_18,P185,P185A,P186,P186A,P187_1,P187_2,P187_3,P187_4,P187_5,P187_6,P187_7,P187_8,P187_9,P187_9_O,P188,P189_1,P189_2,P189_3,P189_4,P189_5,P189_6,P189_7,P189_8,P189_8_O,REGIONNATU,ESTRATO,MOD_ENC,FACTOR,FACTOR_CALIBRADO,RESFIN_label,ID,recycle_median,recycle_mean
f64,str,str,str,str,f64,f64,f64,f64,str,str,str,str,str,str,str,f64,f64,f64,f64,f64,str,f64,str,f64,str,f64,str,f64,f64,str,f64,f64,f64,str,f64,str,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,str,str,f64,f64
3.0,"""2020""","""09""","""44536""","""0040""",1.0,0.0,,1.0,"""23""","""TACNA""","""01""","""TACNA""","""01""","""TACNA""","""005""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",3.0,2.0,221.78146,,"""Completa""","""3.0_09_23_01_01_44536_0040_005…",2.0,1.663366
1.0,"""2020""","""11""","""17351""","""0086""",1.0,0.0,,1.0,"""07""","""CALLAO""","""01""","""CALLAO""","""06""","""VENTANILLA""","""005""",1.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",5.0,"""""",1.0,"""""",4.0,2.0,"""""",1.0,1.0,4.0,"""""",6.0,"""""",…,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,96.8754,181.64433,"""Completa""","""1.0_11_07_01_06_17351_0086_005…",1.0,1.451049
4.0,"""2020""","""09""","""14615""","""0026""",1.0,0.0,,1.0,"""10""","""HUANUCO""","""01""","""HUANUCO""","""02""","""AMARILIS""","""004""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,1.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""2""",3.0,2.0,173.63954,,"""Completa""","""4.0_09_10_01_02_14615_0026_004…",2.0,1.517007
4.0,"""2020""","""03""","""04326""","""0144""",1.0,0.0,,1.0,"""14""","""LAMBAYEQUE""","""01""","""CHICLAYO""","""20""","""TUMAN""","""010""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",3.0,2.0,290.97446,,"""Completa""","""4.0_03_14_01_20_04326_0144_010…",1.0,1.3125
1.0,"""2020""","""08""","""12144""","""0004""",1.0,0.0,,1.0,"""02""","""ANCASH""","""18""","""SANTA""","""01""","""CHIMBOTE""","""001""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",2.0,2.0,320.32922,,"""Completa""","""1.0_08_02_18_01_12144_0004_001…",2.0,1.669604


In [1054]:
df_urban_merge.head(2)

PER,ANIO,MES,CONGLOMERADO,NSELV,TSELV,VIVREM,NUMVIVREM,AREA,CCDD,NOMBREDD,CCPP,NOMBREPP,CCDI,NOMBREDI,VIVIENDA,TOT_HOGAR,HOGAR,RESFIN,P100_C,P101,P101_O,P102A,P102A_O,P103,P103_O,P104B,P104B_O,P105,P106,P106_O,P106A,P107,P107A,P107A_O,P107B,P107B_O,…,P184A_15,P184A_16,P184A_17,P184A_18,P185,P185A,P186,P186A,P187_1,P187_2,P187_3,P187_4,P187_5,P187_6,P187_7,P187_8,P187_9,P187_9_O,P188,P189_1,P189_2,P189_3,P189_4,P189_5,P189_6,P189_7,P189_8,P189_8_O,REGIONNATU,ESTRATO,MOD_ENC,FACTOR,FACTOR_CALIBRADO,RESFIN_label,ID,recycle_median,recycle_mean
f64,str,str,str,str,f64,f64,f64,f64,str,str,str,str,str,str,str,f64,f64,f64,f64,f64,str,f64,str,f64,str,f64,str,f64,f64,str,f64,f64,f64,str,f64,str,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,str,str,f64,f64
3.0,"""2020""","""09""","""44536""","""0040""",1.0,0.0,,1.0,"""23""","""TACNA""","""01""","""TACNA""","""01""","""TACNA""","""005""",1.0,1.0,1.0,1.0,,"""""",,"""""",,"""""",,"""""",,,"""""",,,,"""""",,"""""",…,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,"""""",,,,,,,,,,"""""","""1""",3.0,2.0,221.78146,,"""Completa""","""3.0_09_23_01_01_44536_0040_005…",2.0,1.663366
1.0,"""2020""","""11""","""17351""","""0086""",1.0,0.0,,1.0,"""07""","""CALLAO""","""01""","""CALLAO""","""06""","""VENTANILLA""","""005""",1.0,1.0,1.0,1.0,1.0,"""""",1.0,"""""",5.0,"""""",1.0,"""""",4.0,2.0,"""""",1.0,1.0,4.0,"""""",6.0,"""""",…,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,"""""",2.0,,,,,,,,,"""""","""1""",4.0,1.0,96.8754,181.64433,"""Completa""","""1.0_11_07_01_06_17351_0086_005…",1.0,1.451049


In [1055]:
%whos DataFrame


Variable               Type         Data/Info
---------------------------------------------
cars                   DataFrame    shape: (6, 4)\n┌─────────<...>───┴────────────┴───────┘
dep1                   DataFrame    shape: (6, 3)\n┌────────┬<...>\n└────────┴──────┴─────┘
dep1_sort              DataFrame    shape: (6, 3)\n┌────────┬<...>\n└────────┴──────┴─────┘
df                     DataFrame    shape: (3, 3)\n┌─────┬───<...>   │\n└─────┴─────┴─────┘
df3_rec                DataFrame    shape: (550, 5)\n┌──────┬<...>─────────┴──────────────┘
df3_rec_stack          DataFrame    shape: (1_100, 5)\n┌─────<...>─────────────┴──────────┘
df_l_w                 DataFrame    shape: (550, 5)\n┌──────┬<...>─────────┴──────────────┘
df_matrix              DataFrame    shape: (3, 3)\n┌──────┬──<...>│\n└──────┴──────┴──────┘
df_pd                  DataFrame           PER  ANIO MES CONG<...>42153 rows x 491 columns]
df_tmp                 DataFrame    shape: (24_342, 495)\n┌──<...>─┴───────────┴

## 3.4. <a id='3.4'>References</a>

- Polars User Guide: https://docs.pola.rs/
- Polars expressions: https://docs.pola.rs/user-guide/expressions/
- Polars join: https://docs.pola.rs/user-guide/transformations/joins/
- Polars pivot/melt: https://docs.pola.rs/user-guide/transformations/pivot/