<a href="https://colab.research.google.com/github/densmyslov/ais-data-pipeline/blob/main/notebooks/0_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing Large-Scale Datasets at Scale

Working with multi-gigabyte datasets can be a challenge—not because of complexity, but because of size. A single CSV file several gigabytes in size can quickly expand to many times that in memory if loaded with Pandas (see Step 1 below).

This notebook demoes a preprocessing pipeline that transforms bulky CSVs into compact, query-ready Parquet files while keeping memory usage under control. The Dubai real estate dataset serves as an illustrative example of this approach.

# Setup

In [1]:
import os
import json
import pandas as pd
import numpy as np
from random import choice, choices
from time import time
import pyarrow as pa
import pyarrow.parquet as pq

import polars as pl
import polars.selectors as cs

# Dubai Rent_contracts.csv Dataset pre-processing

## prerequisites:
* at least rent_contracts.csv dataset has been downloaded using "data_ingestion_with_profiling.ipynb" notebook

In [2]:
# change the current folder to that where you downloaded a rent_contracts.csv dataset

project_name = '0_real-estate-agent'
os.chdir(f'/content/drive/MyDrive/0_Projects/{project_name}')

# change the date to the actual date of the downloaded dataset
date = '2025-08-20'

let's check the size of the downloaded rent_contracts.csv dataset

In [3]:
fn = f"datasets/rent_contracts_8192_{date}.csv"
file_size = os.path.getsize(fn)/1000000
print(f"File size: {file_size: .2f} MiB")

File size:  4225.93 MiB


## Step 1. Experiment: how to get OOM ("out-of-memory error")

the file size on disk is over 4 GiB. That's already quite a lot, given that we are going to work with it with serverless functions, and they have a memory max limit. E.g., AWS lambdas have 10 GiB max memory.  


If we try to read this file into memory using pandas, it will exceed 20 GiB.  

You can try to do this; the notebook will most probably crash unless you selected high-memory runtime

In [28]:
df = pd.read_csv(fn, engine='python', on_bad_lines='skip')
print(df.shape)

(8915752, 40)


## Step 2: Memory-Smart Transformations

let's create a `LazyFrame` using Polars - nothing is loaded into memory yet

In [3]:
fn = f"datasets/rent_contracts_8192_{date}.csv"

lf = pl.scan_csv(
    fn,
    null_values=["", "null", "NULL", "None"],
    infer_schema_length=10000,
)

let's get the number of rows in the dataset

In [4]:
n_rows = lf.select(pl.len()).collect().item()
print(f"Number of rows: {n_rows}")

Number of rows: 8915752


### step 2.0: initial memory consumption by columns

let's load a sample of the dataset into memory: we will use that sample to estimate memory consumption by individual columns

In [5]:
n_sample_rows = 20_000
df_sample = lf.head(n_sample_rows).collect(engine="streaming")

next. we will call our sample into memory, convert it to pandas df and estimate how much MiB would each column "weigh" if we called the whole dataset into memory

In [6]:
pdf_sample = df_sample.to_pandas()
col_bytes_sample = pdf_sample.memory_usage(index=False, deep=True).astype("int64")

col_mem_size_df = (
    pd.DataFrame({"col": col_bytes_sample.index, "bytes_sample": col_bytes_sample.values})
      .assign(
          bytes_per_row=lambda d: d["bytes_sample"] / len(pdf_sample),
          est_total_bytes=lambda d: d["bytes_per_row"] * n_rows,
          est_total_mib=lambda d: d["est_total_bytes"] / (1024 ** 2),
      )
      .sort_values("est_total_bytes", ascending=False, ignore_index=True)
)

col_mem_size_df.shape

(40, 5)

In [7]:
n_gig_size = col_mem_size_df['est_total_mib'].sum()
print(f"In-memory pandas df size: {n_gig_size: 2.2f} GiB")

In-memory pandas df size:  16137.33 GiB


In [8]:
col_mem_size_df

Unnamed: 0,col,bytes_sample,bytes_per_row,est_total_bytes,est_total_mib
0,nearest_landmark_ar,1896638,94.8319,845497700.0,806.329443
1,nearest_metro_ar,1807022,90.3511,805548000.0,768.230439
2,area_name_ar,1803314,90.1657,803895000.0,766.654034
3,ejari_property_sub_type_ar,1734718,86.7359,773315800.0,737.491392
4,master_project_ar,1539542,76.9771,686308700.0,654.515012
5,nearest_mall_ar,1534266,76.7133,683956800.0,652.271994
6,contract_reg_type_ar,1497444,74.8722,667542000.0,636.617629
7,property_usage_ar,1487346,74.3673,663040400.0,632.324604
8,ejari_bus_property_type_ar,1479800,73.99,659676500.0,629.116526
9,ejari_property_type_ar,1457790,72.8895,649864700.0,619.759279


there are similar columns with '_ar' and '_en' suffixes. These are string columns in Arabic and English respectively.  

We can drop Arabic duplicates of English columns.

That will be our first step towards optimizing the memory consumption by our dataset

### step 2.1. remove Arabic duplicates of English columns

In [9]:
# remove columns with suffix '_ar'
lf = lf.select(pl.exclude(r".*_ar$"))
names = lf.collect_schema().names()
keep = [c for c in names if "_ar" not in c]
lf = lf.select(keep)

let's check that we actually droppped Arabic columns

In [10]:
schema = lf.collect_schema()

In [11]:
schema

Schema([('contract_id', String),
        ('contract_reg_type_id', Int64),
        ('contract_reg_type_en', String),
        ('contract_start_date', String),
        ('contract_end_date', String),
        ('contract_amount', Int64),
        ('annual_amount', Int64),
        ('no_of_prop', Int64),
        ('line_number', Int64),
        ('is_free_hold', Int64),
        ('ejari_bus_property_type_id', Int64),
        ('ejari_bus_property_type_en', String),
        ('ejari_property_type_id', Int64),
        ('ejari_property_type_en', String),
        ('ejari_property_sub_type_id', Int64),
        ('ejari_property_sub_type_en', String),
        ('property_usage_en', String),
        ('project_number', Int64),
        ('project_name_en', String),
        ('master_project_en', String),
        ('area_id', Int64),
        ('area_name_en', String),
        ('nearest_landmark_en', String),
        ('nearest_metro_en', String),
        ('nearest_mall_en', String),
        ('tenant_type_id', Int64)

### step 2.2. Estimate the size of remaining columns

In [12]:
n_sample_rows = 20_000
df_sample = lf.head(n_sample_rows).collect(engine="streaming")

pdf_sample = df_sample.to_pandas()
col_bytes_sample = pdf_sample.memory_usage(index=False, deep=True).astype("int64")

col_mem_size_df = (
    pd.DataFrame({"col": col_bytes_sample.index, "bytes_sample": col_bytes_sample.values})
      .assign(
          bytes_per_row=lambda d: d["bytes_sample"] / len(pdf_sample),
          est_total_bytes=lambda d: d["bytes_per_row"] * n_rows,
          est_total_mib=lambda d: d["est_total_bytes"] / (1024 ** 2),
      )
      .sort_values("est_total_bytes", ascending=False, ignore_index=True)
)

col_mem_size_df.shape

(27, 5)

In [13]:
n_gig_size = col_mem_size_df['est_total_mib'].sum()
print(f"In-memory pandas df size: {n_gig_size: 2.2f} GiB")

In-memory pandas df size:  8119.15 GiB


we compressed the size from 16 GiB to 8.1 GiB, which is quite straightforward: we dropped 13 heaviest columns from the dataset

In [14]:
col_mem_size_df

Unnamed: 0,col,bytes_sample,bytes_per_row,est_total_bytes,est_total_mib
0,nearest_landmark_en,1300539,65.02695,579764200.0,552.906189
1,nearest_metro_en,1283733,64.18665,572272300.0,545.76135
2,area_name_en,1265168,63.2584,563996200.0,537.868697
3,contract_id,1239989,61.99945,552771700.0,527.164193
4,ejari_property_sub_type_en,1221382,61.0691,544477000.0,519.253684
5,property_usage_en,1195115,59.75575,532767400.0,508.086631
6,contract_end_date,1180000,59.0,526029400.0,501.660698
7,contract_start_date,1180000,59.0,526029400.0,501.660698
8,nearest_mall_en,1125563,56.27815,501762000.0,478.517559
9,tenant_type_en,1079008,53.9504,481008400.0,458.725344


### step 2.2: date columns

columns "est_total_mib" in col_mem_size_df contains 2 rows # 6 and # 7, that obviously should be dtype 'Date', but they are String and occupy a lot of memory: over 500 MiB each.   

We can convert these columns to the efficient Polars/arrow dtype



In [15]:
date_cols = ["contract_start_date", "contract_end_date"]
lf = lf.with_columns(
    pl.col("contract_start_date").str.strptime(pl.Date, strict=False),
    pl.col("contract_end_date").str.strptime(pl.Date, strict=False),
)

let's check how much memory we saved by such conversion

In [16]:
n_sample_rows = 20_000
df_sample = lf.head(n_sample_rows).collect(engine="streaming")

pdf_sample = df_sample.to_pandas()
col_bytes_sample = pdf_sample.memory_usage(index=False, deep=True).astype("int64")

col_mem_size_df = (
    pd.DataFrame({"col": col_bytes_sample.index, "bytes_sample": col_bytes_sample.values})
      .assign(
          bytes_per_row=lambda d: d["bytes_sample"] / len(pdf_sample),
          est_total_bytes=lambda d: d["bytes_per_row"] * n_rows,
          est_total_mib=lambda d: d["est_total_bytes"] / (1024 ** 2),
      )
      .sort_values("est_total_bytes", ascending=False, ignore_index=True)
)

n_gig_size = col_mem_size_df['est_total_mib'].sum()
print(f"In-memory pandas df size: {n_gig_size: 2.2f} GiB")

In-memory pandas df size:  7251.87 GiB


In [17]:
col_mem_size_df

Unnamed: 0,col,bytes_sample,bytes_per_row,est_total_bytes,est_total_mib
0,nearest_landmark_en,1300539,65.02695,579764200.0,552.906189
1,nearest_metro_en,1283733,64.18665,572272300.0,545.76135
2,area_name_en,1265168,63.2584,563996200.0,537.868697
3,contract_id,1239989,61.99945,552771700.0,527.164193
4,ejari_property_sub_type_en,1221382,61.0691,544477000.0,519.253684
5,property_usage_en,1195115,59.75575,532767400.0,508.086631
6,nearest_mall_en,1125563,56.27815,501762000.0,478.517559
7,tenant_type_en,1079008,53.9504,481008400.0,458.725344
8,ejari_property_type_en,1071947,53.59735,477860700.0,455.723458
9,master_project_en,1071806,53.5903,477797800.0,455.663514


## step 2.3 String to categorical

if we can convert string columns into categorical values, we can save a lot of bytes.  


first let's separate string columns from date columns

let's make an assumption that if the number of unique values divided by the number of rows in a column is equal or less than 0.1, we can convert dtype of such a column into  categorical

In [18]:
# Columns eligible for categorical (strings only, excluding date cols)
schema = lf.collect_schema()
string_cols = [c for c in schema.names() if schema[c] == pl.Utf8 and c not in date_cols]


# Compute unique-ratio on the sample
ratios = (
    df_sample.select([(pl.col(c).n_unique() / pl.len()).alias(c) for c in string_cols])
              .row(0, named=True)
)
category_cols = [c for c, r in ratios.items() if r <= 0.10]

# Lazily cast on the FULL dataset (no full collect here)
if category_cols:
    lf = lf.with_columns(pl.col(category_cols).cast(pl.Categorical))


In [19]:
n_sample_rows = 20_000
df_sample = lf.head(n_sample_rows).collect(engine="streaming")

pdf_sample = df_sample.to_pandas()
col_bytes_sample = pdf_sample.memory_usage(index=False, deep=True).astype("int64")

col_mem_size_df = (
    pd.DataFrame({"col": col_bytes_sample.index, "bytes_sample": col_bytes_sample.values})
      .assign(
          bytes_per_row=lambda d: d["bytes_sample"] / len(pdf_sample),
          est_total_bytes=lambda d: d["bytes_per_row"] * n_rows,
          est_total_mib=lambda d: d["est_total_bytes"] / (1024 ** 2),
      )
      .sort_values("est_total_bytes", ascending=False, ignore_index=True)
)

n_gig_size = col_mem_size_df['est_total_mib'].sum()
print(f"In-memory pandas df size: {n_gig_size: 2.2f} GiB")

In-memory pandas df size:  1641.46 GiB


that's quite a result ! By converting string columns to categorical, we compressed the size of the dataset from 7.25 GiB to 1.6 GiB

In [21]:
col_mem_size_df

Unnamed: 0,col,bytes_sample,bytes_per_row,est_total_bytes,est_total_mib
0,contract_id,1239989,61.99945,552771700.0,527.164193
1,contract_reg_type_id,160000,8.0,71326020.0,68.02179
2,contract_start_date,160000,8.0,71326020.0,68.02179
3,no_of_prop,160000,8.0,71326020.0,68.02179
4,contract_end_date,160000,8.0,71326020.0,68.02179
5,contract_amount,160000,8.0,71326020.0,68.02179
6,annual_amount,160000,8.0,71326020.0,68.02179
7,is_free_hold,160000,8.0,71326020.0,68.02179
8,line_number,160000,8.0,71326020.0,68.02179
9,ejari_bus_property_type_id,160000,8.0,71326020.0,68.02179


## Step 2.3 numerical columns

let's check which columns from our top-10 list of columns with maximum sizes didn't nake it to categorical columns list

In [20]:
set(col_mem_size_df.head(10)['col']) - set(category_cols)

{'annual_amount',
 'contract_amount',
 'contract_end_date',
 'contract_id',
 'contract_reg_type_id',
 'contract_start_date',
 'ejari_bus_property_type_id',
 'is_free_hold',
 'line_number',
 'no_of_prop'}

In [41]:
import polars as pl

def get_min_int_type_with_unsigned(
    min_val: int | None,
    max_val: int | None,
    *,
    allow_unsigned: bool = True,
    prefer_default_if_null: pl.DataType | None = pl.Int32,
    allow_boolean: bool = True,
) -> pl.DataType | None:
    """
    Return the smallest integer dtype that fits [min_val, max_val].
    - If both min/max are None (all-null), returns prefer_default_if_null (or None to skip).
    - If allow_unsigned and range is non-negative, prefer UInt*.
    - If allow_boolean and range is within {0,1}, return Boolean.
    """
    if min_val is None and max_val is None:
        return prefer_default_if_null  # or None to "skip casting"

    if min_val is None or max_val is None:
        # Some nulls present but we have at least one bound; keep conservative
        min_val = min_val if min_val is not None else 0
        max_val = max_val if max_val is not None else 0

    # Optional: boolean
    if allow_boolean and 0 <= min_val and max_val <= 1:
        return pl.Boolean

    if allow_unsigned and min_val >= 0:
        if max_val <= 255: return pl.UInt8
        if max_val <= 65_535: return pl.UInt16
        if max_val <= 4_294_967_295: return pl.UInt32
        return pl.UInt64

    # signed ladder
    if -128 <= min_val <= 127 and max_val <= 127: return pl.Int8
    if -32_768 <= min_val and max_val <= 32_767: return pl.Int16
    if -2_147_483_648 <= min_val and max_val <= 2_147_483_647: return pl.Int32
    return pl.Int64


In [42]:
# 1) compute min/max per candidate column (streaming to keep RAM low)
i_cols = [c for c, dt in lf.collect_schema().items() if dt in (pl.Int64, pl.Int32, pl.Int16, pl.Int8, pl.UInt64, pl.UInt32, pl.UInt16, pl.UInt8)]
mxmn = lf.select(
    [pl.col(c).min().alias(f"{c}__min") for c in i_cols] +
    [pl.col(c).max().alias(f"{c}__max") for c in i_cols]
).collect(engine="streaming")

# 2) decide targets
casts = []
for c in i_cols:
    mn = mxmn[f"{c}__min"][0]
    mx = mxmn[f"{c}__max"][0]
    tgt = get_min_int_type_with_unsigned(mn, mx, allow_unsigned=True, prefer_default_if_null=pl.Int32, allow_boolean=True)

    if tgt is None:
        continue  # skip all-null
    # null-safe cast; values outside range become null (optional)
    lo, hi = {
        pl.Boolean: (0, 1),
        pl.UInt8: (0, 255),
        pl.UInt16: (0, 65_535),
        pl.UInt32: (0, 4_294_967_295),
        pl.UInt64: (0, 18_446_744_073_709_551_615),
        pl.Int8: (-128, 127),
        pl.Int16: (-32_768, 32_767),
        pl.Int32: (-2_147_483_648, 2_147_483_647),
        pl.Int64: (-9_223_372_036_854_775_808, 9_223_372_036_854_775_807),
    }[tgt]

    casts.append(
        pl.when(pl.col(c).is_between(lo, hi, closed="both") | pl.col(c).is_null())
          .then(pl.col(c).cast(tgt))
          .otherwise(pl.lit(None, dtype=tgt))
          .alias(c)
    )

lf_opt = lf.with_columns(casts)


in this cell we separated string columns from date columns and

In [22]:
INT32_MIN, INT32_MAX = -2_147_483_648, 2_147_483_647

schema = lf.collect_schema()
i64_cols = [c for c, dt in schema.items() if dt == pl.Int64]

# First, identify and save problematic rows
problem_conditions = [
    ~(pl.col(c).is_between(INT32_MIN, INT32_MAX, closed="both") | pl.col(c).is_null())
    for c in i64_cols
]

if problem_conditions:
    # Find rows with problems
    problematic_rows = lf.filter(pl.any_horizontal(problem_conditions))

    # Save them to a separate file for inspection
    problematic_rows.sink_parquet("problematic_rows.parquet")
    # Or collect a sample to inspect
    print(f"Found {problematic_rows.select(pl.len()).collect().item()} problematic rows")
    print("Sample of problematic rows:")
    print(problematic_rows.limit(5).collect())

    # Now filter them out from main dataset
    filter_conditions = [
        pl.col(c).is_between(INT32_MIN, INT32_MAX, closed="both") | pl.col(c).is_null()
        for c in i64_cols
    ]
    lf = lf.filter(pl.all_horizontal(filter_conditions))

# Safe to cast now
lf = lf.with_columns([
    pl.col(c).cast(pl.Int32) for c in i64_cols
])

# df = lf.collect()

Found 30 problematic rows
Sample of problematic rows:
shape: (5, 27)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ contract_ ┆ contract_ ┆ contract_ ┆ contract_ ┆ … ┆ nearest_m ┆ nearest_m ┆ tenant_ty ┆ tenant_t │
│ id        ┆ reg_type_ ┆ reg_type_ ┆ start_dat ┆   ┆ etro_en   ┆ all_en    ┆ pe_id     ┆ ype_en   │
│ ---       ┆ id        ┆ en        ┆ e         ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│ str       ┆ ---       ┆ ---       ┆ ---       ┆   ┆ cat       ┆ cat       ┆ i64       ┆ cat      │
│           ┆ i64       ┆ cat       ┆ date      ┆   ┆           ┆           ┆           ┆          │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ CRT207994 ┆ 2         ┆ Renew     ┆ 2023-04-0 ┆ … ┆ Nakheel   ┆ Marina    ┆ 1         ┆ Authorit │
│ 4156      ┆           ┆           ┆ 1         ┆   ┆ Metro     ┆ Mall      ┆           ┆ y        │
│           ┆         

In [43]:
n_sample_rows = 20_000
df_sample = lf_opt.head(n_sample_rows).collect(engine="streaming")

pdf_sample = df_sample.to_pandas()
col_bytes_sample = pdf_sample.memory_usage(index=False, deep=True).astype("int64")

col_mem_size_df = (
    pd.DataFrame({"col": col_bytes_sample.index, "bytes_sample": col_bytes_sample.values})
      .assign(
          bytes_per_row=lambda d: d["bytes_sample"] / len(pdf_sample),
          est_total_bytes=lambda d: d["bytes_per_row"] * n_rows,
          est_total_mib=lambda d: d["est_total_bytes"] / (1024 ** 2),
      )
      .sort_values("est_total_bytes", ascending=False, ignore_index=True)
)

n_gig_size = col_mem_size_df['est_total_mib'].sum()
print(f"In-memory pandas df size: {n_gig_size: 2.2f} GiB")

In-memory pandas df size:  1827.07 GiB


In [29]:
schema

Schema([('contract_id', String),
        ('contract_reg_type_id', Int32),
        ('contract_reg_type_en', Categorical(ordering='physical')),
        ('contract_start_date', Date),
        ('contract_end_date', Date),
        ('contract_amount', Int32),
        ('annual_amount', Int32),
        ('no_of_prop', Int32),
        ('line_number', Int32),
        ('is_free_hold', Int32),
        ('ejari_bus_property_type_id', Int32),
        ('ejari_bus_property_type_en', Categorical(ordering='physical')),
        ('ejari_property_type_id', Int32),
        ('ejari_property_type_en', Categorical(ordering='physical')),
        ('ejari_property_sub_type_id', Int32),
        ('ejari_property_sub_type_en', Categorical(ordering='physical')),
        ('property_usage_en', Categorical(ordering='physical')),
        ('project_number', Int32),
        ('project_name_en', Categorical(ordering='physical')),
        ('master_project_en', Categorical(ordering='physical')),
        ('area_id', Int32),
     

In [24]:
col_mem_size_df

Unnamed: 0,col,bytes_sample,bytes_per_row,est_total_bytes,est_total_mib
0,contract_id,1239989,61.99945,552771700.0,527.164193
1,contract_end_date,160000,8.0,71326020.0,68.02179
2,contract_start_date,160000,8.0,71326020.0,68.02179
3,is_free_hold,160000,8.0,71326020.0,68.02179
4,ejari_bus_property_type_id,160000,8.0,71326020.0,68.02179
5,ejari_property_sub_type_id,160000,8.0,71326020.0,68.02179
6,ejari_property_type_id,160000,8.0,71326020.0,68.02179
7,tenant_type_id,160000,8.0,71326020.0,68.02179
8,project_number,160000,8.0,71326020.0,68.02179
9,project_name_en,96625,4.83125,43074230.0,41.078784


In [None]:
df = lf.collect(streaming=True)

  df = lf.collect(streaming=True)

More information on the new streaming engine: https://github.com/pola-rs/polars/issues/20947
  df = lf.collect(streaming=True)


In [31]:
df = lf.collect(streaming=True)

  df = lf.collect(streaming=True)

More information on the new streaming engine: https://github.com/pola-rs/polars/issues/20947
  df = lf.collect(streaming=True)


In [32]:
df

contract_id,contract_reg_type_id,contract_reg_type_en,contract_start_date,contract_end_date,contract_amount,annual_amount,no_of_prop,line_number,is_free_hold,ejari_bus_property_type_id,ejari_bus_property_type_en,ejari_property_type_id,ejari_property_type_en,ejari_property_sub_type_id,ejari_property_sub_type_en,property_usage_en,project_number,project_name_en,master_project_en,area_id,area_name_en,nearest_landmark_en,nearest_metro_en,nearest_mall_en,tenant_type_id,tenant_type_en
str,i32,cat,date,date,i32,i32,i32,i32,i32,i32,cat,i32,cat,i32,cat,cat,i32,cat,cat,i32,cat,cat,cat,cat,i32,cat
"""CRT1012981266""",1,"""New""",2019-04-07,2020-04-06,85000,85000,1,1,1,2,"""Unit""",2,"""Office""",422,"""Office""","""Commercial""",467,"""EMPIRE HEIGHTS""","""Business Bay""",526,"""Business Bay""","""Downtown Dubai""","""Buj Khalifa Dubai Mall Metro S…","""Dubai Mall""",1,"""Authority"""
"""CRT1012983196""",1,"""New""",2019-04-20,2020-04-19,110000,110000,1,1,1,4,"""Villa""",841,"""Villa""",2,"""2 bed rooms+hall""","""Residential""",,,"""Jumeirah Village Triangle""",442,"""Al Barsha South Fifth""","""Sports City Swimming Academy""","""Nakheel Metro Station""","""Marina Mall""",1,"""Authority"""
"""CRT1012984226""",1,"""New""",2019-04-11,2020-04-10,100000,100000,1,1,1,4,"""Villa""",841,"""Villa""",3,"""3 bed rooms+hall""","""Residential""",1488,"""REEM - MIRA OASIS COMMUNITY""",,506,"""Al Yelayiss 1""","""Dubai Cycling Course""",,,1,"""Authority"""
"""CRT1012984996""",2,"""Renew""",2019-03-18,2020-03-17,150000,150000,1,1,1,4,"""Villa""",841,"""Villa""",3,"""3 bed rooms+hall""","""Residential""",1377,"""ARABIAN RANCHES - PALMA COMMUN…","""Arabian Ranches II - PALMA""",463,"""Wadi Al Safa 7""","""Motor City""",,,1,"""Authority"""
"""CRT1012986616""",1,"""New""",2019-04-15,2020-04-14,95000,95000,1,1,1,2,"""Unit""",842,"""Flat""",1,"""1bed room+Hall""","""Residential""",,,"""Jumeriah Beach Residence - JB…",330,"""Marsa Dubai""","""Burj Al Arab""","""Jumeirah Beach Residency""","""Marina Mall""",0,"""Person"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""CNT9995""",1,"""New""",2010-12-01,2011-11-30,81840,81840,1,1,0,2,"""Unit""",842,"""Flat""",4,"""4 bed rooms+hall""","""Residential""",,,,234,"""Hor Al Anz East""","""Dubai International Airport""","""Abu Hail Metro Station""","""City Centre Mirdif""",0,"""Person"""
"""CNT9996""",1,"""New""",2010-07-01,2011-06-30,28000,28000,1,1,0,2,"""Unit""",842,"""Flat""",11,"""Studio""","""Residential""",,,,234,"""Hor Al Anz East""","""Dubai International Airport""","""Abu Hail Metro Station""","""City Centre Mirdif""",0,"""Person"""
"""CNT9997""",1,"""New""",2010-09-01,2011-08-31,28000,28000,1,1,0,2,"""Unit""",842,"""Flat""",11,"""Studio""","""Residential""",,,,234,"""Hor Al Anz East""","""Dubai International Airport""","""Abu Hail Metro Station""","""City Centre Mirdif""",0,"""Person"""
"""CNT9998""",1,"""New""",2010-08-01,2011-07-31,30000,30000,1,1,0,2,"""Unit""",842,"""Flat""",11,"""Studio""","""Residential""",,,,234,"""Hor Al Anz East""","""Dubai International Airport""","""Abu Hail Metro Station""","""City Centre Mirdif""",0,"""Person"""


In [33]:
df0 = df.to_pandas()

In [34]:
df0.to_parquet(f"datasets/rent_contracts_8192_{date}.parquet",
               compression='brotli')

In [35]:
fn=f"datasets/rent_contracts_8192_{date}.parquet"
file_size = os.path.getsize(fn)
print(f"File size: {file_size/1000000: .2f} MiB")

File size:  127.03 MiB


In [39]:
df0.memory_usage(deep=True).sum()/(1024 * 1024)

np.float64(1433.4464874267578)

In [40]:
df0.memory_usage(deep=True)

Unnamed: 0,0
Index,132
contract_id,548903742
contract_reg_type_id,35662888
contract_reg_type_en,8915936
contract_start_date,71325776
contract_end_date,71325776
contract_amount,35662888
annual_amount,35662888
no_of_prop,35662888
line_number,35662888
