# Preprocessing Kickstarter Data with Polars


Import `polars`.


In [4]:
import polars as pl

## Reading CSV with Polars


:::{caution} Polars and "null" Strings

If you try to read the CSV files using Polars, you may encounter an error due to the presence of "null" strings in the data. Polars is stricter than pandas about types. It infers the data types based on the content of the columns, and if it encounters "null" as a string, it may not be able to infer the correct type for that column, leading to an error.

To handle this, you can specify that "null" should be treated as a null value when reading the CSV files with Polars. Here's how you can do it.

```python
# This code can throw an error
df_pl = pl.read_csv("my_kickstarter_dataset.csv")
df_pl.head(3)
```

:::


:::{note} Overriding Schema

You can specify the column types explicitly when reading the CSV file, although you may still need to handle the string "null" values appropriately.

```python
# Sample code to read CSV with schema overrides
df_pl = pl.read_csv(
    "my_kickstarter_dataset.csv",
    null_values=["null"],
    schema_overrides={"converted_pledged_amount": pl.Float64},
)
```

:::


In [5]:
df_pl = pl.read_csv(
    "http://raw.githubusercontent.com/bdi593/datasets/refs/heads/main/kickstarter-projects/kickstarter-sample-data.csv",
    null_values=["null"],
)
display(df_pl.head(2))
df_pl.shape

backers_count,blurb,category,converted_pledged_amount,country,country_displayable_name,created_at,creator,currency,currency_symbol,currency_trailing_code,current_currency,deadline,disable_communication,fx_rate,goal,id,is_disliked,is_in_post_campaign_pledging_phase,is_launched,is_liked,is_starrable,launched_at,location,name,percent_funded,photo,pledged,prelaunch_activated,profile,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_exchange_rate,usd_pledged,usd_type,video
i64,str,str,i64,str,str,i64,str,str,str,bool,str,i64,bool,f64,i64,i64,bool,bool,bool,bool,bool,i64,str,str,f64,str,f64,bool,str,str,str,bool,bool,str,i64,f64,str,f64,f64,str,str
4,"""Original works of art created …","""{""id"":289,""name"":""Textiles"",""a…",13,"""US""","""the United States""",1468927141,"""{""id"":1249027971,""name"":""Andi …","""USD""","""$""",True,"""USD""",1471559704,False,1.0,1500,1487477843,False,False,True,False,False,1468967704,"""{""id"":2399954,""name"":""Epping"",…","""Stained Glass Mosaics""",0.866667,"""{""key"":""assets/013/105/528/54b…",13.0,False,"""{""id"":2605767,""project_id"":260…","""stained-glass-mosaics""","""https://www.kickstarter.com/di…",False,False,"""failed""",1471559704,1.0,"""{""web"":{""project"":""https://www…",1.0,13.0,"""domestic""",
1,"""You are the focus of luxury cr…","""{""id"":345,""name"":""DIY"",""analyt…",7,"""CA""","""Canada""",1766273106,"""{""id"":1961998114,""name"":""Ivana…","""CAD""","""$""",True,"""USD""",1769903400,False,0.739305,1000,1899748009,False,False,True,False,False,1766344490,"""{""id"":3531,""name"":""Windsor"",""s…","""Personalized Boutique Balloons""",1.0,"""{""key"":""assets/052/009/799/238…",10.0,False,"""{""id"":5303540,""project_id"":530…","""personalized-boutique-balloons""","""https://www.kickstarter.com/di…",False,False,"""failed""",1769903400,0.724874,"""{""web"":{""project"":""https://www…",0.733972,7.2487405,"""domestic""",


(3173, 42)

### Select Relevant Columns


There are 42 columns in the dataset. Let's begin by only selecting the columns that are relevant to our analysis.


In [6]:
keep = [
    "name",
    "state",
    "backers_count",
    "usd_pledged",
    "goal",
    "percent_funded",
    "launched_at",
    "state_changed_at",
    "country",
    "currency",
    "staff_pick",
    "spotlight",
    "category",
    "video",
    "blurb",
]

df_pl = df_pl.select([c for c in keep if c in df_pl.columns])
display(df_pl.head(3))
df_pl.shape

name,state,backers_count,usd_pledged,goal,percent_funded,launched_at,state_changed_at,country,currency,staff_pick,spotlight,category,video,blurb
str,str,i64,f64,i64,f64,i64,i64,str,str,bool,bool,str,str,str
"""Stained Glass Mosaics""","""failed""",4,13.0,1500,0.866667,1468967704,1471559704,"""US""","""USD""",False,False,"""{""id"":289,""name"":""Textiles"",""a…",,"""Original works of art created …"
"""Personalized Boutique Balloons""","""failed""",1,7.2487405,1000,1.0,1766344490,1769903400,"""CA""","""CAD""",False,False,"""{""id"":345,""name"":""DIY"",""analyt…",,"""You are the focus of luxury cr…"
"""From Pop-Up to Permanent""","""failed""",2,316.0,4500,7.022222,1765431741,1768455741,"""US""","""USD""",False,False,"""{""id"":343,""name"":""Candles"",""an…",,"""Join us in building a studio h…"


(3173, 15)

### Filter based on `"state"`


We can first check the frequency of the "state" column to see how many unique values it contains and how many times each value appears. This can help us understand the distribution of the data and identify any potential issues with the "state" column.


In [7]:
df_pl["state"].value_counts()

state,count
str,u32
"""submitted""",144
"""live""",131
"""canceled""",157
"""failed""",1030
"""started""",35
"""successful""",1676


We are mainly interested in the "successful" and "failed" states, as these represent the outcomes of the Kickstarter projects. If there are other states present, we will need to consider how to handle them in our analysis, such as whether to include them as separate categories or to filter them out.

Keep only the completed campaigns by filtering based on the "state" column.


In [8]:
df_pl = df_pl.filter(pl.col("state").is_in(["successful", "failed"]))

:::{seealso} Does polars have an `inplace=True` type of parameter?

From Pandas, we are used to modifying the DataFrame in place using the `inplace=True` parameter. However, Polars does not have an `inplace` parameter because it follows a different design philosophy. In Polars, all operations return a new DataFrame, and the original DataFrame remains unchanged. This approach promotes immutability and can lead to better performance and easier debugging.

:::


### Check Other Boolean Columns

While this step is optional, it can be helpful to check the frequency of other boolean columns in the dataset to understand their distribution and how they might relate to the "state" column.


#### `"staff_pick"`


The `"staff_pick"` column indicates whether a project was selected as a staff pick by Kickstarter. This could be an interesting feature to analyze, as being a staff pick may have an impact on the success of a campaign. We can check the frequency of this column to see how many projects were staff picks and how many were not.


In [9]:
df_pl["staff_pick"].value_counts()

staff_pick,count
bool,u32
True,491
False,2215


#### `"spotlight"`


The `"spotlight"` column indicates whether a project was featured on the Kickstarter homepage. This could be an interesting feature to analyze, as being spotlighted may have an impact on the success of a campaign. We can check the frequency of this column to see how many projects were spotlighted and how many were not.


In [10]:
df_pl["spotlight"].value_counts()

spotlight,count
bool,u32
False,1030
True,1676


### Convert Epoch Timestamp to Datetime


Convert the `"launched_at"` and `"state_changed_at"` columns from epoch time to datetime format for easier analysis.


In [11]:
df_pl = df_pl.with_columns(
    pl.from_epoch("launched_at", time_unit="s").alias("launched_at"),
    pl.from_epoch("state_changed_at", time_unit="s").alias("state_changed_at"),
)

df_pl.select(["launched_at", "state_changed_at"]).head(3)

launched_at,state_changed_at
datetime[μs],datetime[μs]
2016-07-19 22:35:04,2016-08-18 22:35:04
2025-12-21 19:14:50,2026-01-31 23:50:00
2025-12-11 05:42:21,2026-01-15 05:42:21


:::{hint} Why Convert Epoch Time to Datetime?

Converting epoch time to datetime format allows us to easily perform time-based analyses, such as calculating the duration of campaigns, analyzing trends over time, and visualizing data in a more human-readable format. It also enables us to easily extract components like year, month, day, etc., which can be useful for further analysis.

For performance reasons, we can choose to convert these columns to datetime format only when we need to perform time-based analyses, rather than converting them immediately after reading the data. Epoch timestamps are stored as integers, which can be more efficient for storage and certain types of calculations. We can convert them to datetime format on-the-fly when we need to work with them in a more human-readable way.

:::


:::{caution} Do not run this code multiple times

If you run the code multiple times, it will attempt to convert already converted datetime columns again, which can lead to errors or unexpected results. The `from_epoch()` function should be run only once on the original epoch time columns. If you re-run the code cell after the first execution, it will generate datetime that is incorrect.

:::


### Parse Video Information


The `"video"` column contains JSON strings with information about the project's video, such as the video URL, width, height, codecs, and etc. We can parse this JSON data to extract relevant information about the videos associated with each project.

Sample a few rows where the video information is not null.


In [12]:
df_video_samples = (
    df_pl.filter(pl.col("video").is_not_null())
    .sample(n=5, seed=42)
    .select(["name", "video"])
)
df_video_samples

name,video
str,str
"""Riven By Ravens: Album Copyrig…","""{""id"":1260668,""status"":""succes…"
"""Hell On Mask - Decameroom""","""{""id"":1031556,""status"":""succes…"
"""Michael McDermott's New Album …","""{""id"":1114575,""status"":""succes…"
"""Experimental Opera- ""Tydrus th…","""{""id"":13226,""status"":""successf…"
"""Hill's House Provisions - Nati…","""{""id"":951571,""status"":""success…"


The JSON strings are truncated in the output. We can print the full JSON string for a sample row to see the complete structure of the video information.


In [13]:
video_json = df_video_samples.row(0)[1]
print(video_json)

{"id":1260668,"status":"successful","hls":"https://v2.kickstarter.com/1770873151-S0kszdLXafjMVmF4ZVvrjPK4tMfnFbRgX5mJxHQE74s%3D/projects/4687488/video-1260668-hls_playlist.m3u8","hls_type":"application/x-mpegURL","high":"https://v2.kickstarter.com/1770873151-S0kszdLXafjMVmF4ZVvrjPK4tMfnFbRgX5mJxHQE74s%3D/projects/4687488/video-1260668-h264_high.mp4","high_type":"video/mp4; codecs="avc1.64001E, mp4a.40.2"","base":"https://v2.kickstarter.com/1770873151-S0kszdLXafjMVmF4ZVvrjPK4tMfnFbRgX5mJxHQE74s%3D/projects/4687488/video-1260668-h264_base.mp4","base_type":"video/mp4; codecs="avc1.42E01E, mp4a.40.2"","tracks":"[]","width":640,"height":360,"frame":"https://d15chbti7ht62o.cloudfront.net/projects/4687488/video-1260668-h264_base.jpg?2023"}


In [14]:
df_pl = df_pl.with_columns(
    [
        # Extract digits following the "width": key
        pl.col("video")
        .str.extract(r'"width":\s*(\d+)', 1)
        .cast(pl.Int64)
        .alias("video_width"),
        # Extract digits following the "height": key
        pl.col("video")
        .str.extract(r'"height":\s*(\d+)', 1)
        .cast(pl.Int64)
        .alias("video_height"),
    ]
)

df_pl.filter(
    pl.col("video_width").is_not_null(), pl.col("video_height").is_not_null()
).select(["name", "video_width", "video_height"]).head(10)

name,video_width,video_height
str,i64,i64
"""43 Amp Arduino Motor shield, t…",640,360
"""DIY Your Own Robot: It's not t…",640,360
"""Fletcher's Myth Adventures""",640,480
"""Rendez-vous l'année dernière a…",640,268
"""Everything Happens at Once Mov…",640,360
"""MALICE: Wars""",640,480
"""The Supernova Helmet""",640,1138
"""Smog-A-Rator""",640,360
"""Pre-made Reusable, Latex-Free …",640,1138
"""HOPEN HEART / COEXIST jewelry …",640,512


:::{attention} Can we parse JSON programmatically instead of using Regex?

In the code above, we used regular expressions to extract the video width and height from the JSON string in the `"video"` column. However, this approach can be error-prone and may not handle all cases correctly, especially if the JSON structure changes or if there are variations in the formatting.

Polars provides a robust way to parse JSON data using the `json_decode` function, which can handle complex JSON structures and is less likely to break if the format changes. Here's how you can use `json_decode` to extract the video width and height:

```python
video_schema = pl.Struct([
    pl.Field("width", pl.Int64),
    pl.Field("height", pl.Int64)
])

# 2. Pass the schema into json_decode
df_with_dims = my_dataframe.with_columns(
    pl.col("video").str.json_decode(dtype=video_schema).alias("video_struct")
).with_columns([
    pl.col("video_struct").struct.field("width").alias("video_width"),
    pl.col("video_struct").struct.field("height").alias("video_height")
]).drop("video_struct")
```

However, this only works if the JSON is not malformed and follows a consistent structure. If the JSON data is inconsistent or contains errors, the entire operation will fail. This is because JSON parsing happens at the Rust level for high performance (Polars is written in Rust). If the parser hits a syntax error (like unescaped quotes, missing commas, etc.), it considers the entire operation a failure and throws a `ComputeError` rather than silently returning a `null`.

:::


In [15]:
df_pl.head(2)

name,state,backers_count,usd_pledged,goal,percent_funded,launched_at,state_changed_at,country,currency,staff_pick,spotlight,category,video,blurb,video_width,video_height
str,str,i64,f64,i64,f64,datetime[μs],datetime[μs],str,str,bool,bool,str,str,str,i64,i64
"""Stained Glass Mosaics""","""failed""",4,13.0,1500,0.866667,2016-07-19 22:35:04,2016-08-18 22:35:04,"""US""","""USD""",False,False,"""{""id"":289,""name"":""Textiles"",""a…",,"""Original works of art created …",,
"""Personalized Boutique Balloons""","""failed""",1,7.2487405,1000,1.0,2025-12-21 19:14:50,2026-01-31 23:50:00,"""CA""","""CAD""",False,False,"""{""id"":345,""name"":""DIY"",""analyt…",,"""You are the focus of luxury cr…",,


### Parse Category Information


In [16]:
import json

df_pl = df_pl.with_columns(
    pl.col("category")
    .map_elements(
        lambda x: json.loads(x).get("name") if x else None, return_dtype=pl.Utf8
    )
    .alias("category"),
    pl.col("category")
    .map_elements(
        lambda x: json.loads(x).get("parent_name") if x else None, return_dtype=pl.Utf8
    )
    .alias("category_parent"),
)

df_pl.head(3)

name,state,backers_count,usd_pledged,goal,percent_funded,launched_at,state_changed_at,country,currency,staff_pick,spotlight,category,video,blurb,video_width,video_height,category_parent
str,str,i64,f64,i64,f64,datetime[μs],datetime[μs],str,str,bool,bool,str,str,str,i64,i64,str
"""Stained Glass Mosaics""","""failed""",4,13.0,1500,0.866667,2016-07-19 22:35:04,2016-08-18 22:35:04,"""US""","""USD""",False,False,"""Textiles""",,"""Original works of art created …",,,"""Art"""
"""Personalized Boutique Balloons""","""failed""",1,7.2487405,1000,1.0,2025-12-21 19:14:50,2026-01-31 23:50:00,"""CA""","""CAD""",False,False,"""DIY""",,"""You are the focus of luxury cr…",,,"""Crafts"""
"""From Pop-Up to Permanent""","""failed""",2,316.0,4500,7.022222,2025-12-11 05:42:21,2026-01-15 05:42:21,"""US""","""USD""",False,False,"""Candles""",,"""Join us in building a studio h…",,,"""Crafts"""


Rearrange the columns so that `"category_parent"` comes right after `"category"` for easier analysis. We can use the `select` method to specify the order of the columns in the DataFrame.


In [17]:
cols = df_pl.columns
idx = cols.index("category")

df_pl = df_pl.select(
    cols[: idx + 1]
    + ["category_parent"]
    + [c for c in cols[idx + 1 :] if c != "category_parent"]
)

df_pl.head(3)

name,state,backers_count,usd_pledged,goal,percent_funded,launched_at,state_changed_at,country,currency,staff_pick,spotlight,category,category_parent,video,blurb,video_width,video_height
str,str,i64,f64,i64,f64,datetime[μs],datetime[μs],str,str,bool,bool,str,str,str,str,i64,i64
"""Stained Glass Mosaics""","""failed""",4,13.0,1500,0.866667,2016-07-19 22:35:04,2016-08-18 22:35:04,"""US""","""USD""",False,False,"""Textiles""","""Art""",,"""Original works of art created …",,
"""Personalized Boutique Balloons""","""failed""",1,7.2487405,1000,1.0,2025-12-21 19:14:50,2026-01-31 23:50:00,"""CA""","""CAD""",False,False,"""DIY""","""Crafts""",,"""You are the focus of luxury cr…",,
"""From Pop-Up to Permanent""","""failed""",2,316.0,4500,7.022222,2025-12-11 05:42:21,2026-01-15 05:42:21,"""US""","""USD""",False,False,"""Candles""","""Crafts""",,"""Join us in building a studio h…",,
