# Value counts

In [1]:
import polars as pl

In [2]:
csv_file = 'data/titanic.csv'

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


## Count occurrences in a `Series`
`value_counts` counts occurrences in a `Series`

In [4]:
df["Pclass"].value_counts()

Pclass,count
i64,u32
2,184
1,216
3,491


In Pandas, the output of this operation is a `Series`

However, in Polars the output is a `DataFrame` with one column for the categories and one for the counts.

The order will vary each time unless passing the `sort` argument

In [6]:
df["Pclass"].value_counts(sort=True)

Pclass,count
i64,u32
3,491
1,216
2,184


Use `sort` method to sort by the category

In [7]:
df["Pclass"].value_counts().sort("Pclass")

Pclass,count
i64,u32
1,216
2,184
3,491


As `value_count` works on a single column which means it is not done in parallel by default. 

A long `Series` might be worth doing this in parallel with the `parallel` argument

In [8]:
df['Pclass'].value_counts(parallel=True)

Pclass,count
i64,u32
1,216
2,184
3,491


## Value counts as an expression

The output is a one-column `DataFrame` with a `pl.Struct` column

In [9]:
df.select(
    pl.col("Pclass").value_counts()
)

Pclass
struct[2]
"{1,216}"
"{3,491}"
"{2,184}"


Get the output as a two-column `DataFrame` by calling `.struct.unnest` on the `Series`.

In [10]:
df.select(
    pl.col("Pclass").value_counts()
)["Pclass"].struct.unnest()

Pclass,count
i64,u32
2,184
3,491
1,216


## Plotting the value counts

In [17]:
df["Pclass"]\
.value_counts()\
.sort("Pclass")\
.with_columns(
    pl.col("Pclass").cast(pl.Utf8)
)\
.plot\
.bar(
    x="Pclass",
    y="count",
)\
.properties(width=500)

## Value counts in lazy mode
To calculate value counts in lazy mode, calling `value_counts` as an expression on a `LazyFrame`.

In [19]:
pl.scan_csv(csv_file)\
.select(
    pl.col("Pclass").value_counts()
)\
.collect()["Pclass"]\
.struct.unnest()

Pclass,count
i64,u32
1,216
2,184
3,491


In [21]:
print(pl.scan_csv(csv_file)\
.select(
    pl.col("Pclass").value_counts()
)\
.explain())

SELECT [col("Pclass").value_counts()]
  Csv SCAN [data/titanic.csv]
  PROJECT 1/12 COLUMNS
  ESTIMATED ROWS: 971


## Exercises

### Exercise 1
Calculate the value counts on the `Survived` column as a `Series`. 

In [22]:
df["Survived"].value_counts()

Survived,count
i64,u32
1,342
0,549


Sort the output from highest to lowest

In [23]:
df["Survived"].value_counts(sort=True)

Survived,count
i64,u32
0,549
1,342


Calculate the value counts on the `Survived` column as an expression 

In [24]:
df.select(
    pl.col("Survived").value_counts()
)

Survived
struct[2]
"{1,342}"
"{0,549}"


Calculate the value counts on the `Survived` column as an expression and convert the `pl.Struct` column to a `DataFrame`

In [25]:
df.select(
    pl.col("Survived").value_counts()
)["Survived"]\
.struct.unnest()

Survived,count
i64,u32
1,342
0,549


### Exercise 2
As in the first part of Exercise 1, calculate the value counts on the `Survived` column as a `Series`

In [26]:
df["Survived"].value_counts(sort=True)

Survived,count
i64,u32
0,549
1,342


Add an additional column with the percentage of passengers in each class (divide the `counts` column by the sum of the `counts` column. 

In [27]:
df["Survived"].value_counts(sort=True)\
.with_columns(
    percent = pl.col("count") / pl.col("count").sum()
)

Survived,count,percent
i64,u32,f64
0,549,0.616162
1,342,0.383838


Express the percentages as values ranging from 0 to 100.

In [29]:
df["Survived"].value_counts(sort=True)\
.with_columns(
    percent = ((pl.col("count") / pl.col("count").sum()) * 100)
)

Survived,count,percent
i64,u32,f64
0,549,61.616162
1,342,38.383838


Visualise the percentage values for each class in a bar chart

In [33]:
df["Survived"].value_counts(sort=True)\
.with_columns(
    pl.col("Survived").cast(pl.Utf8),
    percent = ((pl.col("count") / pl.col("count").sum()) * 100)
)\
.plot\
.bar(
    x="Survived",
    y="percent",
    color="Survived"
)\
.properties(width=500)

### Exercise 3

Construct the query that produces the following optimized query plan
```
 SELECT [col("Age").round().value_counts()] FROM

    Csv SCAN ../data/titanic.csv
    PROJECT 1/12 COLUMNS
```


In [40]:
lf = pl.scan_csv(csv_file)\
.select(
    pl.col("Age").round(0).value_counts()
)

print(lf.explain())

SELECT [col("Age").round().value_counts()]
  Csv SCAN [data/titanic.csv]
  PROJECT 1/12 COLUMNS
  ESTIMATED ROWS: 971


### Exercise 4
We create a `DataFrame` from the Spotify data

In [41]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_rows(10)
spotify_csv = "data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

title,rank,date,artist,url,region,chart,trend,streams
str,i64,date,str,str,str,str,str,i64
"""Starboy""",1,2017-01-01,"""The Weeknd, Daft Punk""","""https://open.spotify.com/track/5aAx2yezTd8zXrkmtKl66Z""","""Global""","""top200""","""SAME_POSITION""",3135625
"""Closer""",2,2017-01-01,"""The Chainsmokers, Halsey""","""https://open.spotify.com/track/7BKLCZ1jbUBVqRi2FVlTVw""","""Global""","""top200""","""SAME_POSITION""",3015525
"""Let Me Love You""",3,2017-01-01,"""DJ Snake, Justin Bieber""","""https://open.spotify.com/track/4pdPtRcBmOSQDlJ3Fk945m""","""Global""","""top200""","""MOVE_UP""",2545384


Create a `DataFrame` with the 5 most common tracks by count of rows

In [42]:
spotify_df["title"].value_counts(sort=True).head(5)

title,count
str,u32
"""Shape of You""",1784
"""Believer""",1776
"""Say You Won't Let Go""",1768
"""Perfect""",1747
"""goosebumps""",1564


Create a bar chart of the 5 most common tracks by count of rows in your preferred plotting library

In [50]:
import altair as alt

spotify_df["title"].value_counts(sort=True).head(5)\
.plot\
.bar(
    x=alt.X("title", sort=None),
    y="count",
    color="title"
)\
.properties(width=500)