# Clustering

## Import the BipartitePandas package

Make sure to install it using `pip install bipartitepandas`.

In [1]:
import bipartitepandas as bpd

## Get your data ready

For this notebook, we simulate data.

In [2]:
df = bpd.SimBipartite().simulate()
bdf = bpd.BipartiteDataFrame(i=df['i'], j=df['j'], y=df['y'], t=df['t']).clean()
display(bdf)

checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how=False)
making 'i' ids contiguous
making 'j' ids contiguous
computing largest connected set (how=None)
sorting columns
resetting index


Unnamed: 0,i,j,y,t,m
0,0,179,2.761114,0,1
1,0,173,2.648802,1,2
2,0,189,4.480470,2,2
3,0,180,1.011759,3,1
4,0,180,2.767630,4,0
...,...,...,...,...,...
49995,9999,76,1.191633,0,1
49996,9999,66,0.577275,1,2
49997,9999,30,0.633013,2,1
49998,9999,30,0.726568,3,0


## Basic clustering

Clustering in BipartitePandas estimates firm groups.

Clustering is simple, just run `.cluster()` - notice the new `g` column!

In [3]:
bdf = bdf.cluster()
display(bdf)

AttributeError: module 'pandas' has no attribute '_lib'

## Advanced clustering

You can investigate all clustering parameters by running `bpd.cluster_params().describe_all()`. We are going to go through some of the most important options.

#### Computing measures and selecting how to group on them

We compute measures using the `bpd.measures` module, and group on the computed measures using the `bpd.grouping` module.

Let's use firm-level income cdfs as our measure, and group using KMeans.

In [None]:
measures = bpd.measures.CDFs()
grouping = bpd.grouping.KMeans()
bdf = bdf.cluster(bpd.cluster_params({'measures': measures, 'grouping': grouping}))
display(bdf)

We can even group on multiple measures!

In [None]:
measures = [bpd.measures.CDFs(), bpd.measures.Moments(measures=['mean', 'var'])]
grouping = bpd.grouping.KMeans()
bdf = bdf.cluster(bpd.cluster_params({'measures': measures, 'grouping': grouping}))
display(bdf)

#### Clustering on subsets of the data - stayers/movers/stays/moves

What if we want our measures to be computed with only movers or only stayers? We can specify `stayers_movers`. Note that some firms may not be clustered - these firms will have `g=pd.NA` (set `'dropna': True` if you want to drop firms that don't get clustered).

In [None]:
bdf = bdf.cluster(bpd.cluster_params({'stayers_movers': 'movers'}))
display(bdf)

#### Clustering on subsets of the data - time

On the other hand, what if we want to cluster on particular periods of data? We can specify `t`. Again, note that some firms may not be clustered - these firms will have `g=pd.NA` (set `'dropna': True` if you want to drop firms that don't get clustered).

In [None]:
bdf = bdf.cluster(bpd.cluster_params({'t': [0, 1, 2]}))
display(bdf)