## Example

In [1]:
from imputepy import LGBMimputer, cols_to_impute, find_cat, column_filter
import pandas as pd
import numpy as np

The first 17 columns of the [NFL data](https://www.kaggle.com/datasets/maxhorowitz/nflplaybyplay2009to2016) is used for demonstration.

In [2]:
df = pd.read_csv('data/df.csv')

In [3]:
df.head()

Unnamed: 0,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,yrdln,yrdline100,ydstogo,ydsnet,GoalToGo,FirstDown,posteam,DefensiveTeam,desc
0,1,1,,15:00,15,3600.0,0.0,TEN,30.0,30.0,0,0,0.0,,PIT,TEN,R.Bironas kicks 67 yards from TEN 30 to PIT 3....
1,1,1,1.0,14:53,15,3593.0,7.0,PIT,42.0,58.0,10,5,0.0,0.0,PIT,TEN,(14:53) B.Roethlisberger pass short left to H....
2,1,1,2.0,14:16,15,3556.0,37.0,PIT,47.0,53.0,5,2,0.0,0.0,PIT,TEN,(14:16) W.Parker right end to PIT 44 for -3 ya...
3,1,1,3.0,13:35,14,3515.0,41.0,PIT,44.0,56.0,8,2,0.0,0.0,PIT,TEN,(13:35) (Shotgun) B.Roethlisberger pass incomp...
4,1,1,4.0,13:27,14,3507.0,8.0,PIT,44.0,56.0,8,2,0.0,1.0,PIT,TEN,(13:27) (Punt formation) D.Sepulveda punts 54 ...


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 407688 entries, 0 to 407687
Data columns (total 17 columns):
 #   Column         Non-Null Count   Dtype   
---  ------         --------------   -----   
 0   Drive          407688 non-null  int64   
 1   qtr            407688 non-null  category
 2   down           346534 non-null  category
 3   time           407464 non-null  object  
 4   TimeUnder      407688 non-null  int64   
 5   TimeSecs       407464 non-null  float64 
 6   PlayTimeDiff   407244 non-null  float64 
 7   SideofField    407160 non-null  category
 8   yrdln          406848 non-null  float64 
 9   yrdline100     406848 non-null  float64 
 10  ydstogo        407688 non-null  int64   
 11  ydsnet         407688 non-null  int64   
 12  GoalToGo       406848 non-null  category
 13  FirstDown      378877 non-null  category
 14  posteam        382696 non-null  category
 15  DefensiveTeam  382696 non-null  category
 16  desc           407686 non-null  object  
dtypes: categor

12 columns are found to have missing values.

In [6]:
columns_with_missing_values = cols_to_impute(df)
print(columns_with_missing_values)
print(len(columns_with_missing_values))

['down', 'time', 'TimeSecs', 'PlayTimeDiff', 'SideofField', 'yrdln', 'yrdline100', 'GoalToGo', 'FirstDown', 'posteam', 'DefensiveTeam', 'desc']
12


Columns `qtr`, `down`, `GoalToGo` and `FirstDown` are considered to be categorical columns. You can also increase the value of `unique_count_limit` to allow for more categorical columns at the expense of computation time. 

In [5]:
find_cat(df, unique_count_limit=15)

['qtr', 'down', 'GoalToGo', 'FirstDown']

Categorical columns with more than 50 unique values are default to be skipped and would not be imputed due to extended fitting time required. The threshold can be controlled by changing `filter_upper_limit`.

In [None]:
%%time
df_imp = LGBMimputer(df)

10 columns will be imputed: ['DefensiveTeam', 'yrdln', 'PlayTimeDiff', 'posteam', 'down', 'yrdline100', 'TimeSecs', 'FirstDown', 'GoalToGo', 'SideofField']
target column: DefensiveTeam
1/10 columns fitted
target column: yrdln
2/10 columns fitted
target column: PlayTimeDiff
3/10 columns fitted
target column: posteam
4/10 columns fitted
target column: down
5/10 columns fitted
target column: yrdline100
6/10 columns fitted
target column: TimeSecs
7/10 columns fitted
target column: FirstDown
8/10 columns fitted
target column: GoalToGo
9/10 columns fitted
target column: SideofField
10/10 columns fitted
CPU times: total: 11min 30s
Wall time: 59.5 s


In [18]:
df_imp

Unnamed: 0,DefensiveTeam,yrdln,PlayTimeDiff,posteam,down,yrdline100,TimeSecs,FirstDown,GoalToGo,SideofField
0,TEN,30.0,0.0,PIT,1.0,30.0,3600.0,0.0,0.0,TEN
1,TEN,42.0,7.0,PIT,1.0,58.0,3593.0,0.0,0.0,PIT
2,TEN,47.0,37.0,PIT,2.0,53.0,3556.0,0.0,0.0,PIT
3,TEN,44.0,41.0,PIT,3.0,56.0,3515.0,0.0,0.0,PIT
4,TEN,44.0,8.0,PIT,4.0,56.0,3507.0,1.0,0.0,PIT
...,...,...,...,...,...,...,...,...,...,...
407683,TEN,32.0,4.0,PIT,2.0,32.0,28.0,0.0,0.0,BAL
407684,CIN,23.0,0.0,BAL,3.0,77.0,28.0,0.0,0.0,BAL
407685,CIN,23.0,4.0,BAL,4.0,77.0,24.0,1.0,0.0,BAL
407686,BAL,36.0,10.0,CIN,1.0,36.0,14.0,0.0,0.0,BAL


In [6]:
df_imp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 407688 entries, 0 to 407687
Data columns (total 10 columns):
 #   Column         Non-Null Count   Dtype   
---  ------         --------------   -----   
 0   DefensiveTeam  407688 non-null  category
 1   yrdln          407688 non-null  float64 
 2   PlayTimeDiff   407688 non-null  float64 
 3   posteam        407688 non-null  category
 4   down           407688 non-null  category
 5   yrdline100     407688 non-null  float64 
 6   TimeSecs       407688 non-null  float64 
 7   FirstDown      407688 non-null  category
 8   GoalToGo       407688 non-null  category
 9   SideofField    407688 non-null  category
dtypes: category(6), float64(4)
memory usage: 14.8 MB
