## Example

In [1]:
from imputepy import LGBMimputer, cols_to_impute, find_cat, column_filter
import pandas as pd
import numpy as np

The first 35,000 rows in the first 17 columns of the [NFL data](https://www.kaggle.com/datasets/maxhorowitz/nflplaybyplay2009to2016) are used for demonstration.

In [5]:
df = pd.read_csv('data/df.csv')

In [6]:
df.head()

Unnamed: 0,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,yrdln,yrdline100,ydstogo,ydsnet,GoalToGo,FirstDown,posteam,DefensiveTeam,desc
0,1,1,,15:00,15,3600.0,0.0,TEN,30.0,30.0,0,0,0.0,,PIT,TEN,R.Bironas kicks 67 yards from TEN 30 to PIT 3....
1,1,1,1.0,14:53,15,3593.0,7.0,PIT,42.0,58.0,10,5,0.0,0.0,PIT,TEN,(14:53) B.Roethlisberger pass short left to H....
2,1,1,2.0,14:16,15,3556.0,37.0,PIT,47.0,53.0,5,2,0.0,0.0,PIT,TEN,(14:16) W.Parker right end to PIT 44 for -3 ya...
3,1,1,3.0,13:35,14,3515.0,41.0,PIT,44.0,56.0,8,2,0.0,0.0,PIT,TEN,(13:35) (Shotgun) B.Roethlisberger pass incomp...
4,1,1,4.0,13:27,14,3507.0,8.0,PIT,44.0,56.0,8,2,0.0,1.0,PIT,TEN,(13:27) (Punt formation) D.Sepulveda punts 54 ...


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35000 entries, 0 to 34999
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Drive          35000 non-null  int64  
 1   qtr            35000 non-null  int64  
 2   down           29668 non-null  float64
 3   time           34975 non-null  object 
 4   TimeUnder      35000 non-null  int64  
 5   TimeSecs       34975 non-null  float64
 6   PlayTimeDiff   34950 non-null  float64
 7   SideofField    34952 non-null  object 
 8   yrdln          34933 non-null  float64
 9   yrdline100     34933 non-null  float64
 10  ydstogo        35000 non-null  int64  
 11  ydsnet         35000 non-null  int64  
 12  GoalToGo       34933 non-null  float64
 13  FirstDown      32542 non-null  float64
 14  posteam        32691 non-null  object 
 15  DefensiveTeam  32691 non-null  object 
 16  desc           35000 non-null  object 
dtypes: float64(7), int64(5), object(5)
memory usage: 4

11 columns are found to have missing values.

In [8]:
columns_with_missing_values = cols_to_impute(df)
print(columns_with_missing_values)
print(len(columns_with_missing_values))

['down', 'time', 'TimeSecs', 'PlayTimeDiff', 'SideofField', 'yrdln', 'yrdline100', 'GoalToGo', 'FirstDown', 'posteam', 'DefensiveTeam']
11


Columns `qtr`, `down`, `GoalToGo` and `FirstDown` are considered to be categorical columns, despite having numerical values. You can also increase the value of `unique_count_limit` to allow for more categorical columns at the expense of computation time. 

In [9]:
find_cat(df, unique_count_limit=15)

['qtr', 'down', 'GoalToGo', 'FirstDown']

Categorical columns with more than 50 unique values are default to be skipped and would not be imputed due to extended fitting time required, e.g. column `time`. The threshold can be controlled by changing `filter_upper_limit`.

In [10]:
%%time
df_imp = LGBMimputer(df, exclude=['DefensiveTeam'])

9 columns will be imputed: ['TimeSecs', 'SideofField', 'GoalToGo', 'PlayTimeDiff', 'FirstDown', 'down', 'yrdline100', 'posteam', 'yrdln']
target column: TimeSecs
1/9 columns fitted
target column: SideofField
2/9 columns fitted
target column: GoalToGo
3/9 columns fitted
target column: PlayTimeDiff
4/9 columns fitted
target column: FirstDown
5/9 columns fitted
target column: down
6/9 columns fitted
target column: yrdline100
7/9 columns fitted
target column: posteam
8/9 columns fitted
target column: yrdln
9/9 columns fitted
CPU times: total: 1min 37s
Wall time: 8.8 s


In [11]:
df_imp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35000 entries, 0 to 34999
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Drive          35000 non-null  int64  
 1   qtr            35000 non-null  int64  
 2   down           35000 non-null  float64
 3   time           34975 non-null  object 
 4   TimeUnder      35000 non-null  int64  
 5   TimeSecs       35000 non-null  float64
 6   PlayTimeDiff   35000 non-null  float64
 7   SideofField    35000 non-null  object 
 8   yrdln          35000 non-null  float64
 9   yrdline100     35000 non-null  float64
 10  ydstogo        35000 non-null  int64  
 11  ydsnet         35000 non-null  int64  
 12  GoalToGo       35000 non-null  float64
 13  FirstDown      35000 non-null  float64
 14  posteam        35000 non-null  object 
 15  DefensiveTeam  32691 non-null  object 
 16  desc           35000 non-null  object 
dtypes: float64(7), int64(5), object(5)
memory usage: 4