# DataWig Examples

## Installation

Clone the repository from git and set up virtualenv in the root dir of the package:

```
python3 -m venv venv
```

Install the package from local sources:

```
./venv/bin/pip install -e .
```

## Running DataWig
The DataWig API expects your data as a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). Here is an example of how the dataframe might look:

|Product Type | Description           | Size | Color |
|-------------|-----------------------|------|-------|
|   Shoe      | Ideal for Running     | 12UK | Black |
| SDCards     | Best SDCard ever ...  | 8GB  | Blue  |
| Dress       | This **yellow** dress | M    | **?** |

DataWig let's you impute missing values in two ways:
  * A `.complete` functionality inspired by [`fancyimpute`](https://github.com/iskandr/fancyimpute)
  * A `sklearn`-like API with `.fit` and `.predict` methods

## Quickstart Example

### Using `AutoGluonImputer.complete`


In [1]:
import datawig, numpy, random, warnings
random.seed(0)
warnings.filterwarnings("ignore")


# generate some data with simple nonlinear dependency
df = datawig.utils.generate_df_numeric() 
# mask 10% of the values
df_with_missing = df.mask(numpy.random.rand(*df.shape) > .8)

# impute missing values
df_with_missing_imputed = datawig.AutoGluonImputer.complete(df_with_missing)

df['f(x) with_missing'] = df_with_missing['f(x)']
df['f(x) imputed'] = df_with_missing_imputed['f(x)']
df[-5:]

Unnamed: 0,x,f(x),f(x) with_missing,f(x) imputed
95,-0.038983,-0.006638,-0.006638,-0.006638
96,0.142835,0.019631,0.019631,0.019631
97,-0.455273,0.210685,0.210685,0.210685
98,-2.98188,8.894373,,8.282938
99,-2.463691,6.078044,6.078044,6.078044


### Using `AutoGluonImputer.fit` and `.predict`

You can also impute values in specific columns only (called `output_column` below) using values in other columns (called `input_columns` below). DataWig currently supports imputation of categorical columns and numeric columns. Type inference is based on ``pandas`` 

#### Imputation of categorical columns

Let's first generate some random strings hidden in longer random strings:

In [2]:
df['f(x) with_missing'] = df_with_missing['f(x)']
df['f(x) imputed'] = df_with_missing_imputed['f(x)']

In [3]:
import datawig

df = datawig.utils.generate_df_string( num_samples=200, 
                                       data_column_name='sentences', 
                                       label_column_name='label')
df.head(n=2)

Unnamed: 0,sentences,label
0,56KkS Of1Ui vHqhM ZAmJ9 cq9GF h3qiP,cq9GF
1,XQTck UZd7R O2NiT NQqEe 9ZMJL cq9GF,cq9GF


In [4]:
df_train, df_test = datawig.utils.random_split(df)

imputer = datawig.AutoGluonImputer(
    input_columns=['sentences'], # column(s) containing information about the column we want to impute
    output_column='label' # the column we'd like to impute values for
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, time_limit=100)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)
imputed.head(n=5)

		`import lightgbm` failed. If you are using Mac OSX, Please try 'brew install libomp'. Detailed info: dlopen(/Users/biessman/code/datawig/venv_/lib/python3.7/site-packages/lightgbm/lib_lightgbm.so, 6): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /Users/biessman/code/datawig/venv_/lib/python3.7/site-packages/lightgbm/lib_lightgbm.so
  Reason: image not found
		`import lightgbm` failed. If you are using Mac OSX, Please try 'brew install libomp'. Detailed info: dlopen(/Users/biessman/code/datawig/venv_/lib/python3.7/site-packages/lightgbm/lib_lightgbm.so, 6): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /Users/biessman/code/datawig/venv_/lib/python3.7/site-packages/lightgbm/lib_lightgbm.so
  Reason: image not found
		XGBoost Library (libxgboost.dylib) could not be loaded.
Likely causes:
  * OpenMP runtime is not installed (vcomp140.dll or libgomp-1.dll for Windows, libomp.dylib for Mac OSX, libgomp.so for Linux and other 

Unnamed: 0,sentences,label,label_imputed
57,uanaT NgmhM UuT3e qJu8x 2yW4A h3jr2,2yW4A,2yW4A
31,NgmhM n9JEC 2V9SM 9EwNs CeqSk 2yW4A,2yW4A,2yW4A
65,gSzmq lThEn uanaT cq9GF M5jtg m6Kop,cq9GF,cq9GF
140,zfx1h 2V9SM 2yW4A bIBNg 5S71I CH4F6,2yW4A,2yW4A
89,2mpbU YwfuN lThEn Rn9Xd cq9GF bccnR,cq9GF,cq9GF


#### Imputation of numerical columns

Imputation of numerical values works just like for categorical values.

Let's first generate some numeric values with a quadratic dependency:


In [5]:
import datawig

df = datawig.utils.generate_df_numeric( num_samples=200, 
                                        data_column_name='x', 
                                        label_column_name='y')         
df.head(n=5)

Unnamed: 0,x,y
0,1.895813,3.617395
1,-1.008764,1.024857
2,1.978105,3.919697
3,-2.638216,6.96594
4,2.480706,6.151376


In [6]:
df_train, df_test = datawig.utils.random_split(df)

imputer = datawig.AutoGluonImputer(
    input_columns=['x'], # column(s) containing information about the column we want to impute
    output_column='y', # the column we'd like to impute values for
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, time_limit=100)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)
imputed.head(n=5)

Unnamed: 0,x,y
57,1.464692,2.149859
31,-2.687957,7.225748
65,2.226667,4.958026
140,2.124441,4.502884
89,-0.434246,0.176235
