# Simple pandas to vowpalwabbit conversion tutorial

In [1]:
import pandas as pd
from vowpalwabbit.dftovw import DFtoVW
from vowpalwabbit import Workspace

### Building simple examples using `DftoVW.from_column_names`

Let's create the following pandas dataframe:

In [2]:
df = pd.DataFrame(
    [
     {'income': 0,
      'age': 27,
      'marital-status': 'Separated',
      'education': 'HS-grad',
      'occupation': 'Handlers-cleaners',
      'hours-per-week': 25},
     {'income': 1,
      'age': 34,
      'marital-status': 'Married-civ-spouse',
      'education': 'Bachelors',
      'occupation': 'Prof-specialty',
      'hours-per-week': 40},
     {'income': 0,
      'age': 44,
      'marital-status': 'Never-married',
      'education': 'Assoc-voc',
      'occupation': 'Priv-house-serv',
      'hours-per-week': 25},
     {'income': 1,
      'age': 38,
      'marital-status': 'Married-civ-spouse',
      'education': 'Bachelors',
      'occupation': 'Prof-specialty',
      'hours-per-week': 60},
     {'income': 0,
      'age': 34,
      'marital-status': 'Married-civ-spouse',
      'education': 'HS-grad',
      'occupation': 'Other-service',
      'hours-per-week': 36
     }
    ]
)

The user builds the examples using the class method `DftoVW.from_column_names`. The method is called using the dataframe object (`df`) and its various column names. The conversion to vowpal wabbit examples is then performed by calling the `convert_df` method:

In [3]:
converter = DFtoVW.from_column_names(
    df=df, y="income", x=["age", "marital-status", "education", "occupation", "hours-per-week"], 
)
examples = converter.convert_df()
examples

['0 | age:27 marital-status=Separated education=HS-grad occupation=Handlers-cleaners hours-per-week:25',
 '1 | age:34 marital-status=Married-civ-spouse education=Bachelors occupation=Prof-specialty hours-per-week:40',
 '0 | age:44 marital-status=Never-married education=Assoc-voc occupation=Priv-house-serv hours-per-week:25',
 '1 | age:38 marital-status=Married-civ-spouse education=Bachelors occupation=Prof-specialty hours-per-week:60',
 '0 | age:34 marital-status=Married-civ-spouse education=HS-grad occupation=Other-service hours-per-week:36']

Note that the vowpal wabbit format for categorical features is `feature_name=feature_value` whereas for numerical features the format is `feature_name:feature_value`. When using `DFtoVW` class, the appropriate format will be inferred from the dataframe columns types.

We then train the model on these examples:

In [4]:
model = Workspace(P=1, enable_logging=True)

for ex in examples:
    model.learn(ex)
model.finish()

### Building more complex examples

The class method `DFtoVW.from_column_names` represents a quick and simple way to build the examples, but if the user needs more control over the way the examples are created, she or he can either use the class `Feature` or the class `Namespace` for building features, and any of the label class available (see below) based on the nature of the task. 

- When using `Namespace` class (see https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Namespaces for the meaning) the user specifies the name of the namespace with the `name` field, and will pass one or a list of `Feature` object to the `features` field.

- The `Feature` class has a `value` field, which is the name of the column. The user can also rename the feature using the `rename_feature` field or choose to enforce a specific type (`"numerical"` or `"categorical"`) using `as_type` field.

Regarding the labels, multiple classes are available:
- `SimpleLabel` for regression
- `MulticlassLabel` and `Multilabel` for classification
- `ContextualbanditLabel`.

In the following examples we'll build 2 namespaces based on socio-demographic features and the job features.

In [5]:
from vowpalwabbit.dftovw import SimpleLabel, Namespace, Feature

ns_sociodemo = Namespace(features=[Feature(col) for col in ["age", "marital-status", "education"]], name="ns_sociodemo")
ns_job = Namespace(features=[Feature(col) for col in ["occupation", "hours-per-week"]], name="ns_job")
label = SimpleLabel("income")

converter_advanced = DFtoVW(df=df, namespaces=[ns_sociodemo, ns_job], label=label)
examples_advanced = converter_advanced.convert_df()
examples_advanced[:5]

['0 |ns_sociodemo age:27 marital-status=Separated education=HS-grad |ns_job occupation=Handlers-cleaners hours-per-week:25',
 '1 |ns_sociodemo age:34 marital-status=Married-civ-spouse education=Bachelors |ns_job occupation=Prof-specialty hours-per-week:40',
 '0 |ns_sociodemo age:44 marital-status=Never-married education=Assoc-voc |ns_job occupation=Priv-house-serv hours-per-week:25',
 '1 |ns_sociodemo age:38 marital-status=Married-civ-spouse education=Bachelors |ns_job occupation=Prof-specialty hours-per-week:60',
 '0 |ns_sociodemo age:34 marital-status=Married-civ-spouse education=HS-grad |ns_job occupation=Other-service hours-per-week:36']

We train the model by also including interactions between the variables of the 2 namespaces:

In [6]:
model_advanced = Workspace(
    #arg_str="--interactions ns_sociodemo:ns_job", P=1, enable_logging=True
    arg_str="--redefine a:=ns_job b:=ns_sociodemo -q ab ", P=1, enable_logging=True
)

for ex in examples_advanced:
    model_advanced.learn(ex)

model_advanced.finish()

Finally, we can get the estimated weights associated to each namespace and feature:

In [7]:
[
    (ns.name, feature.name, model_advanced.get_weight_from_name(feature.name, ns.name))
    for ns in [ns_job, ns_sociodemo]
    for feature in ns.features
    
]

[('ns_job', 'occupation', 0.0),
 ('ns_job', 'hours-per-week', 0.0019117757910862565),
 ('ns_sociodemo', 'age', 0.001858704723417759),
 ('ns_sociodemo', 'marital-status', 0.0),
 ('ns_sociodemo', 'education', 0.0)]