# Sklearn Custom Transformer & Sklearn-Pandas DataFrameMapper

- sklearn pipeline
- sklearn_pandas DataFrameMapper

---

## Import Modules

In [1]:
# Standard
import pandas as pd

# Custom Transformer - Classes for inheritance
from sklearn.base import BaseEstimator, TransformerMixin

# DataFrame Mapper
from sklearn_pandas import DataFrameMapper

---

## Toy Data

In [2]:
data = pd.DataFrame({'Title':      ['EEO - New York', 'EEO - Miami', 'EEO - New Jersey', 'Military Leave - Maine', 'Military Leave - Illinois'], 
                     'Policy': ['EEO', 'EEO', 'EEO', 'Military Leave', 'Military Leave'], 
                     'bodytext':   ['EEO - sent 1', 'EEO - sent 2', 'EEO - sent 3', 'Military Leave - sent 4', 'Military Leave - sent 5'], 
                     'ArticleID': [100, 200, 300, 400, 500]})

In [3]:
data

Unnamed: 0,Title,Policy,bodytext,ArticleID
0,EEO - New York,EEO,EEO - sent 1,100
1,EEO - Miami,EEO,EEO - sent 2,200
2,EEO - New Jersey,EEO,EEO - sent 3,300
3,Military Leave - Maine,Military Leave,Military Leave - sent 4,400
4,Military Leave - Illinois,Military Leave,Military Leave - sent 5,500


---

## I. Sklearn - Custom Transformer

- Objective here is to create a custom transformer that can be used across datasets in an automated fashion
- We'll use sklearn's custom transformer so that the same operations can be done across datasets in a streamlined manner
- Transformation is to group all the text within "bodytext" column into 1 string per "Policy"

In [4]:
class TextGroupingCustomTransformer(BaseEstimator, TransformerMixin):
    """
    Descr: 
        This class contains methods to undertake grouping of
        text within each policy by the overall policy it falls
        under. This custom transformer will be used in sklearn
        pipelines to prevent data leakage and to automate similar
        tranformation on unseen data.
        In order to make this custom transformer compatible with
        sklearn, it is implemented as a class with methods such as
        fit, transform while inheriting from the 2 base classes 
        mentioned
    Return:
        - Pandas Series
    """
    
    # Class constructor
    def __init__(self, grp_by_col, text_content_column):
        """
        Descr: 
            Class constructor that initializes the class
        Input:
            - grp_by_col (str): Column to group on
            - text_content_column (str): Column whose contents are 
              to be concatenated per group
        """
        # Check is string is entered
        if not isinstance(grp_by_col, str):
            raise ValueError('Inappropriate type: {} for policy. A string \
is expected'.format(type(grp_by_col)))

        if not isinstance(text_content_column, str):
            raise ValueError('Inappropriate type: {} for policy. A string \
is expected'.format(type(text_content_column)))
        
        # Set class variables
        self.grp_by_col = grp_by_col
        self.text_content_column = text_content_column

        
    # Fit Method
    def fit(self, X, y=None):
        """
        Descr: 
            Fit method
        Input:
            - X (pd.DataFrame or np.array): features
            - y (pd.Series or np.array): target
        Return:
            - self
        """        
        return self

    
    # Text Grouping function
    def text_grping_func(self, grped_txt_col):
        """
        Descr: 
             # Text Grouping function used by Transform method below
        Input:
            - grped_txt_col (pd.Series): column from grouped data
        Return:
            - string of text concatenated across titles within a policy
        """
        return " ".join(content for content in grped_txt_col)

    
    # Transform Method
    def transform(self, X, y=None):
        """
        Descr: 
            Transform Method for custom transformations
        Input:
            - X (pd.DataFrame or np.array): features
            - y (pd.Series or np.array): target
        Return:
            - pd.Series
        """        
        # Create a policy group to group df by policy
        policy_grp = X.groupby(self.grp_by_col)
        
        # Checking the groupings done above
        print('Grouped by:')
        # print(policy_grp.groups.keys())
        for i, key in enumerate(policy_grp.groups.keys()):
            print(f"{i:<{4}} : {key}")
        print('-'*30)
        
        # lambda function on the group to apply grouping
        # tranformation to each group
        tf_return = policy_grp.apply(lambda g: self.text_grping_func(g[self.text_content_column]))
        
        print('Returning an object of type: ', type(tf_return))
        
        return tf_return

---

### Custom Transformer via sklearn

In [5]:
# Instantiate the custom transformer

txt_grp_tf = TextGroupingCustomTransformer('Policy', 'bodytext')
txt_grp_tf

TextGroupingCustomTransformer(grp_by_col='Policy',
                              text_content_column='bodytext')

In [6]:
# Run the transformer on the toy data established earlier

tf_output = txt_grp_tf.fit_transform(data)

Grouped by:
0    : EEO
1    : Military Leave
------------------------------
Returning an object of type:  <class 'pandas.core.series.Series'>


In [7]:
# Output of the custom transformer

tf_output

Policy
EEO                        EEO - sent 1 EEO - sent 2 EEO - sent 3
Military Leave    Military Leave - sent 4 Military Leave - sent 5
dtype: object

### Takeaway:
- As seen above, we've created a custom transformer which can be easily called, and, **more importantly, be used within a `sklearn pipeline` so that it faclitates streamlined transformations across datasets** (e.g. train, validation, test) 
- A simple objective that was established in the begining is now met. We have grouped the titles by policy and text from each policy group can now be used for various aspects. 
    - *For instance*: this text can be preprocessed and **NLP techniques can be applied on them to examine document similarity** across policies.

---

### Custom Transformer via sklearn-pandas DataFrameMapper

In [8]:
# 1. Without Alias

In [9]:
tf_output_dfm = DataFrameMapper(
    [
        ( ['Policy', 'bodytext'], txt_grp_tf )  # list of column names from the I/P pandas dataframe, object that performs the transformation on the columns
    ], 
    df_out = True,  # if True, df is returned; if False numpy array is returned
    input_df = True  # Set to true if passing a df as input data set
) 

In [10]:
tf_output_dfm

DataFrameMapper(df_out=True, drop_cols=[],
                features=[(['Policy', 'bodytext'],
                           TextGroupingCustomTransformer(grp_by_col='Policy',
                                                         text_content_column='bodytext'))],
                input_df=True)

In [11]:
tf_output_dfm.fit_transform(data)

Grouped by:
0    : EEO
1    : Military Leave
------------------------------
Returning an object of type:  <class 'pandas.core.series.Series'>


Unnamed: 0,Policy_bodytext
0,EEO - sent 1 EEO - sent 2 EEO - sent 3
1,Military Leave - sent 4 Military Leave - sent 5


In [12]:
# 2. With Alias provided

In [13]:
tf_output_dfm_with_alias = DataFrameMapper(
    [
        ( ['Policy', 'bodytext'], txt_grp_tf, {'alias':'Text_grouped_by_Policy'} )
    ], 
    df_out = True,
    input_df = True
) 

In [14]:
tf_output_dfm_with_alias

DataFrameMapper(df_out=True, drop_cols=[],
                features=[(['Policy', 'bodytext'],
                           TextGroupingCustomTransformer(grp_by_col='Policy',
                                                         text_content_column='bodytext'),
                           {'alias': 'Text_grouped_by_Policy'})],
                input_df=True)

In [15]:
tf_output_dfm_with_alias.fit_transform(data)

Grouped by:
0    : EEO
1    : Military Leave
------------------------------
Returning an object of type:  <class 'pandas.core.series.Series'>


Unnamed: 0,Text_grouped_by_Policy
0,EEO - sent 1 EEO - sent 2 EEO - sent 3
1,Military Leave - sent 4 Military Leave - sent 5


---

## II. Sklearn - Custom Transformer (return pd.df)

- Creating a version of the custom transformer defined above to return a pandas dataframe
- The objective of doing this is to be able to compare the features on this compared to the upcoming **DataFrameMapper** exercise below

In [62]:
class TextGroupingCustomTransformer_df(BaseEstimator, TransformerMixin):
    """
    Descr: 
        This class contains methods to undertake grouping of
        text within each policy by the overall policy it falls
        under. This custom transformer will be used in sklearn
        pipelines to prevent data leakage and to automate similar
        tranformation on unseen data.
        In order to make this custom transformer compatible with
        sklearn, it is implemented as a class with methods such as
        fit, transform while inheriting from the 2 base classes 
        mentioned
    Return:
        - Pandas DataFrame
    """
    
    # Class constructor
    def __init__(self, grp_by_col, text_content_column):
        """
        Descr: 
            Class constructor that initializes the class
        Input:
            - grp_by_col (str): Column to group on
            - text_content_column (str): Column whose contents are 
              to be concatenated per group
        """
        # Check is string is entered
        if not isinstance(grp_by_col, str):
            raise ValueError('Inappropriate type: {} for policy. A string \
is expected'.format(type(grp_by_col)))

        if not isinstance(text_content_column, str):
            raise ValueError('Inappropriate type: {} for policy. A string \
is expected'.format(type(text_content_column)))
        
        # Set class variables
        self.grp_by_col = grp_by_col
        self.text_content_column = text_content_column

        
    # Fit Method
    def fit(self, X, y=None):
        """
        Descr: 
            Fit method
        Input:
            - X (pd.DataFrame or np.array): features
            - y (pd.Series or np.array): target
        Return:
            - self
        """        
        return self

    
    # Text Grouping function
    def text_grping_func(self, grped_txt_col):
        """
        Descr: 
             # Text Grouping function used by Transform method below
        Input:
            - grped_txt_col (pd.Series): column from grouped data
        Return:
            - string of text concatenated across titles within a policy
        """
        return " ".join(content for content in grped_txt_col)

    
    # Transform Method
    def transform(self, X, y=None):
        """
        Descr: 
            Transform Method for custom transformations
        Input:
            - X (pd.DataFrame or np.array): features
            - y (pd.Series or np.array): target
        Return:
            - pd.DataFrame
        """        
        # Create a policy group to group df by policy
        policy_grp = X.groupby(self.grp_by_col)
        
        # Checking the groupings done above
        print('Grouped by:')
        # print(policy_grp.groups.keys())
        for i, key in enumerate(policy_grp.groups.keys()):
            print(f"{i:<{4}} : {key}")
        print('-'*30)
        
        # lambda function on the group to apply grouping
        # tranformation to each group
        tf_return = policy_grp.apply(lambda g: self.text_grping_func(g[self.text_content_column]))
        
        tf_return = pd.DataFrame(tf_return).reset_index()
        tf_return.columns = [self.grp_by_col, self.text_content_column+'_joined'] # If not supplied, then another option is to supply and use the get_feature_names() method for renaming the output df later. Also, if not supplied, any output transformed columns are numbered starting from 0

        print('Returning an object of type: ', type(tf_return))
        
        return tf_return

    def get_feature_names(self):
        """
        Descr: 
            Function to return the names of the transformed columns
        Return:
            - list of names of transformed columns
        """
        return (['Policy_grouped','Text_across_policies_concatenated'])
    

---

### Custom Transformer via sklearn (df version)

In [63]:
# Instantiate the custom transformer

txt_grp_tf_df = TextGroupingCustomTransformer_df('Policy', 'bodytext')
txt_grp_tf_df

TextGroupingCustomTransformer_df(grp_by_col='Policy',
                                 text_content_column='bodytext')

In [64]:
# Run the transformer on the toy data established earlier

tf_output_df = txt_grp_tf_df.fit_transform(data)

Grouped by:
0    : EEO
1    : Military Leave
------------------------------
Returning an object of type:  <class 'pandas.core.frame.DataFrame'>


In [65]:
# Output of the custom transformer

tf_output_df

Unnamed: 0,Policy,bodytext_joined
0,EEO,EEO - sent 1 EEO - sent 2 EEO - sent 3
1,Military Leave,Military Leave - sent 4 Military Leave - sent 5


### Takeaway:
- Only essential difference between this version (return df) and the earlier (return series) is the output format which is a pandas dataframe
- We will compare this to the output of the DataFrameMapper on the a custom transformer that uses it and base it on the same data
- if the `tf_return.columns` is not defined (comment it out in custom transformer definition) then the transformed column names are numerically represented (starting with 0) and these are prefixed to the columns names defined in the get_feature_names() method. 

---

### Custom Transformer via sklearn-pandas DataFrameMapper (df version)

In [51]:
# 1. Without Alias

In [52]:
tf_output_dfm_df = DataFrameMapper(
    [
        ( ['Policy', 'bodytext'], txt_grp_tf_df )  # list of column names from the I/P pandas dataframe, object that performs the transformation on the columns
    ], 
    df_out = True,
    input_df = True
) 

In [53]:
tf_output_dfm_df

DataFrameMapper(df_out=True, drop_cols=[],
                features=[(['Policy', 'bodytext'],
                           TextGroupingCustomTransformer_df(grp_by_col='Policy',
                                                            text_content_column='bodytext'))],
                input_df=True)

In [54]:
tf_output_dfm_df.fit_transform(data)

Grouped by:
0    : EEO
1    : Military Leave
------------------------------
Returning an object of type:  <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Policy_bodytext_Policy_grouped,Policy_bodytext_Text_across_policies_concatenated
0,EEO,EEO - sent 1 EEO - sent 2 EEO - sent 3
1,Military Leave,Military Leave - sent 4 Military Leave - sent 5


In [55]:
# 2. With Alias provided

In [59]:
tf_output_dfm_df_with_alias = DataFrameMapper(
    [
        ( ['Policy', 'bodytext'], txt_grp_tf_df, {'alias':'tf_col'} )
    ], 
    df_out = True,
    input_df = True
) 

In [60]:
tf_output_dfm_df_with_alias

DataFrameMapper(df_out=True, drop_cols=[],
                features=[(['Policy', 'bodytext'],
                           TextGroupingCustomTransformer_df(grp_by_col='Policy',
                                                            text_content_column='bodytext'),
                           {'alias': 'tf_col'})],
                input_df=True)

In [61]:
tf_output_dfm_df_with_alias.fit_transform(data)

Grouped by:
0    : EEO
1    : Military Leave
------------------------------
Returning an object of type:  <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,tf_col_Policy_grouped,tf_col_Text_across_policies_concatenated
0,EEO,EEO - sent 1 EEO - sent 2 EEO - sent 3
1,Military Leave,Military Leave - sent 4 Military Leave - sent 5


### Takeaway:

- With sklearn custom transformers, normally the output is numpy array of the transformed features, and are "reduced" to sparse matrices. Additionally, columns are to be labelled if converting into a pandas dataframe should we require furhter analysis.
- With **`sklearn-pandas DataFrameMapper`**, this step can be streamlined and comes in very handy if we intend further dig into the transformed features.
- If `alias` is provided, it is prefixed to the column names as defined in the get_feature_names() method
- If `alias` is not provided, the individual column names are prefixed to the columns names defined in the get_feature_names() method
- if the `tf_return.columns` is defined (uncomment in custom transformer definition) then the names defined here are prefixed to the columns names defined in the get_feature_names() method. 

---