# Customer Transformer

- sklearn pipeline
- sklearn_pandas DataFrameMapper

---

## Import Modules

In [8]:
# Standard
import pandas as pd

# Custom Transformer - Classes for inheritance
from sklearn.base import BaseEstimator, TransformerMixin

# DataFrame Mapper
from sklearn_pandas import DataFrameMapper

## Toy Data

In [6]:
data = pd.DataFrame({'Title':      ['EEO - New York', 'EEO - Miami', 'EEO - New Jersey', 'Military Leave - Maine', 'Military Leave - Illinois'], 
                     'Policy': ['EEO', 'EEO', 'EEO', 'Military Leave', 'Military Leave'], 
                     'bodytext':   ['EEO - sent 1', 'EEO - sent 2', 'EEO - sent 3', 'Military Leave - sent 4', 'Military Leave - sent 5'], 
                     'ArticleID': [100, 200, 300, 400, 500]})

In [7]:
data

Unnamed: 0,Title,Policy,bodytext,ArticleID
0,EEO - New York,EEO,EEO - sent 1,100
1,EEO - Miami,EEO,EEO - sent 2,200
2,EEO - New Jersey,EEO,EEO - sent 3,300
3,Military Leave - Maine,Military Leave,Military Leave - sent 4,400
4,Military Leave - Illinois,Military Leave,Military Leave - sent 5,500


## Sklearn - Custom Transformer

- Idea here is to create a custom transformer that can be used across datasets in an automated fashion
- Therefore, let's use sklearn custom transformer so that the same operations can be done across datasets in a streamlined manner
- Objective is to be group all the text within "bodytext" into 1 such that this appears against each policy group denoted by the "Policy" column

In [66]:
class TextGroupingCustomTransformer(BaseEstimator, TransformerMixin):
    """
    Descr: 
    This class contains methods to undertake grouping of text
    within each policy by the overall policy it falls under.
    This custom transformer will be used in sklearn pipelines
    to prevent data leakage and to automate similar transformations
    on unseen data.
    In order to make this custom transformer compatible with
    sklearn, it is implemented as a class with methods such as
    fit, transform while inheriting from the 2 base classes mentioned
    ------------------------------------------------------------------
    Input:
        - Column to group on
        - Column whose contents are to be concatenated per group
    Return:
        - Pandas Series
    """
    
    # Class constructor
    def __init__(self, grp_by_col, text_content_column):
        self.grp_by_col = grp_by_col
        self.text_content_column = text_content_column
    
    # Fit Method - returns self
    def fit(self, X, y=None):
        return self
    
    # Text Grouping function used by Transform method below
    def text_grping_func(self, grped_txt_col):
        return " ".join(content for content in grped_txt_col)
    
    # Transform Method for custom transformations
    def transform(self, X, y=None):
        
        # Create a policy group to group df by policy
        policy_grp = X.groupby(self.grp_by_col)
        
        # Checking the groupings done above
        print('Grouped by:')
        # print(policy_grp.groups.keys())
        for i, key in enumerate(policy_grp.groups.keys()):
            print(f"{i:<{4}} : {key}")
        print('-'*30)
        
        # lambda function on the group to apply grouping
        # tranformation to each group
        tf_return = policy_grp.apply(lambda g: self.text_grping_func(g[self.text_content_column]))
        
        print('Returning an object of type: ', type(tf_return))
        
        return tf_return

In [67]:
# Instantiate the custom transformer

txt_grp_tf = TextGroupingCustomTransformer('Policy', 'bodytext')
txt_grp_tf

TextGroupingCustomTransformer(grp_by_col='Policy',
                              text_content_column='bodytext')

In [68]:
# Run the transformer on the toy data established earlier

tf_output = txt_grp_tf.fit_transform(data)

Grouped by:
0    : EEO
1    : Military Leave
------------------------------
Returning an object of type:  <class 'pandas.core.series.Series'>


In [71]:
# Output of the custom transformer

tf_output

Policy
EEO                        EEO - sent 1 EEO - sent 2 EEO - sent 3
Military Leave    Military Leave - sent 4 Military Leave - sent 5
dtype: object

### Takeaway:
- As seen above, we've created a custom transformer which can be easily called, and, **more importantly, be used within a `sklearn pipeline` so that it faclitates streamlined transformations across datasets** (e.g. train, validation, test) 
- A simple objective that was established in the begining is now met. We have grouped the titles by policy and text from each policy group can now be used for various aspects. 
    - *For instance*: this text can be preprocessed and **NLP techniques can be applied on them to examine document similarity** across policies.

## Sklearn DataFrameMapper