# Feature Explorer and Recommender Demo

In this notebook, we will cover Feature Recommender, a part of ANOVOS Package. Feature Recommender is a list of methods and functions to help users figure out which features would help them in solving the given problem, which is a cold-start from almost all the data scientists. Feature Recommender contains of 2 part:
- Feature Exploration: This part of the Feature Recommender featuring a list of methods, allowing users to explore existing features based on their desired industry and/or use case.
    - `list_all_industry()`
    - `list_all_usecase()`
    - `list_all_pair()`
    - `list_usecase_by_industry(industry, semantic=True)`
    - `list_industry_by_usecase(usecase, semantic=True)`
    - `list_feature_by_industry(industry, num_of_feat=100, semantic=True)`
    - `list_feature_by_usecase(usecase, num_of_feat=100, semantic=True)`
    - `list_feature_by_pair(industry, usecase, num_of_feat=100, semantic=True)`
    
    
- Feature Recommendation: This part of the Feature Recommender featuring 3 methods, recommend features to users based on their input attribute, and provide a comprehensive mapping method from their own input attribute to either available features, or their own feature corpus.
    - `feature_recommendation(df, name_column=None, desc_column=None, suggested_industry='all',suggested_usecase='all', semantic=True, top_n=2, threshold=0.3)`
    - `find_attr_by_relevance(df, building_corpus, name_column=None, desc_column=None, threshold=0.3)`
    - `sankey_visualization(df, industry_included=False, usecase_included=False)`

## Setting Up

As Feature Recommender is a part of ANOVOS, it is recommended to install ANOVOS to try out all the functions inside Feature Recommender. In this notebook, we will both try both installing from ANOVOS Package and manually set up Feature Recommender functions.

### Install from ANOVOS Package

In [1]:
# !pip install git+https://github.com/anovos/anovos.git@feature_recommender_beta
# from anovos.feature_recommender.feature_exploration import *
# from anovos.feature_recommender.feature_recommendation import *

### Set up the functions manually

In [2]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from re import finditer
import copy
from sentence_transformers import util
import numpy as np
import re
import plotly.graph_objects as go
import random
import matplotlib.pyplot as plt

model = SentenceTransformer('all-mpnet-base-v2')
input_path = 'https://raw.githubusercontent.com/anovos/anovos/feature_recommender_beta/data/feature_recommender/flatten_fr_db.csv'
df_input = pd.read_csv(input_path)


def list_all_industry():
    """
    :return: DataFrame
    """
    odf_uni = df_input['Industry'].unique()
    odf = pd.DataFrame(odf_uni, columns=['Industry'])
    return odf


def list_all_usecase():
    """
    :return: DataFrame
    """
    odf_uni = df_input['Usecase'].unique()
    odf = pd.DataFrame(odf_uni, columns=['Usecase'])
    return odf


def list_all_pair():
    """
    :return: DataFrame
    """
    odf = df_input[['Industry', 'Usecase']].drop_duplicates(keep='last',
                                                            ignore_index=True)
    return odf


def process_usecase(usecase, semantic):
    """
    :param usecase: Input usecase (string)
    :param semantic: Input semantic (boolean) - Whether the input needs to go through semantic matching or not. Default is True.
    :return: Processed Usecase(string)
    """
    if type(semantic) != bool:
        raise TypeError('Invalid input for semantic')
    if type(usecase) != str:
        raise TypeError('Invalid input for usecase')
    usecase = usecase.lower().strip()
    usecase = usecase.replace("[^A-Za-z0-9 ]+", " ")
    all_usecase = list_all_usecase()['Usecase'].to_list()
    if semantic and usecase not in all_usecase:
        all_usecase_embeddings = model.encode(all_usecase,
                                              convert_to_tensor=True)
        usecase_embeddings = model.encode(usecase, convert_to_tensor=True)
        cos_scores = util.pytorch_cos_sim(usecase_embeddings,
                                          all_usecase_embeddings)[0]
        first_match_index = int(np.argpartition(-cos_scores, 0)[0])
        processed_usecase = all_usecase[first_match_index]
        print(
            "Input Usecase not available. Showing the most semantically relevant Usecase result: ",
            processed_usecase)
    else:
        processed_usecase = usecase
    return processed_usecase


def process_industry(industry, semantic):
    """
    :param industry: Input industry (string)
    :param semantic: Input semantic (boolean) - Whether the input needs to go through semantic matching or not. Default is True.
    :return: Processed Industry(string)
    """
    if type(semantic) != bool:
        raise TypeError('Invalid input for semantic')
    if type(industry) != str:
        raise TypeError('Invalid input for industry')
    industry = industry.lower().strip()
    industry = industry.replace("[^A-Za-z0-9 ]+", " ")
    all_industry = list_all_industry()['Industry'].to_list()
    if semantic and industry not in all_industry:
        all_industry_embeddings = model.encode(all_industry,
                                               convert_to_tensor=True)
        industry_embeddings = model.encode(industry, convert_to_tensor=True)
        cos_scores = util.pytorch_cos_sim(industry_embeddings,
                                          all_industry_embeddings)[0]
        first_match_index = int(np.argpartition(-cos_scores, 0)[0])
        processed_industry = all_industry[first_match_index]
        print(
            "Input Industry not available. Showing the most semantically relevant Usecase result: ",
            processed_industry)
    else:
        processed_industry = industry
    return processed_industry


def list_usecase_by_industry(industry, semantic=True):
    """
    :param industry: Input industry (string)
    :param semantic: Input semantic (boolean) - Whether the input needs to go through semantic matching or not. Default is True.
    :return: DataFrame
    """
    industry = process_industry(industry, semantic)
    odf = df_input.loc[df_input['Industry'] == industry][[
        'Usecase'
    ]].drop_duplicates(keep='last', ignore_index=True)
    return odf


def list_industry_by_usecase(usecase, semantic=True):
    """
    :param usecase: Input usecase (string)
    :param semantic: Input semantic (boolean) - Whether the input needs to go through semantic matching or not. Default is True.
    :return: DataFrame
    """
    usecase = process_usecase(usecase, semantic)
    odf = df_input.loc[df_input['Usecase'] == usecase][[
        'Industry'
    ]].drop_duplicates(keep='last', ignore_index=True)
    return odf


def list_feature_by_industry(industry, num_of_feat=100, semantic=True):
    """
    :param industry: Input industry (string)
    :param num_of_feat: Number of features displayed (integer). Default is 100.
    :param semantic: Input semantic (boolean) - Whether the input needs to go through semantic matching or not. Default is True.
    :return: DataFrame
    """
    if type(num_of_feat) != int or num_of_feat < 0:
        if num_of_feat != 'all':
            raise TypeError('Invalid input for num_of_feat')
    industry = process_industry(industry, semantic)
    odf = df_input.loc[df_input['Industry'] == industry].drop_duplicates(
        keep='last', ignore_index=True)
    if len(odf) > 0:
        odf['count'] = odf.groupby('Usecase')['Usecase'].transform('count')
        odf.sort_values('count', inplace=True, ascending=False)
        odf = odf.drop('count', axis=1)
        if num_of_feat != 'all':
            odf = odf.head(num_of_feat).reset_index(drop=True)
        else:
            odf = odf.reset_index(drop=True)
    return odf


def list_feature_by_usecase(usecase, num_of_feat=100, semantic=True):
    """
    :param usecase: Input usecase (string)
    :param num_of_feat: Number of features displayed (integer). Default is 100.
    :param semantic: Input semantic (boolean) - Whether the input needs to go through semantic matching or not. Default is True.
    :return: DataFrame
    """
    if type(num_of_feat) != int or num_of_feat < 0:
        if num_of_feat != 'all':
            raise TypeError('Invalid input for num_of_feat')
    usecase = process_usecase(usecase, semantic)
    odf = df_input.loc[df_input['Usecase'] == usecase].drop_duplicates(
        keep='last', ignore_index=True)
    if len(odf) > 0:
        odf['count'] = odf.groupby('Industry')['Industry'].transform('count')
        odf.sort_values('count', inplace=True, ascending=False)
        odf = odf.drop('count', axis=1)
        if num_of_feat != 'all':
            odf = odf.head(num_of_feat).reset_index(drop=True)
        else:
            odf = odf.reset_index(drop=True)
    return odf


def list_feature_by_pair(industry, usecase, num_of_feat=100, semantic=True):
    """
    :param industry: Input industry (string)
    :param usecase: Input usecase (string)
    :param num_of_feat: Number of features displayed (integer). Default is 100.
    :param semantic: Input semantic (boolean) - Whether the input needs to go through semantic matching or not. Default is True.
    :return: DataFrame
    """
    if type(num_of_feat) != int or num_of_feat < 0:
        if num_of_feat != 'all':
            raise TypeError('Invalid input for num_of_feat')
    industry = process_industry(industry, semantic)
    usecase = process_usecase(usecase, semantic)
    if num_of_feat != 'all':
        odf = df_input.loc[(df_input['Industry'] == industry)
                           & (df_input['Usecase'] == usecase)].drop_duplicates(
                               keep='last',
                               ignore_index=True).head(num_of_feat)
    else:
        odf = df_input.loc[(df_input['Industry'] == industry)
                           & (df_input['Usecase'] == usecase)].drop_duplicates(
                               keep='last', ignore_index=True)
    return odf


def camel_case_split(input):
    """
    :param input: Input needs to be cleaned (string)
    :return: Processed Input (string)
    """
    processed_input = ''
    matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)',
                       input)
    for m in matches:
        processed_input += str(m.group(0)) + str(' ')
    return processed_input


def recommendation_data_prep(df, name_column, desc_column):
    """
    :param df: Input DataFrame
    :param name_column: Input column name in Input DataFrame (string)
    :param desc_column: Input column description in Input DataFrame (string)
    :return list_corpus: List of prepared data for Feature Recommender functions
    :return df_prep: Processed DataFrame for Feature Recommender functions
    """
    if not isinstance(df, pd.DataFrame):
        raise TypeError('Invalid input for df')
    if name_column not in df.columns and name_column != None:
        raise TypeError('Invalid input for name_column')
    if desc_column not in df.columns and desc_column != None:
        raise TypeError('Invalid input for desc_column')
    if name_column == None and desc_column == None:
        raise TypeError(
            'Need at least one input for either name_column or desc_column')
    df_prep = copy.deepcopy(df)
    if name_column == None:
        df_prep[desc_column] = df_prep[desc_column].astype(str)
        df_prep_com = df_prep[desc_column]
    elif desc_column == None:
        df_prep[name_column] = df_prep[name_column].astype(str)
        df_prep_com = df_prep[name_column]
    else:
        df_prep[name_column] = df_prep[name_column].str.replace('_', ' ')
        df_prep[name_column] = df_prep[name_column].astype(str)
        df_prep[desc_column] = df_prep[desc_column].astype(str)
        df_prep_com = df_prep[[name_column, desc_column]].agg(' '.join, axis=1)
    df_prep_com = df_prep_com.replace({"[^A-Za-z0-9 ]+": " "}, regex=True)
    for i in range(len(df_prep_com)):
        df_prep_com[i] = df_prep_com[i].strip()
        df_prep_com[i] = camel_case_split(df_prep_com[i])
    list_corpus = df_prep_com.to_list()
    return list_corpus, df_prep


df_groupby = df_input.groupby(['Feature Name', 'Feature Description']).agg({
    'Industry':
    lambda x: ', '.join(set(x.dropna())),
    'Usecase':
    lambda x: ', '.join(set(x.dropna())),
    'Source':
    lambda x: ', '.join(set(x.dropna()))
}).reset_index()
list_train, df_rec = recommendation_data_prep(df_groupby, 'Feature Name',
                                              'Feature Description')
list_embedding_train = model.encode(list_train, convert_to_tensor=True)


def feature_recommendation(df,
                           name_column=None,
                           desc_column=None,
                           suggested_industry='all',
                           suggested_usecase='all',
                           semantic=True,
                           top_n=2,
                           threshold=0.3):
    """
    :param df: Input DataFrame - Attribute Dictionary
    :param name_column: Input column name for Attribute Name in Input DataFrame (string). Default is None.
    :param desc_column: Input column name for Attribute Description in Input DataFrame (string). Default is None.
    :param suggested_industry: Input suggested Industry (string). Default is 'all', meaning all Industries available.
    :param suggested_usecase: Input suggested Usecase (string). Default is 'all', meaning all Usecases available.
    :param semantic: Input semantic (boolean) - Whether the input needs to go through semantic matching or not. Default is True.
    :param top_n: Number of feature displayed (int). Default is 2
    :param threshold: Input threshold value (float). Default is 0.3
    :return: DataFrame with Recommended Features
    """
    if not isinstance(df, pd.DataFrame):
        raise TypeError('Invalid input for df')
    if type(top_n) != int or top_n < 0:
        raise TypeError('Invalid input for top_n')
    if top_n > len(list_train):
        raise TypeError('top_n value is too large')
    if type(threshold) != float:
        raise TypeError('Invalid input for building_corpus')
    if threshold < 0 or threshold > 1:
        raise TypeError(
            'Invalid input for threshold. Threshold value is between 0 and 1')
    list_user, df_user = recommendation_data_prep(df, name_column, desc_column)

    if suggested_industry != 'all' and suggested_industry == 'all':
        suggested_industry = process_industry(suggested_industry, semantic)
        df_rec_fr = df_rec[df_rec['Industry'].str.contains(suggested_industry)]
        list_keep = list(df_rec_fr.index)
        list_embedding_train_fr = [
            list_embedding_train.tolist()[x] for x in list_keep
        ]
        df_rec_fr = df_rec_fr.reset_index(drop=True)
    elif suggested_usecase != 'all' and suggested_industry == 'all':
        suggested_usecase = process_usecase(suggested_usecase, semantic)
        df_rec_fr = df_rec[df_rec['Usecase'].str.contains(suggested_usecase)]
        list_keep = list(df_rec_fr.index)
        list_embedding_train_fr = [
            list_embedding_train.tolist()[x] for x in list_keep
        ]
        df_rec_fr = df_rec_fr.reset_index(drop=True)
    elif suggested_usecase != 'all' and suggested_industry != 'all':
        suggested_industry = process_industry(suggested_industry, semantic)
        suggested_usecase = process_usecase(suggested_usecase, semantic)
        df_rec_fr = df_rec[df_rec['Industry'].str.contains(suggested_industry)
                           & df_rec['Usecase'].str.contains(suggested_usecase)]
        if len(df_rec_fr) > 0:
            list_keep = list(df_rec_fr.index)
            list_embedding_train_fr = [
                list_embedding_train.tolist()[x] for x in list_keep
            ]
            df_rec_fr = df_rec_fr.reset_index(drop=True)
        else:
            df_out = pd.DataFrame(columns=[
                'Input Attribute Name', 'Input Attribute Description',
                'Recommended Feature Name', 'Recommended Feature Description',
                'Feature Similarity Score', 'Industry', 'Usecase', 'Source'
            ])
            print('Industry/Usecase pair does not exist.')
            return df_out
    else:
        df_rec_fr = df_rec
        list_embedding_train_fr = list_embedding_train

    if name_column == None:
        df_out = pd.DataFrame(columns=[
            'Input Attribute Description', 'Recommended Feature Name',
            'Recommended Feature Description', 'Feature Similarity Score',
            'Industry', 'Usecase', 'Source'
        ])
    elif desc_column == None:
        df_out = pd.DataFrame(columns=[
            'Input Attribute Name', 'Recommended Feature Name',
            'Recommended Feature Description', 'Feature Similarity Score',
            'Industry', 'Usecase', 'Source'
        ])
    else:
        df_out = pd.DataFrame(columns=[
            'Input Attribute Name', 'Input Attribute Description',
            'Recommended Feature Name', 'Recommended Feature Description',
            'Feature Similarity Score', 'Industry', 'Usecase', 'Source'
        ])
    list_embedding_user = model.encode(list_user, convert_to_tensor=True)
    for i, feature in enumerate(list_user):
        cos_scores = util.pytorch_cos_sim(list_embedding_user,
                                          list_embedding_train_fr)[i]
        top_results = np.argpartition(-cos_scores, range(top_n))[0:top_n]
        for idx in top_results[0:top_n]:
            single_score = "%.4f" % (cos_scores[idx])
            if name_column == None:
                if float(single_score) >= threshold:
                    df_append = pd.DataFrame(
                        [[
                            df_user[desc_column].iloc[i],
                            df_rec_fr['Feature Name'].iloc[int(idx)],
                            df_rec_fr['Feature Description'].iloc[int(idx)],
                            "%.4f" % (cos_scores[idx]),
                            df_rec_fr['Industry'].iloc[int(idx)],
                            df_rec_fr['Usecase'].iloc[int(idx)],
                            df_rec_fr['Source'].iloc[int(idx)]
                        ]],
                        columns=[
                            'Input Attribute Description',
                            'Recommended Feature Name',
                            'Recommended Feature Description',
                            'Feature Similarity Score', 'Industry', 'Usecase',
                            'Source'
                        ])
                else:
                    df_append = pd.DataFrame(
                        [[
                            df_user[desc_column].iloc[i], 'Null', 'Null',
                            'Null', 'Null', 'Null', 'Null'
                        ]],
                        columns=[
                            'Input Attribute Description',
                            'Recommended Feature Name',
                            'Recommended Feature Description',
                            'Feature Similarity Score', 'Industry', 'Usecase',
                            'Source'
                        ])
            elif desc_column == None:
                if float(single_score) >= threshold:
                    df_append = pd.DataFrame(
                        [[
                            df_user[name_column].iloc[i],
                            df_rec_fr['Feature Name'].iloc[int(idx)],
                            df_rec_fr['Feature Description'].iloc[int(idx)],
                            "%.4f" % (cos_scores[idx]),
                            df_rec_fr['Industry'].iloc[int(idx)],
                            df_rec_fr['Usecase'].iloc[int(idx)],
                            df_rec_fr['Source'].iloc[int(idx)]
                        ]],
                        columns=[
                            'Input Attribute Name', 'Recommended Feature Name',
                            'Recommended Feature Description',
                            'Feature Similarity Score', 'Industry', 'Usecase',
                            'Source'
                        ])
                else:
                    df_append = pd.DataFrame(
                        [[
                            df_user[desc_column].iloc[i], 'Null', 'Null',
                            'Null', 'Null', 'Null', 'Null'
                        ]],
                        columns=[
                            'Input Attribute Name', 'Recommended Feature Name',
                            'Recommended Feature Description',
                            'Feature Similarity Score', 'Industry', 'Usecase',
                            'Source'
                        ])
            else:
                if float(single_score) >= threshold:
                    df_append = pd.DataFrame(
                        [[
                            df_user[name_column].iloc[i],
                            df_user[desc_column].iloc[i],
                            df_rec_fr['Feature Name'].iloc[int(idx)],
                            df_rec_fr['Feature Description'].iloc[int(idx)],
                            "%.4f" % (cos_scores[idx]),
                            df_rec_fr['Industry'].iloc[int(idx)],
                            df_rec_fr['Usecase'].iloc[int(idx)],
                            df_rec_fr['Source'].iloc[int(idx)]
                        ]],
                        columns=[
                            'Input Attribute Name',
                            'Input Attribute Description',
                            'Recommended Feature Name',
                            'Recommended Feature Description',
                            'Feature Similarity Score', 'Industry', 'Usecase',
                            'Source'
                        ])
                else:
                    df_append = pd.DataFrame(
                        [[
                            df_user[name_column].iloc[i],
                            df_user[desc_column].iloc[i], 'Null', 'Null',
                            'Null', 'Null', 'Null', 'Null'
                        ]],
                        columns=[
                            'Input Attribute Name',
                            'Input Attribute Description',
                            'Recommended Feature Name',
                            'Recommended Feature Description',
                            'Feature Similarity Score', 'Industry', 'Usecase',
                            'Source'
                        ])
            df_out = df_out.append(df_append, ignore_index=True)
    return df_out


def find_attr_by_relevance(df,
                           building_corpus,
                           name_column=None,
                           desc_column=None,
                           threshold=0.3):
    """
    :param df: Input DataFrame - Attribute Dictionary
    :param building_corpus: Input Feature Description (list)
    :param name_column: Input column name for Attribute Name in Input DataFrame (string). Default is None.
    :param desc_column: Input column name for Attribute Description in Input DataFrame (string). Default is None.
    :param threshold: Input threshold value (float). Default is 0.3
    :return: DataFrame with Input Feature Description and Input Attribute matching
    """
    if not isinstance(df, pd.DataFrame):
        raise TypeError('Invalid input for df')
    if type(building_corpus) != list:
        raise TypeError('Invalid input for building_corpus')
    if type(threshold) != float:
        raise TypeError('Invalid input for building_corpus')
    if threshold < 0 or threshold > 1:
        raise TypeError(
            'Invalid input for threshold. Threshold value is between 0 and 1')
    for i in range(len(building_corpus)):
        if type(building_corpus[i]) != str:
            raise TypeError('Invalid input inside building_corpus:',
                            building_corpus[i])
        building_corpus[i] = re.sub("[^A-Za-z0-9]+", " ", building_corpus[i])
        building_corpus[i] = camel_case_split(building_corpus[i])
        building_corpus[i] = building_corpus[i].lower().strip()
    if name_column == None:
        df_out = pd.DataFrame(columns=[
            'Input Feature Desc', 'Recommended Input Attribute Description',
            'Input Attribute Similarity Score'
        ])
    elif desc_column == None:
        df_out = pd.DataFrame(columns=[
            'Input Feature Desc', 'Recommended Input Attribute Name',
            'Input Attribute Similarity Score'
        ])
    else:
        df_out = pd.DataFrame(columns=[
            'Input Feature Desc', 'Recommended Input Attribute Name',
            'Recommended Input Attribute Description',
            'Input Attribute Similarity Score'
        ])
    list_user, df_user = recommendation_data_prep(df, name_column, desc_column)
    list_embedding_user = model.encode(list_user, convert_to_tensor=True)
    list_embedding_building = model.encode(building_corpus,
                                           convert_to_tensor=True)
    for i, feature in enumerate(building_corpus):
        if name_column == None:
            df_append = pd.DataFrame(columns=[
                'Input Feature Desc',
                'Recommended Input Attribute Description',
                'Input Attribute Similarity Score'
            ])
        elif desc_column == None:
            df_append = pd.DataFrame(columns=[
                'Input Feature Desc', 'Recommended Input Attribute Name',
                'Input Attribute Similarity Score'
            ])
        else:
            df_append = pd.DataFrame(columns=[
                'Input Feature Desc', 'Recommended Input Attribute Name',
                'Recommended Input Attribute Description',
                'Input Attribute Similarity Score'
            ])
        cos_scores = util.pytorch_cos_sim(list_embedding_building,
                                          list_embedding_user)[i]
        top_results = np.argpartition(-cos_scores,
                                      range(len(list_user)))[0:len(list_user)]
        for idx in top_results[0:len(list_user)]:
            single_score = "%.4f" % (cos_scores[idx])
            if float(single_score) >= threshold:
                if name_column == None:
                    df_append = df_append.append(
                        {
                            'Input Feature Desc':
                            feature,
                            'Recommended Input Attribute Description':
                            df_user[desc_column].iloc[int(idx)],
                            'Input Attribute Similarity Score':
                            single_score
                        },
                        ignore_index=True)
                elif desc_column == None:
                    df_append = df_append.append(
                        {
                            'Input Feature Desc':
                            feature,
                            'Recommended Input Attribute Name':
                            df_user[name_column].iloc[int(idx)],
                            'Input Attribute Similarity Score':
                            single_score
                        },
                        ignore_index=True)
                else:
                    df_append = df_append.append(
                        {
                            'Input Feature Desc':
                            feature,
                            'Recommended Input Attribute Name':
                            df_user[name_column].iloc[int(idx)],
                            'Recommended Input Attribute Description':
                            df_user[desc_column].iloc[int(idx)],
                            'Input Attribute Similarity Score':
                            single_score
                        },
                        ignore_index=True)
        if len(df_append) == 0:
            if name_column == None:
                df_append = df_append.append(
                    {
                        'Input Feature Desc': feature,
                        'Recommended Input Attribute Description': 'Null',
                        'Input Attribute Similarity Score': 'Null'
                    },
                    ignore_index=True)
            elif desc_column == None:
                df_append = df_append.append(
                    {
                        'Input Feature Desc': feature,
                        'Recommended Input Attribute Name': 'Null',
                        'Input Attribute Similarity Score': 'Null'
                    },
                    ignore_index=True)
            else:
                df_append = df_append.append(
                    {
                        'Input Feature Desc': feature,
                        'Recommended Input Attribute Name': 'Null',
                        'Recommended Input Attribute Description': 'Null',
                        'Input Attribute Similarity Score': 'Null'
                    },
                    ignore_index=True)
        df_out = df_out.append(df_append, ignore_index=True)
    return df_out


def sankey_visualization(df, industry_included=False, usecase_included=False):
    """
    :param df: Input DataFrame. This DataFrame needs to be feature_recommender or find_attr_by_relevance output, or in the same format.
    :param industry_included: Whether the plot needs to include industry mapping or not (boolean). Default is False
    :param usecase_included: Whether the plot needs to include usecase mapping or not (boolean). Default is False
    :return: Sankey plot
    """
    fr_proper_col_list = [
        'Recommended Feature Name', 'Recommended Feature Description',
        'Feature Similarity Score', 'Industry', 'Usecase', 'Source'
    ]
    attr_proper_col_list = [
        'Input Feature Desc', 'Input Attribute Similarity Score'
    ]
    if not isinstance(df, pd.DataFrame):
        raise TypeError('Invalid input for df')
    if not all(x in list(df.columns) for x in fr_proper_col_list) and not all(
            x in list(df.columns) for x in attr_proper_col_list):
        raise TypeError(
            'df is not output DataFrame of Feature Recommendation functions')
    if type(industry_included) != bool:
        raise TypeError('Invalid input for industry_included')
    if type(usecase_included) != bool:
        raise TypeError('Invalid input for usecase_included')
    if 'Feature Similarity Score' in df.columns:
        if 'Input Attribute Name' in df.columns:
            name_source = 'Input Attribute Name'
        else:
            name_source = 'Input Attribute Description'
        name_target = 'Recommended Feature Name'
        name_score = 'Feature Similarity Score'
    else:
        name_source = 'Input Feature Desc'
        if 'Recommended Input Attribute Name' in df.columns:
            name_target = 'Recommended Input Attribute Name'
        else:
            name_target = 'Recommender Input Attribute Description'
        name_score = 'Input Attribute Similarity Score'
        if industry_included != False or usecase_included != False:
            print(
                'Input is find_attr_by_relevance output DataFrame. There is no suggested Industry and/or Usecase.'
            )
        industry_included = False
        usecase_included = False
    industry_target = 'Industry'
    usecase_target = 'Usecase'
    df_iter = copy.deepcopy(df)
    for i in range(len(df_iter)):
        if str(df_iter[name_score][i]) == 'Null':
            df = df.drop([i])
    df = df.reset_index(drop=True)
    source = []
    target = []
    value = []
    if industry_included == False and usecase_included == False:
        source_list = df[name_source].unique().tolist()
        target_list = df[name_target].unique().tolist()
        label = source_list + target_list
        for i in range(len(df)):
            source.append(label.index(str(df[name_source][i])))
            target.append(label.index(str(df[name_target][i])))
            value.append(float(df[name_score][i]))
    elif industry_included == False and usecase_included != False:
        source_list = df[name_source].unique().tolist()
        target_list = df[name_target].unique().tolist()
        raw_usecase_list = df[usecase_target].unique().tolist()
        usecase_list = []
        for i, item in enumerate(raw_usecase_list):
            if ', ' in raw_usecase_list[i]:
                raw_usecase_list[i] = raw_usecase_list[i].split(', ')
                for j, sub_item in enumerate(raw_usecase_list[i]):
                    usecase_list.append(sub_item)
            else:
                usecase_list.append(item)
        label = source_list + target_list + usecase_list
        for i in range(len(df)):
            source.append(label.index(str(df[name_source][i])))
            target.append(label.index(str(df[name_target][i])))
            value.append(float(df[name_score][i]))
            temp_list = df[usecase_target][i].split(', ')
            for k, item in enumerate(temp_list):
                source.append(label.index(str(df[name_target][i])))
                target.append(label.index(str(item)))
                value.append(float(1))
    elif industry_included != False and usecase_included == False:
        source_list = df[name_source].unique().tolist()
        target_list = df[name_target].unique().tolist()
        raw_industry_list = df[industry_target].unique().tolist()
        industry_list = []
        for i, item in enumerate(raw_industry_list):
            if ', ' in raw_industry_list[i]:
                raw_industry_list[i] = raw_industry_list[i].split(', ')
                for j, sub_item in enumerate(raw_industry_list[i]):
                    industry_list.append(sub_item)
            else:
                industry_list.append(item)
        label = source_list + target_list + industry_list
        for i in range(len(df)):
            source.append(label.index(str(df[name_source][i])))
            target.append(label.index(str(df[name_target][i])))
            value.append(float(df[name_score][i]))
            temp_list = df[industry_target][i].split(', ')
            for k, item in enumerate(temp_list):
                source.append(label.index(str(df[name_target][i])))
                target.append(label.index(str(item)))
                value.append(float(1))
    else:
        source_list = df[name_source].unique().tolist()
        target_list = df[name_target].unique().tolist()
        raw_industry_list = df[industry_target].unique().tolist()
        raw_usecase_list = df[usecase_target].unique().tolist()
        industry_list = []
        for i, item in enumerate(raw_industry_list):
            if ', ' in raw_industry_list[i]:
                raw_industry_list[i] = raw_industry_list[i].split(', ')
                for j, sub_item in enumerate(raw_industry_list[i]):
                    industry_list.append(sub_item)
            else:
                industry_list.append(item)

        usecase_list = []
        for i, item in enumerate(raw_usecase_list):
            if ', ' in raw_usecase_list[i]:
                raw_usecase_list[i] = raw_usecase_list[i].split(', ')
                for j, sub_item in enumerate(raw_usecase_list[i]):
                    usecase_list.append(sub_item)
            else:
                usecase_list.append(item)

        label = source_list + target_list + industry_list + usecase_list
        for i in range(len(df)):
            source.append(label.index(str(df[name_source][i])))
            target.append(label.index(str(df[name_target][i])))
            value.append(float(df[name_score][i]))
            temp_list_industry = df[industry_target][i].split(', ')
            temp_list_usecase = df[usecase_target][i].split(', ')
            for k, item_industry in enumerate(temp_list_industry):
                source.append(label.index(str(df[name_target][i])))
                target.append(label.index(str(item_industry)))
                value.append(float(1))
                for j, item_usecase in enumerate(temp_list_usecase):
                    if item_usecase in list_usecase_by_industry(
                            item_industry)['Usecase'].tolist():
                        source.append(label.index(str(item_industry)))
                        target.append(label.index(str(item_usecase)))
                        value.append(float(1))
    line_color = [
        "#" + ''.join([random.choice('0123456789ABCDEF') for j in range(6)])
        for k in range(len(value))
    ]
    label_color = [
        "#" + ''.join([random.choice('0123456789ABCDEF') for e in range(6)])
        for f in range(len(label))
    ]
    fig = go.Figure(data=[
        go.Sankey(node=dict(pad=15,
                            thickness=20,
                            line=dict(color=line_color, width=0.5),
                            label=label,
                            color=label_color),
                  link=dict(source=source, target=target, value=value))
    ])
    fig.update_layout(title_text="Feature Recommendation Sankey Visualization",
                      font_size=10)
    return fig

## Feature Exploration

Feature Exploration is the first part of Feature Recommender. It is to help users explore existing features on their cold-start problems

### list_all_industry( )
Argument: None

This function lists down all the Industries that are supported in Feature Recommender package.

In [3]:
list_all_industry()

Unnamed: 0,Industry
0,gaming
1,retail
2,banking financial service and insurance
3,telecommunication
4,healthcare
5,transportation
6,supply chain


### list_all_usecase( )
Argument: None

This function lists down all the Use cases that are supported in Feature Recommender Package

In [4]:
list_all_usecase()

Unnamed: 0,Usecase
0,customer churn prediction
1,delinquency prediction
2,gender prediction using voice
3,mobile data speed prediction
4,customer segmentation
5,fraud detection
6,network traffic analysis and prediction
7,personality prediction
8,sim swap detection
9,malware detection


### list_all_pair( )
Argument: None

This function lists down all the Industry/Use case pairs that are supported in Feature Recommender Package

In [5]:
list_all_pair()

Unnamed: 0,Industry,Usecase
0,gaming,customer churn prediction
1,retail,customer churn prediction
2,banking financial service and insurance,customer churn prediction
3,telecommunication,customer churn prediction
4,telecommunication,delinquency prediction
5,telecommunication,gender prediction using voice
6,telecommunication,mobile data speed prediction
7,retail,customer segmentation
8,banking financial service and insurance,customer segmentation
9,telecommunication,customer segmentation


### list_usecase_by_industry(industry, semantic=True)
Argument: 
- industry(string): Input industry from user
- semantic(boolean): Whether the input needs to go through semantic matching or not. Default is `True`

This function lists down all the Use cases that are supported in Feature Recommender Package based on the Input Industry

From the output of list_allIndustry( ), we can see the list of available supported industries. Let's first try with `telco`

In [32]:
list_usecase_by_industry('telco')

Input Industry not available. Showing the most semantically relevant Usecase result:  telecommunication


Unnamed: 0,Usecase
0,customer churn prediction
1,delinquency prediction
2,gender prediction using voice
3,mobile data speed prediction
4,customer segmentation
5,fraud detection
6,network traffic analysis and prediction
7,personality prediction
8,sim swap detection
9,malware detection


Next, we will try with `logistics`, another common industry, but not available in Feature Recommender, with the semantic argument as `True` by default

In [7]:
list_usecase_by_industry('logistics')

Input Industry not available. Showing the most semantically relevant Usecase result:  supply chain


Unnamed: 0,Usecase
0,user mobility


We can see `logistics` is clearly not literal matched with any of our available Industries here. But with semantic matching, `logistics` is matched with `supply chain`, which is definitely a very close pair.

### list_industry_by_usecase(usecase, semantic=True)
Argument: 
- usecase(string): Input usecase from user
- semantic(boolean): Whether the input needs to go through semantic matching or not. Default is `True`

This function lists down all the Use cases that are supported in Feature Recommender Package based on the Input Industry

From the output of list_allUsecase( ), we can see the list of available supported usecases. Let's first try with `fraud detection`

In [8]:
list_industry_by_usecase('fraud detection')

Unnamed: 0,Industry
0,healthcare
1,banking financial service and insurance
2,telecommunication


Next, we will try with `demographic inference`, an use case that are not available in Feature Recommender. This time, first let's set semantic to `False`

In [9]:
list_industry_by_usecase('demographic inference', semantic=False)

Unnamed: 0,Industry


Without semantic matching, `demographic inference` is not literal matched to any supported use cases, therefore returns no result. Now we set semantic back to `True`

In [10]:
list_industry_by_usecase('demographic inference', semantic=True)

Input Usecase not available. Showing the most semantically relevant Usecase result:  age and gender prediction


Unnamed: 0,Industry
0,transportation
1,retail
2,banking financial service and insurance
3,telecommunication


With semantic matching, `demographic inference` is matched to `age and gender prediction`, a very close pair as well.

### list_feature_by_industry(industry, num_of_feat=100, semantic=True)
Argument: 
- industry(string): Input industry from user
- num_of_feat(int): Number of features displayed. Default is `100`
- semantic(boolean): Whether the input needs to go through semantic matching or not. Default is `True`

This function lists down all the Features that are available in Feature Recommender Package based on the Input Industry

The output is returned in the form of a DataFrame. Columns are:

- Feature Name: Name of the suggested Feature
- Feature Description: Description of the suggested Feature
- Industry: Industry name of the suggested Feature
- Usecase: Usecase name of the suggested Feature
- Source: Source of the suggested Feature

The list of features is sorted by the Usecases' Relevance to the Input Industry.

Let's try the function with `banking financial service and insurance` industry, with `num_of_feat` set to `10`, and see its output

In [11]:
list_feature_by_industry('banking financial service and insurance', num_of_feat=10)

Unnamed: 0,Feature Name,Feature Description,Industry,Usecase,Source
0,payable turnover - days,Sales / (Accounts Payable/365),banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/31837...
1,monthly_household_cost,monthly cost of household,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/28587...
2,roa,return on assets,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/31837...
3,roe,return on equity,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/31837...
4,credit history,no credits taken/all credits paid back duly/ex...,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/31837...
5,no. annual payments,number of payments per year,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/28587...
6,loans_defaulted,loans defaulted or delinquent,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/28587...
7,avg_CA_balance,average balance in current account,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/28587...
8,employment_type,type of employment,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/28587...
9,monetary values of each product/ service used ...,Monetary values: total assets values of the sp...,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/28587...


The output returns 10 features that are closely related to the Input Industry `banking financial service and insurance`, with the Use case of `credit risk modeling`, which is highly relevant to the Input Industry as well.

In most scenarios, `banking financial service and insurance` is too long and sometimes, users may mistype or forget the exact full name. This time we will try it with only `finance` and see its results

In [12]:
list_feature_by_industry('finance', num_of_feat=10)

Input Industry not available. Showing the most semantically relevant Usecase result:  banking financial service and insurance


Unnamed: 0,Feature Name,Feature Description,Industry,Usecase,Source
0,payable turnover - days,Sales / (Accounts Payable/365),banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/31837...
1,monthly_household_cost,monthly cost of household,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/28587...
2,roa,return on assets,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/31837...
3,roe,return on equity,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/31837...
4,credit history,no credits taken/all credits paid back duly/ex...,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/31837...
5,no. annual payments,number of payments per year,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/28587...
6,loans_defaulted,loans defaulted or delinquent,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/28587...
7,avg_CA_balance,average balance in current account,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/28587...
8,employment_type,type of employment,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/28587...
9,monetary values of each product/ service used ...,Monetary values: total assets values of the sp...,banking financial service and insurance,credit risk modeling,https://www.researchgate.net/publication/28587...


With the semantic matching for `finance`, the results are still matched and similar to `banking financial service and insurance`

### list_feature_by_usecase(usecase, num_of_feat=100, semantic=True)
Argument: 
- usecase(string): Input industry from user
- num_of_feat(int): Number of features displayed. Default is `100`
- semantic(boolean): Whether the input needs to go through semantic matching or not. Default is `True`

This function lists down all the Features that are available in Feature Recommender Package based on the Input Usecase

The output is returned in the form of a DataFrame. Columns are:

- Feature Name: Name of the suggested Feature
- Feature Description: Description of the suggested Feature
- Industry: Industry name of the suggested Feature
- Usecase: Usecase name of the suggested Feature
- Source: Source of the suggested Feature

The list of features is sorted by the Industries' Relevance to the Input Usecase.

Let's try with `brand affinity and propensity`

In [13]:
list_feature_by_usecase('brand affinity and propensity')

Unnamed: 0,Feature Name,Feature Description,Industry,Usecase,Source
0,Number days the consumer observed in the brand...,Number days the consumer observed in the brand...,retail,brand affinity and propensity,https://ieeexplore.ieee.org/document/8622225
1,Total Active days of the consumer,the total number of days in the lookback horiz...,retail,brand affinity and propensity,https://ieeexplore.ieee.org/document/8622225
2,Interim Brand Affinity (Q),Number days the consumer observed in the brand...,retail,brand affinity and propensity,https://ieeexplore.ieee.org/document/8622225
3,Total number of\nconsumers seen in all\nof bra...,Total number of\nconsumers seen in all\nof bra...,retail,brand affinity and propensity,https://ieeexplore.ieee.org/document/8622225
4,Total Number of Consumers Observed,total number of consumers observed at the geog...,retail,brand affinity and propensity,https://ieeexplore.ieee.org/document/8622225
5,Consumer Share,Consumer Share,retail,brand affinity and propensity,https://ieeexplore.ieee.org/document/8622225
6,Discounting Factor,total consumers observed at brand / total cons...,retail,brand affinity and propensity,https://ieeexplore.ieee.org/document/8622225


And this time with semantic matching, only `customer preference`

In [14]:
list_feature_by_usecase('customer preference')

Input Usecase not available. Showing the most semantically relevant Usecase result:  high value customer acquisition


Unnamed: 0,Feature Name,Feature Description,Industry,Usecase,Source
0,gender,Gender of the client,transportation,high value customer acquisition,https://ieeexplore.ieee.org/document/9006106
1,age,age of customer,transportation,high value customer acquisition,https://ieeexplore.ieee.org/document/9006106
2,radius of gyration,radius of the smallest circle that contains al...,transportation,high value customer acquisition,https://ieeexplore.ieee.org/document/9006106
3,App Usage,No. of distinct apps captured by signals from ...,transportation,high value customer acquisition,https://ieeexplore.ieee.org/document/9006106
4,No. of Active Days,No. of days with at least one signal from\nthe...,transportation,high value customer acquisition,https://ieeexplore.ieee.org/document/9006106
5,Day Engagement Pattern,No. of Active Days/Device Age,transportation,high value customer acquisition,https://ieeexplore.ieee.org/document/9006106
6,Internet Connectivity,WIFI or Cellular Data Usage for the device,transportation,high value customer acquisition,https://ieeexplore.ieee.org/document/9006106
7,Mobility of a Device,No. of different geographical areas captured i...,transportation,high value customer acquisition,https://ieeexplore.ieee.org/document/9006106
8,Device's Category Engagement,Signal Distribution across Android or iOS cate...,transportation,high value customer acquisition,https://ieeexplore.ieee.org/document/9006106
9,Affluency Index,"High, Medium or Low",transportation,high value customer acquisition,https://ieeexplore.ieee.org/document/9006106


### list_feature_by_pair(industry, usecase, num_of_feat=100, semantic=True)
Argument: 
- industry(string): Input industry from user
- usecase(string): Input usecase from user
- num_of_feat(int): Number of features displayed. Default is `100`
- semantic(boolean): Whether the input needs to go through semantic matching or not. Default is `True`

This function lists down all the Features that are available in Feature Recommender Package based on the Input Industry/ Usecase pair

The output is returned in the form of a DataFrame. Columns are:

- Feature Name: Name of the suggested Feature
- Feature Description: Description of the suggested Feature
- Industry: Industry name of the suggested Feature
- Usecase: Usecase name of the suggested Feature
- Source: Source of the suggested Feature

Let's try with `healthcare` and `stroke prediction`

In [15]:
list_feature_by_pair('healthcare', 'length of stay prediction', num_of_feat = 15)

Unnamed: 0,Feature Name,Feature Description,Industry,Usecase,Source
0,marital_status,"is the client married (Yes, No)",healthcare,length of stay prediction,https://pdfs.semanticscholar.org/51aa/cfcf3f3c...
1,gender,Gender of the client,healthcare,length of stay prediction,https://pdfs.semanticscholar.org/51aa/cfcf3f3c...
2,age,age of customer,healthcare,length of stay prediction,https://pdfs.semanticscholar.org/51aa/cfcf3f3c...
3,Social history,"Smoking, Alcohol, Living situation, Employment",healthcare,length of stay prediction,https://pdfs.semanticscholar.org/51aa/cfcf3f3c...
4,Diastolic blood pressure,Diastolic blood pressure (mm Hg),healthcare,length of stay prediction,https://pdfs.semanticscholar.org/51aa/cfcf3f3c...
5,Surgery type,Surgery type,healthcare,length of stay prediction,https://pdfs.semanticscholar.org/51aa/cfcf3f3c...
6,Average number of visits per day,Average number of visits per day,healthcare,length of stay prediction,https://pdfs.semanticscholar.org/51aa/cfcf3f3c...
7,Number of consultations,Number of medical consultations,healthcare,length of stay prediction,https://pdfs.semanticscholar.org/51aa/cfcf3f3c...
8,Number of surgeries,Number of surgeries,healthcare,length of stay prediction,https://pdfs.semanticscholar.org/51aa/cfcf3f3c...
9,Interval between discharge order and discharge,Interval between discharge order and discharge,healthcare,length of stay prediction,https://pdfs.semanticscholar.org/51aa/cfcf3f3c...


And for the semantic part, `sale` and `sort customer`

In [16]:
list_feature_by_pair('sale', 'sort customer')

Input Industry not available. Showing the most semantically relevant Usecase result:  retail
Input Usecase not available. Showing the most semantically relevant Usecase result:  customer segmentation


Unnamed: 0,Feature Name,Feature Description,Industry,Usecase,Source
0,recency value,The last time the customer has made a transaction,retail,customer segmentation,https://link.springer.com/content/pdf/10.1057/...
1,buyer,Corresponding to each distinct postcode,retail,customer segmentation,https://link.springer.com/content/pdf/10.1057/...
2,first_purchase,Time in month since the first purchase in 2011,retail,customer segmentation,https://link.springer.com/content/pdf/10.1057/...
3,frequency_per_postcode,Frequency of purchase per postcode,retail,customer segmentation,https://link.springer.com/content/pdf/10.1057/...
4,monetary_per_postcode,Total amount spent per postcode,retail,customer segmentation,https://link.springer.com/content/pdf/10.1057/...
5,postcode_minimum_spend,Minimum spending per postcode,retail,customer segmentation,https://link.springer.com/content/pdf/10.1057/...
6,postcode_maximum_spend,Maximum spending per postcode,retail,customer segmentation,https://link.springer.com/content/pdf/10.1057/...
7,postcode_median_spend,Median spending per postcode,retail,customer segmentation,https://link.springer.com/content/pdf/10.1057/...


## Feature Recommendation

Feature Recommendation is the second part of Feature Recommender. It is to recommend features to users based on their input attributes, and provide a comprehensive mapping method from their own input attribute to either available features, or their own feature corpus.

### Example User Input Attribute Dictionary

To test out the functions in Feature Recommendation part, User Input Attribute Dictionary is required. In this notebook, we will use 2 Example User Input Attribute Dictionaries

First Attribute Dictionary is churn attributes for Telecoms

In [17]:
import pandas as pd

df_attr_1 = pd.read_csv(
    'https://raw.githubusercontent.com/anovos/anovos/feature_recommender_beta/data/feature_recommender/test_input_fr.csv'
)
df_attr_1

Unnamed: 0,Attribute Name,Attribute Description
0,churn,"1 if customer cancelled service, 0 if not"
1,AccountWeeks,number of weeks customer has had active account
2,ContractRenewal,"1 if customer recently renewed contract, 0 if not"
3,DataPlan,"1 if customer has data plan, 0 if not"
4,DataUsage,gigabytes of monthly data usage
5,CustServCalls,number of calls into customer service
6,DayMins,average daytime minutes per month
7,DayCalls,average number of daytime calls
8,MonthlyCharge,average monthly bill
9,OverageFee,largest overage fee in last 12 months


Second Attribute Dictionary is mobility pricing attributes for Transportation services 

In [18]:
import pandas as pd

df_attr_2 = pd.read_csv(
    'https://raw.githubusercontent.com/anovos/anovos/feature_recommender_beta/data/feature_recommender/test_input_fr_2.csv'
)
df_attr_2

Unnamed: 0,Name,Desc,Industry,Usecase
0,key,a unique identifier for each trip,Transportation,Ridepooling Pricing
1,fare_amount,the cost of each trip in usd,Transportation,Ridepooling Pricing
2,pickup_datetime,date and time where the meter was engaged,Transportation,Ridepooling Pricing
3,passenger_count,the number of passengers in the vehicle (drive...,Transportation,Ridepooling Pricing
4,pickup_longitude,the longitude where the meter was engaged,Transportation,Ridepooling Pricing
5,pickup_latitude,the latitude where the meter was engaged,Transportation,Ridepooling Pricing
6,dropoff_longitude,the longitude where the meter was disengaged,Transportation,Ridepooling Pricing
7,dropoff_latitude,the latitude where the meter was disengaged,Transportation,Ridepooling Pricing


### feature_recommendation(df, name_column=None, desc_column=None, suggested_industry='all',suggested_usecase='all', semantic=True, top_n=2, threshold=0.3)
Argument: 
- df(DataFrame): Input User Attribute Dictionary
- name_column(string): Name of the column contains attribute names. Default is `None`
- desc_column(string): Name of the column contains attribute description. Default is `None`
- suggested_industry(string): Input goal industry from user. Default is `all`
- suggested_usecase(string): Input goal usecase from user. Default is `all`
- semantic(boolean): Whether the input needs to go through semantic matching or not. Default is `True`
- top_n(int): Number of most similar features displayed matched to input attributes. Default is `2`
- threshold(float): Floor limit of the similarity score to be matched. Default is `0.3`

This function recommends features to users based on their input attributes, and their goal industry and/or use case

The output is returned in the form of a DataFrame. Columns are:

- Input Attribute Name: Name of the input Attribute
- Input Attribute Description: Description of the input Attribute
- Recommended Feature Name: Name of the recommended Feature
- Recommended Feature Description: Description of the recommended Feature
- Feature Similarity Score: Semantic similarity score between input Attribute and recommended Feature
- Industry: Industry name of the recommended Feature
- Usecase: Usecase name of the recommended Feature
- Source: Source of the recommended Feature

First, we will try with First User Attribute Dictionary example, with `name_column` set to `Attribute Name` and `desc_column` set to `Attribute Description`. All other arguments stay with their default values.

In [19]:
feature_recommendation(df_attr_1, name_column='Attribute Name', desc_column='Attribute Description')

Unnamed: 0,Input Attribute Name,Input Attribute Description,Recommended Feature Name,Recommended Feature Description,Feature Similarity Score,Industry,Usecase,Source
0,churn,"1 if customer cancelled service, 0 if not",churn probability,the percentage of users that discontinue subsc...,0.7258,banking financial service and insurance,customer lifetime value prediction,https://www.researchgate.net/publication/26168...
1,churn,"1 if customer cancelled service, 0 if not",Status,Either 1 for churn or 0 for non-churn,0.6963,healthcare,patient churn,http://www.ieomsociety.org/singapore2021/paper...
2,AccountWeeks,number of weeks customer has had active account,years of credit history,The years since the first entry in the custome...,0.5499,banking financial service and insurance,credit risk modeling,https://www.kaggle.com/c/credit-risk-modeling-...
3,AccountWeeks,number of weeks customer has had active account,length of customer association,Number of years the customer is associated wit...,0.5445,banking financial service and insurance,customer churn prediction,https://www.researchgate.net/publication/30638...
4,ContractRenewal,"1 if customer recently renewed contract, 0 if not",contract,"type of customer contract (Month to month, on...",0.5738,telecommunication,"customer churn prediction, customer segmentation",https://link.springer.com/article/10.1057/dbm....
5,ContractRenewal,"1 if customer recently renewed contract, 0 if not",new customer,New customer flag. 1 if the customer registere...,0.5197,banking financial service and insurance,recommender system,https://github.com/anshuljdhingra/Bank-Recomme...
6,DataPlan,"1 if customer has data plan, 0 if not",Internet Connectivity,WIFI or Cellular Data Usage for the device,0.5072,"banking financial service and insurance, trans...","age and gender prediction, high value customer...","https://ieeexplore.ieee.org/document/9006106, ..."
7,DataPlan,"1 if customer has data plan, 0 if not",subscriber's plan,Subscriber plan,0.4921,telecommunication,sim swap detection,https://logrhythm.com/telecommunication-use-ca...
8,DataUsage,gigabytes of monthly data usage,Cellular Data Usage,Cellular data is more expensive than WIFI data_,0.5012,banking financial service and insurance,credit scoring using mobile engagement data,https://www.mobilewalla.com/financial-services
9,DataUsage,gigabytes of monthly data usage,basic phone use,"number of calls, number of texts",0.501,telecommunication,personality prediction,https://web.media.mit.edu/~yva/papers/deMontjo...


We can see that every attribute is matched to 2 features that they have the highest similarity score with. For example, `AccountWeeks` attribute to `year of credit history` and `length of customer association` features.

Next, let's say, with these input attributes, users want to explore only `telecoms` industry and `churn prediction` usecase related features. 

We will set `suggested_industry` to `telecoms` and `suggested_usecase` to `churn prediction` and run the function again

In [20]:
feature_recommendation(df_attr_1,
                       name_column='Attribute Name',
                       desc_column='Attribute Description',
                       suggested_industry='telecoms',
                       suggested_usecase='churn prediction')

Input Industry not available. Showing the most semantically relevant Usecase result:  telecommunication
Input Usecase not available. Showing the most semantically relevant Usecase result:  customer churn prediction


Unnamed: 0,Input Attribute Name,Input Attribute Description,Recommended Feature Name,Recommended Feature Description,Feature Similarity Score,Industry,Usecase,Source
0,churn,"1 if customer cancelled service, 0 if not",retired flag,"is the client retired (1, 0)",0.4096,"telecommunication, healthcare, banking financi...","customer churn prediction, fraud detection, st...",https://www.kaggle.com/mathchi/churn-for-bank-...
1,churn,"1 if customer cancelled service, 0 if not",number customer service calls,Number of call to customer service,0.4075,telecommunication,customer churn prediction,https://www.researchgate.net/publication/28298...
2,AccountWeeks,number of weeks customer has had active account,last rech date ma,Number of days till last recharge of main account,0.4841,telecommunication,customer churn prediction,https://www.kaggle.com/sivakrishna3311/delinqu...
3,AccountWeeks,number of weeks customer has had active account,last rech date da,Number of days till last recharge of data account,0.4473,telecommunication,customer churn prediction,https://www.kaggle.com/sivakrishna3311/delinqu...
4,ContractRenewal,"1 if customer recently renewed contract, 0 if not",contract,"type of customer contract (Month to month, on...",0.5738,telecommunication,"customer churn prediction, customer segmentation",https://link.springer.com/article/10.1057/dbm....
5,ContractRenewal,"1 if customer recently renewed contract, 0 if not",tenure,Tenure of credit card service for user,0.3821,"telecommunication, banking financial service a...","customer churn prediction, customer segmentation","https://www.kaggle.com/arjunbhasin2013/ccdata,..."
6,DataPlan,"1 if customer has data plan, 0 if not",contract,"type of customer contract (Month to month, on...",0.4543,telecommunication,"customer churn prediction, customer segmentation",https://link.springer.com/article/10.1057/dbm....
7,DataPlan,"1 if customer has data plan, 0 if not",voice mail plan,Voice mail usage,0.441,telecommunication,customer churn prediction,https://www.researchgate.net/publication/28298...
8,DataUsage,gigabytes of monthly data usage,monthlycharges,current monthly payment,0.4644,telecommunication,customer churn prediction,https://www.kaggle.com/radmirzosimov/telecom-u...
9,DataUsage,gigabytes of monthly data usage,last rech date da,Number of days till last recharge of data account,0.4275,telecommunication,customer churn prediction,https://www.kaggle.com/sivakrishna3311/delinqu...


Semantic Matching for Industry and Use case still work here. We can see now, input attributes are only mapped to relevant features in the selected Industry and selected Use case.

If in our attribute dictionary, there is only one column for the whole DataFrame, we can leave either `name_column` or `desc_column` with its default value `None`, and only assign values to either of them.

We will try with the Second User Attribute Dictionary, and with only assign `desc_column` value

In [21]:
feature_recommendation(df_attr_2, desc_column='Desc')

Unnamed: 0,Input Attribute Description,Recommended Feature Name,Recommended Feature Description,Feature Similarity Score,Industry,Usecase,Source
0,a unique identifier for each trip,Total Visits,Total number of times a member made and attend...,0.4541,healthcare,patient churn,http://www.ieomsociety.org/singapore2021/paper...
1,a unique identifier for each trip,Insured Id,Unique ID given to insured,0.4504,healthcare,fraud detection,https://arxiv.org/pdf/2102.10978
2,the cost of each trip in usd,unit price,Price of each product in dollar,0.4767,retail,sales prediction,https://www.kaggle.com/aungpyaeap/supermarket-...
3,the cost of each trip in usd,total,Total price including tax,0.4602,retail,sales prediction,https://www.kaggle.com/aungpyaeap/supermarket-...
4,date and time where the meter was engaged,Device Age,No. of days between first observation and last...,0.5013,"transportation, retail, telecommunication, ban...",age and gender prediction,https://ieeexplore.ieee.org/document/8621942
5,date and time where the meter was engaged,call date time,Date and time of the call,0.4483,telecommunication,fraud detection,https://www.sciencedirect.com/science/article/...
6,the number of passengers in the vehicle (drive...,Total Visits,Total number of times a member made and attend...,0.4269,healthcare,patient churn,http://www.ieomsociety.org/singapore2021/paper...
7,the number of passengers in the vehicle (drive...,total number of employees,the number of employees engaged at the place o...,0.4177,banking financial service and insurance,"credit risk modeling, recommender system, cust...",https://www.researchgate.net/publication/28587...
8,the longitude where the meter was engaged,geomagnetic field,Magnetometer measures the ambient geomagnetic ...,0.5041,telecommunication,transportation model detection,https://infoscience.epfl.ch/record/229181
9,the longitude where the meter was engaged,lat,Relative latitude of the given base station,0.4494,telecommunication,network traffic analysis and prediction,https://www.researchgate.net/publication/34218...


### find_attr_by_relevance(df, building_corpus, name_column=None, desc_column=None, threshold=0.3)
Argument: 
- df(DataFrame): Input User Attribute Dictionary
- building_corpus(list): Input User Feature Corpus
- name_column(string): Name of the column contains attribute names. Default is `None`
- desc_column(string): Name of the column contains attribute description. Default is `None`
- threshold(float): Floor limit of the similarity score to be matched. Default is `0.3`

This function is to provide a comprehensive mapping method from users' input attributes to their own feature corpus, and therefore, help with the process of creating features in cold-start problems

The output is returned in the form of a DataFrame. Columns are:

- Input Feature Desc: Description of the input Feature
- Recommended Input Attribute Name: Name of the recommended Feature
- Recommended Input Attribute Description: Description of the recommended Feature
- Input Attribute Similarity Score: Semantic similarity score between input Attribute and recommended Feature

Let's start with First User Attribute Dictionary example, with `name_column` set to `Attribute Name` and `desc_column` set to `Attribute Description`. For the `building corpus`, we will use `['number of customers using products', 'average recommended product price']`, and we will use the default threshold value `0.4` for now.

In [22]:
test_building_corpus = ['number of customers using products', 'number of call customer make daily']

In [23]:
find_attr_by_relevance(df_attr_1,
                     building_corpus=test_building_corpus,
                     name_column='Attribute Name',
                     desc_column='Attribute Description')

Unnamed: 0,Input Feature Desc,Recommended Input Attribute Name,Recommended Input Attribute Description,Input Attribute Similarity Score
0,number of customers using products,CustServCalls,number of calls into customer service,0.4228
1,number of customers using products,AccountWeeks,number of weeks customer has had active account,0.3764
2,number of customers using products,churn,"1 if customer cancelled service, 0 if not",0.3409
3,number of customers using products,DayCalls,average number of daytime calls,0.3245
4,number of customers using products,DataUsage,gigabytes of monthly data usage,0.3198
5,number of customers using products,DataPlan,"1 if customer has data plan, 0 if not",0.3089
6,number of call customer make daily,DayCalls,average number of daytime calls,0.7598
7,number of call customer make daily,CustServCalls,number of calls into customer service,0.6555
8,number of call customer make daily,DayMins,average daytime minutes per month,0.4595
9,number of call customer make daily,churn,"1 if customer cancelled service, 0 if not",0.4411


The function matches the input feature corpus with the input users' attribute dictionary, with the condition of the pair similarity score must be higher than the threshold input.

Users can use this tool tailored to their own needs to create their features. Change the input feature corpus around to match their existing attributes, and/or change the threshold to not be too strict on the matching.

Next, we will use the Second User Attribute Dictionary example, with `['usual location of the customer', 'how much customer spend for each trip on average']` as Input Feature Corpus, and set the threshold to be `0.25`

In [24]:
test_building_corpus_2 = [
    'usual location of the customer',
    'how much customer spend for each trip on average'
]

In [25]:
df_attr_2

Unnamed: 0,Name,Desc,Industry,Usecase
0,key,a unique identifier for each trip,Transportation,Ridepooling Pricing
1,fare_amount,the cost of each trip in usd,Transportation,Ridepooling Pricing
2,pickup_datetime,date and time where the meter was engaged,Transportation,Ridepooling Pricing
3,passenger_count,the number of passengers in the vehicle (drive...,Transportation,Ridepooling Pricing
4,pickup_longitude,the longitude where the meter was engaged,Transportation,Ridepooling Pricing
5,pickup_latitude,the latitude where the meter was engaged,Transportation,Ridepooling Pricing
6,dropoff_longitude,the longitude where the meter was disengaged,Transportation,Ridepooling Pricing
7,dropoff_latitude,the latitude where the meter was disengaged,Transportation,Ridepooling Pricing


In [26]:
find_attr_by_relevance(df_attr_2,
                     building_corpus=test_building_corpus_2,
                     name_column='Name',
                     desc_column='Desc', threshold=0.25)

Unnamed: 0,Input Feature Desc,Recommended Input Attribute Name,Recommended Input Attribute Description,Input Attribute Similarity Score
0,usual location of the customer,pickup latitude,the latitude where the meter was engaged,0.3327
1,usual location of the customer,passenger count,the number of passengers in the vehicle (drive...,0.2881
2,usual location of the customer,pickup longitude,the longitude where the meter was engaged,0.284
3,usual location of the customer,dropoff latitude,the latitude where the meter was disengaged,0.2829
4,how much customer spend for each trip on average,fare amount,the cost of each trip in usd,0.5199
5,how much customer spend for each trip on average,passenger count,the number of passengers in the vehicle (drive...,0.3881
6,how much customer spend for each trip on average,key,a unique identifier for each trip,0.3358


With lower threshold, more matching attributes come out, sorted by the relevance to the input feature corpus

### Sankey Matching Visualization

For this section, in order to understand the matching methods better in Feature Recommendation part, we use Sankey plot to visualize all the matching we have done above

#### sankey_visualization(df, industry_included=False, usecase_included=False)
Argument: 
- df(DataFrame): Input DataFrame. This DataFrame needs to be output of feature_recommendation or find_attr_by_relevance, or in the same format.
- industry_included(boolean): Whether the plot needs to include industry mapping or not. Default is `False`
- usecase_included(boolean): Whether the plot needs to include usecase mapping or not. Default is `False`

This function is to visualize Feature Recommendation functions through Sankey plots

#### feature_recommendation( ) Sankey Plots

In [35]:
df_attr_1

Unnamed: 0,Attribute Name,Attribute Description
0,churn,"1 if customer cancelled service, 0 if not"
1,AccountWeeks,number of weeks customer has had active account
2,ContractRenewal,"1 if customer recently renewed contract, 0 if not"
3,DataPlan,"1 if customer has data plan, 0 if not"
4,DataUsage,gigabytes of monthly data usage
5,CustServCalls,number of calls into customer service
6,DayMins,average daytime minutes per month
7,DayCalls,average number of daytime calls
8,MonthlyCharge,average monthly bill
9,OverageFee,largest overage fee in last 12 months


In [36]:
df = feature_recommendation(df_attr_1,
                            name_column='Attribute Name',
                            desc_column='Attribute Description')
fig = sankey_visualization(df)
fig.show()

In [28]:
df_2 = feature_recommendation(df_attr_1,
                              name_column='Attribute Name',
                              desc_column='Attribute Description',
                              suggested_industry='telecoms',
                              suggested_usecase='churn prediction')
fig_2 = sankey_visualization(df_2, industry_included=True, usecase_included=True)
fig_2.show()

Input Industry not available. Showing the most semantically relevant Usecase result:  telecommunication
Input Usecase not available. Showing the most semantically relevant Usecase result:  customer churn prediction


In [37]:
df_attr_2

Unnamed: 0,Name,Desc,Industry,Usecase
0,key,a unique identifier for each trip,Transportation,Ridepooling Pricing
1,fare_amount,the cost of each trip in usd,Transportation,Ridepooling Pricing
2,pickup_datetime,date and time where the meter was engaged,Transportation,Ridepooling Pricing
3,passenger_count,the number of passengers in the vehicle (drive...,Transportation,Ridepooling Pricing
4,pickup_longitude,the longitude where the meter was engaged,Transportation,Ridepooling Pricing
5,pickup_latitude,the latitude where the meter was engaged,Transportation,Ridepooling Pricing
6,dropoff_longitude,the longitude where the meter was disengaged,Transportation,Ridepooling Pricing
7,dropoff_latitude,the latitude where the meter was disengaged,Transportation,Ridepooling Pricing


In [29]:
df_3 = feature_recommendation(df_attr_2, desc_column='Desc')
fig_3 = sankey_visualization(df_3, industry_included=True, usecase_included=True)
fig_3.show()

#### find_attr_by_relevance( ) Sankey Plots

In [40]:
df_attr_1

Unnamed: 0,Attribute Name,Attribute Description
0,churn,"1 if customer cancelled service, 0 if not"
1,AccountWeeks,number of weeks customer has had active account
2,ContractRenewal,"1 if customer recently renewed contract, 0 if not"
3,DataPlan,"1 if customer has data plan, 0 if not"
4,DataUsage,gigabytes of monthly data usage
5,CustServCalls,number of calls into customer service
6,DayMins,average daytime minutes per month
7,DayCalls,average number of daytime calls
8,MonthlyCharge,average monthly bill
9,OverageFee,largest overage fee in last 12 months


In [30]:
test_building_corpus = ['number of customers using products', 'number of call customer make daily']
df_4 = find_attr_by_relevance(df_attr_1,
                            building_corpus=test_building_corpus,
                            name_column='Attribute Name',
                            desc_column='Attribute Description')
fig_4 = sankey_visualization(df_4)
fig_4.show()

In [38]:
df_attr_2

Unnamed: 0,Name,Desc,Industry,Usecase
0,key,a unique identifier for each trip,Transportation,Ridepooling Pricing
1,fare_amount,the cost of each trip in usd,Transportation,Ridepooling Pricing
2,pickup_datetime,date and time where the meter was engaged,Transportation,Ridepooling Pricing
3,passenger_count,the number of passengers in the vehicle (drive...,Transportation,Ridepooling Pricing
4,pickup_longitude,the longitude where the meter was engaged,Transportation,Ridepooling Pricing
5,pickup_latitude,the latitude where the meter was engaged,Transportation,Ridepooling Pricing
6,dropoff_longitude,the longitude where the meter was disengaged,Transportation,Ridepooling Pricing
7,dropoff_latitude,the latitude where the meter was disengaged,Transportation,Ridepooling Pricing


In [39]:
test_building_corpus_2 = [
    'usual location of the customer',
    'how much customer spend for each trip on average'
]
df_5 = find_attr_by_relevance(df_attr_2,
                            building_corpus=test_building_corpus_2,
                            name_column='Name',
                            desc_column='Desc',
                            threshold=0.25)
fig_5 = sankey_visualization(df_5)
fig_5.show()