### Teste Stones - Angelo Varella

The main goal is to create a model to predict total TPV based on the provided information. 

The selected model is a robust multiple linear regression named Huber Regressor, based on data characteristcs and the challenge proposed.
The test procedures and resulting outputs were streamlined due to the limited timeframe. However, the displayed results effectively demonstrate a minimum viable prototype.

##### Imports

In [1]:
# Imports
import sys
import pandas as pd
import numpy as np
from dotenv import load_dotenv 

sys.path.append('../')
load_dotenv()

from functions.model import (
    create_categories,
    robust_regression_sm
 )

# Configuration
pd.set_option('display.max_rows', 100)

##### Load and data preparation

In [2]:
# Load data
data = pd.read_csv('../test_data/data.csv')
# data.info()
# data

In [3]:
# Perform data preparation

data['norm_tpv'] = np.log(data['stone_tpv_acquirer_total'])
data['segmento_d'] = data['segmento'].apply(lambda x: 1 if x in [0, 17, 26] else 0)
data['estab_d'] = data['total_estab'].apply(lambda x: 1 if x == 1 else 0)

# Mapping of Brazilian regions to UF codes
region_mapping = {
    'N': ['AC', 'AM', 'AP', 'PA', 'RO', 'RR', 'TO'],
    'NE': ['AL', 'BA', 'CE', 'MA', 'PB', 'PE', 'PI', 'RN', 'SE'],
    'CO': ['DF', 'GO', 'MT', 'MS'],
    'SE': ['ES', 'MG', 'RJ', 'SP'],
    'S': ['PR', 'RS', 'SC']
}

# Function to map UF to region
def get_region(uf):
    for region, uf_list in region_mapping.items():
        if uf in uf_list:
            return region
    return 'Unknown'

# Create the new column 'region' based on 'uf'
data['region'] = data['uf'].apply(get_region)

# Transform capital_social into categories
create_categories(data,'capital_social')

In [4]:
# Create dataframes for testing

dependent_data = data['norm_tpv']
data_temp = data[['segmento_d','mei','estab_d']]

independent_data_temp = pd.DataFrame()
cat_list = ['region','capital_social_cat','porte','faixa_empregados','tier']

for item in  cat_list:
    cat_df = pd.get_dummies(data[item])
    cat_df.columns = [f"{item}_{col}" for col in cat_df.columns]
    cat_df = cat_df.astype(int)
    independent_data_temp = pd.concat([independent_data_temp, cat_df],axis=1)
    
independent_data = pd.concat([data_temp,independent_data_temp],axis=1)
total_data = pd.concat([dependent_data,independent_data],axis=1)

In [5]:
total_data.columns

Index(['norm_tpv', 'segmento_d', 'mei', 'estab_d', 'region_CO', 'region_N',
       'region_NE', 'region_S', 'region_SE', 'capital_social_cat_0',
       'capital_social_cat_1', 'capital_social_cat_2', 'capital_social_cat_3',
       'porte_0', 'porte_1', 'porte_2', 'porte_3', 'faixa_empregados_0',
       'faixa_empregados_1', 'faixa_empregados_2', 'faixa_empregados_3',
       'faixa_empregados_4', 'tier_0', 'tier_1', 'tier_2', 'tier_3', 'tier_4'],
      dtype='object')

##### Tests and assumptions

For testing purposes, the selected model was applied to each variable to generate simplified results that capture relevant signal and interpretation metrics. From this analysis, key variables were identified.

It's important to note that under normal circumstances, additional assumption tests would be conducted. However, this approach remains a valid method for assessing the importance of variables in the prediction model.

In [6]:
# Perform robust regression for testing and evaluation of coeficients

# robust_regression_sm(total_data,'norm_tpv',['segmento_d', 'mei', 'estab_d'])
# robust_regression_sm(total_data,'norm_tpv',['region_CO','region_NE','region_S','region_SE'])
# robust_regression_sm(total_data,'norm_tpv',['capital_social_cat_0','capital_social_cat_1', 'capital_social_cat_2'])
# robust_regression_sm(total_data,'norm_tpv',['porte_0', 'porte_1', 'porte_2'])
# robust_regression_sm(total_data,'norm_tpv',['faixa_empregados_1', 'faixa_empregados_2', 'faixa_empregados_3','faixa_empregados_4'])
# robust_regression_sm(total_data,'norm_tpv',['tier_0', 'tier_1', 'tier_2', 'tier_3'])

In [7]:
# Perform robust regression for testing and evaluation of coeficients

target = 'norm_tpv'

independents = [
        'estab_d',
        'region_CO',
        'region_NE',
        'region_S',
        'region_SE',
        'capital_social_cat_0',
        'capital_social_cat_1',
        'capital_social_cat_2',
        'porte_0',
        'porte_1',
        'porte_2',
        'faixa_empregados_1',
        'faixa_empregados_2',
        'faixa_empregados_3',
        'faixa_empregados_4',
        'tier_0',
        'tier_1',
        'tier_2',
        'tier_3'
    ]

robust_regression_sm(total_data,target,independents)

Test Set Results:
R-squared: 0.206615712194674
Mean Squared Error (MSE): 2.4899215840303137
Mean Absolute Error (MAE): 1.026347786062498


0,1,2,3
Dep. Variable:,norm_tpv,No. Observations:,80000.0
Model:,RLM,Df Residuals:,79980.0
Method:,IRLS,Df Model:,19.0
Norm:,HuberT,,
Scale Est.:,mad,,
Cov Type:,H1,,
Date:,"Mon, 29 Apr 2024",,
Time:,23:51:03,,
No. Iterations:,25,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,11.5835,0.130,89.187,0.000,11.329,11.838
estab_d,0.2686,0.014,18.584,0.000,0.240,0.297
region_CO,0.1244,0.022,5.630,0.000,0.081,0.168
region_NE,-0.0450,0.020,-2.248,0.025,-0.084,-0.006
region_S,-0.0062,0.020,-0.314,0.753,-0.045,0.032
region_SE,0.0465,0.019,2.473,0.013,0.010,0.083
capital_social_cat_0,-0.0200,0.014,-1.445,0.149,-0.047,0.007
capital_social_cat_1,-0.0063,0.012,-0.521,0.603,-0.030,0.017
capital_social_cat_2,0.0141,0.011,1.268,0.205,-0.008,0.036


##### Model prediction

Based on the preceding tests, this model is proposed for addressing the challenge. The primary metrics obtained were deemed satisfactory and effectively showcase the development of this prediction model.

It's worth emphasizing that the model.py file within the functions folder encapsulates the model and its associated parameters. In typical production scenarios, these components would be modularized into separate files for improved management.

Similarly, the use of .csv files for outputs may not be optimal for this experiment. Nevertheless, this example sufficiently demonstrates the processes involved in developing a prediction model.

In [8]:
# Define the target and independent variables from the previous tests
target = 'norm_tpv'
independents = [
        'estab_d',
        'region_CO',
        'region_N',
        'region_S',
        'region_SE',
        'capital_social_cat_0',
        'capital_social_cat_1',
        'capital_social_cat_2',
        'faixa_empregados_1',
        'faixa_empregados_2',
        'faixa_empregados_3',
        'faixa_empregados_4',
        'tier_0',
        'tier_1',
        'tier_2',
        'tier_3'
    ]

In [9]:
# Perform the robust regression with the selected variables
robust_regression_sm(total_data, target, independents)

Test Set Results:
R-squared: 0.2050439821872002
Mean Squared Error (MSE): 2.494854230832159
Mean Absolute Error (MAE): 1.0279753133986178


0,1,2,3
Dep. Variable:,norm_tpv,No. Observations:,80000.0
Model:,RLM,Df Residuals:,79983.0
Method:,IRLS,Df Model:,16.0
Norm:,HuberT,,
Scale Est.:,mad,,
Cov Type:,H1,,
Date:,"Mon, 29 Apr 2024",,
Time:,23:51:04,,
No. Iterations:,26,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,11.8750,0.127,93.442,0.000,11.626,12.124
estab_d,0.2480,0.014,17.226,0.000,0.220,0.276
region_CO,0.1593,0.016,9.671,0.000,0.127,0.192
region_N,0.0346,0.020,1.723,0.085,-0.005,0.074
region_S,0.0435,0.013,3.433,0.001,0.019,0.068
region_SE,0.0896,0.011,7.893,0.000,0.067,0.112
capital_social_cat_0,-0.0925,0.013,-7.022,0.000,-0.118,-0.067
capital_social_cat_1,-0.0588,0.012,-5.009,0.000,-0.082,-0.036
capital_social_cat_2,-0.0032,0.011,-0.295,0.768,-0.025,0.018
