# MACHINE LEARNIG

Realizamos un **modelo predictivo** de **regresión lineal** para predecir el valor de la inversión en función de las 

siguientes variables: Inversión, País, Continente, Industria, Año en que se fundó la empresa, el porcentaje de rentabilidad que se quiere sobre la inversión.

En la implementación de este modelo, utilizamos la biblioteca **pycaret**.


In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.ensemble import GradientBoostingRegressor

In [18]:
df = pd.read_csv("Data/df_reg.csv")

In [19]:
df.head()

Unnamed: 0,Industry,Country,Year_Founded,Funding,Continent,Valuation
0,Artificial intelligence,China,2012,8000,Asia,180000
1,Other,United States,2002,7000,North America,100000
2,E-commerce & direct-to-consumer,China,2008,2000,Asia,100000
3,Fintech,United States,2010,2000,North America,95000
4,Fintech,Sweden,2005,4000,Europe,46000


In [20]:
#años desde que se fundó
df['Years_Since_Founded'] = pd.Timestamp.now().year - df['Year_Founded']

# ratio de financiación y la edad de la empresa
df['Funding_Age_Ratio'] = df['Funding'] / df['Years_Since_Founded']

# Rentabilidad sobre la inversión
df['ROI'] = df['Valuation'] - df['Funding']

In [21]:
df.head()

Unnamed: 0,Industry,Country,Year_Founded,Funding,Continent,Valuation,Years_Since_Founded,Funding_Age_Ratio,ROI
0,Artificial intelligence,China,2012,8000,Asia,180000,11,727.272727,172000
1,Other,United States,2002,7000,North America,100000,21,333.333333,93000
2,E-commerce & direct-to-consumer,China,2008,2000,Asia,100000,15,133.333333,98000
3,Fintech,United States,2010,2000,North America,95000,13,153.846154,93000
4,Fintech,Sweden,2005,4000,Europe,46000,18,222.222222,42000


In [13]:
df.columns

Index(['Industry', 'Country', 'Year_Founded', 'Funding', 'Continent',
       'Valuation', 'Years_Since_Founded', 'Industry_Country',
       'Funding_Age_Ratio', 'Industry_Funding', 'ROI'],
      dtype='object')

In [14]:
df= df[['Industry', 'Country', 'Year_Founded', 'Funding', 'Continent', 'Years_Since_Founded', 'Industry_Country',
       'Funding_Age_Ratio', 'Industry_Funding', 'ROI','Valuation']]

Como tus datos de muestra no son suficientes para un modelo, he decidido crear nuevos a traves de muestras añadiéndole ruido.

In [22]:
num_rows_to_generate = 900

min_values = df.min()
max_values = df.max()

new_rows = df.sample(n=num_rows_to_generate, replace=True)

for column in df.select_dtypes(include=[np.number]).columns:
    noise = np.random.normal(0, 1, new_rows[column].shape)
    new_rows[column] += noise
    new_rows[column] = np.clip(new_rows[column], min_values[column], max_values[column])

new_rows['Valuation'] = np.clip(new_rows['Valuation'], None, df['Valuation'].max())

df = pd.concat([df, new_rows], ignore_index=True)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1945 entries, 0 to 1944
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Industry             1945 non-null   object 
 1   Country              1945 non-null   object 
 2   Year_Founded         1945 non-null   float64
 3   Funding              1945 non-null   float64
 4   Continent            1945 non-null   object 
 5   Years_Since_Founded  1945 non-null   float64
 6   Industry_Country     1945 non-null   object 
 7   Funding_Age_Ratio    1945 non-null   float64
 8   Industry_Funding     1945 non-null   object 
 9   ROI                  1945 non-null   float64
 10  Valuation            1945 non-null   float64
dtypes: float64(6), object(5)
memory usage: 167.3+ KB


# Modelización

In [23]:
#importamos pycaret regression para calcular el price
from pycaret.regression import *

In [24]:
#hacemos un setup con visa a nuestro target ROI
setup = setup(df, target = 'Valuation',remove_outliers = True)

Unnamed: 0,Description,Value
0,Session id,4195
1,Target,Valuation
2,Target type,Regression
3,Original data shape,"(1945, 9)"
4,Transformed data shape,"(1877, 29)"
5,Transformed train set shape,"(1293, 29)"
6,Transformed test set shape,"(584, 29)"
7,Numeric features,5
8,Categorical features,3
9,Preprocess,True


In [17]:
#Evaluamos modelos
best = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lr,Linear Regression,1.0244,2.3413,1.5256,1.0,0.001,0.0006,9.816
omp,Orthogonal Matching Pursuit,1.0178,2.3885,1.5394,1.0,0.001,0.0006,5.928
lasso,Lasso Regression,1.0288,2.4406,1.5554,1.0,0.001,0.0006,5.604
par,Passive Aggressive Regressor,2.195,13.4988,3.1704,1.0,0.0016,0.0012,5.793
br,Bayesian Ridge,1.0185,2.3374,1.5241,1.0,0.001,0.0006,5.883
huber,Huber Regressor,1.0064,2.3473,1.527,1.0,0.001,0.0006,5.867
llar,Lasso Least Angle Regression,1.0154,2.3609,1.531,1.0,0.001,0.0006,5.884
en,Elastic Net,1.0285,2.4377,1.5542,1.0,0.001,0.0006,5.781
ridge,Ridge Regression,1.0241,2.3411,1.5255,1.0,0.001,0.0006,6.169
lar,Least Angle Regression,50.9793,52257.9686,86.0653,0.9967,0.0519,0.0286,5.836


In [25]:
#Vemos que el que mejor nos viene es regresión lineal
lr = create_model('lr') # lo creamos

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.5871,0.9209,0.9596,1.0,0.0005,0.0003
1,0.648,1.1502,1.0725,1.0,0.0006,0.0003
2,0.8362,1.5934,1.2623,1.0,0.0009,0.0006
3,0.7692,1.6341,1.2783,1.0,0.0008,0.0004
4,0.8006,1.5518,1.2457,1.0,0.0007,0.0004
5,0.7269,1.322,1.1498,1.0,0.0007,0.0004
6,0.7161,1.3602,1.1663,1.0,0.0008,0.0004
7,0.9215,1.9929,1.4117,1.0,0.001,0.0006
8,0.6311,1.0891,1.0436,1.0,0.0007,0.0004
9,0.6995,1.326,1.1515,1.0,0.0008,0.0004


In [26]:
# lo tuneamos XD
tuned_lr = tune_model(lr)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.586,0.9173,0.9577,1.0,0.0005,0.0003
1,0.648,1.1502,1.0725,1.0,0.0006,0.0004
2,0.8364,1.594,1.2625,1.0,0.0009,0.0006
3,0.7692,1.6341,1.2783,1.0,0.0008,0.0004
4,0.8006,1.5518,1.2457,1.0,0.0007,0.0004
5,0.7268,1.3217,1.1496,1.0,0.0007,0.0004
6,0.7162,1.3599,1.1662,1.0,0.0008,0.0004
7,0.9217,1.9931,1.4118,1.0,0.001,0.0006
8,0.6311,1.0891,1.0436,1.0,0.0007,0.0004
9,0.6993,1.3254,1.1513,1.0,0.0008,0.0004


Fitting 10 folds for each of 2 candidates, totalling 20 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In [27]:
#Veamos como se comporta
evaluate_model(tuned_lr)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [28]:
#finalizamos el modelo
final_lr_best = finalize_model(tuned_lr)

In [29]:
#guardamos el modelo 
save_model( final_lr_best, 'ml_unicornios')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=FastMemory(location=C:\Users\fabia\AppData\Local\Temp\joblib),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['Year_Founded', 'Funding',
                                              'Years_Since_Founded',
                                              'Funding_Age_Ratio', 'ROI'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['Industry', 'Country',
                                              'Continent'],
                                     transformer=SimpleImputer(s...
                                     transformer=OneHotEncoder(cols=['Industry',
                                                                     'Continent'],
                                                               handle_missing='return_nan',
                                                               use_cat_names=True))),
                

In [5]:
df.columns

Index(['Industry', 'Country', 'Year_Founded', 'Funding', 'Continent',
       'Valuation'],
      dtype='object')

In [6]:
print(df['Continent'].unique())

['Asia' 'North America' 'Europe' 'Oceania' 'South America' 'Africa']
