# PC2 - Grupo 2 (Python)

#### Integrantes
- GARCIA RODRIGUEZ, EMILIO ALONSO
- PADILLA AQUISE, ALESSANDRO PIERO
- RIEGA NUÑEZ, GABRIEL ANTONIO FERMIN
- SALAMANCA FERNANDEZ, LUCAS PABLO
- SILVA ANDUJAR, NICOLAS

#### 1. Loading and processing the data


In [1]:
import numpy as np
import pandas as pd

In [2]:
data= "https://raw.githubusercontent.com/d2cml-ai/CausalAI-Course/main/data/wage2015_subsample_inference.csv"

df =pd.read_csv(data)

In [3]:
df = df.set_index('rownames')

#### As in Group Assignment 1 2024 - 2 #1044 , generate the extra-flexible model. This means that it contains all two-way interactions between the experience polynomials and the indicator variables

In [4]:
df_with_dummies = pd.get_dummies(df, columns=['occ2', 'ind2'], drop_first=True)
extra_flexible_model_vars = ["sex",'exp1', 'exp2', 'exp3', 'exp4', 'hsg', 'scl', 'clg', 'ad', 
                             'so', 'we', 'ne'] + \
                             [col for col in df_with_dummies.columns if col.startswith('occ2_') or col.startswith('ind2_')]

two_way_interactions = []
for i, var1 in enumerate(extra_flexible_model_vars):
    for var2 in extra_flexible_model_vars[i+1:]:
        interaction_var = df_with_dummies[var1] * df_with_dummies[var2]
        two_way_interactions.append(interaction_var.values.reshape(-1, 1))

interactions_array = np.hstack(two_way_interactions)

extra_flexible_model_array = np.hstack([df_with_dummies[extra_flexible_model_vars].values, interactions_array])
print(extra_flexible_model_array.shape)

(5150, 1431)


#### 2.1. Generate the array for the outcome variable Y and normalize it

In [5]:
df_logwage = df[['lwage']]
log_w = np.array(df_logwage['lwage'])
log_w = log_w.reshape(-1, 1)

In [6]:
norm_log_w = (log_w - np.mean(log_w)) / np.std(log_w)

print(norm_log_w)

[[-1.24037498]
 [ 1.5815695 ]
 [-0.99532021]
 ...
 [ 1.19031569]
 [ 0.9200321 ]
 [-0.20976589]]


In [13]:
experience_vars = ['exp1', 'exp2', 'exp3', 'exp4']

experience_var_indices = [extra_flexible_model_vars.index(var) for var in experience_vars]

extra_flexible_model_array_normalized = extra_flexible_model_array.copy()

for idx in experience_var_indices:
    col_mean = np.mean(extra_flexible_model_array[:, idx])
    col_std = np.std(extra_flexible_model_array[:, idx])
    extra_flexible_model_array_normalized[:, idx] = (extra_flexible_model_array[:, idx] - col_mean) / col_std

print(extra_flexible_model_array_normalized.shape)

(5150, 1431)


In [14]:
print(extra_flexible_model_array_normalized)


[[1.0 -0.6372836834860737 -0.6321497674155142 ... 0.0 0.0 0.0]
 [0.0 1.6250669865566842 1.647556165145875 ... 0.0 0.0 0.0]
 [0.0 0.39962704028352375 0.055261560933588944 ... 0.0 0.0 0.0]
 ...
 [0.0 -0.26022523847894735 -0.45217298326593086 ... 0.0 0.0 0.0]
 [0.0 -0.35448984973072895 -0.5046662119762261 ... 0.0 0.0 0.0]
 [0.0 0.02256859527639739 -0.2646971664434482 ... 0.0 0.0 0.0]]


#### Split between training and testing samples. The testing sample should be 10% of the total.

In [15]:
from sklearn.model_selection import train_test_split
ef_model =extra_flexible_model_array_normalized
y=norm_log_w

In [16]:
ef_train,ef_test,y_ef_train,y_ef_test = train_test_split(ef_model,y, train_size= 0.1)


### 3. Creating the Lasso Cross - Validation Procedure 

4. Program a function that generates a logarithmically spaced grid. The input arguments should be the lower and upper bounds of the grid, as well as the natural logarithm of the spacing between each element of the grid. The output should be the logarithmically spaced grid, meaning that if we take the natural logarithm of each entry in the grid, they will be equally spaced. This will be the grid of values for λ values to try during cross-validation.

In [12]:
import numpy as np

def log_spaced_grid(lower_bound, upper_bound, log_spacing, num_points):
    log_lower = np.log(lower_bound)
    log_upper = np.log(upper_bound)

 
    log_grid = np.linspace(log_lower, log_upper, num_points)
    
   
    return np.exp(log_grid)


5. Program a function to generate 
k
 folds. It should take as input the array to be split rowwise and the number of folds desired. It should output a list of 
k
 1d arrays of booleans; these arrays should all be the same length as the number of rows in the input array, and when they are all summed together they should add up to an array of all true values. Create your own procedure for splitting. You can aid yourself with third party packages like numpy in Python or Statistics in Julia, but do not use a pre-programmed third party splitting procedure like sk-learns's KFolds in Python.

In [13]:
def generate_k_folds(X, k):
    n_samples = X.shape[0]
    fold_sizes = np.full(k, n_samples // k, dtype=int)
    fold_sizes[:n_samples % k] += 1  # Distribute remaining samples
    
    # Create an array of indices and shuffle it
    indices = np.arange(n_samples)
    np.random.shuffle(indices)
    
    # Create the folds
    folds = []
    current = 0
    for fold_size in fold_sizes:
        start, stop = current, current + fold_size
        mask = np.zeros(n_samples, dtype=bool)
        mask[indices[start:stop]] = True
        folds.append(mask)
        current = stop
    
    return folds

6. Program a function that integrates those that you programmed in the last two items to find the value of 
λ
 that minimizes the testing mean square error across folds.

In [16]:
from sklearn.linear_model import Lasso

def find_optimal_lambda(Y, X, lambda_bounds, k):
    lambdas = log_spaced_grid(lambda_bounds[0], lambda_bounds[1])
    
    folds = generate_k_folds(X, k)
    
    all_mse = np.zeros((len(lambdas), k))
    
    for i, lambda_val in enumerate(lambdas):
        for j, fold in enumerate(folds):
            X_train, X_test = X[~fold], X[fold]
            Y_train, Y_test = Y[~fold], Y[fold]
            
            model = Lasso(alpha=lambda_val, fit_intercept=True)
            model.fit(X_train, Y_train)
            Y_pred = model.predict(X_test)
            all_mse[i, j] = np.mean((Y_test - Y_pred)**2)

    avg_mse = np.mean(all_mse, axis=1)
    
    
    optimal_index = np.argmin(avg_mse)
    optimal_lambda = lambdas[optimal_index]
    
    optimal_model = Lasso(alpha=optimal_lambda, fit_intercept=True)
    optimal_model.fit(X, Y)
    
    
    result = {
        'optimal_lambda': optimal_lambda,
        'optimal_coef': optimal_model.coef_,
        'all_lambdas': lambdas,
        'all_mse': avg_mse
    }
    
    return result

7. Program a function for predicting the outcome variable through model estimated with the optimal lambda. It should take as inputs

In [17]:

def lasso_predict(optimal_model, X):
    optimal_lambda = optimal_model['optimal_lambda']
    optimal_coef = optimal_model['optimal_coef']

    model = Lasso(alpha=optimal_lambda, fit_intercept=True)
    model.coef_ = optimal_coef
    model.intercept_ = 0  

    predictions = model.predict(X)
    
    return predictions