### Try-it 9.2: Predicting Wages

This activity is meant to summarize your work with regularized regression models.  You will use your earlier work with data preparation and pipelines together with what you've learned with grid searches to determine an optimal model.  In addition to the prior strategies, this example is an excellent opportunity to utilize the `TransformedTargetRegressor` estimator in scikitlearn.

### The Data

This dataset is loaded from the openml resource library.  Originally from census data, the data contains wage and demographic information on 534 individuals. From the dataset documentation [here](https://www.openml.org/d/534)

```
The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. 
```

In [1]:
from sklearn.datasets import fetch_openml

In [2]:
wages = fetch_openml(data_id=534, as_frame=True)

In [3]:
wages.frame.head()

Unnamed: 0,EDUCATION,SOUTH,SEX,EXPERIENCE,UNION,WAGE,AGE,RACE,OCCUPATION,SECTOR,MARR
0,8.0,no,female,21.0,not_member,5.1,35.0,Hispanic,Other,Manufacturing,Married
1,9.0,no,female,42.0,not_member,4.95,57.0,White,Other,Manufacturing,Married
2,12.0,no,male,1.0,not_member,6.67,19.0,White,Other,Manufacturing,Unmarried
3,12.0,no,male,4.0,not_member,4.0,22.0,White,Other,Other,Unmarried
4,12.0,no,male,17.0,not_member,7.5,35.0,White,Other,Other,Married


#### Task

Build regression models to predict `WAGE`.  Incorporate the categorical features and transform the target using a logarithm.  Build `Ridge` models and consider some different amounts of regularization.  

After fitting your model, interpret the model and try to understand what features led to higher wages.  Consider using `permutation_importance` that you encountered in module 8.  Discuss your findings in the class forum.

For an in depth example discussing the perils of interpreting the coefficients see the example in scikitlearn examples [here](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html).

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer, TransformedTargetRegressor
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SequentialFeatureSelector
import numpy as np
import plotly.express as px
import pandas as pd
import warnings
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SequentialFeatureSelector, SelectFromModel
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import set_config
set_config(display="diagram")
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.compose import ColumnTransformer , make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer,make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector, RFE
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score

In [5]:
wages = wages.frame

In [6]:
wages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534 entries, 0 to 533
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   EDUCATION   534 non-null    float64 
 1   SOUTH       534 non-null    category
 2   SEX         534 non-null    category
 3   EXPERIENCE  534 non-null    float64 
 4   UNION       534 non-null    category
 5   WAGE        534 non-null    float64 
 6   AGE         534 non-null    float64 
 7   RACE        534 non-null    category
 8   OCCUPATION  534 non-null    category
 9   SECTOR      534 non-null    category
 10  MARR        534 non-null    category
dtypes: category(7), float64(4)
memory usage: 21.4 KB


## Organize the Data

In [7]:
num_cols = ['EDUCATION', 'EXPERIENCE', 'AGE']
cat_cols = ['SOUTH', 'SEX', 'UNION', 'RACE', 'OCCUPATION', 'SECTOR', 'MARR']
X = wages.drop(columns = ['WAGE'])
y = wages['WAGE']
X_train, X_test, y_train, y_test = train_test_split(X, np.log1p(y), test_size = 0.3)

In [8]:
print(wages.nunique())

EDUCATION      17
SOUTH           2
SEX             2
EXPERIENCE     52
UNION           2
WAGE          238
AGE            47
RACE            3
OCCUPATION      6
SECTOR          3
MARR            2
dtype: int64


## Column Pipelines and Transformer

In [10]:
num_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy = 'median')),
    ('scaler', StandardScaler())
])
cat_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy = 'most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown = 'ignore', sparse = False))
])

In [11]:
col_trans = ColumnTransformer([
    ('num_pipeline', num_pipeline, num_cols),
    ('cat_pipeline', cat_pipeline, cat_cols)
])

## Linear Regression

In [12]:
# Linear Regression Pipeline
linreg = LinearRegression()
linreg_pipeline = Pipeline([
    ('col_trans', col_trans),
    ('model', linreg)
])
linreg_pipeline.fit(X_train, y_train)
linreg_train_mse = mean_squared_error(y_train, linreg_pipeline.predict(X_train))
linreg_test_mse = mean_squared_error(y_test, linreg_pipeline.predict(X_test))
print(f'The Linreg training MSE is: {linreg_train_mse}')
print(f'The Linreg testing MSE is: {linreg_test_mse}')
scores = cross_val_score(linreg_pipeline, X_train, y_train, cv = 5, scoring='neg_root_mean_squared_error')
print(scores)
y_pred = linreg_pipeline.predict(X_test)
linreg_pipeline.score(X_test, y_test)
print(f'Mean score of {scores.mean()} with a standard deviation of {scores.std()}')
# print("accuracy_score(y_test, linreg_pipeline.predict(X_test)): ", accuracy_score(y_test, y_pred))
# print("model.score(X_test, y_test): ", lr.score(X_test, y_test))

The Linreg training MSE is: 0.13592560272053014
The Linreg testing MSE is: 0.14403422171668676
[-0.35880546 -0.39159976 -0.30284768 -0.37341695 -0.47926368]
Mean score of -0.3811867068779426 with a standard deviation of 0.057316342689703106


In [53]:
r = permutation_importance(linreg_pipeline, X_test, y_test , n_repeats=30,random_state=0)
for i in r.importances_mean.argsort()[::-1]:
    if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
        print(f"{X.columns[i]:<8}"
        f"  {r.importances_mean[i]:.3f}"
              f" +/- {r.importances_std[i]:.3f}")


EXPERIENCE  2.317 +/- 0.253
AGE       2.285 +/- 0.244


In [55]:
r = permutation_importance(linreg_pipeline, X_test, y_test , n_repeats=30,random_state=0)
for i in r.importances_mean.argsort()[::-1]:
        print(f"{X.columns[i]:<8}"
        f"  {r.importances_mean[i]:.3f}"
              f" +/- {r.importances_std[i]:.3f}")

EXPERIENCE  2.317 +/- 0.253
AGE       2.285 +/- 0.244
SEX       0.023 +/- 0.031
UNION     0.012 +/- 0.026
SECTOR    0.010 +/- 0.013
EDUCATION  0.004 +/- 0.112
SOUTH     -0.009 +/- 0.012
RACE      -0.015 +/- 0.015
MARR      -0.034 +/- 0.016
OCCUPATION  -0.089 +/- 0.043


## Ridge Regression Pipeline

In [13]:
# Ridge Regression Pipeline
ridge = Ridge()
ridge_pipeline = Pipeline([
    ('col_trans', col_trans),
    ('model', ridge)
])
ridge_pipeline.fit(X_train, y_train)
ridge_train_mse = mean_squared_error(y_train, ridge_pipeline.predict(X_train))
ridge_test_mse = mean_squared_error(y_test, ridge_pipeline.predict(X_test))
print(f'The Ridge training MSE is: {ridge_train_mse}')
print(f'The Ridge testing MSE is: {ridge_test_mse}')
scores = cross_val_score(ridge_pipeline, X_train, y_train, cv = 5, scoring='neg_root_mean_squared_error')
print(scores)
print(f'Mean score of {scores.mean()} with a standard deviation of {scores.std()}')

The Ridge training MSE is: 0.1360103414734535
The Ridge testing MSE is: 0.1439715024139282
[-0.35942331 -0.3910038  -0.30317511 -0.37364975 -0.47835688]
Mean score of -0.3811217685202498 with a standard deviation of 0.05684128266009909


In [56]:
r = permutation_importance(ridge_pipeline, X_test, y_test , n_repeats=30,random_state=0)
for i in r.importances_mean.argsort()[::-1]:
        print(f"{X.columns[i]:<8}"
        f"  {r.importances_mean[i]:.3f}"
              f" +/- {r.importances_std[i]:.3f}")

SEX       0.023 +/- 0.030
UNION     0.012 +/- 0.025
SECTOR    0.010 +/- 0.013
SOUTH     -0.009 +/- 0.012
EXPERIENCE  -0.010 +/- 0.016
RACE      -0.014 +/- 0.014
MARR      -0.033 +/- 0.016
AGE       -0.034 +/- 0.011
OCCUPATION  -0.088 +/- 0.042
EDUCATION  -0.151 +/- 0.069


In [14]:
ridge_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'col_trans', 'model', 'col_trans__n_jobs', 'col_trans__remainder', 'col_trans__sparse_threshold', 'col_trans__transformer_weights', 'col_trans__transformers', 'col_trans__verbose', 'col_trans__verbose_feature_names_out', 'col_trans__num_pipeline', 'col_trans__cat_pipeline', 'col_trans__num_pipeline__memory', 'col_trans__num_pipeline__steps', 'col_trans__num_pipeline__verbose', 'col_trans__num_pipeline__impute', 'col_trans__num_pipeline__scaler', 'col_trans__num_pipeline__impute__add_indicator', 'col_trans__num_pipeline__impute__copy', 'col_trans__num_pipeline__impute__fill_value', 'col_trans__num_pipeline__impute__missing_values', 'col_trans__num_pipeline__impute__strategy', 'col_trans__num_pipeline__impute__verbose', 'col_trans__num_pipeline__scaler__copy', 'col_trans__num_pipeline__scaler__with_mean', 'col_trans__num_pipeline__scaler__with_std', 'col_trans__cat_pipeline__memory', 'col_trans__cat_pipeline__steps', 'col_trans__cat_pipeline__verb

In [15]:
# Using GridSearch to find best alpha
parameters = {'model__alpha': [0.1, .5, 1, 5, 10, 50, 100]}
gd = GridSearchCV(ridge_pipeline, parameters)
gd.fit(X_train, y_train)
gd.best_params_

{'model__alpha': 10}

## Feature Selection

In [58]:
## Feature Selection Pipeline
selector_pipe = Pipeline([
    ('col_trans', col_trans),
    ('selector', SequentialFeatureSelector(LinearRegression())),
    ('model', LinearRegression())
])
selector_pipe
selector_pipe.fit(X_train, y_train)

In [47]:
param_dict = {'selector__n_features_to_select': [8,9,10,11]} #started with 2-5, adjusted...11 was best number of features
selector_grid = GridSearchCV(selector_pipe, param_grid=param_dict)
selector_grid.fit(X_train, y_train)
train_preds = selector_grid.predict(X_train)
test_preds = selector_grid.predict(X_test)
selector_train_mse = mean_squared_error(y_train, train_preds)
selector_test_mse = mean_squared_error(y_test, test_preds)
print(f'Train MSE: {selector_train_mse}')
print(f'Test MSE: {selector_test_mse}')
print(f'Best params: {selector_grid.best_params_}')


Train MSE: 0.2043937749836083
Test MSE: 0.22992639376943913
Best params: {'selector__n_features_to_select': 9}


In [59]:
r = permutation_importance(selector_pipe, X_test, y_test , n_repeats=30,random_state=0)
for i in r.importances_mean.argsort()[::-1]:
        print(f"{X.columns[i]:<8}"
        f"  {r.importances_mean[i]:.3f}"
              f" +/- {r.importances_std[i]:.3f}")

AGE       0.075 +/- 0.020
EXPERIENCE  0.022 +/- 0.025
UNION     0.022 +/- 0.013
OCCUPATION  0.018 +/- 0.011
EDUCATION  0.013 +/- 0.010
MARR      0.000 +/- 0.000
SOUTH     0.000 +/- 0.000
SECTOR    -0.001 +/- 0.002
RACE      -0.001 +/- 0.002
SEX       -0.016 +/- 0.011


In [48]:
best_estimator = selector_grid.best_estimator_
best_selector = best_estimator.named_steps['selector']
best_model = selector_grid.best_estimator_.named_steps['model']
# feature_names = X_train.columns[best_selector.get_support()]
coefs = best_model.coef_
print(best_estimator)
# print(f'Features from best selector: {feature_names}.')
print('Coefficient values: ')
print('===================')
# pd.DataFrame([coefs.T], columns = feature_names, index = ['model'])

Pipeline(steps=[('col_trans',
                 ColumnTransformer(transformers=[('num_pipeline',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['EDUCATION', 'EXPERIENCE',
                                                   'AGE']),
                                                 ('cat_pipeline',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown=

## Lasso with Polynomial Features

In [19]:
# Lasso with Polynomial Features 
Pipe_pf_lasso = Pipeline([
    ('col_trans', col_trans),
    ('polyfeatures', PolynomialFeatures(degree = 2, include_bias = False)),
    ('scaler', StandardScaler()),
    ('lasso', Lasso(random_state = 42))
])
Pipe_pf_lasso.fit(X_train, y_train)
lasso_train_mse = mean_squared_error(y_train, Pipe_pf_lasso.predict(X_train))
lasso_test_mse = mean_squared_error(y_test, Pipe_pf_lasso.predict(X_test))
print(lasso_train_mse)
print(lasso_test_mse)
scores = cross_val_score(Pipe_pf_lasso, X_train, y_train, cv = 5, scoring='neg_root_mean_squared_error')
print(scores)
print("Mean score of %0.2f with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.21263086057655384
0.2221058364513119
[-0.41648133 -0.48654272 -0.4379192  -0.47816861 -0.48847584]
Mean score of -0.46 with a standard deviation of 0.03


In [60]:
r = permutation_importance(Pipe_pf_lasso, X_test, y_test , n_repeats=30,random_state=0)
for i in r.importances_mean.argsort()[::-1]:
        print(f"{X.columns[i]:<8}"
        f"  {r.importances_mean[i]:.3f}"
              f" +/- {r.importances_std[i]:.3f}")

MARR      0.000 +/- 0.000
SECTOR    0.000 +/- 0.000
OCCUPATION  0.000 +/- 0.000
RACE      0.000 +/- 0.000
AGE       0.000 +/- 0.000
UNION     0.000 +/- 0.000
EXPERIENCE  0.000 +/- 0.000
SEX       0.000 +/- 0.000
SOUTH     0.000 +/- 0.000
EDUCATION  0.000 +/- 0.000


In [20]:
columns = Pipe_pf_lasso.named_steps['polyfeatures'].get_feature_names_out()
columns

array(['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10',
       'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19',
       'x20', 'x21', 'x22', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4',
       'x0 x5', 'x0 x6', 'x0 x7', 'x0 x8', 'x0 x9', 'x0 x10', 'x0 x11',
       'x0 x12', 'x0 x13', 'x0 x14', 'x0 x15', 'x0 x16', 'x0 x17',
       'x0 x18', 'x0 x19', 'x0 x20', 'x0 x21', 'x0 x22', 'x1^2', 'x1 x2',
       'x1 x3', 'x1 x4', 'x1 x5', 'x1 x6', 'x1 x7', 'x1 x8', 'x1 x9',
       'x1 x10', 'x1 x11', 'x1 x12', 'x1 x13', 'x1 x14', 'x1 x15',
       'x1 x16', 'x1 x17', 'x1 x18', 'x1 x19', 'x1 x20', 'x1 x21',
       'x1 x22', 'x2^2', 'x2 x3', 'x2 x4', 'x2 x5', 'x2 x6', 'x2 x7',
       'x2 x8', 'x2 x9', 'x2 x10', 'x2 x11', 'x2 x12', 'x2 x13', 'x2 x14',
       'x2 x15', 'x2 x16', 'x2 x17', 'x2 x18', 'x2 x19', 'x2 x20',
       'x2 x21', 'x2 x22', 'x3^2', 'x3 x4', 'x3 x5', 'x3 x6', 'x3 x7',
       'x3 x8', 'x3 x9', 'x3 x10', 'x3 x11', 'x3 x12', 'x3 x13', 'x3 x14',
       'x

In [21]:
Pipe_pf_lasso.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'col_trans', 'polyfeatures', 'scaler', 'lasso', 'col_trans__n_jobs', 'col_trans__remainder', 'col_trans__sparse_threshold', 'col_trans__transformer_weights', 'col_trans__transformers', 'col_trans__verbose', 'col_trans__verbose_feature_names_out', 'col_trans__num_pipeline', 'col_trans__cat_pipeline', 'col_trans__num_pipeline__memory', 'col_trans__num_pipeline__steps', 'col_trans__num_pipeline__verbose', 'col_trans__num_pipeline__impute', 'col_trans__num_pipeline__scaler', 'col_trans__num_pipeline__impute__add_indicator', 'col_trans__num_pipeline__impute__copy', 'col_trans__num_pipeline__impute__fill_value', 'col_trans__num_pipeline__impute__missing_values', 'col_trans__num_pipeline__impute__strategy', 'col_trans__num_pipeline__impute__verbose', 'col_trans__num_pipeline__scaler__copy', 'col_trans__num_pipeline__scaler__with_mean', 'col_trans__num_pipeline__scaler__with_std', 'col_trans__cat_pipeline__memory', 'col_trans__cat_pipeline__steps', 'col

In [22]:
# Using GridSearch to find best alpha
parameters = {'polyfeatures__degree': [1, 2, 3, 4, 5]}
grid_search  = GridSearchCV(Pipe_pf_lasso, parameters, scoring = 'neg_mean_squared_error', cv = 5)
grid_search.fit(X_train, y_train)
grid_search.best_params_
# best_model = grid_search.best_estimator_
# best_model

 

{'polyfeatures__degree': 1}

In [23]:
grid_search.cv_results_

{'mean_fit_time': array([0.00785751, 0.00926814, 0.01756983, 0.07701535, 0.7259902 ]),
 'std_fit_time': array([0.00133779, 0.00135696, 0.00277343, 0.02031496, 0.07377824]),
 'mean_score_time': array([0.00345993, 0.00400586, 0.00478821, 0.01067848, 0.03652492]),
 'std_score_time': array([0.00033551, 0.00066472, 0.00049428, 0.00186793, 0.01481098]),
 'param_polyfeatures__degree': masked_array(data=[1, 2, 3, 4, 5],
              mask=[False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'polyfeatures__degree': 1},
  {'polyfeatures__degree': 2},
  {'polyfeatures__degree': 3},
  {'polyfeatures__degree': 4},
  {'polyfeatures__degree': 5}],
 'split0_test_score': array([-0.17345669, -0.17345669, -0.17345669, -0.17345669, -0.17345669]),
 'split1_test_score': array([-0.23672382, -0.23672382, -0.23672382, -0.23672382, -0.23672382]),
 'split2_test_score': array([-0.19177323, -0.19177323, -0.19177323, -0.19177323, -0.19177323]),
 'split3_test_score': 

In [24]:
grid_search.best_estimator_.predict(X_train)    #without best_estimator will just use last attempt, not best

array([2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170443,
       2.16170443, 2.16170443, 2.16170443, 2.16170443, 2.16170

In [25]:
Pipe_pf_lasso.named_steps['polyfeatures'].get_feature_names_out()
pf_df = pd.DataFrame(Pipe_pf_lasso.named_steps['lasso'].coef_,index=Pipe_pf_lasso.named_steps['polyfeatures'].get_feature_names_out()).T

In [26]:
pf_df.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x19^2,x19 x20,x19 x21,x19 x22,x20^2,x20 x21,x20 x22,x21^2,x21 x22,x22^2
0,0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,...,0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,-0.0


# Sequential Features

In [24]:
#Sequential Features with Linear Regression
num_cols = ['EDUCATION', 'EXPERIENCE', 'AGE']
cat_cols = ['SOUTH', 'SEX', 'UNION', 'RACE', 'OCCUPATION', 'SECTOR', 'MARR']
X_num = wages[num_cols]
y = wages['WAGE']
X_num_train, X_num_test, y_train, y_test = train_test_split(X, np.log1p(y), test_size = 0.3)

In [25]:
selector = SequentialFeatureSelector(estimator=LinearRegression(),
                                    n_features_to_select=2,
                                    cv = 5,
                                    scoring = 'neg_mean_squared_error')
Xt = selector.fit_transform(X_num, y)

In [27]:
#best_2 = pd.DataFrame(selector.fit_transform(wages[num_cols], wages['WAGE']), columns = selector.get_feature_names_out())
best_2 = pd.DataFrame(selector.fit_transform(X_num, y), columns = selector.get_feature_names_out())
best_2

Unnamed: 0,EDUCATION,EXPERIENCE
0,8.0,21.0
1,9.0,42.0
2,12.0,1.0
3,12.0,4.0
4,12.0,17.0
...,...,...
529,18.0,5.0
530,12.0,33.0
531,17.0,25.0
532,12.0,13.0


## Dummies>PolynomialFeatures>SequentialFeatureSelector

In [62]:
wages_no_wage = wages.drop(columns = 'WAGE')
wages_no_wage.head()
wages_no_wage_dum = pd.get_dummies(wages_no_wage, columns = ['SOUTH', 'SEX', 'UNION', 'RACE', 'OCCUPATION', 'SECTOR', 'MARR'])
# wages_no_wage.head()
pfeatures = PolynomialFeatures(degree = 2, include_bias = False)
quad_features = pfeatures.fit_transform(wages_no_wage_dum)
poly_features_df = pd.DataFrame(quad_features, columns = pfeatures.get_feature_names_out())
poly_features_df.head()
selector2 = SequentialFeatureSelector(estimator=LinearRegression(),
                                    n_features_to_select=4,
                                    cv = 5,
                                    scoring = 'neg_mean_squared_error')
Xt = selector2.fit_transform(poly_features_df, y)
best_4 = pd.DataFrame(selector2.fit_transform(poly_features_df, y), columns = selector2.get_feature_names_out())
best_4

Unnamed: 0,EDUCATION^2,EDUCATION OCCUPATION_Sales,EXPERIENCE SEX_male,SOUTH_yes UNION_not_member
0,64.0,0.0,0.0,0.0
1,81.0,0.0,0.0,0.0
2,144.0,0.0,1.0,0.0
3,144.0,0.0,4.0,0.0
4,144.0,0.0,17.0,0.0
...,...,...,...,...
529,324.0,0.0,5.0,0.0
530,144.0,0.0,0.0,0.0
531,289.0,0.0,0.0,0.0
532,144.0,0.0,13.0,0.0


## Above with Pipeline ( Dummies>PolynomialFeatures>SequentialFeatureSelector>Ridge)

In [64]:
selector = SequentialFeatureSelector(estimator=Ridge(),
                                    n_features_to_select= 4,
                                    cv = 5,
                                    scoring = 'neg_mean_squared_error')

Pipe_pf__sf_ridge = Pipeline([
    ('col_trans', col_trans),
    ('polyfeatures', PolynomialFeatures(degree = 2, include_bias = False)),
    ('seq_feat', selector),
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha = 0.1))
])
Pipe_pf__sf_ridge.fit(X_train, y_train)

Train_MSE = mean_squared_error(y_train, Pipe_pf__sf_ridge.predict(X_train))
Test_MSE = mean_squared_error(y_test, Pipe_pf__sf_ridge.predict(X_test))
print(f'Train MSE: {Train_MSE}')
print(f'Test MSE: {Test_MSE}')

Train MSE: 0.19383284471899445
Test MSE: 0.23329498575724053


In [65]:
r = permutation_importance(Pipe_pf__sf_ridge, X_test, y_test , n_repeats=30,random_state=0)
for i in r.importances_mean.argsort()[::-1]:
        print(f"{X.columns[i]:<8}"
        f"  {r.importances_mean[i]:.3f}"
              f" +/- {r.importances_std[i]:.3f}")

OCCUPATION  0.095 +/- 0.045
EDUCATION  0.078 +/- 0.041
MARR      0.000 +/- 0.000
SECTOR    0.000 +/- 0.000
RACE      0.000 +/- 0.000
AGE       0.000 +/- 0.000
UNION     0.000 +/- 0.000
EXPERIENCE  0.000 +/- 0.000
SOUTH     0.000 +/- 0.000
SEX       -0.045 +/- 0.020


In [204]:
Pipe_pf__sf_ridge.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'col_trans', 'polyfeatures', 'seq_feat', 'scaler', 'ridge', 'col_trans__n_jobs', 'col_trans__remainder', 'col_trans__sparse_threshold', 'col_trans__transformer_weights', 'col_trans__transformers', 'col_trans__verbose', 'col_trans__verbose_feature_names_out', 'col_trans__num_pipeline', 'col_trans__cat_pipeline', 'col_trans__num_pipeline__memory', 'col_trans__num_pipeline__steps', 'col_trans__num_pipeline__verbose', 'col_trans__num_pipeline__impute', 'col_trans__num_pipeline__scaler', 'col_trans__num_pipeline__impute__add_indicator', 'col_trans__num_pipeline__impute__copy', 'col_trans__num_pipeline__impute__fill_value', 'col_trans__num_pipeline__impute__missing_values', 'col_trans__num_pipeline__impute__strategy', 'col_trans__num_pipeline__impute__verbose', 'col_trans__num_pipeline__scaler__copy', 'col_trans__num_pipeline__scaler__with_mean', 'col_trans__num_pipeline__scaler__with_std', 'col_trans__cat_pipeline__memory', 'col_trans__cat_pipeline__

In [34]:
# Using GridSearch to find best alpha
parameters = {# 'polyfeatures__degree': [1, 2, 3],
             # 'seq_feat__n_features_to_select': [4, 6, 8,10],
             'ridge__alpha': [0.1,1,10]}
grid_search  = GridSearchCV(Pipe_pf__sf_ridge, parameters, scoring = 'neg_mean_squared_error', cv = 5)
grid_search.fit(X_train, y_train)
grid_search.best_params_
# best_model = grid_search.best_estimator_
# best_model
# grid_search.cv_results_
# grid_search.best_estimator_.predict(X_train)    #without best_estimator will just use last attempt, not best
# Pipe_pf__sf_ridge.named_steps['polyfeatures'].get_feature_names_out()
# pf_sf_df = pd.DataFrame(Pipe_pf__sf_ridge.named_steps['ridge'].coef_,index=Pipe_pf__sf_ridge.named_steps['polyfeatures'].get_feature_names_out()).T

{'ridge__alpha': 10}

## Try with Ridge and GridSearch CV to find best alpha...use linspace for alpha values