# Python Coding Exercise for Module 2, Lesson 2: Subset Selection

Create a Python function that performs forward stepwise selection to identify a subset of predictor variables that are most related to a target variable in a linear regression context.

### Your tasks are to write a function that satisfies the following:

- Accept three parameters: X, a pandas DataFrame of predictor variables; y, a pandas Series of the target variable; and criterion, a string that specifies the metric used for variable selection (e.g., 'AIC', 'BIC', 'R-squared').
- Start with an empty model and iteratively add the variable that improves the model the most based on the specified criterion until no variables improve the model.
- Return a list of models with scores.

### Constraints:

- Use the statsmodels library to fit linear regression models.
- Assume that X and y have been preprocessed and are ready for modeling (e.g., no missing values or categorical variables needing encoding).

### Important Considerations:

Although we have not discussed this yet, the smaller AIC and BIC scores are, the better. So you will need to adjust the previous code of forward stepwise selection to choose smaller values over larger ones. You are also provided with several functions that you can use to answer the questions. 

In [None]:
import pandas as pd
from sklearn.datasets import make_regression
import statsmodels.api as sm

Below is a function you can use inside forward_stepwise_selection to score your models. 

In [None]:
def score(criterion, model):
    if criterion == 'AIC':
        return model.aic 
    if criterion == 'BIC':
        return model.bic 
    if criterion == 'R-squared':
        return -model.rsquared # negative since higher is better

In [None]:
def fit_model(selected_features):
        model = sm.OLS(y, sm.add_constant(X[selected_features])).fit()
        return model

def forward_stepwise_selection(data, response_variable, criterion='AIC'):
    # Initialize variables


    # Loop through the features
    
        # Initialize best_candidate_score, best_candidate, and trial_columns
        # Keep in mind that smaller scores are better. What should the initial score be?

        # for each predictor, consider the addition, the model, and the associated score


            # if the score is better, update best_candidate and best_candidate_score

    
    # return the list of models, sorted from low to high!


In [None]:
# Generate a dataset for demonstration (with 100 samples and 10 features)
X, y = make_regression(n_samples=100, n_features=10, noise=0.1, random_state=42)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(1, 11)])
y = pd.Series(y, name='target')

In [None]:
# What do the statistics on the full model tell us?

full_model = fit_model([f'feature_{i}' for i in range(1, 11)])
print(full_model.summary())

In [None]:
# Function call
models = forward_stepwise_selection(X, y, criterion='R-squared')
print(models)

Here, we can plot the list models by considering their number of features and their associated score. What does this graph suggest?

In [None]:
import matplotlib.pyplot as plt

def plot_features_vs_score(data):
    # Unpack the number of features and their associated scores
    num_features = [len(features) for features, score in data]
    scores = [score for features, score in data]
    
    # Create a plot
    plt.figure(figsize=(10, 5))
    plt.plot(num_features, scores, marker='o')
    
    # Setting the axis labels
    plt.xlabel('Number of Features')
    plt.ylabel('Score')
    
    # Set title
    plt.title('Number of Features vs. Score')
    
    # Show grid
    plt.grid(True)
    
    # Display the plot
    plt.show()

plot_features_vs_score(models)

## Questions:
Based on your work above, answer the following questions. 

1. When using AIC to score, what is the recommended model? What is the score this model receives? Does the graph confirm this?
2. When using BIC to score, what is the recommended model? What is the score this model receives? Does the graph confirm this?
3. When using $R^2$ to score, we know every additional variable will improve the model. What was the score received for the full model? Based on the graph, what model would you pick? What score did it receive?
4. Considering all of the results you've looked at, what model would you recommend using?