**INSTRUCTIONS**

Every learner should submit his/her own homework solutions. However, you are allowed to discuss the homework with each other– but everyone must submit his/her own solution; you may not copy someone else’s solution.

The homework consists of two parts:

    1. Data from our lives
    2. Variable selection

Follow the prompts in the attached jupyter notebook.

**We are using the same data as for the previous homework**. Use the version you created called **df2** where you already cleaned, dropped some of the variables and also created the dummy variables.

Add markdown cells to your analysis to include your solutions, comments, answers. Add as many cells as you need, for easy readability comment when possible. Hopefully this homework will help you develop skills, make you understand the flow of an EDA, get you ready for individual work.

**Note:** This homework has a bonus question, so the highest mark that can be earned is a 105.

Submission: Send in both a ipynb and a pdf file of your work.

Good luck!


# 1. Data from our lives:

### Describe a situation or problem from your job, everyday life, current events, etc., for which a variable selection/feature reduction would be appropriate.

*Your Answer:*

In variable selection, the goal is to identify the most relevant features or variables that significantly contribute to the predictive power of the model. For a cricket scorecard predictive model, these variables might include:

**Player Form:**

Recent performance of key players, such as runs scored, wickets taken, and batting/bowling averages.

**Team Composition:**

The combination of players in the playing XI, considering factors like the balance between batsmen and bowlers, and the presence of all-rounders.

**Match Conditions:**

Weather conditions, pitch type, and venue-specific factors that could influence the performance of players and teams.

**Head-to-Head Records:**

Historical performance of teams against each other, providing insights into team dynamics and potential psychological advantages.

**Toss Outcome:**

The impact of winning or losing the toss on the outcome of the match, as it can influence the team's strategy.

**Feature Reduction:**

Feature reduction involves techniques to decrease the number of input variables in the model, simplifying its structure without significantly compromising its predictive accuracy. In the cricket scorecard predictive model, feature reduction techniques includes:

**Correlation Analysis:**

Identifying and removing highly correlated features, as they may provide redundant information.

**Principal Component Analysis (PCA):**

Transforming the original variables into a smaller set of uncorrelated variables, known as principal components, while retaining as much of the original variability as possible.
Recursive Feature Elimination (RFE):

Iteratively removing the least significant features from the model until the optimal set of features is obtained.

In summary, variable selection and feature reduction are crucial steps in developing a predictive model for a cricket scorecard. They help streamline the model, enhance its interpretability, and improve its efficiency in making accurate predictions about match outcomes and player performances.

In [123]:
from scipy import stats
from sklearn.linear_model import *
from sklearn.ensemble import *
from sklearn.feature_selection import *
from sklearn.model_selection import *
from sklearn.preprocessing import *
from sklearn.decomposition import *
from sklearn.metrics import *
import numpy as np
import pandas as pd

In [124]:
df =pd.read_csv('df2_dataset.csv')
df.head()

Unnamed: 0,wheel_base,length,width,heights,curb_weight,engine_size,bore,stroke,comprassion,horse_power,peak_rpm,city_mpg,highway_mpg,price,fuel_type_gas
0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495,1
1,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500,1
2,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500,1
3,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950,1
4,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450,1


converted dataset df2 from dataframe to csv using [df2.to_csv('cleaned_data.csv', index=False)] and reading the df2 data as cleaned_data.csv dataset.


In [125]:
df.replace('?',None, inplace=True)
df.head(10)

Unnamed: 0,wheel_base,length,width,heights,curb_weight,engine_size,bore,stroke,comprassion,horse_power,peak_rpm,city_mpg,highway_mpg,price,fuel_type_gas
0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495,1
1,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500,1
2,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500,1
3,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950,1
4,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450,1
5,99.8,177.3,66.3,53.1,2507,136,3.19,3.4,8.5,110.0,5500.0,19,25,15250,1
6,105.8,192.7,71.4,55.7,2844,136,3.19,3.4,8.5,110.0,5500.0,19,25,17710,1
7,105.8,192.7,71.4,55.7,2954,136,3.19,3.4,8.5,110.0,5500.0,19,25,18920,1
8,105.8,192.7,71.4,55.9,3086,131,3.13,3.4,8.3,140.0,5500.0,17,20,23875,1
9,101.2,176.8,64.8,54.3,2395,108,3.5,2.8,8.8,101.0,5800.0,23,29,16430,1


In summary, the code is replacing any occurrences of the character '?' with None in the DataFrame and then displaying the first 10 rows of the DataFrame to visually inspect the changes. This is a common preprocessing step when dealing with missing or unknown values in a dataset.



# 2. Variable selection

In our class so far we covered three types of feature selection techniques. They were:
1. Filter methods
2. Wrapper methods
3. Embedded methods

Use the dataset 'auto_imports1.csv' from our previous homework. More specifically, use the version you created called **df2** where you already cleaned, dropped some of the variables and also created the dummy variables.

### 2.1. Filtered methods

Choose one (you may do more, one is required) of the filtered methods to conduct variable selection. Report your findigs

In [126]:
from sklearn.feature_selection import SelectKBest, f_regression

# Assuming 'df' is DataFrame with the specified columns
# 'price' is the continuous target variable

# Extract features and target variable
features = df.drop(['price'], axis=1)
target = df['price']
# Ensure numeric data (might need to encode categorical variables)
features = pd.get_dummies(features)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Choose a predictive model (Linear Regression in this case)
model = LinearRegression()

# Use SelectKBest for feature selection
k_best = SelectKBest(score_func=f_regression, k=5)

# Fit and transform on the training set
X_train_kbest = k_best.fit_transform(X_train, y_train)
selected_features_kbest = X_train.columns[k_best.get_support()]

# Display the selected features using SelectKBest
print("Selected Features using SelectKBest:")
print(selected_features_kbest)

# Transform both training and testing sets with selected features
X_train_kbest = k_best.transform(X_train)
X_test_kbest = k_best.transform(X_test)

# Train the model with the selected features
model.fit(X_train_kbest, y_train)

# Make predictions on the test set
predictions = model.predict(X_test_kbest)

# Evaluate the model performance using mean squared error for regression
mse = mean_squared_error(y_test, predictions)
print(f"Mean Squared Error with selected features: {mse:.2f}")


Selected Features using SelectKBest:
Index(['width', 'curb_weight', 'engine_size', 'horse_power', 'highway_mpg'], dtype='object')
Mean Squared Error with selected features: 20660233.85


In [127]:
df_selected.head()

Unnamed: 0,width,bore,stroke,city_mpg,fuel_type_gas
0,64.1,3.47,2.68,21,1
1,64.1,3.47,2.68,21,1
2,65.5,2.68,3.47,19,1
3,66.2,3.19,3.4,24,1
4,66.4,3.19,3.4,18,1


Extracts features and the target variable ('price') from the DataFrame (df).
Ensures numeric data and encodes categorical variables using one-hot encoding (pd.get_dummies).
Train-Test Split:

Splits the dataset into training and testing sets (80% training, 20% testing) using train_test_split.
Model Selection:

Chooses a linear regression model (LinearRegression) as the predictive model.
Feature Selection using SelectKBest:

Uses SelectKBest with the f_regression score function to select the top 5 features based on their relationship with the target variable.
Fits and transforms the training set to keep only the selected features.
Display Selected Features:

Prints the names of the selected features.
Transforming Both Training and Testing Sets:

Transforms both the training and testing sets using only the selected features.
Model Training:

Trains the linear regression model on the training set containing the selected features.
Making Predictions:

Uses the trained model to make predictions on the testing set.
Model Evaluation:

Evaluates the performance of the model on the test set using mean squared error (mean_squared_error).
In summary, the code performs feature selection using the f_regression score and SelectKBest, then trains a linear regression model using only the top 5 selected features. Finally, it evaluates the model's performance on the test set using mean squared error. This process aims to improve the model's efficiency and interpretability by focusing on the most relevant features.



### 2.2. Wrapper methods

Choose one (you may do more, one is required) of the wrapper methods to conduct variable selection. Report your findigs.

In [128]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Assuming 'df' is your DataFrame with the specified columns
# 'price' is the continuous target variable

# Extract features and target variable
X = df.drop(['price'], axis=1)
y = df['price']

# Ensure numeric data
X = pd.get_dummies(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize RandomForestRegressor
algo = RandomForestRegressor(n_estimators=100)

# Initialize RFE
rfe = RFE(algo, n_features_to_select=10)

# Fit RFE to the entire dataset
model = rfe.fit(X, y)

# Get the selected feature indices
selected_feature_indices_rfe = model.support_

# Display the selected features using RFE
selected_features = X.columns[selected_feature_indices_rfe]
print("Selected Features using RFE:")
print(selected_features)

# Train RandomForestRegressor on the selected features
algo.fit(X_train[selected_features], y_train)

# Get feature importances
feature_importances = algo.feature_importances_

# Display feature importances
print("Feature Importances:")
print(feature_importances)

# Make predictions on the test set
predictions = algo.predict(X_test[selected_features])

# Evaluate the model performance using mean squared error for regression
mse = mean_squared_error(y_test, predictions)
print(f"Mean Squared Error on Test Set with selected features:", mse)


Selected Features using RFE:
Index(['wheel_base', 'length', 'width', 'curb_weight', 'engine_size',
       'comprassion', 'horse_power', 'peak_rpm', 'city_mpg', 'highway_mpg'],
      dtype='object')
Feature Importances:
[0.01723533 0.01055113 0.01435721 0.34907272 0.49912348 0.00626503
 0.02848176 0.00720248 0.01483817 0.05287269]
Mean Squared Error on Test Set with selected features: 12259444.693161927


'X'and 'y' are defined as the feature matrix and target variable, respectively, by extracting them from the DataFrame df.
pd.get_dummies(X) is used to one-hot encode categorical variables in the feature matrix X.
'train_test_split' is used to split the dataset into training and testing sets (X_train, X_test, y_train, y_test) with 80% for training and 20% for testing.
RandomForestRegressor is initialized as 'algo' with 100 trees.
RFE (Recursive Feature Elimination) is initialized with algo as the estimator and the desired number of features to select (n_features_to_select=10).
rfe.fit(X, y) fits the RFE model to the entire dataset, ranking and selecting the top 10 features based on their importance.
model.support_ gives the boolean mask of selected features.
X.columns[selected_feature_indices_rfe] retrieves the names of the selected features.
The selected features are then printed.
algo.fit(X_train[selected_features], y_train) trains the RandomForestRegressor on the training set using only the selected features.
algo.feature_importances_ retrieves the feature importances from the trained model.
The feature importances are then printed.
algo.predict(X_test[selected_features]) makes predictions on the test set using the selected features.
mean_squared_error is used to calculate the mean squared error between the predicted and actual target values on the test set.
The mean squared error is printed as a measure of the model's performance.


In summary,code performs feature selection using Recursive Feature Elimination (RFE) with a RandomForestRegressor, trains the model on the selected features, and evaluates its performance on the test set using mean squared error. The feature importances are also displayed, providing insights into the importance of each feature in making predictions.







### 2.3. Embedded methods

Choose one (you may do more, one is required) of the embedded methods to conduct variable selection. Report your findigs.

In [129]:
#Choose a predictive model (Lasso Regression in this case)
model_lasso = Lasso(alpha=0.01)  # You can adjust the regularization strength (alpha) based on your data

# Fit Lasso to training data
model_lasso.fit(X_train, y_train)

# Get the coefficients and identify non-zero coefficients
coefficients = pd.Series(model_lasso.coef_, index=X_train.columns)
selected_features_lasso = coefficients[coefficients != 0].index

# Display the selected features using Lasso regression
print("Selected Features using Lasso Regression:")
print(selected_features_lasso)

# Evaluate the model performance on the test set using the selected features
X_test_lasso = X_test[selected_features_lasso]
y_pred_lasso = model_lasso.predict(X_test_lasso)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
print("Mean Squared Error on Test Set with selected features:", mse_lasso)
# Access the coefficients from the Lasso model
coefficients_embedded_filter = model_lasso.coef_

# Print or use the coefficients as needed
print("Coefficients using Lasso Regression:")
print(coefficients_embedded_filter)


Selected Features using Lasso Regression:
Index(['wheel_base', 'length', 'width', 'heights', 'curb_weight',
       'engine_size', 'bore', 'stroke', 'comprassion', 'horse_power',
       'peak_rpm', 'city_mpg', 'highway_mpg', 'fuel_type_gas'],
      dtype='object')
Mean Squared Error on Test Set with selected features: 17473007.61161927
Coefficients using Lasso Regression:
[ 5.60905142e+01 -9.10282486e+01  6.58026413e+02  2.68102731e+02
  1.85837941e+00  1.00770609e+02 -1.14496244e+03 -2.89406449e+03
  1.29488217e+01  4.77104788e+01  1.52362991e+00 -3.16227173e+02
  2.60942574e+02 -4.03246919e+03]


  model = cd_fast.enet_coordinate_descent(


The code initializes a Lasso Regression model with the specified regularization strength (alpha).
The alpha parameter controls the strength of regularization. A higher alpha value leads to more regularization.
The fit method trains the Lasso model on the training data (X_train and y_train).
'model_lasso.coef_' retrieves the coefficients assigned to each feature by the trained Lasso model.
coefficients[coefficients != 0].index selects features with non-zero coefficients, identifying the features selected by Lasso.Then prints the lasso regression model
X_test[selected_features_lasso] selects only the features chosen by Lasso for the test set.
model_lasso.predict() predicts the target variable using the selected features.
mean_squared_error() calculates the mean squared error between the predicted and actual values on the test set.
Prints the mean squared error, indicating the model's performance on the test set.
This code demonstrates the entire process of training a Lasso Regression model, identifying important features through regularization, displaying the selected features, and evaluating the model's performance on a test set. The choice of the regularization strength (alpha) can be adjusted based on the specific characteristics of the data.

### 2.4. Compare your results
Compare your results from the three methods and also compare the coefficients to the full linear regression model (model1) from the previous homework.

In [136]:
from sklearn.linear_model import LinearRegression
model1=LinearRegression()
model1.fit(X,y)

Create an instance of the LinearRegression class, which will be your linear regression model. This instance, model1, will have methods that allow you to train and make predictions using linear regression.
This line of code trains the linear regression model (model1) on the input data (X) and corresponding target values (y). In machine learning, X typically represents the features (independent variables), and y represents the target variable (the variable you're trying to predict).

The fit method adjusts the parameters of the linear regression model to minimize the difference between the predicted values and the actual target values. In other words, it finds the best-fitting line (or hyperplane in the case of multiple features) that describes the relationship between the features and the target variable.

In [137]:
#Coeftients to linear regression model
model1.coef_

array([ 3.95305120e+01, -6.06332818e+01,  6.03641360e+02,  3.29566909e+02,
        1.17984542e+00,  1.38453665e+02, -1.20841366e+03, -3.70605308e+03,
       -6.17149719e+02,  3.46328246e+01,  2.55167287e+00, -2.88286769e+02,
        3.16633429e+02, -1.17330063e+04])

In [132]:
coefficients_embedded_filter


array([ 5.60905142e+01, -9.10282486e+01,  6.58026413e+02,  2.68102731e+02,
        1.85837941e+00,  1.00770609e+02, -1.14496244e+03, -2.89406449e+03,
        1.29488217e+01,  4.77104788e+01,  1.52362991e+00, -3.16227173e+02,
        2.60942574e+02, -4.03246919e+03])

coefficients_embedded_filter and model1.coef_ have different magnitude scales.We need to consider normalization or standardization for better comparison.It seems that both coefficients_embedded_filter and model1.coef_ have coefficients for the same set of features, though the values differ.
Feature Selection:

In [133]:
coeffients_wrapper_filter

array([0.19957659, 0.0434406 , 0.0786946 , 0.67422132, 0.00406689])

coefficients_wrapper_filter and coefficients_filtered_selection represent coefficients after feature selection. The former seems to have coefficients for a reduced set of features based on a specific method, while the latter lists specific features and their coefficients.

In [134]:
coefftients_filtered_selection

wheel_base       0.585793
length           0.695331
width            0.754273
heights          0.138291
curb_weight      0.835729
engine_size      0.888942
bore             0.546873
stroke           0.093746
comprassion      0.069500
horse_power      0.811027
peak_rpm        -0.104333
city_mpg        -0.702685
highway_mpg     -0.715590
fuel_type_gas   -0.108968
Name: price, dtype: float64

Coefficients_embedded_filter and model1.coef_ have different magnitude scales.We need to consider normalization or standardization for better comparison.
It seems that both coefficients_embedded_filter and model1.coef_ have coefficients for the same set of features, though the values differ.
Feature Selection:

coefficients_wrapper_filter and coefficients_filtered_selection represent coefficients after feature selection. The former seems to have coefficients for a reduced set of features based on a specific method, while the latter lists specific features and their coefficients.

Analyze the signs and magnitudes of the coefficients to understand the direction and strength of the relationships.
In summary, to provide a more detailed analysis, We may want to consider the following:

Standardize or normalize the coefficients for better comparison.
Compare the signs and magnitudes of coefficients across different methods.
Consider the features that are consistently selected by different methods.


### 2.5 Bonus question (*extra 5 points*)

Reduce your features with PCA. Run a regression with the chosen number of PCA's, report your findings.

In [135]:
n_components = 8
pca = PCA(n_components=n_components)
X_train_pca = pca.fit_transform(Scaled_X_train)
X_test_pca = pca.transform(Scaled_X_test)
regression_model = LinearRegression()
regression_model.fit(X_train_pca, y_train)
y_pred = regression_model.predict(X_test_pca)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

Mean Squared Error (MSE): 19868465.17983548


n_components = 8: This sets the number of principal components to retain during the PCA analysis.
X_train_pca = pca.fit_transform(Scaled_X_train): Fits the PCA model to the scaled training data (Scaled_X_train) and transforms the data into the reduced-dimensional space.
X_test_pca = pca.transform(Scaled_X_test): Transforms the scaled test data (Scaled_X_test) using the same PCA transformation.
regression_model = LinearRegression(): Initializes a linear regression model.
regression_model.fit(X_train_pca, y_train): Fits the linear regression model to the PCA-transformed training data (X_train_pca) and corresponding labels (y_train).
y_pred = regression_model.predict(X_test_pca): Makes predictions on the PCA-transformed test data.
mse = mean_squared_error(y_test, y_pred): Calculates the Mean Squared Error (MSE) between the predicted and actual labels.
print("Mean Squared Error (MSE):", mse): Prints the calculated Mean Squared Error, which is a measure of the model's performance on the test set. Lower MSE values indicate better model performance.





