**INSTRUCTIONS**

Every learner should submit his/her own homework solutions. However, you are allowed to discuss the homework with each other– but everyone must submit his/her own solution; you may not copy someone else’s solution.

The homework consists of two parts:

    1. Data from our lives
    2. Variable selection

Follow the prompts in the attached jupyter notebook. 

**We are using the same data as for the previous homework**. Use the version you created called **df2** where you already cleaned, dropped some of the variables and also created the dummy variables.

Add markdown cells to your analysis to include your solutions, comments, answers. Add as many cells as you need, for easy readability comment when possible. Hopefully this homework will help you develop skills, make you understand the flow of an EDA, get you ready for individual work.

**Note:** This homework has a bonus question, so the highest mark that can be earned is a 105.

Submission: Send in both a ipynb and a pdf file of your work.

Good luck!


# 1. Data from our lives:

### Describe a situation or problem from your job, everyday life, current events, etc., for which a variable selection/feature reduction would be appropriate.

*Your Answer:*

**Answer**:-A Real Life situation where a variable selection/feature reduction  would be appropriate is in predicting the **LOAN RISK PREDICTION**. Through this system we can predict whether that particular applicant is safe or not to get a LOAN.Variable selection or feature reduction is crucial in the context of loan risk prediction using a regression model. The goal is to identify and include only the most relevant features that significantly contribute to the prediction of loan risk, while excluding irrelevant or redundant ones. This process offers several advantages:
**Improved Accuracy:**
 By performing variable selection in the context of loan risk prediction, the model focuses on the most relevant predictors, leading to improved accuracy. For example, if Credit History and Cash Flow are identified as key factors, the model emphasizes these influential features, resulting in more accurate risk assessments.

**Simple Models Are Easier to Interpret**: Feature reduction ensures that only essential variables are included in the model. This simplicity enhances interpretability, making it easier for stakeholders, such as loan officers, to understand and trust the model's decisions. For instance, a straightforward model highlighting the importance of Credit History and Cash Flow simplifies the interpretation process.

**Shorter Training Times:** With a reduced set of features, the model training process becomes more efficient. This is particularly important in the context of loan risk prediction, where quick decision-making is crucial. Shorter training times enable faster deployment of models, ensuring timely assessments of loan applications.

**Enhanced Generalization by Reducing Overfitting:** Feature reduction mitigates the risk of overfitting by focusing on the most relevant information. Instead of learning noise in the data, the model generalizes better to new, unseen loan applications. This is vital for making accurate predictions and avoiding the pitfalls of overfitting in the dynamic context of loan risk assessment.

**Easier to Implement by Software Developers:**  A streamlined model with a reduced set of features is easier for software developers to implement. It simplifies the integration of the model into the bank's existing software infrastructure, allowing for smoother deployment and utilization in real-world lending scenarios.

**Reduced Risk of Data Errors by Model Use:** Feature reduction helps improve the overall quality of the data used for training the model. By excluding irrelevant or noisy variables, the model becomes less susceptible to errors caused by misleading information. This ensures that the predictions are based on more reliable and relevant data.

**Variable Redundancy:** Techniques like correlation analysis and feature importance ranking help identify and eliminate redundant variables. In the loan risk prediction example, if two features are highly correlated (e.g., Collateral and Capitalization), one of them might be selected while the redundant one is excluded, reducing redundancy and enhancing the model's efficiency.

**Avoiding Bad Learning Behavior in High Dimensional Spaces:** High-dimensional spaces can lead to bad learning behavior and increased computational complexity. Feature reduction ensures that the model operates in a more manageable space, reducing the risk of issues associated with high dimensionality and improving the model's overall performance.






# 2. Variable selection

In our class so far we covered three types of feature selection techniques. They were: 
1. Filter methods
2. Wrapper methods
3. Embedded methods

Use the dataset 'auto_imports1.csv' from our previous homework. More specifically, use the version you created called **df2** where you already cleaned, dropped some of the variables and also created the dummy variables.

### 2.1. Filtered methods

Choose one (you may do more, one is required) of the filtered methods to conduct variable selection. Report your findigs

In [1]:
%store -r df2


getting the stored df2 dataframe from HW2 notebook to this notebook.

In [2]:
df2.head()

Unnamed: 0,wheel_base,length,width,heights,curb_weight,engine_size,bore,stroke,comprassion,horse_power,peak_rpm,city_mpg,highway_mpg,price,fuel_type_gas
0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495,1
1,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500,1
2,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500,1
3,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950,1
4,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450,1


In [3]:
df2.shape

(195, 15)

**Filter methods** are generally used as a preprocessing step. 
The selection of features is independent of any machine learning algorithms. 
Instead, features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable.


2.1.1 **Basic Methods**

In [4]:
#importing required packages
from sklearn.feature_selection import VarianceThreshold
import pandas as pd 

#setting the threshold
threshold = 0.1 
#determining the dependent variable 
X = df2.drop(['price'], axis=1) if 'price' in df2.columns else df2

selector = VarianceThreshold(threshold)
X_filtered = selector.fit_transform(X)
features_to_keep = X.columns[selector.get_support()]
df2_filtered = pd.DataFrame(X_filtered, columns=features_to_keep)
df2_filtered['price'] = df2['price']
df2_filtered.head()






Unnamed: 0,wheel_base,length,width,heights,curb_weight,engine_size,comprassion,horse_power,peak_rpm,city_mpg,highway_mpg,price
0,88.6,168.8,64.1,48.8,2548.0,130.0,9.0,111.0,5000.0,21.0,27.0,13495.0
1,88.6,168.8,64.1,48.8,2548.0,130.0,9.0,111.0,5000.0,21.0,27.0,16500.0
2,94.5,171.2,65.5,52.4,2823.0,152.0,9.0,154.0,5000.0,19.0,26.0,16500.0
3,99.8,176.6,66.2,54.3,2337.0,109.0,10.0,102.0,5500.0,24.0,30.0,13950.0
4,99.4,176.6,66.4,54.3,2824.0,136.0,8.0,115.0,5500.0,18.0,22.0,17450.0


In [5]:
df2_filtered.shape

(195, 12)

here, we have done the basic method of filtering where we removed constant and quasiconstant features. using variance threshold where it removes all features which variance doesn’t meet the threshold. 


### 2.2. Wrapper methods

Choose one (you may do more, one is required) of the wrapper methods to conduct variable selection. Report your findigs.

In **wrapper methods**, the approach involves iteratively selecting subsets of features and training a model with each subset. The decision to include or exclude features from the subset is based on insights gained from the performance of the previous model. Essentially, this process transforms into a search problem, where the goal is to find the most informative combination of features for optimal model performance. However, it's important to note that wrapper methods tend to be computationally intensive due to the exhaustive search over feature subsets, making them resource-demanding.







2.2.1 **Forward Selection**

In [6]:
#importing the required libraries
import statsmodels.api as sm
#defining the dependent variable
target_variable = 'price'
# Separating features from target variable
X = df2.drop([target_variable], axis=1)
y = df2[target_variable]

# Setting the significance threshold
significance_threshold = 0.01
#storing the selected features 
selected_features = []
# Creating the null model
null_model = sm.OLS(y, sm.add_constant(pd.Series([1]*len(y), index=y.index))).fit()
while True:
    best_feature = None
    best_pvalue = float('inf')  
    
    # Iterate through  features 
    for feature in X.columns:
        if feature in selected_features:
            continue
        
        # Adding the features 
        current_features = selected_features + [feature]
        
        # Training the model with the current feature set
        model = sm.OLS(y, sm.add_constant(X[current_features])).fit()
        
        # p-value for the added feature
        pvalue = model.pvalues[feature]
        
        # Update the best feature if the one has a smaller p-value
        if pvalue < best_pvalue:
            best_feature = feature
            best_pvalue = pvalue
    
    # If the best feature's p-value is below the significance threshold, add it to the selected set
    if best_pvalue < significance_threshold:
        selected_features.append(best_feature)
    else:
        break

#final selected features
print("Final Selected Features:")
print(selected_features)


Final Selected Features:
['engine_size', 'width', 'horse_power', 'stroke', 'fuel_type_gas', 'peak_rpm']


**Forward Stepwise Selection**: It is a method for variable selection in statistical modeling. The process initiates with a Null Model. Subsequently, the method systematically incorporates the most statistically significant variables into the model, one at a time. This iterative addition of variables continues until a predetermined stopping rule is met, or until all the variables under consideration have been included in the model. The primary objective is to refine the model by iteratively introducing the most relevant variables based on statistical significance, ultimately enhancing its predictive capabilities.
we determined the threshold value as 0.01. The most significant variable was choosen on the criteria that it has the smallest p-value compared to the given set threshold value.Therefore, the features we got  based on this forward selection process are 'engine_size', 'width', 'horse_power', 'stroke', 'fuel_type_gas', and 'peak_rpm'.

### 2.3. Embedded methods

Choose one (you may do more, one is required) of the embedded methods to conduct variable selection. Report your findigs.

2.3.1 **LASSO Regrresion**

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso

# Assuming 'target' is your target variable
target_variable = 'price'

# Separate features and target variable
X = df2.drop([target_variable], axis=1)
y = df2[target_variable]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#

# Create a LASSO regression model
lasso_model = Lasso(alpha=1000)

# Fit the LASSO model to the training data
lasso_model.fit(X_train, y_train)

# Get the selected features (non-zero coefficients)
selected_features = X.columns[lasso_model.coef_ != 0]

# Print the selected features
print("Selected Features:")
print(selected_features)



Selected Features:
Index(['curb_weight', 'engine_size', 'comprassion', 'horse_power', 'peak_rpm'], dtype='object')


**LASSO REGRESSION**: It stands for **L**east **A**bsolute **S**hrinkage and **S**election **O**perator,Lasso regression, a variant of linear regression, incorporates a technique known as shrinkage, wherein data values are pulled towards a central point, often the mean. This method is specifically designed to promote the development of uncomplicated, sparse models—models with fewer parameters. Lasso regression is particularly advantageous in scenarios characterized by high levels of multicollinearity or when there's a need to automate aspects of model selection, such as variable selection or parameter elimination.

The key mechanism in lasso regression involves L1 regularization, which imposes a penalty equivalent to the absolute value of the coefficients' magnitudes. Regularization, in general, entails introducing a penalty to the various parameters of a machine learning model to constrain its flexibility and, consequently, mitigate the risk of overfitting. In the context of linear models, this penalty is applied to the coefficients that scale each predictor.

Lasso's distinctive attribute within the spectrum of regularization techniques is its ability to shrink certain coefficients all the way to zero. Consequently, features associated with these zeroed-out coefficients can be effectively removed from the model. This property makes lasso regression a powerful tool for feature selection, providing a means to streamline models by automatically identifying and excluding less influential predictors.

 the features that were selected by the LASSO regression model based on their non-zero coefficients are : 'curb_weight', 'engine_size', 'compression', 'horse_power', and 'peak_rpm'.






### 2.4. Compare your results
Compare your results from the three methods and also compare the coefficients to the full linear regression model (model1) from the previous homework.

In [8]:
#loading model1 from HW2 
import pickle
import statsmodels.api as sm
#opening the model
with open('model1.pkl', 'rb') as file:
    model1_loaded = pickle.load(file)
 #printing the summary
print(model1_loaded.summary())


                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.860
Model:                            OLS   Adj. R-squared:                  0.849
Method:                 Least Squares   F-statistic:                     78.89
Date:                Sun, 19 Nov 2023   Prob (F-statistic):           5.84e-69
Time:                        14:30:22   Log-Likelihood:                -1838.5
No. Observations:                 195   AIC:                             3707.
Df Residuals:                     180   BIC:                             3756.
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const          -4.45e+04   1.84e+04     -2.419

In [10]:
# Assuming model1 is already trained and defined
X = df2.drop(columns=["price"])
y = df2["price"]
X = sm.add_constant(X)
model1 = sm.OLS(y, X).fit()

# Extract selected features from each method
features_method1 = df2_filtered.columns[:-1]  # Exclude the dependent variable 'price'
features_method2 = selected_features
features_method3 = selected_features

# Extract features with non-zero coefficients from the full linear regression model
features_model1 = model1.params.index[1:]  # Exclude the intercept

# Compare selected features
common_features = set(features_method1) & set(features_method2) & set(features_method3) & set(features_model1)

print("Common Features Selected by All Methods:")
print(common_features)

# Compare with features with non-zero coefficients in the full linear regression model
print("\nFeatures with Non-Zero Coefficients in Model1:")
print(features_model1)


Common Features Selected by All Methods:
{'curb_weight', 'peak_rpm', 'comprassion', 'horse_power', 'engine_size'}

Features with Non-Zero Coefficients in Model1:
Index(['wheel_base', 'length', 'width', 'heights', 'curb_weight',
       'engine_size', 'bore', 'stroke', 'comprassion', 'horse_power',
       'peak_rpm', 'city_mpg', 'highway_mpg', 'fuel_type_gas'],
      dtype='object')


Interpreting the results of feature selection and comparing the selected features:

**Common Features Selected by All Methods:**

'curb_weight': The weight of the car when it's ready to drive.
'peak_rpm': The maximum revolutions per minute of the engine.
'comprassion': Compression ratio of the engine.
'horse_power': The horsepower of the engine.
'engine_size': The size of the car's engine.
These features are consistently identified as important across different feature selection methods, suggesting they have a significant impact on predicting the dependent variable 'price.'

**Features with Non-Zero Coefficients in Model1:**

'wheel_base': The distance between the centers of the front and rear wheels.
'length': The length of the car.
'width': The width of the car.
'height': The height of the car.
'curb_weight': The weight of the car when it's ready to drive.
'engine_size': The size of the car's engine.
'bore': The diameter of the engine cylinders.
'stroke': The length of the engine's pistons moving up and down.
'compression': Compression ratio of the engine.
'horse_power': The horsepower of the engine.
'peak_rpm': The maximum revolutions per minute of the engine.
'city_mpg': Miles per gallon in the city.
'highway_mpg': Miles per gallon on the highway.
'fuel_type_gas': Binary indicator for gas fuel type.
These features with non-zero coefficients in the full linear regression model (model1) are considered significant in predicting 'price' based on their contribution to the linear regression equation.


The common features selected by all methods and the features with non-zero coefficients in model1 represent aspects of the car that strongly influence its price.
'engine_size', 'curb_weight', 'horse_power', and 'peak_rpm' appear to be consistently important across all methods, indicating the significant role of the car's engine characteristics.
Other features like 'wheel_base', 'length', 'width', 'height', 'bore', 'stroke', 'compression', 'city_mpg', 'highway_mpg', and 'fuel_type_gas' also contribute to predicting the 'price' in the linear regression model.


### 2.5 Bonus question (*extra 5 points*)

Reduce your features with PCA. Run a regression with the chosen number of PCA's, report your findings.

In [14]:
#importing the required libraries
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
#determining the features column
features = ['wheel_base', 'length', 'width', 'heights', 'curb_weight', 'engine_size', 'bore', 'stroke',
            'comprassion', 'horse_power', 'peak_rpm', 'city_mpg', 'highway_mpg', 'fuel_type_gas']

# Separate features and target variable
X = df2[features]
y = df2['price']

# Standardizing  the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Applying  PCA with 5 components
num_components = 5
pca = PCA(n_components=num_components)
X_pca = pca.fit_transform(X_scaled)

# Creating a DataFrame with the reduced features
columns_pca = [f'PC{i+1}' for i in range(num_components)]
df_pca = pd.DataFrame(data=X_pca, columns=columns_pca)

# Concatenate the reduced features DataFrame with the target variable
df_final = pd.concat([df_pca, y], axis=1)

# Display the reduced features DataFrame
print("DataFrame with Reduced Features:")
print(df_final.head())


DataFrame with Reduced Features:
        PC1       PC2       PC3       PC4       PC5    price
0 -0.727050 -1.869607 -0.315350  2.645986  0.562464  13495.0
1 -0.727050 -1.869607 -0.315350  2.645986  0.562464  16500.0
2  0.271794 -1.194953 -1.574672 -0.688328  0.040793  16500.0
3 -0.236626 -0.389173 -0.015361 -1.151228  0.318649  13950.0
4  1.122627 -1.234495 -0.203774 -1.105527  0.337484  17450.0


 The above Dataframe  with reduced features is obtained through Principal Component Analysis (PCA). Each row represents an observation from our original dataset. The columns 'PC1' through 'PC5' are the principal components resulting from the PCA transformation, and the 'price' is the dependent variable.

Each principal component (PC) is a linear combination of the original features.
These principal components are orthogonal to each other, and they capture the maximum variance in the data.
The values in each row of the 'PC1' to 'PC5' columns represent the coordinates of the data points in the reduced feature space.

PC1, PC2, and PC3 have values that vary across different observations, indicating variations in the dataset along these directions.
PC4 and PC5, being orthogonal, capture additional variations in the data not covered by the first three components.
Price Column:

The 'price' column represents the  dependent variable.
Each row corresponds to the price of the item associated with the feature values represented by the principal components.
Overall Interpretation:

The reduced feature space represented by 'PC1' to 'PC5' condenses the information from the original features while retaining the most significant variations.
The 'price' column allows you to associate the reduced feature values with the original target variable, facilitating analysis or modeling with a lower-dimensional representation.

In [17]:
# Accessing the contribution of each original feature on each principal component
contributions = pd.DataFrame(pca.components_, columns=features, index=columns_pca)

# Displaying the contibutions 
print("Contributions:")
print(contributions)


Contributions:
     wheel_base    length     width   heights  curb_weight  engine_size  \
PC1    0.311628  0.351393  0.342885  0.128115     0.370784     0.330728   
PC2    0.196130  0.096381  0.089492  0.263685     0.042755    -0.060528   
PC3    0.203115  0.150999 -0.028725  0.574718    -0.059040    -0.263229   
PC4   -0.256601 -0.139484 -0.099469 -0.395476     0.019719     0.160582   
PC5   -0.087765  0.016078  0.078358 -0.020250     0.059017    -0.040486   

         bore    stroke  comprassion  horse_power  peak_rpm  city_mpg  \
PC1  0.275624  0.059112     0.024492     0.303223 -0.096411 -0.321601   
PC2 -0.026328  0.132418     0.523976    -0.243137 -0.358649  0.259522   
PC3  0.117597 -0.616858    -0.187490    -0.237090 -0.066608 -0.013667   
PC4  0.424716 -0.547628     0.128795     0.119776 -0.443762  0.025908   
PC5 -0.130874 -0.433406     0.391050     0.191786  0.685783 -0.048707   

     highway_mpg  fuel_type_gas  
PC1    -0.333626      -0.050302  
PC2     0.217083      -0.52

Here, we get to know which original feature contributes how much to each Principle Component.

**REGRESSION**

checking for null values:

In [45]:
df_final.isnull().sum()

PC1      6
PC2      6
PC3      6
PC4      6
PC5      6
price    0
dtype: int64

dropping the null value rows

In [53]:
df_final.isnull().sum()

PC1      0
PC2      0
PC3      0
PC4      0
PC5      0
price    0
dtype: int64

In [54]:
df_final = df_final.dropna()

In [52]:
# Dropping  rows with NaN values in price 
df_final = df_final.dropna(subset=['price'])

# Extracting  features and price
X_pca = df_final[['PC1', 'PC2', 'PC3', 'PC4', 'PC5']]
y = df_final['price']

# Adding a constant to the features 
X_pca = sm.add_constant(X_pca)

# Creating  the linear regression model
model_pca = sm.OLS(y, X_pca).fit()

# Print the model summary
print(model_pca.summary())



                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.236
Model:                            OLS   Adj. R-squared:                  0.215
Method:                 Least Squares   F-statistic:                     11.32
Date:                Sun, 19 Nov 2023   Prob (F-statistic):           1.60e-09
Time:                        15:41:33   Log-Likelihood:                -1942.6
No. Observations:                 189   AIC:                             3897.
Df Residuals:                     183   BIC:                             3917.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1.304e+04    520.536     25.061      0.0

the statistical significance of each variable is determined by the p-values (P>|t|) associated with their coefficients. A common threshold for statistical significance is a p-value less than 0.05. examining the p-values for each variable:

const (Constant): The p-value is 0.000, which is less than 0.05. Therefore, the constant term is statistically significant.

PC1: The p-value is 0.000, indicating that PC1 is statistically significant.

PC2: The p-value is 0.299, which is greater than 0.05. Therefore, PC2 is not statistically significant at the 0.05 level.

PC3: The p-value is 0.947, which is much greater than 0.05. PC3 is not statistically significant.

PC4: The p-value is 0.756, which is greater than 0.05. PC4 is not statistically significant.

PC5: The p-value is 0.191, which is greater than 0.05. Therefore, PC5 is not statistically significant at the 0.05 level.

In summary, the statistically significant variables are the constant term and PC1. These variables have p-values less than 0.05, suggesting that they have a statistically significant impact on the dependent variable 'price' in this regression model. The other variables (PC2, PC3, PC4, and PC5) are not statistically significant based on the 0.05 significance level.

the R-squared value of 0.236 indicates that the model explains about 23.6% of the variability in the dependent variable. The low p-value associated with the F-statistic (1.60e-09) suggests that the overall regression model is statistically significant, meaning that there is evidence that at least one of the independent variables is related to the dependent variable.
