<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: Variables and variable selection
© ExploreAI Academy

In this exercise, we apply variance thresholding to select features from a dataset.  

## Learning objectives

By the end of this train, you should be able to:
* Perform Dummy Variable Encoding
* Implement variance thresholding in Python.
* Use a variance threshold to filter out some features in a dataset.

## Exercises

We are provided with the `Crop_yield` dataset that contains various factors that could influence the yield of a particular crop across different regions.

### Import libraries and dataset

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [2]:
# Load dataset
df= pd.read_csv("https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/Python/Crop_yield.csv")
df.head(5)

Unnamed: 0,Region,Temperature,Rainfall,Soil_Type,Fertilizer_Usage,Pesticide_Usage,Irrigation,Crop_Variety,Yield
0,East,23.152156,803.362573,Clayey,204.792011,20.76759,1,Variety B,40.316318
1,West,19.382419,571.56767,Sandy,256.201737,49.290242,0,Variety A,26.846639
2,North,27.89589,-8.699637,Loamy,222.202626,25.316121,0,Variety C,-0.323558
3,East,26.741361,897.426194,Loamy,187.98409,17.115362,0,Variety C,45.440871
4,East,19.090286,649.384694,Loamy,110.459549,24.068804,1,Variety B,35.478118


In [3]:
df.shape

(1000, 9)

### Exercise 1

Our dataset contains several categorical features: `Region`, `Soil_Type`, and `Crop_Variety`. 

Use dummy variable encoding to convert these features into a numerical format suitable for model training. Verify the transformation by displaying the first five rows of the modified dataset.

> How has the number of variables in our dataset changed?

In [4]:
# Dummy variable encoding our dataset

df_dummies = pd.get_dummies(df)

# Again we make sure that all the column names have underscores instead of whitespaces
df_dummies.columns = [col.replace(" ","_") for col in df_dummies.columns]

df_dummies.head()

Unnamed: 0,Temperature,Rainfall,Fertilizer_Usage,Pesticide_Usage,Irrigation,Yield,Region_East,Region_North,Region_South,Region_West,Soil_Type_Clayey,Soil_Type_Loamy,Soil_Type_Sandy,Crop_Variety_Variety_A,Crop_Variety_Variety_B,Crop_Variety_Variety_C
0,23.152156,803.362573,204.792011,20.76759,1,40.316318,True,False,False,False,True,False,False,False,True,False
1,19.382419,571.56767,256.201737,49.290242,0,26.846639,False,False,False,True,False,False,True,True,False,False
2,27.89589,-8.699637,222.202626,25.316121,0,-0.323558,False,True,False,False,False,True,False,False,False,True
3,26.741361,897.426194,187.98409,17.115362,0,45.440871,True,False,False,False,False,True,False,False,False,True
4,19.090286,649.384694,110.459549,24.068804,1,35.478118,True,False,False,False,False,True,False,False,True,False


In [5]:
df_dummies.shape

(1000, 16)

### Exercise 2

We want to determine which variables from the new dataset we will use for model training.

Write a function `variance_thresholding` that will use variance thresholding to filter out features based on a variance threshold. The function should accept two parameters, which is the  dataframe and the threshold value. It should return two DataFrames, one containing the only the features that meet the variance threshold criterion, and one containing the scaled dataframe.

**Hint:** Scaling is crucial as it allows the variance thresholding to be applied uniformly across features. Read up on using the `MinMaxScaler()` function from the `sklearn.preprocessing` package.

In [35]:
def variance_thresholding(df:pd.DataFrame, threshold:float):
    # Splitting the dataset into features and target variable for scaling and training
    x = df.drop(columns=["Yield"])
    y = df["Yield"]
    
    # Initialize and fit the scaler to the features only
    scaler = MinMaxScaler()
    scaled_features = scaler.fit_transform(x)
    # Convert the scaled features back to a DataFrame
    df_scaled = pd.DataFrame(scaled_features, columns=x.columns)

    # Initialize the VarianceThreshold object with the specified threshold value
    selector = VarianceThreshold(threshold=threshold)

    # Apply the selector to the scaled feature DataFrame
    df_filtered_values = selector.fit_transform(df_scaled)
    # Convert the array result into a DataFrame with only the selected features
    df_filtered = pd.DataFrame(df_filtered_values , columns = df_scaled.columns[selector.get_support(indices=True)])

    return (df_filtered, df_scaled)

### Exercise 3

Using the function we created in **Exercise 2**, apply variance threshold filtering to our encoded dataset, with a threshold of `0.03`. Compare the number of features before and after applying the variance threshold.

In [36]:
features, scaled_frame = variance_thresholding(df_dummies, threshold=0.03)
features.shape

(1000, 13)

### Exercise 4

Train two linear regression models:

**a)** Using all the available features in our dummy encoded dataset from **Exercise 1**.

In [13]:
target_variable = "Yield"
columns = [col for col in df_dummies.columns if col != target_variable]
X_data = df_dummies[columns]
y_data = df_dummies[target_variable]
# Train-test split the original dataset
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20, shuffle=False)
lm = LinearRegression()
lm.fit(X_train, y_train)
train_prediction = lm.predict(X_train)
test_prediction = lm.predict(X_test)

In [17]:
from sklearn.metrics import mean_squared_error, r2_score
print("test prediction MSE",mean_squared_error(test_prediction, y_test))
print("train prediction MSE",mean_squared_error(train_prediction, y_train))
print("test prediction r2_score",r2_score(test_prediction, y_test))
print("train prediction r2_score",r2_score(train_prediction, y_train))

test prediction MSE 0.2410217473797156
train prediction MSE 0.24748513751666118
test prediction r2_score 0.9976125451765713
train prediction r2_score 0.9976213924633842


**b)** Using only the features selected through the variance thresholding process in **Exercise 3**.

In [38]:
X_data = df_dummies[features.columns]
y_data = df_dummies["Yield"]
# Train-test split the original dataset
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20, shuffle=False)
lm = LinearRegression()
lm.fit(X_train, y_train)
train_prediction = lm.predict(X_train)
test_prediction = lm.predict(X_test)

In [27]:
from sklearn.metrics import mean_squared_error, r2_score
print("test prediction MSE",mean_squared_error(test_prediction, y_test))
print("train prediction MSE",mean_squared_error(train_prediction, y_train))
print("test prediction r2_score",r2_score(test_prediction, y_test))
print("train prediction r2_score",r2_score(train_prediction, y_train))

test prediction MSE 99.7058593914497
train prediction MSE 100.26807133865265
test prediction r2_score -22.256050625786937
train prediction r2_score -23.907362379498508


## Solutions

**Note:** Use the comments provided to better understand the various parts of the code solutions below.

### Exercise 1

In [28]:
# Apply dummy variable encoding to the categorical variables
df_encoded = pd.get_dummies(df, columns=["Region", "Soil_Type", "Crop_Variety"], dtype=int)

# Display the first few rows of the modified dataset to confirm the transformation
df_encoded.head()

Unnamed: 0,Temperature,Rainfall,Fertilizer_Usage,Pesticide_Usage,Irrigation,Yield,Region_East,Region_North,Region_South,Region_West,Soil_Type_Clayey,Soil_Type_Loamy,Soil_Type_Sandy,Crop_Variety_Variety A,Crop_Variety_Variety B,Crop_Variety_Variety C
0,23.152156,803.362573,204.792011,20.76759,1,40.316318,1,0,0,0,1,0,0,0,1,0
1,19.382419,571.56767,256.201737,49.290242,0,26.846639,0,0,0,1,0,0,1,1,0,0
2,27.89589,-8.699637,222.202626,25.316121,0,-0.323558,0,1,0,0,0,1,0,0,0,1
3,26.741361,897.426194,187.98409,17.115362,0,45.440871,1,0,0,0,0,1,0,0,0,1
4,19.090286,649.384694,110.459549,24.068804,1,35.478118,1,0,0,0,0,1,0,0,1,0


In [29]:
# Check the new number of columns
df_encoded.shape

(1000, 16)

The categorical features have been successfully transformed into numerical format. Each unique value in these columns has been transformed into a separate column with a binary indicator, representing the presence `1` or absence `0` of that category in each row. Note: there has been an update on the `get_dummies` function, and the default output is now True/False.

We examine the new number of columns using the `.shape` attribute. We can see that the columns have increased from `9` to `16`.

### Exercise 2

In [30]:
def variance_thresholding(df_encoded, threshold_value):
    
   # Splitting the dataset into features and target variable for scaling and training
    X = df_encoded.drop(columns=['Yield']) 
    y = df_encoded['Yield']
    
    # Initialize and fit the scaler to the features only
    scaler = MinMaxScaler()
    scaled_features = scaler.fit_transform(X)
    
    # Convert the scaled features back to a DataFrame
    df_scaled = pd.DataFrame(scaled_features, columns=X.columns)
    
    # Initialize the VarianceThreshold object with the specified threshold value
    selector = VarianceThreshold(threshold=threshold_value)
    
    # Apply the selector to the scaled feature DataFrame
    df_filtered_values = selector.fit_transform(df_scaled)
    
    # Convert the array result into a DataFrame with only the selected features
    df_filtered = pd.DataFrame(df_filtered_values, columns=df_scaled.columns[selector.get_support(indices=True)])
    
    # Return the filtered DataFrame
    return df_filtered, df_scaled

We start by scaling our features using the `MinMaxScaler()`.

We then use the `threshold_value` passed as a parameter to filter out features whose variance falls below this value.

The function eventually returns `df_filtered`, the DataFrame with features whose variances are above the given threshold.

### Exercise 3

In [31]:
# Call the variance_thresholding() function and pass the given threshold
df_filtered, df_scaled = variance_thresholding(df_encoded, 0.03)

# Compare the number of features before and after variance thresholding
print("Number of features before variance thresholding:", df_scaled.shape[1])
print("Number of features after variance thresholding:", df_filtered.shape[1])  

Number of features before variance thresholding: 15
Number of features after variance thresholding: 13


Using a `0.03` threshold, the number of features has reduced from `15` to `13`, indicating that 2 of the features have been excluded.

### Exercise 4

**a)**

In [None]:
X_all = df_encoded.drop(columns=['Yield'])
y = df_encoded['Yield']
# Splitting both datasets into training and testing sets
X_train_all, X_test_all, y_train_all, y_test_all = train_test_split(X_all, y, test_size=0.2, random_state=42)

# Training the model using all available features
model_all = LinearRegression()
model_all.fit(X_train_all, y_train_all)

**b)**

In [None]:
# Splitting the dataset into training and testing sets
X_train_filtered, X_test_filtered, y_train_filtered, y_test_filtered = train_test_split(df_filtered, y, test_size=0.2, random_state=42)

# Training the model using selected features
model_filtered = LinearRegression()
model_filtered.fit(X_train_filtered, y_train_filtered)

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>