## `Task` Do feature selection as per metods taught is session 54 on SECOM dataset.

Dataset Link : https://archive.ics.uci.edu/ml/datasets/SECOM

Drive Link : https://docs.google.com/spreadsheets/d/1dFCe1zgokabsiEr6BbWmMJtiMefkrChpJWLiG_0dDkk/edit?usp=share_link

In [None]:
# Write your Code here

# Feature Selection Using Variance Threshold

Feature selection is an essential step in preparing data for machine learning models. The Variance Threshold method is one of the techniques used to select relevant features while filtering out those with low variance. Features with low variance typically do not provide much information and may be considered noise in the dataset.

## What is Variance Threshold?

Variance measures the spread or variability of a feature's values. Features with low variance have values that are mostly the same, making them less informative for predictive modeling. On the other hand, features with high variance have values that vary more, potentially containing useful information for the model.

The Variance Threshold method helps identify and retain features with variance above a specified threshold while discarding those below the threshold.

## Steps for Using Variance Threshold

1. **Import Libraries**: Import the necessary libraries. In Python, you can use the `VarianceThreshold` class from the `sklearn.feature_selection` module.

2. **Create a VarianceThreshold Object**: Initialize a `VarianceThreshold` object, specifying the desired threshold value. For example:
   
   ```python
   from sklearn.feature_selection import VarianceThreshold
   selector = VarianceThreshold(threshold=0.01)
   sel = selector.fit(df)
   columns = df.columns[sel.get_support()]

   ```

### `Solution`

In [1]:
import pandas as pd

data = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQtBXo5cBnDsM2fmfHPm6u72KGUS5FjPHNGMxOfYjA9-CAhmnRpwkIw_rOR3sANJIToiUU__6fbBvig/pub?gid=572763137&single=true&output=csv")

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1567 entries, 0 to 1566
Columns: 592 entries, Time to Pass/Fail
dtypes: float64(590), int64(1), object(1)
memory usage: 7.1+ MB


In [5]:
data.head()

Unnamed: 0,Time,0,1,2,3,4,5,6,7,8,...,581,582,583,584,585,586,587,588,589,Pass/Fail
0,2008-07-19 11:55:00,3030.93,2564.0,2187.7333,1411.1265,1.3602,100.0,97.6133,0.1242,1.5005,...,,0.5005,0.0118,0.0035,2.363,,,,,-1
1,2008-07-19 12:32:00,3095.78,2465.14,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,1.4966,...,208.2045,0.5019,0.0223,0.0055,4.4447,0.0096,0.0201,0.006,208.2045,-1
2,2008-07-19 13:17:00,2932.61,2559.94,2186.4111,1698.0172,1.5102,100.0,95.4878,0.1241,1.4436,...,82.8602,0.4958,0.0157,0.0039,3.1745,0.0584,0.0484,0.0148,82.8602,1
3,2008-07-19 14:43:00,2988.72,2479.9,2199.0333,909.7926,1.3204,100.0,104.2367,0.1217,1.4882,...,73.8432,0.499,0.0103,0.0025,2.0544,0.0202,0.0149,0.0044,73.8432,-1
4,2008-07-19 15:22:00,3032.24,2502.87,2233.3667,1326.52,1.5334,100.0,100.3967,0.1235,1.5031,...,,0.48,0.4766,0.1045,99.3032,0.0202,0.0149,0.0044,73.8432,-1


In [9]:
data["1"][2]

2559.94

In [15]:
# importing dependencies
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# dropping column Time
df = data.drop("Time" , axis=1)

# filling NaN values with random values between columns
for column in df.columns:
  # Get the maximum and minimum values from column
  min_value = df[column].min()
  max_value = df[column].max()

  # generate random number between max and min
  random_values = np.random.uniform(min_value , max_value , size=df[column].isnull().sum())

  random_series = pd.Series(random_values , index=df[column][df[column].isnull()].index)
  # fill NaN values with the random series
  df[column].fillna(random_series ,inplace =True)



In [16]:
df.isnull().sum()

0            0
1            0
2            0
3            0
4            0
            ..
586          0
587          0
588          0
589          0
Pass/Fail    0
Length: 591, dtype: int64

In [21]:
# Seprate the data into x and y

x = df.drop("Pass/Fail", axis=1)
y = df["Pass/Fail"]

# split the dataset into training and testing data
x_train ,x_test , y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=8)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

# Initialize and train the logistic regression model
log_reg  = LogisticRegression(max_iter=100000)
log_reg.fit(x_train , y_train)

# make predictions on the test data
y_pred = log_reg.predict(x_test)

# calculate the accuracy score
accuracy = accuracy_score(y_test,y_pred)
print("Accuracy is :" , accuracy)

(1253, 590)
(1253,)
(314, 590)
(314,)
Accuracy is : 0.8821656050955414


STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [28]:
duplicate_columns = data.columns[data.T.duplicated()]

In [30]:
df.drop(columns=duplicate_columns , inplace =True)

In [33]:
print("Columns before reduction :",data.shape[1])
print("Columns after reduction :",df.shape[1])


Columns before reduction : 592
Columns after reduction : 487


In [35]:
from sklearn.feature_selection import VarianceThreshold
from scipy.stats import pearsonr

selector = VarianceThreshold(threshold=0.01)
sel = selector.fit(df)
# Get the selected columns:retrieves the column names of the selected features.
# The get_support method returns a Boolean mask indicating which features are selected based on the variance threshold.
columns = df.columns[sel.get_support()]
df1 = sel.transform(df)

# convert the into dataframe
df1 = pd.DataFrame(df1 , columns=columns)

print("Number of columns after applying the Variance threshold is :", df1.shape[1])

Number of columns after applying the Variance threshold is : 324


### After reducing the (590 - 324) = 266 columns and training the model

In [36]:
x = df1.drop("Pass/Fail", axis=1)
y = df1["Pass/Fail"]

# split the dataset into training and testing data
x_train ,x_test , y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=8)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

# Initialize and train the logistic regression model
log_reg  = LogisticRegression(max_iter=100000)
log_reg.fit(x_train , y_train)

# make predictions on the test data
y_pred = log_reg.predict(x_test)

# calculate the accuracy score
accuracy = accuracy_score(y_test,y_pred)
print("Accuracy is :" , accuracy)

(1253, 323)
(1253,)
(314, 323)
(314,)
Accuracy is : 0.8789808917197452


STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
