## Handling Missing Values in Large-scale ML Pipelines:

**Task 1**: Impute with Mean or Median
- Step 1: Load a dataset with missing values (e.g., Boston Housing dataset).
- Step 2: Identify columns with missing values.
- Step 3: Impute missing values using the mean or median of the respective columns.

In [None]:
# write your code from here

In [1]:
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
import pandas as pd

# Load the Boston Housing dataset
boston = fetch_openml(name="boston", version=1, as_frame=True)
boston_df = boston.frame

# Identify columns with missing values
missing_columns = boston_df.columns[boston_df.isnull().any()]
print("Columns with missing values:", missing_columns)

# Impute missing values using the mean
imputer = SimpleImputer(strategy='mean')
boston_df[missing_columns] = imputer.fit_transform(boston_df[missing_columns])

# Verify that there are no missing values
print("Missing values after imputation:", boston_df.isnull().sum().sum())

Columns with missing values: Index([], dtype='object')


ValueError: at least one array or dtype is required

**Task 2**: Impute with the Most Frequent Value
- Step 1: Use the Titanic dataset and identify columns with missing values.
- Step 2: Impute categorical columns using the most frequent value.

In [None]:
# write your code from here

In [2]:
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
import pandas as pd

# Load the Titanic dataset
titanic = fetch_openml(name="titanic", version=1, as_frame=True)
titanic_df = titanic.frame

# Identify columns with missing values
missing_columns_titanic = titanic_df.columns[titanic_df.isnull().any()]
print("Columns with missing values in Titanic dataset:", missing_columns_titanic)

# Impute categorical columns using the most frequent value
categorical_columns = titanic_df.select_dtypes(include=['category', 'object']).columns
categorical_missing_columns = [col for col in categorical_columns if col in missing_columns_titanic]

imputer_most_frequent = SimpleImputer(strategy='most_frequent')
titanic_df[categorical_missing_columns] = imputer_most_frequent.fit_transform(titanic_df[categorical_missing_columns])

# Verify that there are no missing values in categorical columns
print("Missing values after imputation in categorical columns:", titanic_df[categorical_missing_columns].isnull().sum().sum())

Columns with missing values in Titanic dataset: Index(['age', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'], dtype='object')
Missing values after imputation in categorical columns: 0


**Task 3**: Advanced Imputation - k-Nearest Neighbors
- Step 1: Implement KNN imputation using the KNNImputer from sklearn.
- Step 2: Explore how KNN imputation improves data completion over simpler methods.

In [None]:
# write your code from here

In [3]:
from sklearn.impute import KNNImputer

# Implement KNN imputation
knn_imputer = KNNImputer(n_neighbors=5)
boston_df_imputed = boston_df.copy()
boston_df_imputed.iloc[:, :] = knn_imputer.fit_transform(boston_df)

# Verify that there are no missing values
print("Missing values after KNN imputation:", boston_df_imputed.isnull().sum().sum())

Missing values after KNN imputation: 0


 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0.
 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

## Feature Scaling & Normalization Best Practices:

**Task 1**: Standardization
- Step 1: Standardize features using StandardScaler.
- Step 2: Observe how standardization affects data distribution.

In [None]:
# write your code from here

In [4]:
from sklearn.preprocessing import StandardScaler

# Standardize features using StandardScaler
scaler = StandardScaler()
boston_df_standardized = boston_df_imputed.copy()
boston_df_standardized.iloc[:, :] = scaler.fit_transform(boston_df_imputed)

# Verify the mean and standard deviation of standardized features
print("Mean of standardized features:", boston_df_standardized.mean().mean())
print("Standard deviation of standardized features:", boston_df_standardized.std().mean())

Mean of standardized features: -1.7866378431291338e-16
Standard deviation of standardized features: 1.0009896093465718


**Task 2**: Min-Max Scaling

- Step 1: Scale features to lie between 0 and 1 using MinMaxScaler.
- Step 2: Compare with standardization.

In [None]:
# write your code from here

In [5]:
from sklearn.preprocessing import MinMaxScaler

# Scale features to lie between 0 and 1 using MinMaxScaler
min_max_scaler = MinMaxScaler()
boston_df_minmax_scaled = boston_df_imputed.copy()
boston_df_minmax_scaled.iloc[:, :] = min_max_scaler.fit_transform(boston_df_imputed)

# Verify the minimum and maximum values of scaled features
print("Minimum value of scaled features:", boston_df_minmax_scaled.min().min())
print("Maximum value of scaled features:", boston_df_minmax_scaled.max().max())

Minimum value of scaled features: 0.0
Maximum value of scaled features: 1.0


**Task 3**: Robust Scaling
- Step 1: Scale features using RobustScaler, which is useful for data with outliers.
- Step 2: Assess changes in data scaling compared to other scaling methods.

In [None]:
# write your code from here

In [6]:
from sklearn.preprocessing import RobustScaler

# Scale features using RobustScaler
robust_scaler = RobustScaler()
boston_df_robust_scaled = boston_df_imputed.copy()
boston_df_robust_scaled.iloc[:, :] = robust_scaler.fit_transform(boston_df_imputed)

# Verify the median and interquartile range of scaled features
print("Median of robust scaled features:", boston_df_robust_scaled.median().median())
print("Interquartile range of robust scaled features:", (boston_df_robust_scaled.quantile(0.75) - boston_df_robust_scaled.quantile(0.25)).median())

Median of robust scaled features: 0.0
Interquartile range of robust scaled features: 1.0


## Feature Selection Techniques:
### Removing Highly Correlated Features:

**Task 1**: Correlation Matrix
- Step 1: Compute correlation matrix.
- Step 2: Remove highly correlated features (correlation > 0.9).

In [None]:
# write your code from here

In [7]:
# Compute the correlation matrix for the Boston dataset
correlation_matrix = boston_df.corr()

# Identify highly correlated features (correlation > 0.9)
highly_correlated_features = set()
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > 0.9:
            colname = correlation_matrix.columns[i]
            highly_correlated_features.add(colname)

# Drop highly correlated features from the dataset
boston_df_reduced = boston_df.drop(columns=highly_correlated_features)

# Display the reduced dataset
print("Reduced dataset shape:", boston_df_reduced.shape)
print("Dropped features due to high correlation:", highly_correlated_features)

Reduced dataset shape: (506, 13)
Dropped features due to high correlation: {'TAX'}


### Using Mutual Information & Variance Thresholds:

**Task 2**: Mutual Information
- Step 1: Compute mutual information between features and target.
- Step 2: Retain features with high mutual information scores.

In [None]:
# write your code from here

In [8]:
from sklearn.feature_selection import mutual_info_regression

# Compute mutual information between features and target
target = boston_df['MEDV']
features = boston_df.drop(columns=['MEDV'])

mutual_info = mutual_info_regression(features, target)
mutual_info_series = pd.Series(mutual_info, index=features.columns)

# Retain features with high mutual information scores
high_mutual_info_features = mutual_info_series[mutual_info_series > 0.1].index
boston_df_high_mi = boston_df[high_mutual_info_features]

# Display the retained features
print("Retained features with high mutual information scores:", high_mutual_info_features.tolist())

Retained features with high mutual information scores: ['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']


**Task 3**: Variance Threshold
- Step 1: Implement VarianceThreshold to remove features with low variance.
- Step 2: Analyze impact on feature space.

In [9]:
from sklearn.feature_selection import VarianceThreshold

# Implement VarianceThreshold to remove features with low variance
variance_threshold = VarianceThreshold(threshold=0.01)
boston_df_high_variance = boston_df.copy()
boston_df_high_variance.iloc[:, :] = variance_threshold.fit_transform(boston_df)

# Display the shape of the dataset after applying VarianceThreshold
print("Dataset shape after applying VarianceThreshold:", boston_df_high_variance.shape)

Dataset shape after applying VarianceThreshold: (506, 14)


In [None]:
# write your code from here

In [None]:
It seems you want to generate code for a specific task related to the provided variables and datasets. However, the exact task is unclear. Could you clarify what you want to achieve with the provided data? For example:

- Do you want to perform additional data preprocessing?
- Do you want to visualize the data?
- Do you want to build a machine learning model?
- Do you want to analyze correlations or relationships?

Please provide more details about the specific task you want to accomplish.give code