### 1. Error Analysis Conclusions & Work Plan

#### **Conclusions from Error Analysis**
Based on the initial error analysis, the following factors were identified as the main causes of errors in the baseline model:

1. **Skewed Distributions**
   - Features like `residual sugar` and `total sulfur dioxide` have highly skewed distributions.
   - The skewness introduces bias in the model’s learning process, leading to poor generalization on extreme values and outliers.

2. **Underrepresentation of Key Patterns**
   - Certain combinations of features (e.g., very low `residual sugar` with high `total sulfur dioxide`) are rare in the dataset, causing the model to underperform on these edge cases.

3. **Feature Redundancy**
   - High correlation between `free sulfur dioxide` and `total sulfur dioxide` indicates redundancy. This may confuse the model and lead to inefficiencies in learning.

4. **Bias-Variance Tradeoff**
   - The current model may overfit dominant patterns while underfitting on edge cases, likely due to suboptimal hyperparameters or insufficient regularization.

---

#### **Work Plan for Addressing Errors**
To address the issues identified, the following steps will be implemented:

1. **Handling Skewed Distributions**
   - Apply log transformations to skewed features (`residual sugar`, `total sulfur dioxide`) to stabilize variance and reduce skewness.
   - Use visualizations like histograms to confirm the effect of the transformation.

2. **Improving Representation of Key Patterns**
   - Perform data augmentation techniques like SMOTE (Synthetic Minority Oversampling Technique) to increase representation for underrepresented feature combinations.
   - Stratify the training data to ensure even representation of different feature ranges during training.

3. **Reducing Feature Redundancy**
   - Create a new composite feature: the sulfur dioxide ratio (`free sulfur dioxide / total sulfur dioxide`) to capture relationships between the redundant features.
   - Remove the individual redundant features after validating the new feature’s utility.

4. **Hyperparameter Tuning**
   - Optimize key hyperparameters such as `max_depth`, `learning_rate`, `n_estimators`, and `min_child_weight` using Grid Search or Randomized Search.
   - Adjust regularization parameters (e.g., `lambda`, `alpha`) to balance bias and variance effectively.

5. **Outlier Handling**
   - Detect and mitigate outliers using techniques like IQR-based capping or robust scaling methods to reduce their influence on the model.


### 2. Improving Model Performance

#### **Identifying Weaknesses in the Baseline Model**
The baseline model exhibits the following weaknesses:
1. **Sensitivity to Skewed Features**:
   - Skewed distributions in `residual sugar` and `total sulfur dioxide` negatively impact the model’s ability to generalize.
2. **Redundant and Low-Importance Features**:
   - Redundant features (e.g., `free sulfur dioxide`, `total sulfur dioxide`) dilute the model’s focus.
   - Features with low correlation to the target (e.g., `pH`) may introduce noise.
3. **Lack of Robust Outlier Handling**:
   - Extreme values adversely affect the model’s predictions, particularly on edge cases.
4. **Limited Hyperparameter Optimization**:
   - The current hyperparameter configuration may not adequately balance bias and variance.

---

#### **Steps to Improve Performance**

1. ### **Hyperparameter Tuning**
   - Use a systematic approach like Grid Search or Randomized Search to find optimal hyperparameters:
     - `max_depth`: Increase to allow the model to capture more complex patterns.
     - `n_estimators`: Experiment with higher values for better ensemble performance.
     - `learning_rate`: Lower the learning rate while increasing `n_estimators` for a more gradual learning process.
     - `min_child_weight`: Optimize to control the model’s sensitivity to small sample sizes.
   - Explore regularization parameters (`lambda`, `alpha`) to mitigate overfitting.


In [15]:
# Load the dataset
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBRegressor

# Load the dataset
data = pd.read_csv('wine-quality-white-and-red.csv')

# Check and encode categorical columns
if 'type' in data.columns:
    label_encoder = LabelEncoder()
    data['type'] = label_encoder.fit_transform(data['type'])  # Convert 'type' to numerical

# Prepare features (X) and target (y)
X = data.drop(columns=['quality'])  # Features
y = data['quality']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the model with experimental categorical support
xgb_model = XGBRegressor(
    random_state=42,
    objective='reg:squarederror'
)

# Define the parameter grid
param_grid = {
    'max_depth': [3, 5, 7],           # Control the complexity of the model
    'n_estimators': [100, 200, 300],  # Number of trees in the ensemble
    'learning_rate': [0.01, 0.1, 0.2], # Step size for weight updates
    'min_child_weight': [1, 3, 5],    # Minimum sum of instance weight in a child
    'lambda': [1, 2, 5],              # L2 regularization term
    'alpha': [0, 0.1, 0.5]            # L1 regularization term
}

# Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    scoring='neg_mean_squared_error', # Use MSE as evaluation metric
    cv=5,                             # 5-fold cross-validation
    verbose=1,                        # Output progress
    n_jobs=-1                         # Use all available processors
)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = -grid_search.best_score_  # Convert back to positive MSE

print("Best Parameters:", best_params)
print("Best MSE:", best_score)


Fitting 5 folds for each of 729 candidates, totalling 3645 fits
Best Parameters: {'alpha': 0.5, 'lambda': 1, 'learning_rate': 0.1, 'max_depth': 7, 'min_child_weight': 1, 'n_estimators': 300}
Best MSE: 0.4032341796945539



### **2. Feature Engineering**

To enhance the model's performance, the following feature engineering techniques are applied:

**Transform Skewed Features**
- Apply log transformations to features like `residual sugar` and `total sulfur dioxide` to reduce skewness and stabilize variance.
- Log transformations help to make the data distribution closer to normal, which is beneficial for the model's learning process.

In [16]:
# Log transformation of skewed features
import numpy as np

X_train['residual sugar'] = np.log1p(X_train['residual sugar'])
X_train['total sulfur dioxide'] = np.log1p(X_train['total sulfur dioxide'])

**Create Composite Features**

Introduce a new feature: the **sulfur dioxide ratio**, calculated as:

This composite feature replaces the redundant features (`free sulfur dioxide` and `total sulfur dioxide`).


In [17]:
# Create sulfur dioxide ratio feature
X_train['sulfur_dioxide_ratio'] = X_train['free sulfur dioxide'] / (X_train['total sulfur dioxide'] + 1e-6)

# Drop redundant features
X_train.drop(columns=['free sulfur dioxide', 'total sulfur dioxide'], inplace=True)


**3. Remove Low-Importance Features**

Use feature importance scores from the model to identify and remove features contributing minimally to predictions.

This step simplifies the model and reduces noise from irrelevant features, ensuring the model focuses on the most impactful variables.


In [18]:
# Verify alignment between features and feature importances
if len(X_train.columns) != len(feature_importances):
    print("Mismatch detected. Retraining the model to align features...")
    xgb_model.fit(X_train, y_train)
    feature_importances = xgb_model.feature_importances_

# Create a DataFrame for visualization
feature_importances_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Drop features with importance below a certain threshold (e.g., 0.01)
low_importance_features = feature_importances_df[feature_importances_df['Importance'] < 0.01]['Feature']
X_train.drop(columns=low_importance_features, inplace=True)
