In [11]:
# Aggregate all the code into one cell

import pandas as pd
from scipy.stats import skew
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
url = "https://drive.google.com/file/d/1b8qJrZJ36Qu4a7xSfhXG1ibbLmBIlF4h/view?usp=drive_link"
wine_df = pd.read_csv(url, sep=';')

# Check the skewness of the attributes
skewness = wine_df.apply(lambda x: skew(x))
print(pd.DataFrame(skewness))

# Standardize and normalize the attributes
scaler = StandardScaler()
wine_df_scaled = pd.DataFrame(scaler.fit_transform(wine_df), columns=wine_df.columns)
print(wine_df_scaled.head())

# Histograms of the attributes
wine_df_scaled.hist(bins=30, figsize=(20,15))
plt.show()

# Box plots of the attributes
plt.figure(figsize=(20,15))
sns.boxplot(data=wine_df_scaled)
plt.show()

# Scatter plots of the attributes against the target variable
plt.figure(figsize=(20,15))
for i, column in enumerate(wine_df_scaled.columns[:-1]):
    plt.subplot(3, 4, i+1)
    sns.scatterplot(data=wine_df_scaled, x=column, y='quality')
plt.tight_layout()
plt.show()

# Split the data into training and test sets
X = wine_df_scaled.drop('quality', axis=1)
y = wine_df_scaled['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameters to tune
param_grid = {
    'max_iter': [1000, 5000, 10000],
    'learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'],
    'loss': ['squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'],
    'penalty': ['none', 'l2', 'l1', 'elasticnet']
}

# Create a SGDRegressor object
sgd = SGDRegressor()

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(sgd, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Print the best hyperparameters
print('Best hyperparameters:\n', grid_search.best_params_)

# Evaluate the model on the test set
y_pred = grid_search.predict(X_test)
print('Test error:', mean_squared_error(y_test, y_pred))
print('Test accuracy:', grid_search.score(X_test, y_test))
print('R-squared statistic:', r2_score(y_test, y_pred))

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)>

In [10]:
The Stochastic Gradient Descent (SGD) Regressor was used to predict the quality of the wine in the dataset. The SGDRegressor is a linear model fitted by minimizing a regularized empirical loss with SGD. It is particularly useful when the number of samples is very large.
The hyperparameters of the SGDRegressor were tuned using GridSearchCV, which performs an exhaustive search over the specified parameter values for the estimator. The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid. The combination of hyperparameters that resulted in the best performance on the training set was selected. The best hyperparameters found were: {'learning_rate': 'adaptive', 'loss': 'squared_loss', 'max_iter': 1000, 'penalty': 'elasticnet'}
The model was then evaluated on the test set. The test error, which is the mean squared error between the predicted and actual values on the test set, was found to be: 0.7256889648345134
The test accuracy, which is the coefficient of determination R^2 of the prediction, was: -0.7256889648345134
The R-squared statistic, a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model, was: 0.2652010398906337
The results suggest that the model has a moderate predictive power. The R-squared statistic of approximately 0.26 indicates that about 26% of the variance in the quality of the wine can be explained by the model. This suggests that there may be other factors not included in the dataset that influence the quality of the wine.
In conclusion, the SGDRegressor model with the tuned hyperparameters provides a reasonable prediction of the wine quality, but there is still room for improvement. Other models or feature engineering techniques could potentially improve the predictive power of the model.




SyntaxError: invalid syntax (3853803651.py, line 1)

In [4]:
import statsmodels.api as sm

# Add a constant to the independent variables
X_train_ols = sm.add_constant(X_train)
X_test_ols = sm.add_constant(X_test)

# Fit the OLS model
ols_model = sm.OLS(y_train, X_train_ols)
ols_results = ols_model.fit()

# Print the model summary
print(ols_results.summary())
# Predict on the test set
y_pred_ols = ols_results.predict(X_test_ols)

# Evaluate the model on the test set
print('OLS Test MSE:', mean_squared_error(y_test, y_pred_ols))
print('OLS R-squared statistic:', r2_score(y_test, y_pred_ols))


                            OLS Regression Results                            
Dep. Variable:                quality   R-squared:                       0.284
Model:                            OLS   Adj. R-squared:                  0.282
Method:                 Least Squares   F-statistic:                     141.1
Date:                Sat, 30 Sep 2023   Prob (F-statistic):          8.15e-274
Time:                        20:15:03   Log-Likelihood:                -4909.6
No. Observations:                3918   AIC:                             9843.
Df Residuals:                    3906   BIC:                             9919.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -0.0025 