# Model Building

In [3]:
%%capture
%run 05_feature_selection.ipynb

## 1. Linear regression using sm.OLS

In [4]:
def evaluate_ols_model(X, y, test_size=0.2, random_state=42):
    """
    Cfreate OLS regression model and evalate it.

    Parameters:
    - X (pandas.DataFrame): Independent variables.
    - y (pandas.Series): Dependent variable.
    - test_size (float): Size of the test set in the train-test split (default: 0.2).
    - random_state (int): Random seed for reproducibility (default: 42).

    Returns:
    OLS model results summary, R-squared score, and Mean Squared Error (MSE) on the test set.
    """
    #train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # add a constant term to the independent variables for the OLS model
    X_train = sm.add_constant(X_train)
    X_test = sm.add_constant(X_test)

    # fit & predict
    model = sm.OLS(y_train, X_train)
    model_results = model.fit()
    y_pred = model_results.predict(X_test)

    # evaluate
    test_r2 = r2_score(y_test, y_pred)
    test_mse = mean_squared_error(y_test, y_pred)

    return model_results, test_r2, test_mse 

In [5]:
results_df = pd.DataFrame()

def find_best_model(df, features=[top_features_dict, features_among_highly_associated_dict]):
    """
    Find the best  model among different feature sets for a given dataset.

    Parameters:
    - df (pandas.DataFrame): Dataset.
    - features (list): List of feature sets to evaluate.

    Returns:
    - pandas.Series: Series containing information about the best OLS model: model name, model, features list, r2 for test set, MSE for test set.
    """
    dataset_name = get_wine_str(df)
    best_r2 = 0
    
    for feature_set in features:
        # name the model
        model_name = f"Model_{dataset_name}"
        # create X and y sets
        X = df[feature_set[dataset_name]]
        y = df['quality']
        
        # evaluate the OLS model
        model, r2, mse = evaluate_ols_model(X, y)      

        # Compare R-squared scores
        if best_r2 < r2:
            best_model, best_features, best_r2, best_mse,  = model, list(X.columns), r2, mse

        result_model = pd.Series({
            'Model name': model_name,
            'model': best_model,
            'Features': best_features,
            'Test r2': best_r2,
            'Test MSE': best_mse
        })

    return result_model

Now we are testing all different combinations of features on datasets.

In [6]:
for dataset in wine_quality_without_outliers_dfs:
    dataset_name = get_wine_str(dataset)
    model = find_best_model(dataset)
    results_df = pd.concat([results_df, model], axis=1)

## 2. The list of the best models with their scores:

In [7]:
results_df = results_df.transpose()
results_df

Unnamed: 0,Model name,model,Features,Test r2,Test MSE
0,Model_White Wine Poor (Without Outliers),<statsmodels.regression.linear_model.Regressio...,"[free sulfur dioxide, volatile acidity, residu...",0.0702,0.136345
0,Model_White Wine Good (Without Outliers),<statsmodels.regression.linear_model.Regressio...,"[alcohol, density, chlorides, total sulfur dio...",0.119301,0.32609
0,Model_Red Wine Poor (Without Outliers),<statsmodels.regression.linear_model.Regressio...,"[volatile acidity, total sulfur dioxide, pH, c...",0.133565,0.082503
0,Model_Red Wine Good (Without Outliers),<statsmodels.regression.linear_model.Regressio...,"[alcohol, volatile acidity, sulphates, citric ...",0.284669,0.147073


### 2.1 Comparision of poor wine:

In [8]:
def describe_the_model(number_of_the_model):
    '''
    Print description of the model
    
    Parameters (int): number of the model in the result_df
    '''
    model = results_df.iloc[number_of_the_model]
    print('-------------------------------------------------------')
    print(f'The summary for the {model["Model name"]}:\n\n{model["model"].summary()}.\n')
    print('-------------------------------------------------------')
    print(f'R-squared on Test Set: {model["Test r2"]}')
    print(f'Mean Squared Error on Test Set: {model["Test MSE"]}\n')

In [9]:
describe_the_model(0)

-------------------------------------------------------
The summary for the Model_White Wine Poor (Without Outliers):

                            OLS Regression Results                            
Dep. Variable:                quality   R-squared:                       0.067
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     11.69
Date:                Wed, 22 Nov 2023   Prob (F-statistic):           3.67e-16
Time:                        16:51:27   Log-Likelihood:                -443.99
No. Observations:                1308   AIC:                             906.0
Df Residuals:                    1299   BIC:                             952.6
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---

In [10]:
describe_the_model(2)

-------------------------------------------------------
The summary for the Model_Red Wine Poor (Without Outliers):

                            OLS Regression Results                            
Dep. Variable:                quality   R-squared:                       0.099
Model:                            OLS   Adj. R-squared:                  0.087
Method:                 Least Squares   F-statistic:                     8.017
Date:                Wed, 22 Nov 2023   Prob (F-statistic):           2.76e-10
Time:                        16:51:27   Log-Likelihood:                -180.16
No. Observations:                 592   AIC:                             378.3
Df Residuals:                     583   BIC:                             417.8
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----

#### Discussion:

**White Poor Wine (N=1639)**

* **R-squared on the test = 0.070:**
  - The R-squared value is 0.070, indicating that only 7% of the variability in the dependent variable (quality) is explained by the model. Also, the adjusted R-squared is 0.060, adjusting for the number of predictor variables, suggesting that we should reduce the number of parameters because they are not improving the model.

* **F-Statistic = 11.69:**
  - The model is significant.

* **Features Significantly Constituting the Model:**
  - **free sulfur dioxide:** A one-unit increase in free sulfur dioxide is associated with a 0.0020 increase in quality.
  - **volatile acidity:** A one-unit increase in volatile acidity is associated with a -0.1660 decrease in quality.
  - **residual sugar:** A one-unit increase in residual sugar is associated with a 0.0043 increase in quality.

**Red Poor Wine (N=741)**

* **R-squared on the test = 0.133:**
  - The R-squared value is 0.099, indicating that approximately 9.9% of the variability in the dependent variable (quality) is explained by the model. However, the adjusted R-squared is 0.087, adjusting for the number of predictor variables. This is slightly lower than the R-squared, suggesting that not all added variables contribute meaningfully to the model.

* **F-Statistic = 8.017:**
  - The F-statistic tests the overall significance of the model.

* **Features Significantly Constituting the Model:**
  - **volatile acidity:** A one-unit increase in volatile acidity is associated with an approximate -0.537 decrease in wine quality.
  - **total sulfur dioxide:** A one-unit increase in total sulfur dioxide is associated with an increase of 0.0017 in wine quality.
  - **pH:** A one-unit increase in pH is associated with an approximate -0.233 decrease in wine quality.

**Summary**: 

Based on our models we can observe that volatile acidity and level of free/total sulfur dioxide are importnat feature in predicitng poor wine. Since the models don't have the best performance and also the adjusting r2 suggests that we may reduce the nmber of feautres in the model without loosing a lot of information. Also it will be interesing to observe how combination of multicolinear features like total sulfur dioxide and free sulfur dioxide will influance the model. 


### 2.2 Comparision of good wine:

In [11]:
describe_the_model(1)

-------------------------------------------------------
The summary for the Model_White Wine Good (Without Outliers):

                            OLS Regression Results                            
Dep. Variable:                quality   R-squared:                       0.116
Model:                            OLS   Adj. R-squared:                  0.113
Method:                 Least Squares   F-statistic:                     42.62
Date:                Wed, 22 Nov 2023   Prob (F-statistic):           1.61e-64
Time:                        16:51:27   Log-Likelihood:                -2161.5
No. Observations:                2603   AIC:                             4341.
Df Residuals:                    2594   BIC:                             4394.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
--

In [12]:
describe_the_model(3)

-------------------------------------------------------
The summary for the Model_Red Wine Good (Without Outliers):

                            OLS Regression Results                            
Dep. Variable:                quality   R-squared:                       0.196
Model:                            OLS   Adj. R-squared:                  0.187
Method:                 Least Squares   F-statistic:                     20.48
Date:                Wed, 22 Nov 2023   Prob (F-statistic):           7.36e-28
Time:                        16:51:27   Log-Likelihood:                -419.30
No. Observations:                 680   AIC:                             856.6
Df Residuals:                     671   BIC:                             897.3
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----

#### Discussion:

**White Good Wine (N=3249)**

* **R-squared on the test = 0.119:** 
  - The R-squared value on the test is 0.119, indicating that approximately 11.9% of the variability in the dependent variable (quality) is explained by the model.

* **F-Statistic = 42.62:**
  - The F-statistic tests the overall significance of the model. A larger F-statistic with a small p-value suggests that at least one predictor variable is significantly related to the dependent variable.
  - The model is statistically significant.

* **Features Significantly Constituting the Model:**
  - **density:** A one-unit increase in density is associated with a quality decrease of approximately 113.8570 units.
  - **cholorides:** A one-unit increase in chloride content is associated with a quality decrease of approximately 0.0868 units.
  - **residual sugar:** A one-unit increase in residual sugar is associated with a quality increase of approximately 0.0499 units.
  - **pH:** A one-unit increase in pH is associated with a quality increase of approximately 0.6328 units.
  - **fixed acidity:** A one-unit increase in fixed acidity is associated with a quality increase of approximately 0.0952 units.

**Red Good Wine (N=851)**

* **R-squared on the test = 0.285:** 
  - The R-squared value on the test is 0.285, indicating that approximately 28.5% of the variability in the dependent variable (quality) is explained by the model. It is much higher value then r2 of train data, which can by cause by random fluctuation.

* **F-Statistic = 20.48:** 
  - The model is statistically significant.

* **Features Significantly Constituting the Model:**
  - **alcohol:** A one-unit increase in alcohol is associated with a quality decrease of approximately 0.1289 units.
  - **sulphates:** A one-unit increase in sulphates content is associated with a quality decrease of approximately 0.6421 units.
  - **chlorides:**  A one-unit increase in chloride content is associated with a quality decrease of approximately 0.1498 units.
  - **total sulfur dioxide:** A one-unit increase in totla sulfur dioxide is associated with a quality increase of approximately 0.0018 units.


**Summary**: 

Based on our models we can say that the chlorides level can be an important predictor of good quality of wine. What is interesing here is the high score and error of **density** can suggest that there are two clusters of points with different density level. In the future it will be interesing to split the data to smoler groups and observe subclusters. Maybe they can by characterised by different quality level... To check it in the future it can be interesting to check how models with different density level clusters will perform.
