# In this Notebook we explore the Red Wine Dataset,Perform the Exploratory Data Analysis, Train 8 Regression models By Applying Grid Search CV and then Visualize the Results.

# About Red Wine Dataset

The dataset is pertaining to variations of Portuguese "Vinho Verde" wine. The source of additional information is the reference [Cortez et al., 2009].

The Dataset contain total 12 columns whose discription is as below:
1. **fxed acidity:** The majority of acids found in wine are classified as fixed or nonvolatile, indicating that they don't evaporate easily. These acids play a significant role in shaping the wine's overall taste, mouthfeel, and balance. 

2. **volatile acidity:** Excessive levels of acetic acid in wine can result in an undesirable vinegar-like taste. The presence of acetic acid, when exceeding appropriate levels, negatively impacts the wine's flavor profile, leading to an unpleasant sensory experience reminiscent of vinegar. Proper control and monitoring of acetic acid content are crucial in winemaking to avoid compromising the overall quality and taste of the final product.

3. **citric acid:** When present in minor amounts, citric acid can impart a sense of 'freshness' and enhance the flavor profile of wines. In small quantities, citric acid contributes to the wine's taste by adding a refreshing element.

4. **residual sugar:** Residual Sugar refers to the sugar left in wine after fermentation cessation, and it's uncommon to come across wines with less than 1 gram per liter of residual sugar.

5. **chlorides:** Chlorides in wine refer to the quantity of salt present in the beverage. This parameter helps measure the salt content in the wine, which can have an impact on its overall taste and flavor profile.

6. **free sulphur dioxide:** Free sulfur dioxide in wine exists in a balance between molecular SO2 (as a dissolved gas) and bisulfite ion. This equilibrium helps prevent various wine-related issues and oxidation. The presence of free SO2 acts as a preservative, protecting the wine from spoilage and maintaining its freshness. 

7. **total sulfur dioxide:** in wine represents the combined quantity of both free and bound forms of SO2. In low concentrations, SO2 is usually not detectable in wine, but it becomes apparent when present in its free form. This compound serves as a preservative, safeguarding the wine from spoilage and oxidation.

8. **density:** The density of wine is closely related to that of water and varies depending on the percentage of alcohol and sugar content. The specific gravity or density measurement can provide valuable information about the wine's composition and its alcohol and sugar levels. 

9. **ph:** The pH of wine indicates its level of acidity or basicity on a scale ranging from 0 (very acidic) to 14 (very basic). Typically, most wines fall within the pH range of 3 to 4. This measurement allows winemakers to understand and control the wine's acidity, which is crucial in determining its overall taste, stability, and how well it pairs with different foods.

10. **sulphates:** Sulphates are wine additives that can increase the levels of sulfur dioxide gas (SO2), which acts as an antimicrobial agent and preservative. By adding sulfates to wine, winemakers enhance its ability to ward off unwanted microbial growth and oxidation, thus improving its shelf life and overall stability.

11. **alcohol:**  Alcohol is a key component in wine, formed through the fermentation process when yeast converts sugar into ethanol and carbon dioxide. It plays a crucial role in defining a wine's character, affecting its body, aroma, and overall flavor profile. 

12. **quality:** The quality columns tells us how is the red wine quality depending on the concenterations of all other features/columns that are explained above.

In this notebook we predict the quality of red wine based on other features/columns.


In [None]:
# Importing all the necessary libraries and models used in the experiment.
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.svm import SVR
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Dataset read by pandas
df= pd.read_csv("/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")
df.head()

# Exploratory Data Analysis

In [None]:
# Description of dataset
df.info()

In [None]:
df.describe()

In [None]:
# To check the null values in the dataset
df.isnull().sum()

In [None]:
# dataset shape
df.shape

In [None]:
# To check the duplicate values in the dataset
df.duplicated().sum()

In [None]:
# Remove the duplicated rows from the dataset
df.drop_duplicates(inplace=True)

In [None]:
# check the shape of dataset after removing duplicates
df.shape

In [None]:
# Correlation -- tells us the relationship between two variables( here - sign  indicate the negative correlation and + sign indicate the positive correlation)
corr_matrix=df.corr()
corr_matrix

In [None]:
# Lets make the correlation matrix for easy visualisation 
plt.figure(figsize=(10,7))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt='.2f')
plt.title("Correlation Matrix", fontsize=14)
plt.show()

In [None]:
# Lets make the histogram containing all the columns
df.hist(bins=10, figsize=(10,11))
plt.suptitle("Data Distribution of all the columns")
plt.show()

In [None]:
# let's visulause the percentile and median base distribution (Boxplot helps us to see the outliers in the dataset)
df.boxplot(column=df.columns.tolist(), figsize=(20,20), grid=True, rot=45, fontsize=16)
plt.suptitle("Percentile and Median base distribution of all the columns", fontsize=25)
plt.show()

# Detecting Outliers from Dataset

In [None]:
# Detecting outliers in daatset
columns=df.columns.tolist()
outliers=[]

for col in columns:
    q1=np.percentile(df[col], 1)
    q3=np.percentile(df[col],99)
    
    print("col", col)
    
    for pos in range(len(df)):
        if df[col].iloc[pos] > q3 or df[col].iloc[pos]< q1:
            outliers.append(pos)
            
    print(outliers)

In [None]:
# Removing the duplicte values from outliers list
outliers_set= set(outliers)
final_outliers=list(outliers_set)

In [None]:
# Ratio (tell us the percentage of outliers find in the dataset)
ratio_outliers=len(final_outliers)/len(df)
ratio_outliers*100

In [None]:
# Drop the outliers from our dataset
df.drop(df.index[final_outliers], inplace=True)

In [None]:
# length of dataset after removing outliers 
len(df)

In [None]:
# Here we clearly see the impact between the boxplots after removing the outliers from the dataset.
df.boxplot(column=df.columns.tolist(), figsize=(20,20), grid=True, rot=45, fontsize=16 )
plt.suptitle("Percentile and Median base distribution of all the columns after removing Outliers ", fontsize=25)
plt.show()

# Split the dataset into train and  test set

In [None]:
# Split the data into train and test split and we use 20 percent data for testing
x_train,x_test,y_train,y_test= train_test_split(df.drop("quality", axis=1),
                                                df["quality"],
                                                test_size=0.2,
                                                random_state=42)
x_train.shape,x_test.shape,y_train.shape, y_test.shape

# Data Preprocessing step

In [None]:
# Data Preprocessing (--normalise the values of dataset)
std= StandardScaler()
x_train= std.fit_transform(x_train)  
x_test=std.transform(x_test)

# Creating 8 Regression Models that used in the Experiment

In [None]:
# Defining Models
models=[
        LinearRegression(),
        RandomForestRegressor(),
        DecisionTreeRegressor(),
        GradientBoostingRegressor(),
        SVR(),
        Lasso(),
        Ridge(),
        ElasticNet()
        
       
]

# Creating the Parameters List for all Regression Models

In [None]:
# Defining parameters
Linear_param={'n_jobs':[-1]}
              

Random_param={'n_estimators':[100,200],
              'max_depth':[6,8],
              'min_samples_split':[2,4], 
              'criterion':['squared_error'],
                                       }
                      
                                       
Decsion_param={'splitter':['best'], 
               'max_depth':[8,10], 
               'min_samples_split':[2],
               'criterion':['squared_error'], 
                                        
              }           
                                       
gradient_param={'n_estimators':[100,200], 
                   'learning_rate':[0.1, 0.01,0.001],
                   'max_depth':[8,10],
                   'min_samples_leaf':[2,4,5],
                   'loss':['squared_error'],
                    }
                   
        
        
SVR_param={'kernel':['rbf','poly'], 
      'gamma':['scale', 'auto'],
    }
          
Lasso_param={'alpha':[1.0,1.1],
             'max_iter':[1000,1200],
             'selection':['cyclic', 'random']
}

Ridge_param={ 'alpha':[1.0,1.1],
             'max_iter':[1000,1200],
             'solver':['auto','svd','lsqr']
    
}

ElasticNet_param={'alpha':[1.0,1.1],
                 'max_iter':[1000,1400],
                 'selection':['cyclic', 'random']
    
}

parameters=[ 
            Linear_param,
            Random_param,
            Decsion_param,
            gradient_param,
            SVR_param,
            Lasso_param,
            Ridge_param,
            ElasticNet_param
            ]
                            

# Apply GridSearchCV by passing all models and their parameters list

In [None]:
# Train the models using GridSearchCV
result={}
    
for i in range(len(models)):
    temp = []
    regressor = GridSearchCV(models[i], parameters[i], cv=2, scoring="r2", n_jobs=-1).fit(x_train, y_train)    # fitting the object
    models[i] = models[i].__class__.__name__
    best_parameters = regressor.best_params_
    y_pred = regressor.predict(x_test)
    mse = mean_squared_error(y_test, y_pred)
    temp.append(mse)
    result[f"{models[i]}"] = temp  


# Display the results of all models in a dictionary

In [None]:
result

# Create a Dataframe for results 

In [None]:
final_results= pd.DataFrame(result)
final_results=final_results.T
final_results.columns = ["MeanSquaredError"]
final_results

# Visulaise the Result

In [None]:
final_results.plot(kind="bar", figsize=(10, 7)).legend(bbox_to_anchor=(1.0, 1.0));

 # Conclusions
1. In the Exploratory Data Analysis, we came to know that the quality(target variable) is highly correlate with the alcohal(input feature) with the value of 0.48 followed by sulphates(0.25), which means these features plays very important role to predict the wine quality.
2. With help of percentile capping we detect the outliers from our dataset and their are about 13% outliers in the dataset, this is visualised by boxplot with or without outliers.
3. After performing the extensive experiment on the dataset using 8 different Regression Model, 
we conclude that RandomForest Regressor has the least Mean Squared Error.