# GDAA 2010 – Data Mining Modelling

## Project #1 - Comprehensive Regression Analyis in Python

### Alex Moss

In [None]:
# This first section of code will just be to import all of the various packages required for this Notebook to run.
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression




### Problem Definition

The problem at hand involves predicting a continuous target variable using regression analysis. In this project, we aim to predict the quality of red wine based on various physicochemical properties. The quality of the red wine is rated on a scale from 0 to 10, making it a suitable target variable for regression analysis.

## Data Collection and Preprocessing

Data Description - 

The target variable in this dataset is 'Quality'. 'Quality' refers to the overall quality of the wine, output on a scale of 1 to 10. 

There are a total of 11 predictor variables in this dataset, all of which are numeric physicochemical properties of the wine. These predictors include 'Fixed Acidity', 'Volatile Acidity', 'Citric Acid', 'Residual Sugar', 'Chlorides', 'Free Sulfur Dioxide', 'Total Sulfur Dioxide', 'Density', 'pH', 'Sulphates', and 'Alcohol'. Before any of the data preprocessing takes place, there are a total of 1599 rows in the dataset. 

The dataset comes from the UC Irvine Machine Learning Repository website. The full dataset actually comes with two different csvs, one for red wine, another for white. For this project, the red wine csv was chosen. Here is the link to the dataset -> https://archive.ics.uci.edu/dataset/186/wine+quality.




Data Preprocessing - 

The first few chunks of code takes the original csv file found on the UC Irvine Machine Learning Repository and make it usable. The csv file used semicolons instead of commas to separate the data, causing all of the data to be bunched into one column. The first code chunk fixes that and the second chunk brings in the newly changed csv as a data frame.

In [None]:
# Open the input CSV file
with open('winequality-red.csv', 'r') as infile:
    # Read the CSV file with semicolon delimiter
    reader = csv.reader(infile, delimiter=';')
    # Read all rows and store them
    data = list(reader)

# Open a new CSV file for writing
with open('winequalityred-fixed.csv', 'w', newline='') as outfile:
    # Create a CSV writer with comma delimiter
    writer = csv.writer(outfile)
    # Write the rows with commas as delimiter
    writer.writerows(data)

In [None]:
file_path = "E:\\NSCC\\Semester_2\\GDAA2010_DataMiningModelling\\Project_1\\winequalityred-fixed.csv"
redwine = pd.read_csv(file_path)
redwine.head() 

In [None]:
# Verify the data types of each variable
print(redwine.dtypes)

Check for any missing data. Data imputation will be necessary if there is missing data.

In [None]:
# Check for missing data
missing_data = redwine.isnull().sum()

# Print the summary of missing data
print(missing_data[missing_data > 0])

With no missing data, we can move on to trimming outliers. There are three main methods that could be employed to trim the outliers from the dataset in this project:

- Standard Deviation Method:
    - Use when the data is (approximately) normally distributed or when the underlying distribution is unknown.
    - Good for detecting outliers that are symmetrically distributed around the mean.
    - May not be robust to outliers if the data is heavily skewed or contains extreme values.

- Interquartile Range (IQR) Method:
    - Use when the data is skewed or contains extreme values.
    - Robust to outliers and resistant to extreme values in the dataset.
    - Suitable for detecting outliers that are not normally distributed and may be asymmetrically distributed.

- Percentile Method:
    - Use when you want to specify the exact percentage of data to trim from both ends of the distribution.
    - Provides flexibility in choosing the percentage of data to trim based on domain knowledge or specific requirements.
    - Suitable for situations where you need to customize the trimming level based on the characteristics of the data.

Based on the description of each method, only the STD and IQR methods will be considered. The percentile method makes more sense when there are more specific things in mind with the data, the other two methods are more generalized/blanket methods that are better suited for this project.

Firstly, let's look at the original predictor distributions.

In [None]:
# List of target variables
predictor_variables = ['fixed acidity', 'volatile acidity', 'citric acid', 
                      'residual sugar', 'chlorides', 'free sulfur dioxide', 
                      'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', ]


# Create subplots based on the number of target variables
fig, axes = plt.subplots(len(predictor_variables), 1, figsize=(10, 6 * len(predictor_variables)))

# Iterate through each target variable
for i, target_var in enumerate(predictor_variables):
    # Create a histogram using seaborn
    sns.histplot(data=redwine, x=target_var, bins=30, kde=True, color='blue', ax=axes[i])
    axes[i].set_title(f'Distribution of {target_var}')
    axes[i].set_xlabel(f'{target_var}')
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

All three methods will be used and the method that produces the best result will move forward with the rest of the project. 

In [None]:
# Specify the multiplier for the IQR method
iqr_multiplier = 1.5

# Calculate the first and third quartiles (Q1 and Q3) of predictors
Q1 = redwine.iloc[:, :-1].quantile(0.25)
Q3 = redwine.iloc[:, :-1].quantile(0.75)

# Calculate the interquartile range (IQR) for each predictor
IQR = Q3 - Q1

# Define the lower and upper bounds for outliers detection
lower_bound = Q1 - iqr_multiplier * IQR
upper_bound = Q3 + iqr_multiplier * IQR

# Identify rows containing outliers and drop them entirely from the dataset
trimmed_redwine_iqr = redwine.copy()
outliers_index = (trimmed_redwine_iqr.iloc[:, :-1] < lower_bound) | (trimmed_redwine_iqr.iloc[:, :-1] > upper_bound)
outliers_index = outliers_index.any(axis=1)
trimmed_redwine_df_iqr = trimmed_redwine_iqr[~outliers_index]

trimmed_redwine_df_iqr.reset_index(drop=True, inplace=True) #This was necessary because without it, the index of the data frame gets all messed up which was causing issues

print(trimmed_redwine_df_iqr.head())

fig, axes = plt.subplots(len(predictor_variables), 1, figsize=(12, 6 * len(predictor_variables)))

# Iterate through each target variable
for i, target_var in enumerate(predictor_variables):
    # Create a histogram using seaborn
    sns.histplot(data=trimmed_redwine_df_iqr, x=target_var, bins=30, alpha=0.7, kde=True, color='green', ax=axes[i], label='Trimmed (IQR)')
    axes[i].set_title(f'Trimmed {target_var} Histogram (IQR Method)')
    axes[i].set_xlabel(f'{target_var}')
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
print("There are", trimmed_redwine_df_iqr.shape[0], "records with", trimmed_redwine_df_iqr.shape[1], "variables.")

In [None]:
# Specify the number of standard deviations from the mean to consider as outliers
num_std = 2

# Calculate the mean and standard deviation of predictors
predictors_mean = redwine.iloc[:, :-1].mean() # This runs through each column, calculting its mean, except for the last column (-1), which is our target variable
predictors_std = redwine.iloc[:, :-1].std() 

# Identify rows containing outliers and drop them from the dataset
trimmed_redwine_df_std = redwine.copy()
outliers_index = np.abs((trimmed_redwine_df_std.iloc[:, :-1] - predictors_mean) / predictors_std) > num_std
outliers_index = outliers_index.any(axis=1) # axis=1 means scan the dataset across columns
trimmed_redwine_df_std = trimmed_redwine_df_std[~outliers_index]

trimmed_redwine_df_std.reset_index(drop=True, inplace=True)
print(trimmed_redwine_df_std.head())

fig, axes = plt.subplots(len(predictor_variables), 1, figsize=(12, 6 * len(predictor_variables)))

# Iterate through each target variable
for i, target_var in enumerate(predictor_variables):
    # Create a histogram using seaborn
    sns.histplot(data=trimmed_redwine_df_std, x=target_var, bins=30, alpha=0.7, kde=True, color='pink', ax=axes[i], label='Trimmed (STD)')
    axes[i].set_title(f'Trimmed {target_var} Histogram (STD Method)')
    axes[i].set_xlabel(f'{target_var}')
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
print("There are", trimmed_redwine_df_std.shape[0], "records with", trimmed_redwine_df_std.shape[1], "variables.")

Looking over the histograms produced after both outlier trimming methods, we are inclined to choose the STD method. This is mostly due to the strange distribution of the residual sugar predictor after the IQR outlier trimming. It can almost be described as having missing data. 

Now that the outliers in our data have been removed, there are only 1124 rows of data, down 475 rows from the non-trimmed dataset.

With the trimming complete, the remaining data needs to be scaled. Just like with the outlier trimming, there are three main methods that could be employed to scale our data in this project. 

Z-score standardization, also known as z-score scaling or standardization, is a common method used to scale numeric data. 

Here's how it works:

1. Calculate Mean and Standard Deviation
    - For each feature (predictor variable) in your dataset, calculate the mean (average) and the standard deviation.
    - Subtract Mean: Subtract the mean from each data point in the feature. This centers the data around zero.

2. Divide by Standard Deviation
    - Divide each centered data point by the standard deviation of the feature. This scales the data so that it has a standard deviation of 1.

The resulting values after standardization have a mean of 0 and a standard deviation of 1. This process does not change the shape of the distribution, but it ensures that all features are on the same scale, which is particularly important for algorithms that are sensitive to the scale of features, such as many machine learning algorithms.


Min-Max scaling, also known as normalization, is a method used to scale numeric data to a fixed range, typically between 0 and 1. 

Here's how it works:

1. Determine Range
    - For each feature (predictor variable) in your dataset, determine the minimum and maximum values.

2. Scale Data
    - Subtract the minimum value from each data point in the feature and then divide by the range (the maximum value minus the minimum value).

The resulting values are scaled to a range between 0 and 1, with the minimum value transformed to 0 and the maximum value transformed to 1.


Robust scaling, also known as robust standardization, is a method used to scale numeric data by centering and scaling it based on the median and interquartile range (IQR) rather than the mean and standard deviation. 

Here's how it works:

1. Calculate Median and Interquartile Range (IQR)
    - For each feature (predictor variable) in your dataset, calculate the median (the middle value) and the interquartile range, which is the difference between the 75th percentile (Q3) and the 25th percentile (Q1).

2. Scale Data
    - Subtract the median from each data point in the feature and then divide by the IQR.

The resulting values are scaled based on the median and IQR, making robust scaling less sensitive to outliers compared to z-score standardization.

For this project, Z-score standardization will be used to scale our trimmed dataset. This is due to our familiarity with this method of scaling data. Also, looking back over sample code, there was no visual difference when all scaling methods were performed on the sample dataset. Moving forward with a familiar method will hopefully allow for the rest of the project to roll out smoothly.

In [None]:
target = trimmed_redwine_df_std['quality']
predictors = trimmed_redwine_df_std.drop(columns=['quality'])

scaler = StandardScaler()
scaled_predictors = scaler.fit_transform(predictors)

# Convert the scaled predictors back to a DataFrame
scaledz_trimmed_redwine = pd.DataFrame(scaled_predictors, columns=predictors.columns)

# Concatenate the scaled predictors with the target variable
scaledz_trimmed_redwine['quality'] = target

print(scaledz_trimmed_redwine.head())
print("Dimensions of the DataFrame:", scaledz_trimmed_redwine.shape)

The above output shows that our predictor variables have been successfully scaled using the Z-score standardization. Now to take a look at the predictor histograms after outlier trimming and scaling took place.

In [None]:
# List of target variables
predictor_variables = ['fixed acidity', 'volatile acidity', 'citric acid', 
                      'residual sugar', 'chlorides', 'free sulfur dioxide', 
                      'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', ]

# Use the 'style' argument to ensure that a grid appears (this is optional)

# Create subplots based on the number of target variables
fig, axes = plt.subplots(len(predictor_variables), 1, figsize=(10, 6 * len(predictor_variables)))

# Iterate through each target variable
for i, target_var in enumerate(predictor_variables):
    # Create a histogram using seaborn
    sns.histplot(data=scaledz_trimmed_redwine, x=target_var, bins=30, kde=True, color='blue', ax=axes[i])
    axes[i].set_title(f'Distribution of Trimmed & Scaled {target_var}')
    axes[i].set_xlabel(f'{target_var}')
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

## Exploratory Data Analysis

For the exploratory data analysis section, we are first going to compute some summary statistics for our variables.

In [None]:
processed_redwine = scaledz_trimmed_redwine 

print(processed_redwine.info())  # Display the information about the dataset, including data types and missing values


summary_statistics = processed_redwine.describe()
print(summary_statistics)

Now, we want to compute correlation coefficients accompanined by a scatter plot matrix. The results of this code could indicate whether certain variables exhibit multicollinearity. 

In [None]:
# First, lets output a correlation matrix so that just the values can be looked at. 
correlation_matrix = processed_redwine.corr()
print(correlation_matrix)

In [None]:
# Lets look at the values in descending order.
correlation_df = correlation_matrix.unstack().sort_values(ascending=False)

# Keep track of pairs of variables that have already been printed
printed_pairs = set()

# Print the correlation coefficients and their corresponding variable names
for index, value in correlation_df.items():
    variable1, variable2 = index
    if variable1 != variable2 and (variable1, variable2) not in printed_pairs and (variable2, variable1) not in printed_pairs:
        print(f"{variable1:20} {variable2:20} {value:.6f}")
        printed_pairs.add((variable1, variable2))

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix")
plt.show()

In [None]:
sns.pairplot(processed_redwine, kind='reg', plot_kws={'scatter_kws': {'s': 10}, 'line_kws': {'color': 'red'}})

plt.show()

The correlation coefficient that has been calculated is the Pearson correlation coefficient (r). Typically, any values of r between 0 <= |r| < 0.3 suggests that two variables have weak correlation. Any r values between 0.3 <= |r| < 0.7 suggests a moderate correlation. Finally, any r values |r| >= 0.7 suggests a strong correlation. For the sake of this project, any two variables with a 'strong correlation' will be identified as having potential multicollinearity.

In our case, no two variables shared an |r| values above 0.7. However, there were 4 different combinations that produced |r| values above 0.6, which does suggest higher moderate correlation. These combinations can be seen here:

total sulfur dioxide free sulfur dioxide  0.652327

fixed acidity        citric acid          0.648852

citric acid          volatile acidity     -0.640712

fixed acidity        pH                   -0.656122



Checking in on the correlation coefficient values with the target variable, below are the two combinations that produced the highest coefficient values: 

quality              alcohol              0.503886

sulphates            quality              0.406099

The code below will show the histograms for these two predictors to see what their distribution is.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

axes[0].hist(processed_redwine['alcohol'], bins=30, alpha=0.7, color='red')
axes[0].set_title('alcohol histogram')
axes[0].set_xlabel('alcohol')
axes[0].set_ylabel('Frequency')

axes[1].hist(processed_redwine['sulphates'], bins=30, alpha=0.7, color='green')
axes[1].set_title('sulphates histogram')
axes[1].set_xlabel('sulphates')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

Both histograms suggest they both have right skewing data, with the skew being seemingly more pronounced with alcohol. Neither have a normal distribution.

## Feature Engineering

For this section, there are three main tools we will be looking to apply. Firstly, transformations will be applied to any non-normal distributions for our predictors. A lot of ML algorithms that could be implemented later on in the project work much better when variables are normally, or close to normally distributed. They also often assume normal distribution among predictors. Next, dimensionality reduction will be considered, using either PCA or factor analysis. Finally, data augmentation will be implemented to generate synthetic samples to increase the number of samples in the dataset.

Looking back on the histograms created earlier for all of our predictors, only 3 out of 11 had normal(ish) distributions, which were 'volatile acidity', 'density' and 'pH'. Transformations will be applied to the other predictors to try and normalize these distributions, then they will be rescaled using the z-score standardization. 

Since all of our predictors have been scaled using z-score standardization, they all contain negative values. For this reason, only the log transformations can be used to change their distributions, square root transformations can only be used on positive values.

In [None]:
### Fixed Acidity

scaling_factor_log_FA = 0.2 #This value was changed and tested multiple times

pos_skewed_FA_log_scaled = np.log(processed_redwine['fixed acidity'] * scaling_factor_log_FA + 1)


plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.hist(processed_redwine['fixed acidity'], bins=30, color='yellow', edgecolor='black', alpha=0.7)
plt.title('Original fixed acidity')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(pos_skewed_FA_log_scaled, bins=30, color='purple', edgecolor='black', alpha=0.7)
plt.title('Log Transformed fixed acidity')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

Distribution still is not normal but the skew is not as exaggerated.

In [None]:
### Citric Acid
scaling_factor_log_CA = 0.5
pos_skewed_CA_log_scaled = np.log(processed_redwine['citric acid'] * scaling_factor_log_CA + 1 )

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.hist(processed_redwine['citric acid'], bins=30, color='yellow', edgecolor='black', alpha=0.7)
plt.title('Original citric acid')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(pos_skewed_CA_log_scaled, bins=30, color='purple', edgecolor='black', alpha=0.7)
plt.title('Log Transformed citric acid')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

No scaling factor provided any distribution that even remotely resembled normality.

In [None]:
### Residual Sugar

scaling_factor_log_RS = 0.45

pos_skewed_RS_log_scaled = np.log(processed_redwine['residual sugar'] * scaling_factor_log_RS + 1)


plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.hist(processed_redwine['residual sugar'], bins=30, color='yellow', edgecolor='black', alpha=0.7)
plt.title('Original residual sugar')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(pos_skewed_RS_log_scaled, bins=30, color='purple', edgecolor='black', alpha=0.7)
plt.title('Log Transformed residual sugar')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

Distribution is now close to normal. Entirely missing columns in the negative values are, however, a little strange.

In [None]:
### Chlorides

scaling_factor_log_Ch = 0.25

pos_skewed_Ch_log_scaled = np.log(processed_redwine['chlorides'] * scaling_factor_log_Ch + 1)


plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.hist(processed_redwine['chlorides'], bins=30, color='yellow', edgecolor='black', alpha=0.7)
plt.title('Original chlorides')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(pos_skewed_Ch_log_scaled, bins=30, color='purple', edgecolor='black', alpha=0.7)
plt.title('Log Transformed chlorides')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

So far, this transformation looks the best.

In [None]:
### Free Sulfur Dioxide

scaling_factor_log_FSD = 0.3

pos_skewed_FSD_log_scaled = np.log(processed_redwine['free sulfur dioxide'] * scaling_factor_log_FSD + 1)


plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.hist(processed_redwine['free sulfur dioxide'], bins=30, color='yellow', edgecolor='black', alpha=0.7)
plt.title('Original free sulfur dioxide')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(pos_skewed_FSD_log_scaled, bins=30, color='purple', edgecolor='black', alpha=0.7)
plt.title('Log Transformed free sulfur dioxide')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

This is another case where the new distribution is not normal but more centered.

In [None]:
### Total Sulfur Dioxide

scaling_factor_log_TSD = 0.45

pos_skewed_TSD_log_scaled = np.log(processed_redwine['total sulfur dioxide'] * scaling_factor_log_TSD + 1)


plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.hist(processed_redwine['total sulfur dioxide'], bins=30, color='yellow', edgecolor='black', alpha=0.7)
plt.title('Original total sulfur dioxide')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(pos_skewed_TSD_log_scaled, bins=30, color='purple', edgecolor='black', alpha=0.7)
plt.title('Log Transformed total sulfur dioxide')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

This is as close to a normal distribution as we could achieve.

In [None]:
### Sulphates

scaling_factor_log_Sul = 0.2

pos_skewed_Sul_log_scaled = np.log(processed_redwine['sulphates'] * scaling_factor_log_Sul + 1)


plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.hist(processed_redwine['sulphates'], bins=30, color='yellow', edgecolor='black', alpha=0.7)
plt.title('Original sulphates')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(pos_skewed_Sul_log_scaled, bins=30, color='purple', edgecolor='black', alpha=0.7)
plt.title('Log Transformed sulphates')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

This distribution is close to being normal.

In [None]:
### Alcohol

scaling_factor_log_Al = 0.3

pos_skewed_Al_log_scaled = np.log(processed_redwine['alcohol'] * scaling_factor_log_Al + 1)


plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.hist(processed_redwine['alcohol'], bins=30, color='yellow', edgecolor='black', alpha=0.7)
plt.title('Original alcohol')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(pos_skewed_Al_log_scaled, bins=30, color='purple', edgecolor='black', alpha=0.7)
plt.title('Log Transformed alcohol')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

This tightened the spread of the distribution more than normalized it.

Now that the required variables have been transformed, a new dataframe must be created using the untransformed predictors as well as the transformed predictors and our target variable.

In [None]:
# Extract untransformed variables target variable from processed_redwine
untransformed_variables = processed_redwine.drop(columns=['fixed acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'sulphates', 'alcohol', 'quality' ])
target_variable = processed_redwine.drop(columns=['fixed acidity', 'citric acid', 'residual sugar', 'free sulfur dioxide', 'total sulfur dioxide', 'sulphates', 'alcohol', 'volatile acidity',
                                                  'chlorides', 'density', 'pH'])

# Create new data frames for all of the transformed variables
df1 = pd.DataFrame({'fixed acidity': pos_skewed_FA_log_scaled})
df2 = pd.DataFrame({'citric acid': pos_skewed_CA_log_scaled})
df3 = pd.DataFrame({'residual sugar': pos_skewed_RS_log_scaled})
df4 = pd.DataFrame({'chlorides': pos_skewed_Ch_log_scaled})
df5 = pd.DataFrame({'free sulfur dioxide': pos_skewed_FSD_log_scaled})
df6 = pd.DataFrame({'total sulfur dioxide': pos_skewed_TSD_log_scaled})
df7 = pd.DataFrame({'sulphates': pos_skewed_Sul_log_scaled})
df8 = pd.DataFrame({'alcohol': pos_skewed_Al_log_scaled})

# Concatenate the untransformed variables, transformed variables, and target variable
new_dataframe = pd.concat([untransformed_variables, df1, df2, df3, df4, df5, df6, df7, df8, target_variable], axis=1)

print(new_dataframe.dtypes)

In [None]:
# Split predictors and target variable to scale the predictors
target = new_dataframe['quality']
predictors = new_dataframe.drop(columns=['quality'])

scaler = StandardScaler()
scaled_predictors = scaler.fit_transform(predictors)

# Convert the scaled predictors back to a DataFrame
transformed_redwine = pd.DataFrame(scaled_predictors, columns=predictors.columns)

# Concatenate the scaled predictors with the target variable
transformed_redwine['quality'] = target

# Verify everything looks as it should
print(transformed_redwine.head())
print(transformed_redwine.info())

Check back on all of the predictor histograms.

In [None]:
# List of target variables
predictor_variables = ['fixed acidity', 'volatile acidity', 'citric acid', 
                      'residual sugar', 'chlorides', 'free sulfur dioxide', 
                      'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']

# Create subplots based on the number of target variables
fig, axes = plt.subplots(len(predictor_variables), 1, figsize=(10, 6 * len(predictor_variables)))

# Iterate through each target variable
for i, target_var in enumerate(predictor_variables):
    # Create a histogram using seaborn
    sns.histplot(data=transformed_redwine, x=target_var, bins=30, kde=True, color='red', ax=axes[i])
    axes[i].set_title(f'Distribution of {target_var}')
    axes[i].set_xlabel(f'{target_var}')
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

These distributions are definitely not all normalized but they have been improved from before the transformations.

Next on the list for Feature Engineering is a potential Dimensionality Reduction. In this case, because there was no multicollinearity found, represented for this project's sake by strong variable correlation (|r| >= 0.7), no Dimensionality Reduction will take place.

The final item for this section is Data Augmentation. Again, in our case, this does not feel necessary. Our trimmed down dataset has over 1100 rows of data, so there isn't a pressing need for more samples.

## Model Selection
Here is a list of five regression algorithms that could be used in the model training for this project:

1. Decision Trees
    
    Strengths  : Interpretability, able to capture non-linear relationships and interactions between predictors, robust to outliers. 
    
    Weaknesses : Prone to overfitting, sensitive to small variations in the data, may create complex trees.
    
    Assumptions: No strict assumptions on the distribution of the data.
    
    Limitations: May not generalize well to unseen data, may require boosting or ensemble methods to improve performance.

2. Ridge Regression
    
    Strengths  : Handles multicollinearity well by adding a penalty term to the coefficients, reduces model variance.
    
    Weaknesses : Adds bias to the model, requires tuning of regularization parameter (alpha).
    
    Assumptions: Assumes linearity, independence of predictors, constant variance of errors, and normality of errors.
    
    Limitations: May not perform well with highly sparse datasets, less interpretable due to regularization.

3. Polynomial Regression
    
    Strengths  : Can capture non-linear relationships between predictors and the target variable, flexible model.
    
    Weaknesses : Susceptible to overfitting with high polynomial degrees, requires careful selection of polynomial degree.
    
    Assumptions: Assumes a polynomial relationship between predictors and the target variable.
    
    Limitations: May not generalize well to unseen data, interpretation becomes more complex with higher polynomial degrees.

4. Multiple Linear Regression
    
    Strengths  : Extends linear regression to multiple predictors, provides interpretable coefficients for each predictor.
    
    Weaknesses : Assumes a linear relationship between predictors and the target variable, sensitive to outliers and multicollinearity.
    
    Assumptions: Assumes a polynomial relationship between predictors and the target variable.
   
    Limitations: May not capture complex interactions between predictors, requires careful handling of multicollinearity.

5. Random Forest Regression
    
    Strengths  : Ensemble of decision trees that reduces overfitting, provides feature importance ranking.
    
    Weaknesses : Less interpretable than a single decision tree, requires tuning of hyperparameters.
    
    Assumptions: No strict assumptions, but may still overfit if not properly tuned.
    
    Limitations: Can be computationally expensive, may not perform well with very high-dimensional data.

Before entering the model training step of the project, the model algorithms need to be narrowed down to three models from the five listed above. 

Firstly, Ridge Regression will be the first model used for model training. We wanted at least one model that is linear and on the simpler side, but not as simple as just linear/multiple linear regression. The alpha value will be tuned to find the optimal RMSE.

Random Forest will be moving forward because the ensemble of decision trees reduces overfitting, which is something that some models have an issue with. Random or Grid Search will be used to tune the hyperparamters.

Finally, Polynomial regression will be the final model used because it can capture non-linear relationships between predictors and the target variable as well as being a flexible model. The degree value will be tuned to find the optimal RMSE.




## Model Training

The following code chunk prepares our dataset for training/testing. Firstly, the dataset needs to split into two separate objects, one for our predictors, and another for the target variable. Following that, both objects then need to be split into both training splits and testing splits. For this project, an 80/20 split will be used, meaning 80% of the data will be used for training, and the remaining 20% will be used to testing.

In [None]:
# Separate predictors and target variable
X = transformed_redwine[['fixed acidity', 'volatile acidity', 'citric acid', 
                      'residual sugar', 'chlorides', 'free sulfur dioxide', 
                      'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']]
y = transformed_redwine['quality']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2024)

### Ridge Regression

In [None]:
# Ridge Regression
alpha = 1.0
ridge_model = Ridge(alpha=alpha)
ridge_model.fit(X_train, y_train)

# Perform cross-validation
cv_scores_ridge = cross_val_score(ridge_model, X_train, y_train, cv=10, scoring='neg_mean_squared_error')

# Calculate RMSE using cross-validation scores
train_rmse_ridge = np.sqrt(-cv_scores_ridge.mean())
print("RMSE (Ridge with Cross-Validation):", train_rmse_ridge)

# Predict on the test set
y_pred_ridge = ridge_model.predict(X_test)

# Calculate RMSE on the test set
test_rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
print("RMSE (Ridge on Test Set):", test_rmse_ridge)

The above test only used a single alpha value of 1. What happens if we check all alpha values from 0.1 to 100, with 0.1 increments. 

In [None]:
# Define alpha values
alpha_values_ridge = np.arange(0.1, 100, 0.1)

rmse_values_ridge = []

# Iterate through alpha values
for alpha in alpha_values_ridge:
    ridge_model = Ridge(alpha=alpha)
    cv_scores_ridge = cross_val_score(ridge_model, X_train, y_train, cv=10, scoring='neg_mean_squared_error')
    rmse_scores_ridge = np.sqrt(-cv_scores_ridge)
    mean_rmse_ridge = np.mean(rmse_scores_ridge)
    rmse_values_ridge.append(mean_rmse_ridge)

# Find the index of the minimum RMSE value
min_rmse_index = np.argmin(rmse_values_ridge)

# Plot alpha values vs. RMSE values
plt.figure(figsize=(10, 6))
plt.plot(alpha_values_ridge, rmse_values_ridge, marker='o', linestyle='-')
plt.title('Alpha vs. RMSE for Ridge Regression')
plt.xlabel('Alpha')
plt.ylabel('RMSE')
plt.show()

rmse_ridge_opt = rmse_values_ridge[min_rmse_index]

print("Lowest overall RMSE value achieved with Ridge regression:", rmse_ridge_opt)
print("Corresponding alpha value:", alpha_values_ridge[min_rmse_index])

### Random Forests

In [None]:
# Standard Random Forest model with cross-validation
rf_model = RandomForestRegressor(random_state=2024)

# Perform cross-validation
cv_scores_rf = cross_val_score(rf_model, X_train, y_train, cv=10, scoring='neg_mean_squared_error')

rf_rmse_train = np.sqrt(-cv_scores_rf.mean())

# Print cross-validation scores
print("RMSE (Random Forest with Cross-Validation):", rf_rmse_train)

# Fit the model on the entire training set
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluation
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print("RMSE (Random Forest on Test Set):", rmse_rf)


Now, lets use the Randomized Search function to find more optimzed better hyperparameters.

In [None]:
# Define the Random Forest model
rf_model = RandomForestRegressor(random_state=2024)

# Define the parameter distributions for RandomizedSearchCV
param_distributions = {
    'n_estimators': [int(x) for x in np.linspace(start=200, stop=2000, num=10)],
    'max_depth': [int(x) for x in np.linspace(10, 110, num=11)],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Perform RandomizedSearchCV
random_search = RandomizedSearchCV(rf_model, param_distributions, n_iter=250, cv=10, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
random_search.fit(X_train, y_train)

# Get the best estimator and best hyperparameters
best_estimator = random_search.best_estimator_
best_params = random_search.best_params_

# Predict using the best estimator
y_pred = best_estimator.predict(X_test)

# Compute RMSE using the test data
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("Lowest RMSE achieved with Random Forest:", rmse)
print("Corresponding hyperparameter values:")
for param, value in best_params.items():
    print(param, ":", value)

### Polynomial Regression

In [None]:
# Initialize Polynomial Regression model
degree = 2  # Set the degree of polynomial features
polynomial_model = make_pipeline(PolynomialFeatures(degree), LinearRegression())

# Cross-validation on training data
cv_scores = cross_val_score(polynomial_model, X_train, y_train, cv=10, scoring='neg_mean_squared_error')
rmse_poly = np.sqrt(-cv_scores)
mean_rmse_poly = np.mean(rmse_poly)
print("Mean Root Mean Squared Error (Polynomial Regression with 10-fold CV):", mean_rmse_poly)

# Fit the polynomial model on the training data
polynomial_model.fit(X_train, y_train)

# Predict using the trained polynomial model
y_pred_test = polynomial_model.predict(X_test)

# Compute RMSE using the test data
rmse_poly_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
print("RMSE on Test Data (Polynomial Regression):", rmse_poly_test)

The degree of polynomial features was randomly started at 2 for this first run. In the next code chunk, the optimized degree will be found.

In [None]:
np.random.seed(2024)

degrees = range(1, 10)  # The different degrees you want to iterate through

mean_rmse_poly_scores = []

# Iterate over each degree
for degree in degrees:
    polynomial_model_opt = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    cv_scores = cross_val_score(polynomial_model_opt, X_train, y_train, cv=10, scoring='neg_mean_squared_error')
    rmse_poly_opt = np.sqrt(-cv_scores)
    mean_rmse_polyy = np.mean(rmse_poly_opt)
    mean_rmse_poly_scores.append(mean_rmse_polyy)

# Find the optimal degree with the lowest mean RMSE
optimal_degree = degrees[np.argmin(mean_rmse_poly_scores)]
optimal_rmse = np.min(mean_rmse_poly_scores)

# Plot the mean RMSE scores against degrees
plt.figure(figsize=(10, 6))
plt.plot(degrees, mean_rmse_poly_scores, marker='o', linestyle='-')
plt.title('Mean RMSE vs. Degree of Polynomial Features')
plt.xlabel('Degree of Polynomial Features')
plt.ylabel('Mean RMSE')
plt.xticks(degrees)
plt.grid(True)
plt.show()

print("Optimal Degree:", optimal_degree)
print("Lowest Mean RMSE:", optimal_rmse)

## Model Evaluation

Each model produced three meaningful RMSE results; a mean RMSE from the cross validation training, an RMSE value on the unseen or test data and finally, a lowest RMSE value after tuning and optimizing the hyperparameters. 

Ridge Regression:

CV Training mean RMSE         = 0.6047955177720854

Test set RMSE                 = 0.5709275391082791

Lowest RMSE with optimization = 0.6025522657022474


Random Forest:

CV Training mean RMSE         = 0.5525856913792679

Test set RMSE                 = 0.5027015902987466

Lowest RMSE with optimization = 0.5054039305931296


Polynomial Regression:

CV Training mean RMSE         = 0.5955428930200941

Test set RMSE                 = 0.5892338805516505

Lowest RMSE with optimization = 0.5955428930200941


Lets now visualize these results in with a bar plot. For this to be achieved, the stored RMSE values need to be combined into a data frame.



In [None]:
# Training
data_training = {
    'Model_Training': ['ridge', 'random forest', 'polynomial'],
    'RMSE_Training': [train_rmse_ridge, rf_rmse_train, mean_rmse_poly  ]
}

# Create the DataFrame
df_train = pd.DataFrame(data_training)

df_train.head()

In [None]:
# Testing
data_testing = {
    'Model_Testing': ['ridge', 'random forest', 'polynomial'],
    'RMSE_Testing': [test_rmse_ridge, rmse_rf, rmse_poly_test]
}

# Create the DataFrame
df_test = pd.DataFrame(data_testing)

df_test.head()

In [None]:
# Optimization
data_opt = {
    'Model_Opt': ['ridge', 'random forest', 'polynomial'],
    'RMSE_Opt': [rmse_ridge_opt, rmse, optimal_rmse]
}

# Create the DataFrame
df_opt = pd.DataFrame(data_opt)

df_opt.head()

In [None]:
# Now, let's output the three different bar plots

# Train
# Sort the DataFrame by RMSE in descending order
df_train_sorted = df_train.sort_values(by='RMSE_Training', ascending=False)

# Create a bar plot
plt.figure(figsize=(10, 6))
plt.barh(df_train_sorted['Model_Training'], df_train_sorted['RMSE_Training'], color='slateblue')
plt.xlabel('RMSE')
plt.ylabel('Model')
plt.title('Training RMSE for Different Models')
plt.grid(axis='x')
plt.show()


# Test
# Sort the DataFrame by RMSE in descending order
df_test_sorted = df_test.sort_values(by='RMSE_Testing', ascending=False)

# Create a bar plot
plt.figure(figsize=(10, 6))
plt.barh(df_test_sorted['Model_Testing'], df_test_sorted['RMSE_Testing'], color='coral')
plt.xlabel('RMSE')
plt.ylabel('Model')
plt.title('Testing RMSE for Different Models')
plt.grid(axis='x')
plt.show()


# Optimization
# Sort the DataFrame by RMSE in descending order
df_opt_sorted = df_opt.sort_values(by='RMSE_Opt', ascending=False)

# Create a bar plot
plt.figure(figsize=(10, 6))
plt.barh(df_opt_sorted['Model_Opt'], df_opt_sorted['RMSE_Opt'], color='palegreen')
plt.xlabel('RMSE')
plt.ylabel('Model')
plt.title('Optimized RMSE for Different Models')
plt.grid(axis='x')
plt.show()

Looking at the visualized RMSE values from testing, training, and optimization sets, it is clear that Random Forest was the best performing model for the project. It outperformed Ridge and Polynomial Regression in all three sets. With that said, all three models were relatively close in RMSE values.

## Interpretation and Discussion

### Does your model perform better or worse than expected?
    For this project, the models performed better than expected. It was expected to perhaps run into more issues with the variables exhibiting multicollinearity or the models showing some overfitting. This, however, was not really the case. There were no variables that showed strong correlation and although the RMSE scores of the test sets were the best scores for all three models, hinting at potential overfitting, the scores from the other sets were not far off. It was expected to see a wider gap between the testing RMSE scores and the training RMSE scores. 


### Based on your domain knowledge, is this model worth deploying?
    This is a difficult question to answer as my domain knowledge in for this project does not extend very far. It would be helpful if a comparison could be made to the performance of alternative models built by other people. As it stands, since there is no way to know if my model performed well in a real world scale/scenario, it is impossible to say if the model in this project should be deployed or not.

### What needs to be improved in order to increase model performance?
    For improving this model, there are a few things that come to mind. More could have been done on the feature engineering front. The distributions of some of the predictors were a little strange looking and not very normalized which could have had negative impacts on the model. Also, no PCA or factor analyis was explored but there was perhaps room for those features to be implemented into this project. There is also a chance that the model could have benefitted from synthetic data created and injected into the project. On the model approach side, it would have been interesting to explore far more model types, and even the possibility of tweaking/reiterating through the models that were used to potentially squeeze out better results. Also, ideally, when taking on a project like this, a lot more background research takes place. As stated earlier, the domain knowledge for this project was not overly strong. Had more research taken place, perhaps that could have led to better result analysis or even different steps taken during steps 2 to 5. 

### Did hyperparameter tuning lead to better or worse model performance? Explain your answer.
    In the case of the Polynomial Regression, because the optimal degree was chosen for the initial run, there was no improvement found when the optimal degree was found. The greatest improvements were seen with the Random Forest model, where the training RMSE started at 0.5525, and after the hyperparameter tuning, a lowest RMSE of 0.5054 was achieved. There was also a small improvement seen after finding the optimal alpha value for the Ridge Regression. 

### Other Thoughts and Improvements
    For this last part, I will be candid and say that a lot of procrastination and a lack of motivation occurred before the main undertaking of this project took place. Ideally, I would have done more with the models than the bare minimum. There was some code you provided that acted as some sort of test for the Polynomial Regression that I did not look into but would have liked to. The Random Forest ensemble method provides a feature importance ranking that would have provided super cool insight into what predictors were most influential in predicting wine quality. Initially, I was interested in the Stochastic Gradient Descent model but after looking into it, that model is mostly reserved for very large datasets of 10,000+ rows. It would have been interesting to create loads of synthetic data and put that model to the test. I also just did not run many iterations of my models because it would have been too time consuming and I didn't leave myself enough time to do that. 
