### Wine Project

Data Import:

In [None]:
import pandas as pd

red_path = './data/winequality-red.csv'
white_path = './data/winequality-white.csv'

red_dataset = pd.read_csv(red_path, header=0, sep=';')
white_dataset = pd.read_csv(white_path, header=0, sep=';')

# Task 1 Exploring the data:

In [None]:
display(red_dataset.describe())
display(red_dataset.head())
display(white_dataset.describe())
display(white_dataset.head())

## White wine quality distribution:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize = (5,5))
sns.countplot(x = "quality", data = white_dataset)
plt.title("Count number of times a quality rating was given to a white wine")
plt.xlabel("quality rating")
plt.ylabel("Count")
plt.show()
plt.close()


The above plot shows that between 5 and 7 are the most commonly given quality rating among the white wine samples with 6 being by far the most common. A count plot was used as it clearly displays the distribution of quality ratings given in the white wine database.

## Red wine quality distribution:

In [None]:
plt.figure(figsize = (5,5))
sns.countplot(x = "quality", data = red_dataset)
plt.title("Count number of times a quality rating was given to a red wine")
plt.xlabel("quality rating")
plt.ylabel("Count")
plt.show()
plt.close()

The above plot shows that 5 and 6 are by far the most commonly given quality ratings for wines in the red wine data set. A count plot was used for the same reason as it was in the white wine example as it gives a clear insight into the distribution of quality ratings given across the samples in the red wine database.

## Comparison of red and white quality distributions:


In [None]:
fig, ax =plt.subplots(1,2)
sns.countplot(x="quality",data=white_dataset, ax=ax[0])
sns.countplot(x="quality",data=red_dataset , ax=ax[1])
fig.suptitle("Comparison of quality distributions across both red and white wines")
ax[0].set_title("White Wine")
ax[1].set_title("Red Wine")

The above plots show a comparison between the distribution of quality ratings across the samples of red and white wine. It is clearly shown there are significantly more samples of white wines than red wines. These plots show that both red and white wine share a similar distribution across the quality ratings which were given in each data set this could be because most wines would fall in an average category and only a wine being notably good or notably bad could warrant a rating of less than five or more than 6. It is evident from these graphs that the reviewers are more willing to give a value of 7 than they are to give any value less than 5 this could be due to the reviewers having a generally positive attitude towards wine and so would require more negative motivation to give a bellow average rating, but not enough data on the reviewers is available to test this. The plots also show that on average a higher quality of white wine was sampled compared to red wine this could be due to reviewer preference or due to actually due to lower quality red wines being sampled but again not enough information is available to confirm this.

## Categorizing Wine based on alcohol content white wine:

In [None]:
white_lower = white_dataset["alcohol"].mean() - white_dataset["alcohol"].std()
white_upper = white_dataset["alcohol"].mean() + white_dataset["alcohol"].std()
white_dataset["alcohol_cat"] = pd.cut(
    x=white_dataset["alcohol"],
    bins=[0,white_lower, white_upper, white_dataset["alcohol"].max()],
    labels=["low", "average", "high"]
)
display(white_dataset.head())

In the code above a lower and upper bound of alcohol content in white wine is calculated where the upper is one standard deviation above the mean and the lower is one standard deviation bellow the mean. These variables are used to categorise the white wine samples as high medium or low in alcohol content.

## Categorizing wine based on alcohol content Red Wine:

In [None]:
red_lower = red_dataset["alcohol"].mean() - red_dataset["alcohol"].std()
red_upper = red_dataset["alcohol"].mean() + red_dataset["alcohol"].std()
red_dataset["alcohol_cat"] = pd.cut(
    x=red_dataset["alcohol"],
    bins=[0,red_lower, red_upper, red_dataset["alcohol"].max()],
    labels=["low", "average", "high"]
)
display(red_dataset.head())

The same method as with the red wine samples was used to categorise the alcohol content in the red wine samples but the boundaries were adjusted to the red wine data sample.

## Quality Distribution based on alcohol content for white wines:

In [None]:
plt.figure(figsize = (5,5))
sns.countplot(x = "quality", data = white_dataset, hue="alcohol_cat")
plt.title("Count number of times a quality rating was given to a white wine")
plt.xlabel("quality rating")
plt.ylabel("Count")
plt.show()
plt.close()

The plot above shows the effect alcohol content has on the distribution of quality ratings given. The plot shows that there are not a huge amount of white wines which are outside the average alcohol content so the alcohol content was quite tightly grouped across the sample and did not vary a huge amount. A count plot was used as it is able to clearly show the effect alcohol content has had on the quality rating distribution with the use of the hue function to separate the categories of alcohol content.

## Quality Distribution based on alcohol content for red wines:

In [None]:
plt.figure(figsize = (5,5))
sns.countplot(x = "quality", data = red_dataset, hue="alcohol_cat")
plt.title("Count number of times a quality rating was given to a red wine")
plt.xlabel("quality rating")
plt.ylabel("Count")
plt.show()
plt.close()

The plot above was used for much the same reasons as the plot fot the white wine which is explained above.

## Quality Distribution comparison between red and white wine based on alcohol content:

In [None]:
fig, ax =plt.subplots(1,2, figsize=(10,5))
sns.countplot(x="quality",data=white_dataset, ax=ax[0], hue="alcohol_cat")
sns.countplot(x="quality",data=red_dataset , ax=ax[1], hue="alcohol_cat")
fig.suptitle("Comparison of quality distributions across both red and white wines based on there alcohol content")
ax[0].set_title("White Wine")
ax[1].set_title("Red Wine")

The plot above shows the comparison between the effect that alcohol content has on the quality rating given to both red and white wine. Overall both graph exhibit very similar features showing there was not a huge variation in the content of alcohol in the wines sampled. An interesting observation from both graphs is that the lower alcohol content wines for both types of wine were given lower ratings on average shown by the decline of lower alcohol content as the higher rating are given. This could be for a number of reasons and there is not enough data to speculate on these but it does seem that at least a certain level of alcohol content is needed to create a higher quality wine.

## Categorising wine based on residual sugar white wine:

In [None]:
white_dataset["is_sweet"] = pd.cut(
    x=white_dataset["residual sugar"],
    bins=[0, white_dataset["residual sugar"].median(), white_dataset["residual sugar"].max()],
    labels=["dry","sweet"])
display(white_dataset.head())


The above code divides the white wine samples into sweet and dry wines based on their residual sugar content this was done by taking the median residual sugar value as the splitting point. This point was chosen to balance the group to make it more useful for machine learning applications later.

## Categorising wine based on residual sugar red wine:

In [None]:
red_dataset["is_sweet"] = pd.cut(
    x=red_dataset["residual sugar"],
    bins=[0, red_dataset["residual sugar"].median(), red_dataset["residual sugar"].max()],
    labels=["dry","sweet"])
display(red_dataset.head())


The above code divides the red wine samples into sweet and dry wines based on their residual sugar content it was done using the same method as with the white wine samples for the same reasons. This data set did not split with the same balance as the white wine set and is slightly skewed toward dry wines this will be taken into account for future machine learning applications.

In [None]:
fig, ax =plt.subplots(1,2, figsize=(10,5))
sns.countplot(x="quality",data=white_dataset, ax=ax[0], hue="is_sweet")
sns.countplot(x="quality",data=red_dataset , ax=ax[1], hue="is_sweet")
fig.suptitle("Comparison of quality distributions across both red and white wines based on there sweetness")
ax[0].set_title("White Wine")
ax[1].set_title("Red Wine")

The above plot shows the effect sweetens has on quality ratings given to the wines sampled. Count graphs were used with a hue based on the sweet or dry categorization made above to clearly show the effect of sweetness on wine quality ratings. The sweetness based on the categorization factors chosen does not seem to have much of an effect on the quality rating given to the wines sampled. There is perhaps a slight skewing towards higher quality white wines being dry but this does not have a strong enough link to say anything conclusive.

# Task 2: Determining which subset of variables is the most useful for learning

## Determining the correlation of the data in the white wine data set

In [None]:
sns.heatmap(white_dataset.corr(method="pearson").round(2), annot=True,vmin=-1, vmax=1)

The above plot shows a visual representation of the correlation matrix for the white wine data set. The Pearson's correlation coefficient was used as the data was normal most of the time, and it is all quantitative meaning that Pearson's appeared to be the best fit. A heat map was used to give a clear and easy to understand visualisation of the correlation matrix which was produced across the white wine data sample. The plot show that some values do have a reasonable correlation and some have almost no correlation some interesting examples of these are alcohol content appears to have quite a strong negative correlation with the density of the wine meaning as the alcohol content of the wine increases its density decreases this could be due to the fact that alcohol has a relatively low density and thus as it makes up a larger percentage of the wines it causes the density to decrease. There is then a strong positive correlation between the residual sugar and the density of the wine meaning as the residual sugar content increases the wines density increases this could be due to residual sugar having a high density in and of its self causing the density of the wine to increase in the wines which contain more of it. The free sulphur and total sulfur dioxides of the wine samples share a medium positive correlation which is not surprising as an increased number of free sulphur dioxide would suggest there should be more total sulfur dioxide on average. The total sulphur dioxide also has a medium positive correlation with the density again suggesting that sulphur dioxide could be one of higher density ingredients in the wine. When it comes to the wines quality rating the highest correlation seems to be with alcohol content which has a medium positive correlation showing that it may be possible that higher quality wines contain more alcohol. All of these are assumptions based on the data which has been collected and a more indepth quality scaling system may have helped to provide more context to this also more information on the reviewers may have allowed for more conclusive data analysis.

## Determining the correlation of the data in the red wine data set

In [None]:
sns.heatmap(red_dataset.corr(method="pearson").round(2), annot=True,vmin=-1, vmax=1)

The above plot shows a visual representation of correlation matrix created from the red wine data set using Pearson's correlation coefficient and a heat map to provide a visualisation of the data these options were chosen for the same reasons as explained in the white wine plot above. Some of the more interesting findings from this plot are fixed acidity and citric acid have a reasonably strong positive correlation while the volatile acidity as a medium strength negative correlation this could suggest that the fixed acidity comes from the citric acid contained in red wine and the volatile acidity is supplied by a different ingredient. The free sulfur and total sulphur dioxide values share a reasonable strong correlation as in the white wine which again could be for reasons specified in the white wine plot above. The density of the red wine samples has a reasonably strong positive correlation with the fixed acidity which could suggest the ingredients which give the wine it fixed acidity are also some of the more dense ingredients. The pH has a reasonably strong negative correlation with the fixed acidity and the citric acid content which makes sense as an increased amount of these ingredients would cause the wine to be more acidic brining the pH down. The quality rating of the wine does share a medium strength positive correlation with the alcohol content of the wine which could suggest that a higher alcohol content up to a point produces a higher quality red wine. The volatile acidity of the wine has a medium strength negative correlation with the quality rating the wine was given this could suggest that volatile acidity is not a desirable trait in red wine.

## Comparison of the red and white wine correlation matrices:

In [None]:
fig, ax =plt.subplots(1,2, figsize=(15,5))
sns.heatmap(white_dataset.corr(method="pearson").round(2), annot=True,vmin=-1, vmax=1, ax=ax[0])
sns.heatmap(red_dataset.corr(method="pearson").round(2), annot=True,vmin=-1, vmax=1, ax=ax[1])
fig.suptitle("Comparison of red and white wine correlation matrices")
ax[0].set_title("White Wine")
ax[1].set_title("Red Wine")

The above plot shows side by side the correlation matrices for the red and white wine data sets. These show that most of the correlations between the samples are shared but to varying degrees one example of this is that total sulphur dioxide content seems to have a much stronger correlation to the density of the wine in white wine than it does in red wine and the same goes for residual sugar content this could be due to a number of reasons.The reasons for this could be red wines on average having lower quantities of these ingredients causing them to have less of an impact on the overall density or other ingredients just having much larger impacts on the density further studies on the wines would be needed to make any real claims.An interesting observation is that in both plots alcohol content has a medium strength positive correlation to the quality rating given the wine which pose some interesting questions on whether a higher alcohol content causes a wine to be a higher quality or if higher quality wines simply contain more alcohol.
From these plots it is still difficult to determine which parameters would be useful for a successful machine learning model but going forward I will exclude any values which had a correlation value of less than 0.1 and more than -0.1 with the quality rating given the wine.
For white wine sulphates, free sulfur dioxide and citric acid will be excluded.
For red wine pH, free sulfur dioxide and residual sugar will be excluded.

## preparing the data for machine learning models:

In [None]:
white_dataset_prep =white_dataset.drop(columns=["sulphates", "free sulfur dioxide", "citric acid"])
red_dataset_prep = red_dataset.drop(columns=["pH", "free sulfur dioxide", "residual sugar"])


cats = ["is_sweet", "alcohol_cat"]

for feature in cats:
    dummies = pd.get_dummies(white_dataset_prep[feature])
    white_dataset_prep = pd.concat((white_dataset_prep, dummies), axis = 1)
    white_dataset_prep = white_dataset_prep.drop(feature, axis = 1)

    dummies = pd.get_dummies(red_dataset_prep[feature])
    red_dataset_prep = pd.concat((red_dataset_prep, dummies), axis = 1)
    red_dataset_prep = red_dataset_prep.drop(feature, axis = 1)

display(white_dataset_prep.head())
display(red_dataset_prep.head())



The above code removes the lowest correlated valued from the data sets and converts the catagorical values in the data set to numerical representations to allow for machine learning algorithm application.

# Task 3: machine learning approaches:

## Classification model white wine:

In [None]:



white_dataset_num_quality_0 = pd.cut(
    x=white_dataset["quality"],
    bins=[0, 5, 10],
    labels=[0,1])
white_dataset_onehot_0 = white_dataset_prep.drop(columns="quality")
white_dataset_onehot_0 = pd.concat((white_dataset_onehot_0, white_dataset_num_quality_0), axis=1)

white_dataset_num_quality_1 = pd.cut(
    x=white_dataset["quality"],
    bins=[0, 6, 10],
    labels=[0,1])

white_dataset_onehot_1 = white_dataset_prep.drop(columns="quality")
white_dataset_onehot_1 = pd.concat((white_dataset_onehot_1, white_dataset_num_quality_1), axis=1)

white_dataset_num_quality_2 = pd.cut(
    x=white_dataset["quality"],
    bins=[0, 7, 10],
    labels=[0,1])

white_dataset_onehot_2 = white_dataset_prep.drop(columns="quality")
white_dataset_onehot_2 = pd.concat((white_dataset_onehot_2, white_dataset_num_quality_2), axis=1)




The above code converts the quality ratings from a 1-10 scale to a binary high or low quality rating where 1 is high quality and 0 is low quality wine. The parameters for this split have been split into three versions where the splitting point from low to high quality increases from 5 to 6 and finally 7 this is to allow for the observation this will have on the training of the machine learning model.

## splitting the white wine data:

In [None]:
from sklearn.model_selection import train_test_split

labels_0 = white_dataset_onehot_0['quality'].values
data_0 = white_dataset_onehot_0.drop(['quality'], axis = 1).values

data_train_0, data_test_0, labels_train_0, labels_test_0 = train_test_split(data_0,
                                                                            labels_0,
                                                                            test_size = 0.33,
                                                                            random_state = 10)

labels_1 = white_dataset_onehot_1['quality'].values
data_1 = white_dataset_onehot_1.drop(['quality'], axis = 1).values

data_train_1, data_test_1, labels_train_1, labels_test_1 = train_test_split(data_1,
                                                                            labels_1,
                                                                            test_size = 0.33,
                                                                            random_state = 10)

labels_2 = white_dataset_onehot_2['quality'].values
data_2 = white_dataset_onehot_2.drop(['quality'], axis = 1).values

data_train_2, data_test_2, labels_train_2, labels_test_2 = train_test_split(data_2,
                                                                            labels_2,
                                                                            test_size = 0.33,
                                                                            random_state = 10)


The above code splits the 3 datasets into testing and training data with a split of 33% of the data being used as test data leaving 67% as training data.

## Addressing the oversampling in the datasets:

In [None]:
from imblearn.over_sampling import SMOTE

data_train_0, labels_train_0 = SMOTE(random_state=10).fit_resample(data_train_0, labels_train_0)
data_train_1, labels_train_1 = SMOTE(random_state=10).fit_resample(data_train_1, labels_train_1)
data_train_2, labels_train_2 = SMOTE(random_state=10).fit_resample(data_train_2, labels_train_2)



The above code cell balances the datasets as high to low quality wine ratings become quite unbalanced especially as the split point from low to high quality wine increases. The SMOTE function was used to balance the datasets to improve the application of the machine learning algorithm.

## scaling the data:

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

data_train_scaled_0 = scaler.fit_transform(data_train_0)
data_test_scaled_0 = scaler.transform(data_test_0)

data_train_scaled_1 = scaler.fit_transform(data_train_1)
data_test_scaled_1 = scaler.transform(data_test_1)

data_train_scaled_2 = scaler.fit_transform(data_train_2)
data_test_scaled_2 = scaler.transform(data_test_2)

The above code cell scales the datasets using the standard scaler from sklearn this is again to improve the datas usefulness for machine learning.

## Random tree classifier:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

random_forest_0 = RandomForestClassifier(n_estimators=50, max_depth=50, random_state=10)
random_forest_0.fit(data_train_scaled_0, labels_train_0)

random_forest_predictions_0 = random_forest_0.predict(data_train_scaled_0)
train_acc_0 = accuracy_score(labels_train_0, random_forest_predictions_0)

random_forest_predictions_0 = random_forest_0.predict(data_test_scaled_0)
test_acc_0 = accuracy_score(labels_test_0, random_forest_predictions_0)

print(f"Train_0 acc: {train_acc_0 * 100}%")
print(f"Test_0 acc: {test_acc_0 * 100}%")

random_forest_1 = RandomForestClassifier(n_estimators=50, max_depth=50, random_state=10)
random_forest_1.fit(data_train_scaled_1, labels_train_1)

random_forest_predictions_1 = random_forest_1.predict(data_train_scaled_1)
train_acc_1 = accuracy_score(labels_train_1, random_forest_predictions_1)

random_forest_predictions_1 = random_forest_1.predict(data_test_scaled_1)
test_acc_1 = accuracy_score(labels_test_1, random_forest_predictions_1)

print(f"Train_1 acc: {train_acc_1 * 100}%")
print(f"Test_1 acc: {test_acc_1 * 100}%")

random_forest_2 = RandomForestClassifier(n_estimators=50, max_depth=50, random_state=10)
random_forest_2.fit(data_train_scaled_2, labels_train_2)

random_forest_predictions_2 = random_forest_2.predict(data_train_scaled_2)
train_acc_2 = accuracy_score(labels_train_2, random_forest_predictions_2)

random_forest_predictions_2 = random_forest_2.predict(data_test_scaled_2)
test_acc_2 = accuracy_score(labels_test_2, random_forest_predictions_2)

print(f"Train_2 acc: {train_acc_2 * 100}%")
print(f"Test_2 acc: {test_acc_2 * 100}%")


The above code cell train and then tests the application of a random tree classifier for the purpose of predicting the quality ratings given to the wine samples given measures given for each sample. A random tree classifier was used because this is an ideal learning model for classification, and it has a significantly lower chance of overfitting than the normal tree model. The train and test accuracy which were produced for each dataset still seem to suggest overfitting as the test accuracy's are all extremely close to 100% to try and further reduce this hyperparameter tuning must take place.

## Hyperparameter tuning:

In [None]:
from sklearn.model_selection import GridSearchCV

random_state = 10
number_of_folds = 5

parameters_to_tune = [{'n_estimators': [10, 50],
                       'max_depth': [5, 50]}]

search = GridSearchCV(RandomForestClassifier(random_state = random_state), parameters_to_tune, cv = number_of_folds)
search.fit(data_train_scaled_0, labels_train_0)

print(f"Best parameters set found: {search.best_params_}")

random_forest = RandomForestClassifier(n_estimators=search.best_params_['n_estimators'],
                                       max_depth=search.best_params_['max_depth'],
                                       random_state=10)
random_forest.fit(data_train_scaled_0, labels_train_0)

random_forest_predictions = random_forest.predict(data_train_scaled_0)
train_acc = accuracy_score(labels_train_0, random_forest_predictions)

random_forest_predictions = random_forest.predict(data_test_scaled_0)
test_acc = accuracy_score(labels_test_0, random_forest_predictions)

print(f"Train acc: {train_acc * 100}%")
print(f"Test acc: {test_acc * 100}%")


The code cell above uses a 5-fold cross validation grid search to find the best parameters and surprisingly supports the use of the parameters which were used this may suggest that there is not over fitting and in fact this learning algorithm has produced a relatively accurate wine quality predictor when the wine quality is categorised as a binary high or low quality value.

## Classification model red wine:

In [None]:
red_dataset_num_quality_0 = pd.cut(
    x=red_dataset["quality"],
    bins=[0, 5, 10],
    labels=[0,1])
red_dataset_onehot_0 = red_dataset_prep.drop(columns="quality")
red_dataset_onehot_0 = pd.concat((red_dataset_onehot_0, red_dataset_num_quality_0), axis=1)

red_dataset_num_quality_1 = pd.cut(
    x=red_dataset["quality"],
    bins=[0, 6, 10],
    labels=[0,1])

red_dataset_onehot_1 = red_dataset_prep.drop(columns="quality")
red_dataset_onehot_1 = pd.concat((red_dataset_onehot_1, red_dataset_num_quality_1), axis=1)

red_dataset_num_quality_2 = pd.cut(
    x=red_dataset["quality"],
    bins=[0, 7, 10],
    labels=[0,1])

red_dataset_onehot_2 = red_dataset_prep.drop(columns="quality")
red_dataset_onehot_2 = pd.concat((red_dataset_onehot_2, red_dataset_num_quality_2), axis=1)




## Splitting Red wine data:

In [None]:
labels_0 = red_dataset_onehot_0['quality'].values
data_0 = red_dataset_onehot_0.drop(['quality'], axis = 1).values

data_train_0, data_test_0, labels_train_0, labels_test_0 = train_test_split(data_0,
                                                                            labels_0,
                                                                            test_size = 0.33,
                                                                            random_state = 10)

labels_1 = red_dataset_onehot_1['quality'].values
data_1 = red_dataset_onehot_1.drop(['quality'], axis = 1).values

data_train_1, data_test_1, labels_train_1, labels_test_1 = train_test_split(data_1,
                                                                            labels_1,
                                                                            test_size = 0.33,
                                                                            random_state = 10)

labels_2 = red_dataset_onehot_2['quality'].values
data_2 = red_dataset_onehot_2.drop(['quality'], axis = 1).values

data_train_2, data_test_2, labels_train_2, labels_test_2 = train_test_split(data_2,
                                                                            labels_2,
                                                                            test_size = 0.33,
                                                                            random_state = 10)

## Addressing the oversampling in the datasets:

In [None]:
data_train_0, labels_train_0 = SMOTE(random_state=10).fit_resample(data_train_0, labels_train_0)
data_train_1, labels_train_1 = SMOTE(random_state=10).fit_resample(data_train_1, labels_train_1)
data_train_2, labels_train_2 = SMOTE(random_state=10).fit_resample(data_train_2, labels_train_2)


## scaling the data:

In [None]:
data_train_scaled_0 = scaler.fit_transform(data_train_0)
data_test_scaled_0 = scaler.transform(data_test_0)

data_train_scaled_1 = scaler.fit_transform(data_train_1)
data_test_scaled_1 = scaler.transform(data_test_1)

data_train_scaled_2 = scaler.fit_transform(data_train_2)
data_test_scaled_2 = scaler.transform(data_test_2)

## Random tree classifier:

In [None]:
random_forest_0 = RandomForestClassifier(n_estimators=50, max_depth=50, random_state=10)
random_forest_0.fit(data_train_scaled_0, labels_train_0)

random_forest_predictions_0 = random_forest_0.predict(data_train_scaled_0)
train_acc_0 = accuracy_score(labels_train_0, random_forest_predictions_0)

random_forest_predictions_0 = random_forest_0.predict(data_test_scaled_0)
test_acc_0 = accuracy_score(labels_test_0, random_forest_predictions_0)

print(f"Train_0 acc: {train_acc_0 * 100}%")
print(f"Test_0 acc: {test_acc_0 * 100}%")

random_forest_1 = RandomForestClassifier(n_estimators=50, max_depth=50, random_state=10)
random_forest_1.fit(data_train_scaled_1, labels_train_1)

random_forest_predictions_1 = random_forest_1.predict(data_train_scaled_1)
train_acc_1 = accuracy_score(labels_train_1, random_forest_predictions_1)

random_forest_predictions_1 = random_forest_1.predict(data_test_scaled_1)
test_acc_1 = accuracy_score(labels_test_1, random_forest_predictions_1)

print(f"Train_1 acc: {train_acc_1 * 100}%")
print(f"Test_1 acc: {test_acc_1 * 100}%")

random_forest_2 = RandomForestClassifier(n_estimators=50, max_depth=50, random_state=10)
random_forest_2.fit(data_train_scaled_2, labels_train_2)

random_forest_predictions_2 = random_forest_2.predict(data_train_scaled_2)
train_acc_2 = accuracy_score(labels_train_2, random_forest_predictions_2)

random_forest_predictions_2 = random_forest_2.predict(data_test_scaled_2)
test_acc_2 = accuracy_score(labels_test_2, random_forest_predictions_2)

print(f"Train_2 acc: {train_acc_2 * 100}%")
print(f"Test_2 acc: {test_acc_2 * 100}%")


## Hyperparameter tuning:

In [None]:
random_state = 10
number_of_folds = 5

parameters_to_tune = [{'n_estimators': [10, 50],
                       'max_depth': [5, 50]}]

search = GridSearchCV(RandomForestClassifier(random_state = random_state), parameters_to_tune, cv = number_of_folds)
search.fit(data_train_scaled_0, labels_train_0)

print(f"Best parameters set found: {search.best_params_}")

random_forest = RandomForestClassifier(n_estimators=search.best_params_['n_estimators'],
                                       max_depth=search.best_params_['max_depth'],
                                       random_state=10)
random_forest.fit(data_train_scaled_0, labels_train_0)

random_forest_predictions = random_forest.predict(data_train_scaled_0)
train_acc = accuracy_score(labels_train_0, random_forest_predictions)

random_forest_predictions = random_forest.predict(data_test_scaled_0)
test_acc = accuracy_score(labels_test_0, random_forest_predictions)

print(f"Train acc: {train_acc * 100}%")
print(f"Test acc: {test_acc * 100}%")


# Regression problem white wine:

## Splitting the data:

In [None]:
data = white_dataset_prep.drop(['quality'], axis = 1).values
labels = white_dataset_prep['quality'].values

data_train, data_test, labels_train, labels_test = train_test_split(data,
                                                                    labels,
                                                                    test_size = 0.33,
                                                                    random_state = 10)

## Balancing the data:

In [None]:
#data_train, labels_train = SMOTE(random_state=10).fit_resample(data_train, labels_train)

## Scaling the data:

In [536]:
data_train_scaled = scaler.fit_transform(data_train)
data_test_scaled = scaler.transform(data_test)

In [None]:
from sklearn.svm import SVC

C = 100
kernel = 'poly'
random_state = 10
degree = 10
coef0 = 9

non_linear_svm = SVC(kernel=kernel, random_state=random_state, degree=degree, coef0=coef0)

non_linear_svm.fit(data_train_scaled, labels_train)

non_linear_svm_predictions = non_linear_svm.predict(data_test_scaled)

test_acc = accuracy_score(labels_test, non_linear_svm_predictions)

non_linear_svm_predictions = non_linear_svm.predict(data_train_scaled)
train_acc = accuracy_score(labels_train, non_linear_svm_predictions)

print(f"Train acc: {train_acc * 100}%")
print(f"Test acc: {test_acc * 100}%")