
# CSE 40 Take-Home Final: ajbhatia

Your unique dataset consists of physiochemical properties of a selection of Portuguese Vinho Verde wines.

Some wines are red, some are white. A boolean label for high-quality white wines has been provided.

You are free to use any library code provided within the `cse40` conda environment.

In [None]:
import re

import numpy as np
import pandas as pd

from matplotlib import pyplot as plt

from sklearn.model_selection import KFold, cross_validate
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.svm import SVC

%matplotlib widget

# Import stats to be used for outlier detection
from scipy import stats 

df = pd.read_csv('data.csv')
df.info()

In [None]:
# DATA CLEANING AND MANIPULATION

# exact the first number from a string and return just the number as a float
def extract_first_number(source_str):
    '''
    extract the first numerical value appearing in a string
    as a float and return it, allowing negative numbers and 
    decimal values
    '''
    
    # `source_str` does not contain interpretable data
    if pd.isna(source_str) or source_str.startswith('NA'):
        
        # use `np.nan` to represent missing data
        return np.nan

    # find instances of a valid numerical substrings in the source string
    # see https://regex101.com/r/gwP6Qy/2 for an explanation
    matches = re.search(r'-?\d+\.?\d*', source_str.replace(',', ''))
    
    # if there are any valid numerical substrings
    if matches:
        return float(matches[0])                                  # convert the first one to a float and return the value
    else:
        return np.nan                                             # otherwise, treat as missing data
    
# change True/False values into 1s and 0s
def one_hot(source_bool):
    if source_bool == True:
        return 1                                                  # if value is true, replace it with 1
    else:
        return 0                                                  # if value is false, replace it with 0
    
# call extract_first_number on specific columns
def replace_str(df, exclude_columns=[]):
    
    for c in df.columns:                                          # loop through columns that are included and extract first numbers
        if c not in exclude_columns:          
            df[c] = (df[c].apply(extract_first_number))
        else:
            pass
        
    return df                                                     # return the extracted df

# count the number of true values and false values and print them
def true_false(df):
    true = 0
    false = 0
    for i in df['quality white']:
        if i == True:
            true += 1                                            # if a true value appears, increment true variable
        else:
            false += 1                                           # if a false value appears, increment false variable
    print("The number of trues in quality white:", true)
    print("The number of falses in quality white:", false)
    
df = df.dropna()                                                  # drop all rows with NaN values

df = replace_str(df, ['quality white','pH'])                      # filter out strings and data information

df['quality white'] = (df['quality white'].apply(one_hot))        # perform one hot encoding for quality white column

df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]               # detect and remove outliers from data set

true_false(df)

In [None]:
# VISUALIZATION WITH CORRELATION PLOTS

# adds new feature value plot at a specific spot in the vertical stack
def plot_bestfit(column, n): 
    x = (df[column]).to_numpy()
    y = (df["quality white"]).to_numpy()
    m, b = np.polyfit(x, y, 1)
    axs[n].plot(x, m*x + b)
    axs[n].set_title(column)

# plot correlation matrix for feature values with "quality white"
f = plt.figure(figsize=(12, 6))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14, rotation=45)
plt.yticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);
    
# plot vertically stacked correlation plots
fig, axs = plt.subplots(5)
fig.suptitle("Feature Value Correlation Plots")
fig.set_size_inches(10, 10)

plot_bestfit("volatile acidity", 0)
plot_bestfit("alcohol", 1)
plot_bestfit("citric acid", 2)
plot_bestfit("free sulfur dioxide", 3)
plot_bestfit("pH", 4)

plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0, 
                    hspace=1.5)

plt.show


In [None]:
# drop columns that are least correlated with column "quality white"
df = df.drop(columns=["volatile acidity", "citric acid", "pH"])

# transform df into a Scaler
scaler = StandardScaler()
scaled_array = scaler.fit_transform(df)
df = pd.DataFrame(scaled_array, columns=df.columns, index=df.index)

In [None]:
# TRAIN, PLOT, AND VALIDATE MODELS

# find k-fold validate scores
def K_Fold_Validate(models, kf, X, y):
    
    # initialize dictionary to hold scores
    scores = {
        name: [] for name in models
    }
    
    for train_index, test_index in kf.split(X):
    
        # split training test sets by index
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        # compute the cross-validation score for each model
        # using the same splits
        for name, model in models.items():

            model.fit(X_train, y_train.to_numpy().ravel())
        
            scores[name].append(
                f1_score(y_test, model.predict(X_test), zero_division=0)
            )

    return {k: np.array(v) for k, v in scores.items()}

# train models with model.fit function
def train(model, df):
    model.fit(df[["alcohol", "free sulfur dioxide"]], (df["quality white"] > 0).values.ravel())
    return None

# find K Fold Validate scores
linear_model = SGDClassifier(
    loss='log',
    random_state=0
)
tree_model = DecisionTreeClassifier(
    criterion='entropy',
    max_depth=5,
    random_state=0
)
knn_model = KNeighborsClassifier(
    n_neighbors=8
)

# define models
models = {
    'linear': linear_model,
    'tree': tree_model, 
    'KNN': knn_model
}

# suppress errors
tree_model.fit(np.array([[1,1]]), np.array([[1]]))
knn_model.fit(np.array([[1,1]]*10), np.array([[1]]*10))

# call your function
train(tree_model, df)
train(knn_model, df)

# plot decision boundaries
features = ['alcohol', 'free sulfur dioxide']
label = ['quality white']

# data range
x_min, x_max = df[features[0]].min() - 1, df[features[0]].max() + 1
y_min, y_max = df[features[1]].min() - 1, df[features[1]].max() + 1

# meshgrid
res = (x_max - x_min) / 100
xx, yy = np.meshgrid(np.arange(x_min, x_max, res), np.arange(y_min, y_max, res))

Z_tree = tree_model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
Z_knn = knn_model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(10, 4))
ax1.scatter(df[features[0]], df[features[1]], c=np.array(df[label] > 0), alpha=0.5)
ax1.contourf(xx, yy, Z_tree, alpha=0.2)
ax1.set_xlabel(f'Relative {features[0]}'), 
ax1.set_ylabel(f'Relative {features[1]}'), ax1.set_title('Decision Tree')

ax2.scatter(df[features[0]], df[features[1]], c=np.array(df[label] > 0), alpha=0.5)
ax2.contourf(xx, yy, Z_knn, alpha=0.2)
ax2.set_xlabel(f'Relative {features[0]}'), ax2.set_title('K Nearest Neighbors')

plt.show()

# define X and y values
X, y = df[features], df[label] < 0

k = 5
kf = KFold(k)

scores = K_Fold_Validate(models, kf, X, y)

# print out k-fold validation scores for each model
print('K Fold Validation:')
for k, v in scores.items():
    print(k, v)

In [None]:
# FIND AND PLOT P-VALUES FOR THE DIFFERENT MODELS USED

# return probability of null hypothesis 
def t_stat(model1_scores, model2_scores):
    deltaI = np.subtract(model1_scores, model2_scores)
    deltaBar = np.mean(deltaI)
    sDelta = np.sqrt(np.mean((deltaI-deltaBar)*(deltaI-deltaBar)))
    denominater = (sDelta)/(np.sqrt(len(model1_scores)-1))
    t = (deltaBar)/denominater
    
    # compute the pvalue from a two-tailed t-statistic `t`
    pval = stats.t.sf(np.abs(t), len(model1_scores)-1) * 2
    
    return pval

# reset KFold and k to calculate the f1 value
k = int(np.sqrt(len(X)))
kf = KFold(k)

# recall scores
scores = K_Fold_Validate(models, kf, X, y)

# prints out the p_val for each pair of models
def report(a, b):
    pval = t_stat(scores[a], scores[b])
    print(f'prob. that avg. f1 score for {a} and {b} are the same:', pval)

# call report with each pair of model
report('linear', 'tree')
report('linear', 'KNN')
report('KNN', 'tree')
    
# plot f1 values
fig = plt.figure()
fig.suptitle('f1 Score Comparison using K-Fold Cross-Validation')
ax = fig.add_subplot(111)
for i, (k, v) in enumerate(scores.items()):
    ax.scatter(i * np.ones(len(v)) + 1, v, label=k, alpha=0.5)
for i in range(len(scores['linear'])):
    ax.plot([1, 2, 3], [scores[k][i] for k in scores.keys()], c='k', alpha=0.2)
ax.boxplot(scores.values())
plt.legend()
plt.show()

## <span style="color: blue;">Analysis</span>


The main relationships that I noticed were the correlations between quality white and alcohol and free sulfur dioxide. This was shown explicitly through the models I used and their results on the test data. In order to maximize the impact of these relationships, I dropped all other feature variables so that the only columns that were taken into account by the model were alcohol and free sulfur dioxide and how they apply to the label, quality white. I believed that removing these other features helped better my model accuracy.

I applied many different concepts from this class to this project in order to complete it. Mainly, I used the eight steps to clean data, importing data, merging sets, rebuilding missing data, standardization, normalization, deduplication, verification, and exporting the data. I focused on a few of these eight steps to ensure that my data was cleaned properly. Mainly, removing data, standardization, and normalization. Additionally, when training my models, the concept of overfitting was clear to me so that I could ensure my model would not be incredibly complex. This motivated me to find the middle ground between a complex model that would overfit my training data and a model that would be complex enough to accurately represent my data with high accuracy. Lastly, the talk of using P-values in order to determine if our models were statistically significant is something that I was able to apply to this project. Since P-values depict whether or not we should reject or accept a null hypothesis, I used this metric to see if my models were performing the way they were depending on random chance. I ended up finding that the opposite was true, and the models were statistically significant because of how they compared to my other models. None of my p-values were below 0.05, so there was not enough evidence to reject any null hypotheses.

Changing the max_depth variable resulted in a very interesting error with how the model displayed the data. When choosing a higher max_depth, the model would sometimes display nothing. I think this occurred because when we have a high max_depth size, our model becomes increasingly complex. This can cause overfitting of our model and make it so that the test accuracy of the model is incredibly low. To avoid this concept of overfitting and ensure that model runs correctly, I chose a smaller max_depth size so that the model was much simpler and properly represented the data.

Similar to the max_depth variable, the n_neighbors counter needed to be limited in order to ensure that my model was not incredibly complex, and thus avoiding overfitting. But, when changing the value, I found that sometimes a lower value resulted in a less complicated model and, in turn, resulted in lower mean accuracy. To balance this, I found a middle ground between a complicated model and a simple model, as to have the largest possible mean accuracy, while still avoiding overfitting.

If I had more time to work on this data set, I would try to create an algorithm based on something other than a categorical label. I have rarely seen models without categorical labels, so I think it would be interesting to try to find something without much variation, like the pH, depending on the citric acid and alcohol. Additionally, I would like to explore how to use more than two feature values to help classify the label. By using three or more feature values I would be able to compare the F1 scores of the models to the ones I used for this project and see if it would be better or worse. Also, with three or more values, I would be able to see how it affects the complexity of the models and if it makes it more likely for my models to overfit the training data. There many different things that I would be able to do with this data set if I had more time, but the relationships that I found between my feature values and label value were incredible to see, especially because of how well my models worked on identifying the proper classification.