# <font color="red">What's Making Red Wine "Good"?</font>

This notebook aims to classify the quality of red wine using various parameters or features. This will be a classification problem and will try to use various classification models to find best accuracy score.

In [None]:
# insert about the data here

In [None]:
#Python and Visualization Imports
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import graphviz  

#Wrangling/ Exploration
import explore
import wrangle 
from wrangle import get_wine_data, split_wine_data 

#Math
from scipy import stats

#sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

#Warnings




# <font color="red">What's Making Red Wine "Good"?</font>

## <font color ="darkgreen">Executive Summary</font>
* __The Problem__
    - What is driving the quality in red wine?

* __The Goal__
    - Identify the drivers for quality rankings in red wine.
    - Document my process/ workflow used to accomplish the project goals.
    - Demonstrate my process and summarize my findings.

* __The Process/ Pipeline__
    1. Acquire the Data
    2. Prepare
    3. Explore
    4. Model
    5. Create Recommendations Based On Findings


## <font color ="blue">The Findings</font>
* Alcohol was the most significant feature in 

* We were unable to detect any linear correlation to logerror utilizing this methodology.

* Clustering was not the best method to accomplish our goal of predicting logerror  


## <font color="purple">Project Planning</font>
* The trello board I used to map out my project planning can be found <a href="https://trello.com/b/NJcVVZvd/individual-project-board">[here]</a>.

   * `Data Acquisition`: Data is collected from the UCI database with the appropriate function to grab the red wine data from file path and read as a dataframe  
   * `Data Prep`: Column data types are appropriate for the data they contain
   * `Data Prep`: Missing values are investigated and handled
   * `Exploration`: The interaction between independent variables and the target variable is explored using visualization and statistical testing
   * `Modeling`: Different classification models are created and their performance is compared. 
   
## <font color="red">Hypotheses:</font>
  * Is there a correlation between alcohol and red wine qquality ranking?
  * Is there a correlation between sulfates and red wine quality ranking?
  * Is there a corrlation between citric acid and red wine quality ranking?

#### <a href="https://github.com/david-and-brandon-the-sa-se-bros/zillow-clustering-project">[Data Dictionary can be found here]</a>



## <font color="green">Acquisition</font>
Data was obtained from the UCI database which can be found <a href"https://archive.ics.uci.edu/ml/datasets/wine+quality">here</a> acquired using this fucntion housed in my `wrangle.py` file:

>def get_wine_data():
 
   >   df = pd.read_csv('winequality-red.csv')
   
   >  return df


## <font color="blue">Preparation</font>
* This data set contained very few nulls or missing vallues, the few that remained were identified and handled using functions housed in my `wrangle.py` file.

In [None]:
#Acquire the data
df = wrangle.get_wine_data()


In [None]:
df.head()

In [None]:
df.info()

In [None]:
print(f'My original dataframe is coming in with {df.shape[0]} rows and {df.shape[1]} columns.')

In [None]:
df.isnull().sum()

In [None]:
#Dataframe now has zero nulls to address

# Explore 

In [None]:
#7 quality scores
#Insert percentages here 
df.quality.value_counts()

In [None]:
sns.countplot(x='quality',data=df)

In [None]:
#6 is most populated quality

In [None]:
plt.figure()
df.hist(rwidth=0.9)
plt.tight_layout()

In [None]:
df.describe()

In [None]:
#Univariate Takeaways

In [None]:
#Comparing each feature with quality feature

In [None]:
sns.set_style('whitegrid')
sns.lineplot(data=df, x="quality", y="fixed acidity")

In [None]:
sns.set_style('whitegrid')
sns.lineplot(x = 'quality', y = 'volatile acidity', data = df)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(x = 'quality', y = 'citric acid', data = df)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(x = 'quality', y = 'residual sugar', data = df)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(x = 'quality', y = 'chlorides', data = df)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(x = 'quality', y = 'free sulfur dioxide', data = df)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(x = 'quality', y = 'free sulfur dioxide', data = df)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(x = 'quality', y = 'density', data = df)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(x = 'quality', y = 'free sulfur dioxide', data = df)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(x = 'quality', y = 'pH', data = df)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(x = 'quality', y = 'sulphates', data = df)

In [None]:
sns.set_style('whitegrid')
sns.lineplot(x = 'quality', y = 'alcohol', data = df)

In [None]:
# SUMMARY

#### With the increase of the quality score, the composition of chlorides and volatile acidity decreases.

#### With the increase for quality score, the compostion of alcohol, sulphates and citric acid increases.

In [None]:
from wrangle import split_wine_data

In [None]:
train, validate, test = split_wine_data(df, stratify_by='quality')
train.shape, validate.shape, test.shape

In [None]:
train.head()

In [None]:
#correlation heatmap view
train.corr() 
f, ax = plt.subplots(figsize = (10,10))
sns.heatmap(train.corr(), annot = True, linewidths=.5, fmt = ".2f", ax=ax)
plt.show()

In [None]:
sns.barplot(x = train['quality'], y = train['alcohol'],palette='magma')

In [None]:
pd.DataFrame(train.groupby('quality')['alcohol'].value_counts())

In [None]:
sns.pairplot(df, hue="quality", palette="rocket")

## Bivariate Takeaways

- Quality has a positive correation between alcohol
- Quality has a weak negative correlation to volitile_acidicity
- Quality has almost no relationship with residual_sugar, free sulfur dioxide, and pH.Should drop these columns.
- Alcohol has a weak correlation to pH
- Volitile acidicity has a strong negative correlation to citric acid
- Density has positive correlation fixed acidicity
- Citric acid has positive correlation between fixed acidicity
- Citric acid has a negative relationship between volitile acidicity and pH



In [None]:

train.info()

In [None]:
#Bivariate Takeaways 
cat_vars =["type","quality"]
quant_vars =["fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide","total sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol"]

In [None]:
corr = train.corr()


In [None]:
corr = pd.DataFrame(corr)
corr_wine_quality = pd.DataFrame(corr.quality)
corr_wine_quality


In [None]:
#Here i set a quality threshold to visualize the distribution of "good" quality wines and "bad" quality wines
#If the wine was ranked above a six i placed it in the good category
df['quality'] = [1 if i > 6 else 0 for i in df['quality']]
sns.countplot(x=df['quality'])

In [None]:
x = df.quality.value_counts()
sns.barplot(['Bad','Good'],x.values)
plt.show()

In [None]:
#drop columns I don't need
df=df.drop(['residual sugar','free sulfur dioxide','pH'],axis=1)

# <font color ="brown">Modeling</font>

In [None]:
#Find median or mode
train.quality.value_counts()

In [None]:
# Establish new column that contains the mode
train["most_frequent"] = 5

# Calcuate the baseline accuracy
baseline_accuracy = (train.quality == train.most_frequent).mean()
print(f'My baseline prediction is survived = 0')
print(f'My baseline accuracy is: {baseline_accuracy:.2%}')

## I learned a variety of new methods in regards to modeling that I implemented in this project I will explain in the comments what each new function is designated for  for 

In [None]:
#This function gives for each value the same value intervals means between 0-1
def normalization(X):
    mean = np.mean(X)
    std = np.std(X)
    X_t = (X - mean)/std
    return X_t

#Train and Test splitting of data     
def train_test(X_t, y):
    x_train, x_test, y_train, y_test = train_test_split(X_t, y, test_size = 0.3, random_state = 42)
    print("Train:",len(x_train), " - Test:", len(x_test))
    return x_train, x_test, y_train, y_test

#This function finds the optimal hyperparameters of a model which results in the most 'accurate' predictions.
def grid_search(name_clf, clf, x_train, x_test, y_train, y_test):
    if name_clf == 'Logistic_Regression':
        # Logistic Regression 
        log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
        grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params)
        grid_log_reg.fit(x_train, y_train)
        # We automatically get the logistic regression with the best parameters.
        log_reg = grid_log_reg.best_estimator_
        print("Best Parameters for Logistic Regression: ", grid_log_reg.best_estimator_)
        print("Best Score for Logistic Regression: ", grid_log_reg.best_score_)
        print("------------------------------------------")
        return log_reg

    
    elif name_clf == 'Decision_Tree':
        # DecisionTree Classifier
        tree_params = {"criterion": ["gini", "entropy"], "max_depth": list(range(2,30,1)), 
                  "min_samples_leaf": list(range(5,20,1))}
        grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params)
        grid_tree.fit(x_train, y_train)
        # tree best estimator
        tree_clf = grid_tree.best_estimator_
        print("Best Parameters for Decision Tree: ", grid_tree.best_estimator_)
        print("Best Score for Decision Tree: ", grid_tree.best_score_)
        print("------------------------------------------")
        
        #FEATURE IMPORTANCE FOR DECISION TREE
        importnce = tree_clf.feature_importances_
        plt.figure(figsize=(10,10))
        plt.title("Feature Importances of Decision Tree")
        plt.barh(X_t.columns, importnce, align="center")
        
        return tree_clf
    
    elif name_clf == 'Random_Forest':
        forest_params = {"bootstrap":[True, False], "max_depth": list(range(2,10,1)), 
                  "min_samples_leaf": list(range(5,20,1))}
        grid_forest = GridSearchCV(RandomForestClassifier(), forest_params)
        grid_forest.fit(x_train, y_train)
        # forest best estimator
        forest_clf = grid_forest.best_estimator_
        print("Best Parameters for Random Forest: ", grid_forest.best_estimator_)
        print("Best Score for Random Forest: ", grid_forest.best_score_)
        print("------------------------------------------")
        
        #FEATURE IMPORTANCE FOR DECISION TREE
        importnce = forest_clf.feature_importances_
        plt.figure(figsize=(10,10))
        plt.title("Feature Importances of Random Forest")
        plt.barh(X_t.columns, importnce, align="center")
        
        return forest_clf
    
def plot_learning_curve(estimator,title, X, y, ylim=None, cv=None, n_jobs=None,
                        train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, 
                                                            n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

#Create Applying ClassificTION function
def apply_classification(name_clf, clf, x_train, x_test, y_train, y_test):
    #Find the best parameters and get the classification with the best parameters as return valu of grid search
    grid_clf = grid_search(name_clf, clf, x_train, x_test, y_train, y_test)
    
    #Plotting the learning curve
    # score curves, each time with 30% data randomly selected as a validation set.
    cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
    plot_learning_curve(grid_clf, name_clf, x_train, y_train, 
                    ylim=(0.1, 1.01), cv=cv, n_jobs=4)
    
    #Apply cross validation to estimate the skills of models with 10 split with using best parameters
    scores = cross_val_score(grid_clf, x_train, y_train, cv=10)
    print("Mean Accuracy of Cross Validation: %", round(scores.mean()*100,2))
    print("Std of Accuracy of Cross Validation: %", round(scores.std()*100))
    print("------------------------------------------")
    
    #Predict the test data as selected classifier
    clf_prediction = grid_clf.predict(x_test)
    clf1_accuracy = sum(y_test == clf_prediction)/len(y_test)
    print("Accuracy of",name_clf,":",clf1_accuracy*100)
    
    #print confusion matrix and accuracy score before best parameters
    clf1_conf_matrix = confusion_matrix(y_test, clf_prediction)
    print("Confusion matrix of",name_clf,":\n", clf1_conf_matrix)
    print("==========================================")
    return grid_clf

In [None]:
#Setting my Xand Y
X = df.drop(['quality'], axis = 1)
#y = pd.DataFrame(data['value'])
y = df['quality']

In [None]:
#Normalization
X_t = normalization(X)
print("X_t:", X_t.shape)

#Train and Test splitting of data 
x_train, x_test, y_train, y_test = train_test(X_t, y)

In [None]:
# Logistic Regression
lr = LogisticRegression()
apply_classification('Logistic_Regression', lr, x_train, x_test, y_train, y_test)


In [None]:
# Decision Tree
clf = DecisionTreeClassifier(max_depth=3, random_state=123)
dt_clf = apply_classification('Decision_Tree', clf, x_train, x_test, y_train, y_test)

In [None]:
#Random Forest
rf = RandomForestClassifier(n_estimators=100)
apply_classification('Random_Forest', rf, x_train, x_test, y_train, y_test)

In [None]:
#Model Takeaways
# Random Forest best model 

#Conclusion
* For this individual project, I aimed to analyze which psychochemical are more related with higher quality wine.
* Although I ran out of time and was unable to see how my model performs on unseen data, I am confident that the Random Forest model would be sufficient in correctly in predicting on test data.

I was able to utilize new methods in order to explore my data as well as model it. if I had more time, I would evaluate my best models performance on unseen data, as well as enhance my exploratory phase. This project would also benefit from access to more features in the data such as grapes, price, etc. that i DID NOT HAVE ACCESS TO USIING THIS PARTICULAR DATA SET.

Based on my data exploration I found that :
** Alcohol is the most important feature to decide quality of the wine. If the alcohol percentage is high enough, it means that quality of the wine should be better

** Sulphates is another selecting criteria for good wines, with high percentage sulphates wine quality is increasing

** Citric Acid is another selecting criteria, it should be higher to decide more better wine

Should be lower;

** Volatile Acidity should be less in the good wine

** Sulfur dioxide is another effect to decreasing wine quality and also it causes head ache therefore if there is less sulfur dioxide in wine, it should be selected

** Chlorides value has very less effect to quality of the wine but again it is obvious more value of it causes bad quality of the wine

Additionally, for marketing point of view, if a customer wants to buy a wine just looking with some psychochemical values can decide what s/he needs to buy. Of course, brand and price feature was evaluated on this research, therefore, it is not a good analysis for saying “it is good wine”. However, it can give some idea for the people who do not have more knowledge about wine for selecting the good wine maybe for just dinner or gift for friends!


