## Project Proposal - DRAFT TO BE REVIEWED & IMPROVED

## Wine Quality Analysis of Vinho Verde Red Wines 

---
## Executive Summary

The purpose of this document is to outline the flow and outcome of this project. This project will explore the quality of Vinho Verde wines within a region and select the more relevant physiochemical features that contribute to wine quality and in which ways. The assumption is that when wine quality is assessed and refined to produce more quality wines, sales within the region will increase. While assessing the sales data within the region is outside of the scope of this project, that is the assumption that is being made here. The one limitation of this project is that the wines sampled in this data excerpt are more reflective of normal wines, than excellent or poor.
To conduct this analysis, the first portion will be feature selection using stepwise binary logistic regression. Once the features are selected, the second portion of this project will be using classification to assess the individual features' effect on the quality. This project will require a project manager to ensure the project is moving along, a scrum master to make sure hurdles are overcome, a few data analysts to do some exploratory analysis, a few data scientists to make sure the machine learning model is robust and data visualization is accurate, and a business analyst to make sure that the project is meeting the requirements of the business.
The limitations of this project are there not being data on grape types, wine brand, or price point. The dataset is also limited in terms of size and the project deadline is somewhat limited. Risks are similar to the limitations within an addition of the dataset being imbalanced towards normal quality wines and the dataset being somewhat outdated (from 2009) so the physiochemical features back then might be different than they are today due to the effects of climate change.

Autumn: Still working on my comments for this section.

---
## Business Objectives

Due to rarity of red wine produced in region, this analysis will help create a better understanding of the current quality of wine for sales and provide a basis on how to improve wine quality.

Autumn: Quality evaluation is part of the wine certification process, a process assessed through physicochemical and sensory tests.  Knowledge gained by identifying the most influential factors in wine quality can improve wine making and help determine relative price points, distinguishing premium brands from average or lower quality brands.

Although the process seems straightforward, the relationships between the physicochemical and sensory analysis bear much complexity and are not fully understood.  Wine tasting is a mitigating factor in this process, a crucial component in the certification and quality recognition, yet taste is the least understood of the human senses.  This makes wine classification an even more daunting task.
Though there is potential in predicting wine quality based on physicochemical data, before the original ‘Modeling Wine Preferences’ data mining project in 2009, predicting wine quality based on physicochemical data was limited.  The potential for integrating such techniques and unearthing data trends and patterns could vastly improve wine makers’ decision-making.

Due to the rarity of red wine produced in the Minho region, we look to focus our efforts exclusively on the Vinho Verde red portion of the data set of 1599 cases, which will further reduce to 1143 cases once the data is cleaned and wrangled.  Through this process, we aim to build a successful model for evaluating which physicochemical properties have the most significant influence on the quality of wine.
Our evaluation questions and methodology are as follows.

Evaluation question 1.  Which physiochemical properties of wine show the greatest significance and impact on wine quality?

Methodology for answering evaluation question 1.  To answer this first question, we intend to conduct an ordinal logistic regression to evaluate the effect of 11 independent (continuous) variables on wine quality (dependent, ordinal variable).  Previous regression models we have explored use stepwise/binary logistic regression.  As a data science team, we felt this was a disservice to the data by grouping quality data that was initially ranked on an 11-point scale (0-10) into only two groups.  We believe that grouping the quality, ranging from 3-8 in these data values, is a better approach to assessing the data (quality rating 3-4: below average, 5-6: average or good, 7-8: above average). 

Evaluation question 2.  For the properties (variables), if any, that show the most significant impact on wine quality, what balance or interplay between these variables is necessary for the best quality wine?

Methodology for answering evaluation question 2.  To answer evaluation question 2, we will use the ordinal logistic regression results to build a model to run a discriminant functional analysis.  From that model, we anticipate a clearer idea of the variables that play a significant role in the quality and the optimal balance between these properties that lead to producing the greatest quality Vinho Verde red wine.

---
## Background

There is a lack of data in red wine quality in the tested area, and as data scientists we wanted to take a look at current quality of wine to find opportunities to boost sales and improve wine quality in the region.

Autumn: 
Portugal is the fifth largest wine producer in Europe and the eleventh largest in the world.  Despite the COVID-19 pandemic, Portugal's wine exports grew by 5% compared to the previous year.  More than 306 million liters of wine were exported during that period, valued at $936 million.  In 2021, Portuguese wines grew an additional 3.8% in volume, 8.6% in value, and 4.6% in average price from the same reporting period in 2020. 
The country boasts 14 renowned wine regions for quality wine production.  Because of the increasing value of Portuguese wine, in competition with France, the United States, the United Kingdom, Brazil, and Germany, the Ministry of Agriculture has honed its focus on monitoring farmers and producers to define and implement necessary measures to guarantee predictability and stability in Portuguese wine production.  Wine certification and quality assessment are key elements within this context. 

Vinho Verde hails from a small region, Minho, nestled in Northern Portugal.  Minho is particularly known for its bright, summery white wines with subtle carbonation.  However, red and rosé Vinho Verde wines are the region's rarer gem.  The cool and rainy climate in Minho makes it a bit more difficult to ripen the red Vinhão and even rarer Padeiro grapes.  

Due to the lack of data surrounding red wine quality, particularly in the Minho region of Portugal, coupled with initiatives to be a larger regional and global competitor in the wine export industry, as data scientists, we recognize the opportunity available for winemakers to capitalize on this data, improve decision making processes, and bring these rare red wines to more tables across the globe.



---
## Scope

Included in this scope is the factors affecting the quality of red wine in the Vino Verde region. Analysis of sales data is outside of the scope of this project. Analysis of white wine will be outside of the scope of this project. The factors are as follows: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol.

---
## Functional requirements

We will be using stepwise binary logistic regression to analyse the importance of the given factors to the quality of wine listed. We will also be using K Mean to analyse how each factor with the most impact affects the quality of wine. There will be access to Tableau to see the visualization tables provided. There will be access to a powerpoint presentation for in-depth analysis outcomes of this project.

Autumn: In this data analysis, we will be utilizing the ‘Wine Quality Data Set’, located on Kaggle: https://www.kaggle.com/datasets/yasserh/wine-quality-dataset.  This dataset comprises red "Vinho Verde" wine from the Minho region of Portugal, produced between 2004 and 2007. The original dataset is located on the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/wine+quality.  

The dataset provides 11 variable/chemical components in wine and their effect on quality. No additional information is provided to name the wines collected as a part of this sample or their price point. The classes of wine within this dataset are ordered and are not balanced.

To complete the exploratory plots, build a predictive model, and run an ordinal logistic regression, we will perform the analysis and meet the assumptions testing in R. Based on background research, we anticipate utilizing the following packages in the R library to complete this process. These include brant, car, caret, corpcor, corrplot, dyplr, e1071, effects, ggplot2, llmtest, magrittr, MASS, PerformanceAnalytics, popbio, tidyr, and tidyverse.

The ordinal logistic regression will look at which of the following independent variables (11) have the most influence on wine quality (dependent variable): fixed acidity, volatile acidity, citric acid, residual sugars, pH, free sulfur dioxide, total sulfur dioxides, sulphates, density, pH, and alcohol. From the ordinal logistic regression results, data scientists will take the variables that show statistical significance in influencing wine quality and use them to run a discriminant function analysis.

In the discriminant function analysis, data scientists look to see the relative ratios and influence, if any, between the significant variables (balance between them) that constitute a poor, average, or above-average quality wine. To complete the discriminant function analysis and check for influential outliers, data scientists will use R to complete these processes, anticipating the following packages: ggplot2, MASS.

Once all exploratory, modeling, and analyses have been completed, data scientists will summarize and present the final results through a PowerPoint presentation. Data scientists will also use Tableau for data display to supplement and provide clear, in-depth visualizations.

---
## Personnel requirements

There will be five data scientists working on this project. We will need a Project Manager to oversee the project, a scrum master to facilitate overcoming any hurdles that arise, a data analyst to work on exploratory analysis, a data scientist to work on machine learning and visualization tools, and a business analyst to ensure the needs of the business are being met.

Autumn: 
This project's scope requires three data scientists, a data science mentor for consultation, and an instructor/industry professional for project oversight.  Scrum master duties for this project will be rotated between the participating data scientists and posted on the team Kanban board in Trello.  The weekly scrum master will be responsible for facilitating the structure of the project duties for the week, conducting weekly team stand-up meetings, presenting weekly progress to the instructor, and coordinating/organizing team feedback and communication.

The expected timeline for completion is six weeks from onset to the final presentation, with each data scientist devoting approximately 20 hours per week to the project.  Team meetings will take place at least twice per week for 30-40 minutes via Zoom or a Slack call.  One team meeting will take place at the beginning of each project week.  A second team meeting will occur immediately following the weekly 30-minute meeting with the instructor and data science mentor.

The team Slack channel will communicate additional feedback, ideas, and obstacles.  The weekly scrum master will be responsible for disseminating guidance through the team Slack channel and recommending a proposed course of action to overcome barriers or integrate new ideas, as necessary.


---
## Delivery schedule

    Week 1: Project Planning
    Week 2: Data Wrangling
    Week 3: Data Exploration
    Week 4: Data Analysis
    Week 5: Data Visualization
    Week 6: Data Reporting and Presentation

 Autumn: I need to think about this one for a bit to see how I want to organize my feedback, considering the changes to the timeline (without including all of the 'extra' stuff). 
 
 I can also add a Gantt chart here if that's something the team is interested in/if there's time.  Hoping to have my version of feedback here by Monday.

---
## Assumptions

We will be assuming the wines are available for testing even with rarity, and that there are testers available in the area. There is an assumption of population size in relation to the data sample.

---
## Limitations

There is no data on grape types, wine brand or selling price in this dataset. There is a small sample size. There is a six week deadline to this project.

Autumn:
Here we address limitations for consideration upon review of the project proposal. In terms of running the selected analyses, the data scientists involved in the project have not previously completed projects that required ordinal logistic regression or discriminant function analysis. To plan around these obstacles, data scientists have devoted time to researching the processes and gathering sound, published, and peer-reviewed resources in anticipation of the project. Data scientists have consulted with their data science mentor and instructor before the project proposal phase to gather additional feedback and guidance.

The second limitation we anticipate encountering concerns scheduling and timing issues. Data scientists have each blocked time throughout the six-week process to work on this data science project. There are, however, blocks of time in which data scientists do not have the immediate availability to consult via Zoom or Slack if a question or issue arises. To counterbalance this issue, data scientists look to practice great diligence at the beginning of each workday to summarize their goals and availability for the day, so one can expect when another can answer or when to expect progress. A variety of tasks will also be published at the beginning of each week, in which a team member can pick up another card to work on while awaiting a response from the team/scrum master.

Lastly, this project proposal follows an original proposal in which team members felt pulled to explore different models and additional evaluation questions relevant to this dataset. The changes that arose, separating the team into other courses of analysis, shortened the time allowance for the project. The three data scientists on the current project proposal team anticipate needing to rework some of the data wrangling and exploratory analyses required to move forward with the project's new direction.


---
## Risks

Due to small sample size, the analysis of data may not be accurate in relation to actual quality of the wine. The data might be skewed due to previous cleaning of data from former user. Due to age of data, physiochemical properties might have changed. The classification is skewed due to inbalance, there are more normal wines than excellent or poor.

Autumn:
Risks are anticipated in this project, but the data science team has discussed such concerns to mitigate them and work through them when possible.  The initial perils come with the dataset itself.  In the project selection process, we chose the red Vinho Verde data from the original dataset; the red wine is not as readily produced in the Minho region due to the cool, wet climate the area experiences.  Because of this, excellent red wines are harder to find in this region.  The red dataset has fewer samples (1599 vs. 4098) than the white Vinho Verde wine samples.  The quality scale in our dataset is also highly imbalanced.  There are far more average or ‘normal’ wines than poor or above-average ones.

Another risk the team has considered concerns is the analyses and outside factors we may not have control over (i.e., illness on the team or other acute emergencies that could disturb the project). Time may run out before the team can complete all of the portions of the project, considering the added tightened timeline after the previous separation.  The data science team has taken great care in the secondary planning phases of this second project to allow for any obstacles that should be temporarily reassigned as necessary until the team member can return to the project.