Análise da relação entre o tipo de vinho e qualidade, assim como nível de acidez e classificação de qualidade.

P1.Existe um certo tipo de vinho (tinto ou branco) associado a uma melhor qualidade?

P2.Qual nível de acidez (valor de pH) recebe a classificação média mais alta?

Considerando os seguintes grupos de nível de acidez:

Alto: Abaixo de 25% dos valores de pH;
Moderadamente alto: 25% a 50% dos valores de pH;
Médio: 50% a 75% dos valores de pH;
Baixo: 75% ou mais dos valores de pH.


Analysis of the link between wine type and quality, as well as acidity level and quality rating.

Q1. Is there a type of wine (red or white) associated with better quality?

Q2. Which acidity level (pH value) receives the highest average rating?

Considering the following acidity level groups:

High: Below 25% of pH values;
Moderately high: 25% to 50% of pH values;
Medium: 50% to 75% of pH values;
Low: 75% or more of pH values.

In [5]:
import pandas as pd
import numpy as np

%matplotlib inline

df = pd.read_csv('winequality_edited.ipynb.csv')
display(df.head())

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


In [6]:
df.groupby('color').mean()['quality']

color
red      5.636023
white    5.877909
Name: quality, dtype: float64

P1: Existe um certo tipo de vinho (tinto ou branco) associado a uma melhor qualidade?

R1: Sim, o vinho branco apresenta qualidade média de 5,88 enquanto o vinho tinto apresenta qualidade média de 5.64, sendo portanto vinho branco associado a uma melhor qualidade.

Q1: Is there a certain type of wine (red or white) associated with better quality?

A1: Yes, white wine has an average quality of 5.88 while red wine has an average quality of 5.64, therefore white wine is associated with better quality.

In [7]:
# Criando categorias dos valores de pH com Pandas describe (mostra estatísticas descritivas do data frame): min, 25%, 50%, 75% e max
#Creating categories of pH values with Pandas describe (shows descriptive statistics of the data frame): min, 25%, 50%, 75% and max

ph_desc = df["pH"].describe()
display(ph_desc)

count    6497.000000
mean        3.218501
std         0.160787
min         2.720000
25%         3.110000
50%         3.210000
75%         3.320000
max         4.010000
Name: pH, dtype: float64

In [10]:
# Bordas dos intervalos que serão usados para dividir os dados em grupos (valores min, 25%, 50%, 75% e max encontrados)
# Preencher esta lista com os cinco valores encontrados
# bin_edges = [ph_desc['min'] ,ph_desc["25%"] , ph_desc["50%"] ,ph_desc["75%"], ph_desc["max"]] 

# Borders of the intervals that will be used to segment the data into groups (min, 25%, 50%, 75% and max values ​​found)
# Fill this list with the five values found


bin_edges = [2.72,3.11,3.21,3.32,4.01]

In [12]:
# Rótulos para os quadro grupos de nível de acidez
# Labels for the four acidity level groups

bin_names = ["Alto", "Moderadamente Alto", "Médio", "Baixo" ] 

In [13]:
# Criando a coluna acidity_levels com os rótulos de níveis de acidez
# Uso da função cut, da biblioteca Pandas, para segmentar e agrupar os dados numéricos em intervalos personalizados

# Creating the acidity_levels column with the acidity level labels
# Using the cut function from the Pandas library to segment and group numerical data into custom ranges

df['acidity_levels'] = pd.cut(df['pH'], bin_edges, labels=bin_names)

# Verificando se esta coluna foi criada corretamente
# Checking if this column was created correctly

df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color,acidity_levels
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,Baixo
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,Moderadamente Alto
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,Médio
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red,Moderadamente Alto
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,Baixo


In [14]:
# Encontrando a qualidade média de cada nível de acidez com groupby
# Finding the average quality of each acidity level with groupby

df.groupby(["acidity_levels"]).mean()["quality"]

acidity_levels
Alto                  5.783343
Moderadamente Alto    5.784540
Médio                 5.850832
Baixo                 5.859593
Name: quality, dtype: float64

P2: Qual nível de acidez (valor de pH) recebe a classificação média mais alta?

    Essa pergunta é mais complexa pois, ao contrário da cor, que possui categorias claras pelas quais é possível agrupar 
    (tinto ou branco), pH é uma variável quantitativa, sem categorias claras. No entanto, utilizando-se a função Cut do Pandas     foi possível segmentar e agrupar os dados numéricos de pH em intervalos personalizados e criar uma variável quantitativa       (níveis de acidez) e suas próprias categorias. A função groupby foi usada para conseguir a classificação de qualidade média 
    para cada nível de acidez.

R2:O nível de acidez baixo recebe a classificação média mais alta.


Q2: Which acidity level (pH value) receives the highest average rating?

This question is more complex because, unlike color, which has clear categories by which it is possible to group (red or white), pH is a quantitative variable, with no clear categories. However, using Pandas' Cut function, it was possible to segment and group the numerical pH data into custom ranges and create a quantitative variable (acidity levels) and its own categories. The groupby function was used to obtain the average quality rating for each acidity level.

A2: Low acidity level receives the highest average rating.

In [None]:
# Salvar

#df.to_csv('winequality_edited.csv', index=False)