Análise do teor alcoólico na classificação de qualidade dos vinhos.

P: Vinhos com maior teor alcoólico recebem classificações maiores?

Baixo álcool (amostras com um teor alcoólico abaixo da mediana),
Alto álcool (amostras com um teor alcoólico maior ou igual à mediana)


Analysis of alcohol content in the classification of wine quality.

Q: Do wines with higher alcohol content receive higher ratings?

Low alcohol (samples with an alcohol content below the median),
High alcohol (samples with an alcohol content greater than or equal to the median)

In [47]:
import pandas as pd
import numpy as np

%matplotlib inline

df = pd.read_csv('winequality_edited.ipynb.csv')
display(df.head())

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


Usando a função Query do Pandas é possível filtrar valores acima e abaixo da mediana, e assim analisar os dados de baixo e alto teor alcoólico:

Using the Pandas Query function, it´s possible to filter values above and below the median, and thus analyze low and high alcohol content data:

In [5]:
alcohol_median = df["alcohol"].median()
display(alcohol_median)

10.3

As amostras são classificadas em baixo ou alto teor alcoólico de acordo com a mediana. 

Para utilizarmos uma variável em uma query, precisamos colocar @ antes da mesma, como vemos abaixo:

The samples are classified as low or high alcohol content according to the median.

To use a variable in a query, we need to put @ before it, as shown below:

In [9]:
low_alcohol = df.query('alcohol < @alcohol_median')
display(low_alcohol.head())

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


In [10]:
high_alcohol = df.query('alcohol >= @alcohol_median')
display(high_alcohol.head())

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
9,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,5,red
11,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,5,red
16,8.5,0.28,0.56,1.8,0.092,35.0,103.0,0.9969,3.3,0.75,10.5,7,red
31,6.9,0.685,0.0,2.5,0.105,22.0,37.0,0.9966,3.46,0.57,10.6,6,red
36,7.8,0.6,0.14,2.4,0.086,3.0,15.0,0.9975,3.42,0.6,10.8,6,red


In [11]:
# certifique-se que estas consultas incluíram cada amostra uma única vez

num_samples = df.shape[0]
num_samples == low_alcohol['quality'].count() + high_alcohol['quality'].count() # resultado deve ser True, para que a contagem seja a mesma

True

In [15]:
display(low_alcohol['quality'].count() + high_alcohol['quality'].count())

6497

In [16]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         6497 non-null   float64
 1   volatile acidity      6497 non-null   float64
 2   citric acid           6497 non-null   float64
 3   residual sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free sulfur dioxide   6497 non-null   float64
 6   total sulfur dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  color                 6497 non-null   object 
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB


None

In [18]:
display(low_alcohol['quality'].mean())

5.475920679886686

In [20]:
display(high_alcohol['quality'].mean())

6.146084337349397

Outra forma de escrever a resposta:

In [21]:
# Avaliação média de qualidade para grupos com alto e baixo teor alcóolico:
# Average quality rating for high and low alcohol groups:

print(" Avaliação média para alto teor:{0} \n Avaliação média para baixo teor:{1}".format(
        high_alcohol["quality"].mean(), low_alcohol["quality"].mean() ))

 Avaliação média para alto teor:6.146084337349397 
 Avaliação média para baixo teor:5.475920679886686


P1: Vinhos com maior teor alcoólico recebem classificações maiores?

R1: Sim. Avaliação média para alto teor (6.15) > (5.48) Avaliação média para baixo teor alcoólico.

Q1: Do higher alcohol content wines receive higher ratings?

A1: Yes. Average rating for high alcohol content (6.15) > (5.48) Average rating for low alcohol content.