# Description of Physiochemical Properties 

Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Title: Wine Quality

Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

Past Usage:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

Relevant Information:

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

Number of Instances: red wine - 1599; white wine - 4898.

Number of Attributes: 11 + output attribute

Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

Attribute information:

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

Missing Attribute Values: None

Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

# Univariate Linear Regression: Red Wines


In [62]:
import pandas as pd
import numpy as np
import statsmodels.api as sm # import statsmodels

red_wine_data_to_load = "/Users/kurtshiple/Desktop/Project1/projectcsvs/wineQualityReds.csv"

red_wine_data = pd.read_csv(red_wine_data_to_load)

white_wine_data_to_load = "/Users/kurtshiple/Desktop/Project1/projectcsvs/wineQualityWhites.csv"

white_wine_data = pd.read_csv(white_wine_data_to_load)

In [63]:

red_wine_data.head()

Unnamed: 0.1,Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
0,1,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,2,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,3,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,4,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,5,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [64]:


X = red_wine_data["fixed.acidity"] ## X usually means our input variables (or independent variables)
y = red_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()




0,1,2,3
Dep. Variable:,quality,R-squared:,0.015
Model:,OLS,Adj. R-squared:,0.015
Method:,Least Squares,F-statistic:,24.96
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,6.5e-07
Time:,18:53:04,Log-Likelihood:,-1914.2
No. Observations:,1599,AIC:,3832.0
Df Residuals:,1597,BIC:,3843.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.1573,0.098,52.684,0.000,4.965,5.349
fixed.acidity,0.0575,0.012,4.996,0.000,0.035,0.080

0,1,2,3
Omnibus:,17.047,Durbin-Watson:,1.743
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18.553
Skew:,0.205,Prob(JB):,9.36e-05
Kurtosis:,3.333,Cond. No.,42.1


In [65]:
X = red_wine_data["volatile.acidity"] ## X usually means our input variables (or independent variables)
y = red_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.153
Model:,OLS,Adj. R-squared:,0.152
Method:,Least Squares,F-statistic:,287.4
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,2.0500000000000002e-59
Time:,18:53:05,Log-Likelihood:,-1794.3
No. Observations:,1599,AIC:,3593.0
Df Residuals:,1597,BIC:,3603.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.5657,0.058,113.388,0.000,6.452,6.679
volatile.acidity,-1.7614,0.104,-16.954,0.000,-1.965,-1.558

0,1,2,3
Omnibus:,20.577,Durbin-Watson:,1.736
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21.905
Skew:,0.242,Prob(JB):,1.75e-05
Kurtosis:,3.306,Cond. No.,7.18


In [66]:
X = red_wine_data["citric.acid"] ## X usually means our input variables (or independent variables)
y = red_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.051
Model:,OLS,Adj. R-squared:,0.051
Method:,Least Squares,F-statistic:,86.26
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,4.99e-20
Time:,18:53:06,Log-Likelihood:,-1884.6
No. Observations:,1599,AIC:,3773.0
Df Residuals:,1597,BIC:,3784.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.3817,0.034,159.610,0.000,5.316,5.448
citric.acid,0.9385,0.101,9.288,0.000,0.740,1.137

0,1,2,3
Omnibus:,11.279,Durbin-Watson:,1.74
Prob(Omnibus):,0.004,Jarque-Bera (JB):,11.967
Skew:,0.162,Prob(JB):,0.00252
Kurtosis:,3.272,Cond. No.,5.53


In [67]:
X = red_wine_data["residual.sugar"] ## X usually means our input variables (or independent variables)
y = red_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.0
Method:,Least Squares,F-statistic:,0.3012
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,0.583
Time:,18:53:06,Log-Likelihood:,-1926.5
No. Observations:,1599,AIC:,3857.0
Df Residuals:,1597,BIC:,3868.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.6161,0.042,134.950,0.000,5.534,5.698
residual.sugar,0.0079,0.014,0.549,0.583,-0.020,0.036

0,1,2,3
Omnibus:,16.985,Durbin-Watson:,1.729
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18.016
Skew:,0.215,Prob(JB):,0.000122
Kurtosis:,3.292,Cond. No.,6.54


In [68]:
X = red_wine_data["chlorides"] ## X usually means our input variables (or independent variables)
y = red_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.017
Model:,OLS,Adj. R-squared:,0.016
Method:,Least Squares,F-statistic:,26.99
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,2.31e-07
Time:,18:53:07,Log-Likelihood:,-1913.2
No. Observations:,1599,AIC:,3830.0
Df Residuals:,1597,BIC:,3841.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.8295,0.042,137.852,0.000,5.747,5.912
chlorides,-2.2118,0.426,-5.195,0.000,-3.047,-1.377

0,1,2,3
Omnibus:,14.102,Durbin-Watson:,1.738
Prob(Omnibus):,0.001,Jarque-Bera (JB):,14.663
Skew:,0.199,Prob(JB):,0.000655
Kurtosis:,3.249,Cond. No.,21.4


In [69]:
X = red_wine_data["free.sulfur.dioxide"] ## X usually means our input variables (or independent variables)
y = red_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.003
Model:,OLS,Adj. R-squared:,0.002
Method:,Least Squares,F-statistic:,4.109
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,0.0428
Time:,18:53:07,Log-Likelihood:,-1924.6
No. Observations:,1599,AIC:,3853.0
Df Residuals:,1597,BIC:,3864.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.6981,0.037,155.357,0.000,5.626,5.770
free.sulfur.dioxide,-0.0039,0.002,-2.027,0.043,-0.008,-0.000

0,1,2,3
Omnibus:,16.011,Durbin-Watson:,1.728
Prob(Omnibus):,0.0,Jarque-Bera (JB):,17.376
Skew:,0.197,Prob(JB):,0.000169
Kurtosis:,3.324,Cond. No.,34.6


In [70]:
X = red_wine_data["total.sulfur.dioxide"] ## X usually means our input variables (or independent variables)
y = red_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.034
Model:,OLS,Adj. R-squared:,0.034
Method:,Least Squares,F-statistic:,56.66
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,8.62e-14
Time:,18:53:08,Log-Likelihood:,-1898.8
No. Observations:,1599,AIC:,3802.0
Df Residuals:,1597,BIC:,3812.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.8472,0.034,170.140,0.000,5.780,5.915
total.sulfur.dioxide,-0.0045,0.001,-7.527,0.000,-0.006,-0.003

0,1,2,3
Omnibus:,20.665,Durbin-Watson:,1.769
Prob(Omnibus):,0.0,Jarque-Bera (JB):,30.817
Skew:,0.117,Prob(JB):,2.03e-07
Kurtosis:,3.638,Cond. No.,98.6


In [71]:
X = red_wine_data["density"] ## X usually means our input variables (or independent variables)
y = red_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.031
Model:,OLS,Adj. R-squared:,0.03
Method:,Least Squares,F-statistic:,50.41
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,1.87e-12
Time:,18:53:08,Log-Likelihood:,-1901.8
No. Observations:,1599,AIC:,3808.0
Df Residuals:,1597,BIC:,3818.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,80.2385,10.508,7.636,0.000,59.628,100.849
density,-74.8460,10.542,-7.100,0.000,-95.524,-54.168

0,1,2,3
Omnibus:,13.878,Durbin-Watson:,1.702
Prob(Omnibus):,0.001,Jarque-Bera (JB):,15.259
Skew:,0.174,Prob(JB):,0.000486
Kurtosis:,3.329,Cond. No.,1060.0


In [72]:
X = red_wine_data["pH"] ## X usually means our input variables (or independent variables)
y = red_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.003
Model:,OLS,Adj. R-squared:,0.003
Method:,Least Squares,F-statistic:,5.34
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,0.021
Time:,18:53:08,Log-Likelihood:,-1924.0
No. Observations:,1599,AIC:,3852.0
Df Residuals:,1597,BIC:,3863.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.6359,0.433,15.320,0.000,5.786,7.486
pH,-0.3020,0.131,-2.311,0.021,-0.558,-0.046

0,1,2,3
Omnibus:,16.478,Durbin-Watson:,1.73
Prob(Omnibus):,0.0,Jarque-Bera (JB):,17.084
Skew:,0.222,Prob(JB):,0.000195
Kurtosis:,3.244,Cond. No.,77.7


In [73]:
X = red_wine_data["sulphates"] ## X usually means our input variables (or independent variables)
y = red_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.063
Model:,OLS,Adj. R-squared:,0.063
Method:,Least Squares,F-statistic:,107.7
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,1.8e-24
Time:,18:53:09,Log-Likelihood:,-1874.4
No. Observations:,1599,AIC:,3753.0
Df Residuals:,1597,BIC:,3764.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.8477,0.078,61.818,0.000,4.694,5.002
sulphates,1.1977,0.115,10.380,0.000,0.971,1.424

0,1,2,3
Omnibus:,12.685,Durbin-Watson:,1.712
Prob(Omnibus):,0.002,Jarque-Bera (JB):,17.098
Skew:,0.083,Prob(JB):,0.000194
Kurtosis:,3.479,Cond. No.,8.51


In [74]:
X = red_wine_data["alcohol"] ## X usually means our input variables (or independent variables)
y = red_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.227
Model:,OLS,Adj. R-squared:,0.226
Method:,Least Squares,F-statistic:,468.3
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,2.83e-91
Time:,18:53:09,Log-Likelihood:,-1721.1
No. Observations:,1599,AIC:,3446.0
Df Residuals:,1597,BIC:,3457.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.8750,0.175,10.732,0.000,1.532,2.218
alcohol,0.3608,0.017,21.639,0.000,0.328,0.394

0,1,2,3
Omnibus:,38.501,Durbin-Watson:,1.748
Prob(Omnibus):,0.0,Jarque-Bera (JB):,71.758
Skew:,-0.154,Prob(JB):,2.62e-16
Kurtosis:,3.991,Cond. No.,104.0


# Multivariate Linear Regression: Red Wines

In [75]:

X = red_wine_data[["fixed.acidity","volatile.acidity","citric.acid","residual.sugar","chlorides","free.sulfur.dioxide","total.sulfur.dioxide","density","pH","sulphates","alcohol"]] ## X usually means our input variables (or independent variables)
y = red_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.361
Model:,OLS,Adj. R-squared:,0.356
Method:,Least Squares,F-statistic:,81.35
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,1.79e-145
Time:,18:53:12,Log-Likelihood:,-1569.1
No. Observations:,1599,AIC:,3162.0
Df Residuals:,1587,BIC:,3227.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,21.9652,21.195,1.036,0.300,-19.607,63.538
fixed.acidity,0.0250,0.026,0.963,0.336,-0.026,0.076
volatile.acidity,-1.0836,0.121,-8.948,0.000,-1.321,-0.846
citric.acid,-0.1826,0.147,-1.240,0.215,-0.471,0.106
residual.sugar,0.0163,0.015,1.089,0.276,-0.013,0.046
chlorides,-1.8742,0.419,-4.470,0.000,-2.697,-1.052
free.sulfur.dioxide,0.0044,0.002,2.009,0.045,0.000,0.009
total.sulfur.dioxide,-0.0033,0.001,-4.480,0.000,-0.005,-0.002
density,-17.8812,21.633,-0.827,0.409,-60.314,24.551

0,1,2,3
Omnibus:,27.376,Durbin-Watson:,1.757
Prob(Omnibus):,0.0,Jarque-Bera (JB):,40.965
Skew:,-0.168,Prob(JB):,1.27e-09
Kurtosis:,3.708,Cond. No.,113000.0


# Univariate Linear Regression: White Wines

In [78]:
X = white_wine_data["fixed.acidity"] ## X usually means our input variables (or independent variables)
y = white_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.013
Model:,OLS,Adj. R-squared:,0.013
Method:,Least Squares,F-statistic:,64.08
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,1.48e-15
Time:,19:00:51,Log-Likelihood:,-6322.8
No. Observations:,4898,AIC:,12650.0
Df Residuals:,4896,BIC:,12660.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.6956,0.103,65.057,0.000,6.494,6.897
fixed.acidity,-0.1193,0.015,-8.005,0.000,-0.149,-0.090

0,1,2,3
Omnibus:,29.986,Durbin-Watson:,1.657
Prob(Omnibus):,0.0,Jarque-Bera (JB):,31.513
Skew:,0.166,Prob(JB):,1.44e-07
Kurtosis:,3.211,Cond. No.,57.7


In [79]:
X = white_wine_data["volatile.acidity"] ## X usually means our input variables (or independent variables)
y = white_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.038
Model:,OLS,Adj. R-squared:,0.038
Method:,Least Squares,F-statistic:,193.0
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,4.67e-43
Time:,19:00:53,Log-Likelihood:,-6260.0
No. Observations:,4898,AIC:,12520.0
Df Residuals:,4896,BIC:,12540.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.3540,0.036,174.320,0.000,6.283,6.425
volatile.acidity,-1.7109,0.123,-13.891,0.000,-1.952,-1.469

0,1,2,3
Omnibus:,76.271,Durbin-Watson:,1.646
Prob(Omnibus):,0.0,Jarque-Bera (JB):,84.657
Skew:,0.268,Prob(JB):,4.14e-19
Kurtosis:,3.357,Cond. No.,10.7


In [80]:
X = white_wine_data["citric.acid"] ## X usually means our input variables (or independent variables)
y = white_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.0
Method:,Least Squares,F-statistic:,0.4153
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,0.519
Time:,19:00:54,Log-Likelihood:,-6354.4
No. Observations:,4898,AIC:,12710.0
Df Residuals:,4896,BIC:,12730.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.9004,0.037,158.736,0.000,5.828,5.973
citric.acid,-0.0674,0.105,-0.644,0.519,-0.272,0.138

0,1,2,3
Omnibus:,27.428,Durbin-Watson:,1.657
Prob(Omnibus):,0.0,Jarque-Bera (JB):,29.026
Skew:,0.154,Prob(JB):,4.98e-07
Kurtosis:,3.217,Cond. No.,9.2


In [81]:
X = white_wine_data["residual.sugar"] ## X usually means our input variables (or independent variables)
y = white_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.01
Model:,OLS,Adj. R-squared:,0.009
Method:,Least Squares,F-statistic:,47.06
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,7.72e-12
Time:,19:00:54,Log-Likelihood:,-6331.2
No. Observations:,4898,AIC:,12670.0
Df Residuals:,4896,BIC:,12680.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.9868,0.020,295.447,0.000,5.947,6.027
residual.sugar,-0.0170,0.002,-6.860,0.000,-0.022,-0.012

0,1,2,3
Omnibus:,25.795,Durbin-Watson:,1.653
Prob(Omnibus):,0.0,Jarque-Bera (JB):,29.066
Skew:,0.126,Prob(JB):,4.88e-07
Kurtosis:,3.281,Cond. No.,13.2


In [82]:
X = white_wine_data["chlorides"] ## X usually means our input variables (or independent variables)
y = white_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.044
Model:,OLS,Adj. R-squared:,0.044
Method:,Least Squares,F-statistic:,225.7
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,6.51e-50
Time:,19:00:55,Log-Likelihood:,-6244.2
No. Observations:,4898,AIC:,12490.0
Df Residuals:,4896,BIC:,12510.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.2674,0.029,218.166,0.000,6.211,6.324
chlorides,-8.5100,0.566,-15.024,0.000,-9.620,-7.400

0,1,2,3
Omnibus:,23.609,Durbin-Watson:,1.654
Prob(Omnibus):,0.0,Jarque-Bera (JB):,26.746
Skew:,0.116,Prob(JB):,1.56e-06
Kurtosis:,3.277,Cond. No.,45.9


In [83]:
X = white_wine_data["free.sulfur.dioxide"] ## X usually means our input variables (or independent variables)
y = white_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.0
Method:,Least Squares,F-statistic:,0.3259
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,0.568
Time:,19:00:55,Log-Likelihood:,-6354.5
No. Observations:,4898,AIC:,12710.0
Df Residuals:,4896,BIC:,12730.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.8629,0.029,201.025,0.000,5.806,5.920
free.sulfur.dioxide,0.0004,0.001,0.571,0.568,-0.001,0.002

0,1,2,3
Omnibus:,27.869,Durbin-Watson:,1.658
Prob(Omnibus):,0.0,Jarque-Bera (JB):,29.413
Skew:,0.157,Prob(JB):,4.1e-07
Kurtosis:,3.214,Cond. No.,90.4


In [84]:
X = white_wine_data["total.sulfur.dioxide"] ## X usually means our input variables (or independent variables)
y = white_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.031
Model:,OLS,Adj. R-squared:,0.03
Method:,Least Squares,F-statistic:,154.2
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,6.990000000000001e-35
Time:,19:00:56,Log-Likelihood:,-6278.7
No. Observations:,4898,AIC:,12560.0
Df Residuals:,4896,BIC:,12570.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.3817,0.042,150.356,0.000,6.299,6.465
total.sulfur.dioxide,-0.0036,0.000,-12.418,0.000,-0.004,-0.003

0,1,2,3
Omnibus:,27.91,Durbin-Watson:,1.656
Prob(Omnibus):,0.0,Jarque-Bera (JB):,35.245
Skew:,0.094,Prob(JB):,2.22e-08
Kurtosis:,3.371,Cond. No.,493.0


In [85]:
X = white_wine_data["density"] ## X usually means our input variables (or independent variables)
y = white_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.094
Model:,OLS,Adj. R-squared:,0.094
Method:,Least Squares,F-statistic:,509.9
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,1.73e-107
Time:,19:00:57,Log-Likelihood:,-6112.0
No. Observations:,4898,AIC:,12230.0
Df Residuals:,4896,BIC:,12240.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,96.2771,4.003,24.049,0.000,88.429,104.125
density,-90.9424,4.027,-22.581,0.000,-98.838,-83.047

0,1,2,3
Omnibus:,56.162,Durbin-Watson:,1.654
Prob(Omnibus):,0.0,Jarque-Bera (JB):,87.563
Skew:,0.099,Prob(JB):,9.68e-20
Kurtosis:,3.624,Cond. No.,665.0


In [86]:
X = white_wine_data["pH"] ## X usually means our input variables (or independent variables)
y = white_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.01
Model:,OLS,Adj. R-squared:,0.01
Method:,Least Squares,F-statistic:,48.88
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,3.08e-12
Time:,19:00:57,Log-Likelihood:,-6330.3
No. Observations:,4898,AIC:,12660.0
Df Residuals:,4896,BIC:,12680.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.0187,0.266,15.095,0.000,3.497,4.541
pH,0.5832,0.083,6.992,0.000,0.420,0.747

0,1,2,3
Omnibus:,26.367,Durbin-Watson:,1.639
Prob(Omnibus):,0.0,Jarque-Bera (JB):,28.569
Skew:,0.142,Prob(JB):,6.26e-07
Kurtosis:,3.244,Cond. No.,74.1


In [87]:
X = white_wine_data["sulphates"] ## X usually means our input variables (or independent variables)
y = white_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.003
Model:,OLS,Adj. R-squared:,0.003
Method:,Least Squares,F-statistic:,14.15
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,0.000171
Time:,19:00:58,Log-Likelihood:,-6347.6
No. Observations:,4898,AIC:,12700.0
Df Residuals:,4896,BIC:,12710.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.6739,0.056,101.863,0.000,5.565,5.783
sulphates,0.4166,0.111,3.761,0.000,0.199,0.634

0,1,2,3
Omnibus:,30.264,Durbin-Watson:,1.651
Prob(Omnibus):,0.0,Jarque-Bera (JB):,32.27
Skew:,0.161,Prob(JB):,9.83e-08
Kurtosis:,3.233,Cond. No.,10.9


In [88]:
X = white_wine_data["alcohol"] ## X usually means our input variables (or independent variables)
y = white_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.19
Model:,OLS,Adj. R-squared:,0.19
Method:,Least Squares,F-statistic:,1146.0
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,5.6100000000000004e-226
Time:,19:00:58,Log-Likelihood:,-5839.4
No. Observations:,4898,AIC:,11680.0
Df Residuals:,4896,BIC:,11700.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.5820,0.098,26.345,0.000,2.390,2.774
alcohol,0.3135,0.009,33.858,0.000,0.295,0.332

0,1,2,3
Omnibus:,88.78,Durbin-Watson:,1.637
Prob(Omnibus):,0.0,Jarque-Bera (JB):,180.233
Skew:,0.031,Prob(JB):,7.289999999999999e-40
Kurtosis:,3.938,Cond. No.,91.9


# Multivariate Linear Regression: White Wines

In [89]:
X = white_wine_data[["fixed.acidity","volatile.acidity","citric.acid","residual.sugar","chlorides","free.sulfur.dioxide","total.sulfur.dioxide","density","pH","sulphates","alcohol"]] ## X usually means our input variables (or independent variables)
y = white_wine_data["quality"] ## Y usually means our output/dependent variable
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.282
Model:,OLS,Adj. R-squared:,0.28
Method:,Least Squares,F-statistic:,174.3
Date:,"Fri, 15 Mar 2019",Prob (F-statistic):,0.0
Time:,19:01:03,Log-Likelihood:,-5543.7
No. Observations:,4898,AIC:,11110.0
Df Residuals:,4886,BIC:,11190.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,150.1928,18.804,7.987,0.000,113.328,187.057
fixed.acidity,0.0655,0.021,3.139,0.002,0.025,0.106
volatile.acidity,-1.8632,0.114,-16.373,0.000,-2.086,-1.640
citric.acid,0.0221,0.096,0.231,0.818,-0.166,0.210
residual.sugar,0.0815,0.008,10.825,0.000,0.067,0.096
chlorides,-0.2473,0.547,-0.452,0.651,-1.319,0.824
free.sulfur.dioxide,0.0037,0.001,4.422,0.000,0.002,0.005
total.sulfur.dioxide,-0.0003,0.000,-0.756,0.450,-0.001,0.000
density,-150.2842,19.075,-7.879,0.000,-187.679,-112.890

0,1,2,3
Omnibus:,114.161,Durbin-Watson:,1.621
Prob(Omnibus):,0.0,Jarque-Bera (JB):,251.637
Skew:,0.073,Prob(JB):,2.28e-55
Kurtosis:,4.101,Cond. No.,374000.0
