# Week 01 Example Statistics in Python 3
This Notebook introduces basic statistical analysis from week 01 using Python 3.x.

Jupyter notebooks blend Markdown (rich formatted text) with software code and output to ease the learning process.



Version: 01

Author: Chris Kennedy

In [1]:
import pandas as pd
from scipy import stats

In [2]:
dfWine = pd.read_excel(r'W1 - Wine Quality.xlsx')


In [3]:
dfWine.describe()

Unnamed: 0,Wine #,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,3249.0,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801,5.818378
std,1875.666681,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712,0.873255
min,1.0,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0
25%,1625.0,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0
50%,3249.0,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0
75%,4873.0,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0
max,6497.0,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0


## Practice Question 1
What is the 99% confidence interval for the average alcohol level of a bottle of wine?

In [4]:
n = dfWine['alcohol'].count()
print("# Wines: %6.2f" % n)

# Wines: 6497.00


In [5]:
avg = dfWine['alcohol'].mean()
print("Average Alcohol: %6.4f" % avg)

Average Alcohol: 10.4918


In [6]:
stderr = dfWine['alcohol'].sem()
print("Standard Error: %6.4f" % stderr)

Standard Error: 0.0148


In [7]:
t_CI = stats.t.ppf(0.995,df = n - 1)
print("T-statistic: %6.4f" % t_CI)

T-statistic: 2.5766


In [8]:
moe = t_CI * stderr
print("Margin of error: %6.4f" % moe)

Margin of error: 0.0381


In [9]:
print("Lower: %6.4f" % (avg - moe))
print("Upper: %6.4f" % (avg + moe))

Lower: 10.4537
Upper: 10.5299


In [10]:
# Concise using stats package directly:
stats.t.interval(0.99, loc=avg, scale=stderr, df = n-1)

(10.453674609436044, 10.529927052869633)

## Practice Question 2
What is the 90% confidence interval around the proportion of white wines that are rated very good quality (7 or higher)?


In [11]:
filteredDF = dfWine[dfWine['type'] == 'white']
n = filteredDF['type'].count()

In [12]:
qfilteredDF = filteredDF[filteredDF['quality'] >= 7]
nq = qfilteredDF['type'].count()

In [13]:
proportion  = nq / n
print("Proportion: %5.2f%%" % (proportion*100))

Proportion: 21.64%


In [14]:
stderr = (proportion * (1 - proportion) / n)**0.50
print("Standard error: %5.2f%%" % (stderr*100))

Standard error:  0.59%


In [15]:
z_CI = stats.norm.ppf(0.95)

In [16]:
print ("Z for 90%%: %6.4f" % z_CI)

Z for 90%: 1.6449


In [17]:
stats.norm.interval(0.90, loc=proportion*100, scale=stderr*100)

(20.67364423441555, 22.6093284074791)

## Question 3
Can you conclude (at the 5% significance level) that the average fixed acid level for all wines is above 7.2?

Creating a simple helper function:

In [18]:
def getStatistic(x, H0, stderr):
    return (x - H0) / stderr

H$_0$: $\mu \le 7.2$

H$_a$: $\mu > 7.2$


In [19]:
H0 = 7.20
avg = dfWine['fixed acidity'].mean()
stderr = dfWine['fixed acidity'].sem()
tStatistic = getStatistic(avg, H0, stderr)
nWines = dfWine['fixed acidity'].count()
print("T-stat: %7.4f" %tStatistic)
dfWine['fixed acidity'].describe()

T-stat:  0.9517


count    6497.000000
mean        7.215307
std         1.296434
min         3.800000
25%         6.400000
50%         7.000000
75%         7.700000
max        15.900000
Name: fixed acidity, dtype: float64

Right tail test due to alternative hypothesis.

In [20]:
prob_value = 1 - stats.t.cdf(tStatistic, df=nWines-1)
print(prob_value)

0.17064341607837663


Reject the null hypothesis if probability value is less than significance level.

In [24]:
print("Reject Null") if prob_value < 0.05 else print("Cannot Reject Null")

Cannot Reject Null


### End of Notebook!