## Classifying Whisky
In this case study, we will classify scotch whiskies
based on their flavor characteristics.
The dataset we'll be using contains a selection of scotch whiskies
from several distilleries, and we'll attempt to cluster whiskies
into groups that are similar in flavor.
This case study will deepen your understanding of Pandas, NumPy,
and scikit-learn, and perhaps of scotch whisky.
You'll also get a try out Bokeh, which is an interactive visualization
library for web browsers.

The dataset we'll be using consists of tasting ratings
of one readily available single malt scotch whisky
from almost every active whisky distillery in Scotland.
The resulting dataset has 86 malt whiskies
that are scored between 0 and 4 in 12 different taste categories.
The scores have been aggregated from 10 different tasters.
The taste categories describe whether the whiskies are sweet, smoky,
medicinal, spicy, and so on.

### Getting Started with Pandas

`Pandas` is a Python library that provides data structures and functions
for working with structured data, primarily tabular data.
Pandas is built on top of `NumPy` and some familiarity with NumPy
makes Pandas easier to use and understand.

Pandas has two data structures that you need to know the basics of,
and these are called `Series` and `Data Frame`.

In short, Series is a one-dimensional array-like object,
and Data Frame is a two-dimensional array-like object.
Both objects also contain additional information about the data
called `metadata`

#### series

In [1]:
import pandas as pd

In [2]:
x=pd.Series([6,3,8,6])
x

0    6
1    3
2    8
3    6
dtype: int64

Here the data array is shown in the right column,
and the left column shows the index, which is an array of data labels.
Because we didn't specify an index explicitly,
Pandas is using the default index, which is
a sequence of integers starting at 0, increasing one
by one for every subsequent row.

Let's now specify an index explicitly.

In [4]:
x=pd.Series([6,3,8,6], index=["q","w","e","r"])
x

q    6
w    3
e    8
r    6
dtype: int64

You can use the index to specify values or a set of values.

In [5]:
x["w"]

3

If we would like to have multiple entries, we construct a list,
and inside the list, we enter the entries we are interested in.

In [6]:
x[["w","r"]]

w    3
r    6
dtype: int64

There are many ways to construct a Series object in Pandas.
A common way is by passing a dictionary.
You'll notice from the output that the index of the Series
consists of keys of the dictionary in sorted order.
And the values are the value objects in the dictionary.

In [7]:
age={"Tim":29,"Jim":31,"Pam":27,"Sam":35}

In [8]:
x=pd.Series(age)
x

Tim    29
Jim    31
Pam    27
Sam    35
dtype: int64

#### Data Frame
Data Frames represent table-like data, and they have both row and column
index.
Like with Series, there are many ways to construct a Data Frame.
A common way is by passing a dictionary where the value objects are
lists or NumPy arrays of equal length.

In [9]:
data={'name':['Tim','Jim','Pam','Sam'],
     'age':[29,31,27,35],
     'ZIP':['02115','02130','67700','00100']}

In [10]:
x=pd.DataFrame(data,columns=["name","age","ZIP"])
x

Unnamed: 0,name,age,ZIP
0,Tim,29,2115
1,Jim,31,2130
2,Pam,27,67700
3,Sam,35,100


We can retrieve a column by using dictionary-like notation
or we can specify the name of the column as an attribute of the Data Frame.

In [11]:
x["name"]

0    Tim
1    Jim
2    Pam
3    Sam
Name: name, dtype: object

The alternative approach is to use the attribute notation.
We can type x dot name, and we get the same identical output.

In [12]:
x.name

0    Tim
1    Jim
2    Pam
3    Sam
Name: name, dtype: object

Let's continue with the Series object that we entered previously.

In [13]:
x=pd.Series([6,3,8,6], index=["q","w","e","r"])
x

q    6
w    3
e    8
r    6
dtype: int64

In [14]:
x.index

Index(['q', 'w', 'e', 'r'], dtype='object')

We can take the index, and we can construct a new Python
list, which consists of the same elements, the same letters,
but now they've been ordered alphabetically.

In [15]:
sorted(x.index)

['e', 'q', 'r', 'w']

In [16]:
x.reindex(sorted(x.index))

e    8
q    6
r    6
w    3
dtype: int64

Series and Data Frame objects support arithmetic operations like addition.
If we, for example, add two Series objects together,
the data alignment happens by index.
What that means is that entries in the series that have the same index
are added together in the same way we might add elements of a NumPy array.
If the indices do not match, however, Pandas
introduces a NAN, or not a number object, the resulting series.
This is easy to understand through an example.

In [17]:
x=pd.Series([6,3,8,6], index=["q","w","e","r"])
y=pd.Series([7,3,5,2], index=["e","q","r","t"])

In [18]:
x

q    6
w    3
e    8
r    6
dtype: int64

In [19]:
y

e    7
q    3
r    5
t    2
dtype: int64

In [28]:
x+y

e    15.0
q     9.0
r    11.0
t     NaN
w     NaN
dtype: float64

In this case, both x and y have indices e, q, and r, but only one of them
has either t or w.
This is why the entries corresponding to indices t and w
appear as NANs in the output.

If you construct a similar example for a Data Frame,
you'll see that arithmetic operations work the same way for them.
Pandas has many, many more features.
You can summarize data, compute correlations, handle missing data,
use hierarchical indexing, and much, much more.

### Loading and Inspecting Data

Two files are called whiskies.txt and regions.txt.
The regions file contains the regions in which each of the whiskies produced.
The whiskies file contains all other details about the whiskies.

In [29]:
import numpy as np
import pandas as pd

In [36]:
import os
os.getcwd()

'C:\\Users\\user\\Documents\\jupyter\\HavardX-Using-Python-For-Research\\Week 3 - Case Study'

In [32]:
pd.read_csv("whiskies.txt")


FileNotFoundError: [Errno 2] File whiskies.txt does not exist: 'whiskies.txt'

In [None]:
regions=pd.read_csv("regions.txt")