# Week 4 Notes

## 4.1.1: Getting Started with Pandas

In this case study, we will attempt to group different samples of whiskey using their flavor characteristics.

``pandas`` is built on top of NumPy and is a great tool for data analysis.
- ``pandas.Series`` is a 1-dimensional array-like object with a name and index.
- ``pandas.DataFrame`` is a 2-dimensional array-like object with a column and row labels.

To create a ``pandas.Series``, we can use the ``pandas.Series()`` function.

In [None]:
import pandas as pd


x = pd.Series([1, 2, 3, 4, 5])
x

In [None]:
# Using explicit indices
y = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
y

In [None]:
# Using a dictionary
z = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
z

To create a ``pandas.DataFrame``, we can use the ``pandas.DataFrame()`` function.

In [None]:
# Using a dictionary where the values are lists (can also be 1D numpy arrays)
z = pd.DataFrame({
    "name": ["John", "Mary", "Mark"],
    "age": [30, 40, 50],
    "ZIP": [12345, 23456, 34567]
})
z

We can get the index of a ``pandas.Series`` or ``pandas.DataFrame`` using the ``.index`` attribute. Using ``sorted()``, we can sort the index and create a list of the sorted indices.

We can also reorder a ``pandas.Series`` or ``pandas.DataFrame`` using the ``.reindex()`` method.

## 4.1.2: Loading and Inspecting Data

We will now load and inspect the data stored in ``whiskey.txt`` and ``regions.txt``, both of which are formatted in a CSV format.

In [None]:
import numpy as np, pandas as pd

whiskies = pd.read_csv('whiskies.txt')
whiskies["Region"] = pd.read_csv('regions.txt')
whiskies

We can use the ``.head()`` method to view the first few rows of the data. We can use the ``.tail()`` method to view the last few rows of the data.

In [None]:
whiskies.head()

In [None]:
whiskies.tail()

We would like to see the specific subset of the ``whiskies`` dataframe that corresponds to the flavors of whiskies. To do this, we can create a new dataframe using the followingg code:

In [None]:
flavors = whiskies.iloc[:, 2:14]
flavors

## 4.1.3: Exploring Correlations

We want to find out if there are any strong linear correlations between the different taste attributes of each whisky. We can use the ``.corr()`` method to find the correlation between each pair of columns, and by default, this method uses the Pearson correlation coefficient.

In [None]:
corr_flavors = flavors.corr()
corr_flavors

The above output corresponds to a correlation matrix. Let us plot this matrix.

In [None]:
import matplotlib.pyplot as plt


plt.figure(figsize=(10, 10))
plt.pcolor(corr_flavors)
plt.colorbar()
plt.show()

We now have a plot where we can see the correlation between each pair of taste attributes. We can also transpose the ``corr_flavors`` matrix to find correlations between the whiskies with respect to flavors (this can also be interpreted as the correlations between the whiskey refineries and the flavors of whiskey they produce).

In [None]:
corr_whiskies = flavors.T.corr()
corr_whiskies

In [None]:
plt.figure(figsize=(10, 10))
plt.pcolor(corr_whiskies)
plt.colorbar()
plt.show()

## 4.1.4 Clustering Whiskies by Flavor Profile

Spectral co-clustering is a method for grouping data points into clusters. There exists a Python function called ``scipy.cluster.bicluster.SpectralCoclustering()`` that can be used to perform spectral co-clustering.

Although this problem is still computationally to solve directly, an approximate solution can be found using eigenvalues and eigenvectors of an adjacency matrix.

In [None]:
from sklearn.cluster import SpectralCoclustering


model = SpectralCoclustering(n_clusters=6, random_state=0) # create a spectral co-clustering model with 6 clusters (represnting the 6 regions)

model.fit(corr_whiskies) # fit the model to the whiskies data

model.rows_  # see the clusters as rows and the individual whiskies as columns, with "True" denoting that the whisky belongs to a certain cluster and "False" denoting that it does not

In [None]:
np.sum(model.rows_, axis=1) # the number of whiskies in each cluster

In [None]:
model.row_labels_ # the output denotes the cluster that each whisky belongs to