<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Learning-Objectives---Beer-PCA" data-toc-modified-id="Learning-Objectives---Beer-PCA-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Learning Objectives - Beer PCA</a></span><ul class="toc-item"><li><span><a href="#Exercise-Overview" data-toc-modified-id="Exercise-Overview-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Exercise Overview</a></span></li><li><span><a href="#In-More-Detail" data-toc-modified-id="In-More-Detail-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>In More Detail</a></span></li></ul></li></ul></div>

---
# Learning Objectives - Beer PCA
At the end of [this exercise](./NB3_EXERCISE_Beer_PCA.ipynb) students will be able to:
1. Retrieve process data from multiple assets using a cloud-based platform (OSIsoft Cloud Services or OCS) which stores real industrial sensor data.
2. Perform basic operations using Pandas, a software library widely used in data science applications,
3. Identify problems with data quality.
4. Perform data cleansing operations to ensure that the quality of the data fits the analysis goals.
5. Perform Principal Component Analysis (PCA) to visualize high-dimensional datasets and enhance process analytics.
6. Discuss how PCA works.
7. Apply the k-means clustering algorithm to classify data (BONUS).


---
## Exercise Overview
It is often useful to be able to quickly determine whether and which of the production batches are out-of-spec. The ability to determine the origin or the cause of a process failure or a bad batch is valuable to industry. In this [exercise](./NB3_EXERCISE_Beer_PCA.ipynb), students will use Principal Component Analysis (PCA) to determine which batches are out-of-spec without having to analyze the time series data for every attribute of erach beer batch, of which there are dozens. Students will then use the properties of PCA to identify what may have caused a batch to be out-of-spec.

The exercise notebook also has blocks of missing code which students will have to complete in order for the notebook to run properly and produce results; this ensures that students will interact with the notebook and pay attention to the algorithm and how the code is written.

 
**This [exercise](./NB3_EXERCISE_Beer_PCA.ipynb) has six total sections, including a [bonus section](././NB3_EXERCISE_Beer_PCA.ipynb#section_6).** 

In [Part 1](./NB3_EXERCISE_Beer_PCA.ipynb#section_1), specify parameters of interest like brands, fermentors, time period to analyze, time granularity, attributes.

In [Part 2](./NB3_EXERCISE_Beer_PCA.ipynb#section_2), use OSIsoft Cloud Services (OCS) to obtain process data from Deschutes Brewery.

In [Part 3](./NB3_EXERCISE_Beer_PCA.ipynb#section_3), create utility functions for calculating the time elapsed per batch, identifying fermentation stages without relying on the "Status" labels, removing the bad batches, and summarizing the data.

In [Part 4](./NB3_EXERCISE_Beer_PCA.ipynb#section_4), execute the functions created in Part 3 for each brand in each fermentor vessel. Collect the summary of the data into the dataframe `FeatureDF`.

In [Part 5](./NB3_EXERCISE_Beer_PCA.ipynb#section_5), `FeatureDF` is a high-dimensional dataset that is not easily human-processable. Use Principal Components Analysis (PCA) to obtain new components composed of linear combinations of the most "relevant" input features in `FeatureDF`( [5a](./NB3_EXERCISE_Beer_PCA.ipynb#section_5a)-[5c](./NB3_EXERCISE_Beer_PCA.ipynb#section_5c)); use those components to graphically visualize `FeatureDF` and identify the bad batches [5d](./NB3_EXERCISE_Beer_PCA.ipynb#section_5d); then identify the input features which contributed to the out-of-spec characteristics of the outlier batches ([5e](./NB3_EXERCISE_Beer_PCA.ipynb#section_5e)).

In [Part 6 (BONUS)](./NB3_EXERCISE_Beer_PCA.ipynb#section_6), use the k-means clustering algorithm to classify data.

**Furthermore at the end of the exercise, students will be asked to think about the following questions:** 
1. Why do you think there are bad batches?
2. Does it seem like there is equipment failure? Process failure?
3. What would you do to determine the root cause of a process or equipment failure?

---
## In More Detail

**Students will learn how to initialize OSIsoft Cloud Services (OCS) by authenticating their purpose and directing calls to the right namespace:**

```python
config = configparser.ConfigParser()
config.read("config.ini")

hub_client = HubClient(
    config.get("Access", "ApiVersion"),
    config.get("Access", "Tenant"),
    config.get("Access", "Resource"),
    config.get("Credentials", "ClientId"),
    config.get("Credentials", "ClientSecret"),
)

namespace_id = config.get("Configurations", "Namespace")
```

**Students will then use OCS to retrieve process data from multiple assets in Deschutes Brewery:**
```python
dv_ids = [DATAVIEW_PREFIX + str(vessel) for vessel in FERMENTORS_OF_INTEREST]

all_brands_df = hub_client.dataviews_interpolated_pd(
namespace_id, dv_ids, START_INDEX, END_INDEX, INDEX_INTERVAL, chunks=1
)
```

**The filtering expressions students will have to utilize are also more complex. For example, students will have to complete filtering expressions of this type:**
```python
# we wonly want batches in `valid_batches`
# TODO: Complete the filter expression
# =========== STUDENT BEGIN ==========
offset_df = offset_df[offset_df["batch"].isin(valid_batches)]
# =========== STUDENT END ==========
```

**where in previous notebooks, similar filtering would be accomplished by using loops of more basic filter expressions. Students will also work with more advanced Pandas commands and different strategies for dealing with mislabeled but valid data.**

```python
# relabel batch statuses as either 'pre-cooling' or 'cooling'
for __, row in cooling_starts.iterrows():

    # label for current batch
    current_batch = row["batch"]

    # when cooling starts for current batch
    cooling_start_time = row["tsf"]

    # get booleans for current batch
    boolean = offset_df["batch"] == current_batch

    # relabel Status as either "Precooling" or "Cooling"
    # TODO: Complete the expression
    # HINT: Status should be "Precooling" if time < cooling start time, "Cooling" otherwise
    # HINT: 
    # =========== STUDENT BEGIN ==========
    offset_df.loc[boolean, "Status"] = offset_df.loc[boolean, "tsf"].apply(
        lambda t: "Precooling" if t < cooling_start_time else "Cooling"
    )
    # =========== STUDENT END ==========
```
**Students will perform principal component analysis using the sci-kit learn library in Python, which is widely used in machine learning applications:**

![](https://academicpi.blob.core.windows.net/images/NB3_PCA_Visualization.png)

**Besides going through how to use scikit-learn to perform PCA, the notebook also contains high-level discussions of PCA, with visual supplements, to ensure that students get a good grasp of what PCA accomplishes.**

![](https://academicpi.blob.core.windows.net/images/NB3_Coefficient_Schematic.png)

**There are also graphical representations of operations that many not be intuitive:**

![](https://academicpi.blob.core.windows.net/images/NB3_Scaling_Schematic.png)

**Finally, students will apply some of their knowledge of PCA to determine which of the input features were responsible for the out-of-spec qualities of bad batches**

![](https://academicpi.blob.core.windows.net/images/NB3_ContributionPlot.PNG)

**Also included in the notebook are some cells describing how to perform k-means clustering using Python. This is a bonus feature.**