# Solutions: Unit 8
-------------------

Complete the problems below in your copy of the Jupyter Notebook.

## Problem 8.1. 

Image analysis can be used to digitize a printed graph. The reference book "Permeability and Other Film Properties" (1995) provides information on gas transmission through polymers. Use file `eva_otr.png`, which is a scan from page 224, and provides the oxygen barrier of ethylene-co-vinyl acetate copolymers as a function of the vinyl acetate percentage. 

1. Load the image, convert to black/white and crop the image to fit the graph and display the resulting cropped image
2. Analyze the image to find the $x$, $y$ coordinates of the plotted points
3. Use regression analysis to fit a function to the data
4. Plot the identified points and regression line

It may be helpful to know that a column from a `pandas.DataFrame` can be converted to a `numpy.ndarray` by selecting the column and calling the `to_numpy()` function: `df['colname'].to_numpy()`.

In [None]:
# problem 8.1. solution

from skimage import io, measure
from skimage.filters import threshold_otsu
from sklearn import linear_model, preprocessing
import numpy as np
from scipy import ndimage
import matplotlib.pyplot as plt
import pandas as pd

# load and display the raw image in grayscale
img = io.imread('../../data/eva_otr.png', as_gray=True)

# convert the image to black and white
# because the points are black (low number) on white (high number), 
# we select points less than the threshold
threshold = threshold_otsu(img)
img_bw = img < threshold

# display image
fig, ax = plt.subplots()
ax.imshow(img_bw)

In [None]:
# crop the image

# you could crop the image in an image editor, or crop using matrix slicing
# this method will require some trial and error to get dialed in
img_crop = img_bw[185:1275, 340:2150]

# display the cropped image
fig, ax = plt.subplots()
ax.imshow(img_crop)

In [None]:
# use object detection to get the points in the plot

# label the plotted points in the image
labels, no_objects = ndimage.label(img_crop)

props_table = measure.regionprops_table(labels, properties=['centroid'])
props_df = pd.DataFrame(props_table)

props_df.head()

In [None]:
# plot the identified objects

# these points are read off of the original graph scales
ticks_x_min = 0
ticks_x_max = 30
ticks_y_min = 2000
ticks_y_max = 6000

# the number of pixels in the graph will be used to set the scale factor
img_x_max = img_crop.shape[1]
img_y_max = img_crop.shape[0]

# calculate the x, y coordinates by scaling the centroid (in pixel units) to the 
# x, y scale of the original plot
props_df['x'] = props_df['centroid-1'] / img_x_max * (ticks_x_max - ticks_x_min) + ticks_x_min
props_df['y'] = (1 - props_df['centroid-0'] / img_y_max) * (ticks_y_max - ticks_y_min) + ticks_y_min

# display the results, in the scale of the original
fig, ax = plt.subplots()
ax.scatter(props_df['x'], props_df['y'])

In [None]:
# clean up the errors around the edges
# alternatively, these could have been erased in an image editor before loading

# filter out the edges, these were the tick marks in the image
props_df = props_df[props_df['x'] > 2]
props_df = props_df[props_df['x'] < 29]
props_df = props_df[props_df['y'] > 2100]
props_df = props_df[props_df['y'] < 5200]

# display the cleaned data set
fig, ax = plt.subplots()
ax.scatter(props_df['x'], props_df['y'])

ax.set_xlabel('Vinyl Acetate Content (wt%)')
ax.set_ylabel('O$_2$ Permeability (cm$^3$/m$^2$$\cdot$bar$\cdot$day)')

In [None]:
# apply regression to the extracted points
# this appears to have some curvature, so we will use a quadratic polynomial

# use sklearn to generate the polynomial feature matrix
poly = preprocessing.PolynomialFeatures(2)
X = poly.fit_transform(props_df['x'].to_numpy().reshape(-1, 1))
y = props_df['y'].to_numpy()

# fit a linear regression model to the data
reg = linear_model.LinearRegression(fit_intercept=False)
reg.fit(X, y)

# create a new array to plot the regression line from 0-30% VA
# this must also be transformed into a polynomial feature matrix
x_model = np.linspace(0, 30, 100)
X_model = poly.fit_transform(x_model.reshape(-1, 1))

# use the regression model to calculate the points on the fitted line
y_hat = reg.predict(X_model)

# display the points and the regression line
fig, ax = plt.subplots()
ax.scatter(props_df['x'], props_df['y'])
ax.plot(x_model, y_hat)

ax.set_xlabel('Vinyl Acetate Content (wt%)')
ax.set_ylabel('O$_2$ Permeability (cm$^3$/m$^2$$\cdot$bar$\cdot$day)')

# display the equation and R**2 on the plot as text
ax.text(0, 4500, f'y={reg.coef_[0]:0.1f}+{reg.coef_[1]:0.1f}$x$+{reg.coef_[2]:0.1f}$x^2$')
ax.text(0, 4250, f'$R^2$={reg.score(X, y):0.2f}')

## Problem 8.2.

Sometimes it is helpful to use individual RGB color channels to provide more contrast than is apparent in the full color image. The files `gel1.jpg` and `gel2.jpg` are photographs of "gels" (cross-linked defects) in polymer films. These are especially common in recycled materials. Compare the gel content in these two images. 

1. Load the files `gel1.jpg` and `gel2.jpg`
2. Determine which color channel in `gel1.jpg` is most suitable for image analysis
3. Using this color channel for both images, count and find the area of each defect in the film
   - Manually set two different values for the black/white threshold
   - Produce histograms of the results for each image, at each threshold
4. Compare the results of the two images, and the impact of the different threshold values

In [None]:
# problem 8.2. solution

gel1 = io.imread('../../data/gel1.jpg')
gel2 = io.imread('../../data/gel2.jpg')

# plot the channels individually
fig, ax = plt.subplots(ncols=3)

# red channel
ax[0].imshow(gel1[:, :, 0], cmap='Reds')
ax[0].set_axis_off()
ax[0].set_title(f'min={gel1[:, :, 0].min()}, max={gel1[:, :, 0].max()}', fontsize='small')

# green channel
ax[1].imshow(gel1[:, :, 1], cmap='Greens')
ax[1].set_axis_off()
ax[1].set_title(f'min={gel1[:, :, 1].min()}, max={gel1[:, :, 1].max()}', fontsize='small')

# blue channel
ax[2].imshow(gel1[:, :, 2], cmap='Blues')
ax[2].set_axis_off()
ax[2].set_title(f'min={gel1[:, :, 2].min()}, max={gel1[:, :, 2].max()}', fontsize='small')

This shows us that the blue channel uses the widest range within the 8-bit [0, 255] space. We should proceed with the blue channel.

In [None]:
# wrote a function to plot each image/threshold combination to reduce redundant code
def process_image(img_arr, threshold, ax, ax_col, title):

    # apply threshold to convert to black and white
    img_bw = img_arr > threshold

    # label and count the defects
    labels, no_objects = ndimage.label(img_bw)
    props_table = measure.regionprops_table(labels, properties=['centroid', 
                                                                'area'])
    df = pd.DataFrame(props_table)

    defect_count = len(df)
    defect_area_fraction = df['area'].sum() / (img_arr.shape[0] * img_arr.shape[1])


    # plot the image and histogram
    ax[0, ax_col].imshow(img_bw)
    ax[0, ax_col].set_axis_off()
    ax[0, ax_col].set_title(title, fontsize='x-small')

    ax[1, ax_col].hist(df['area'], bins=np.arange(0, 700, 50))
    ax[1, ax_col].set_xticks(np.arange(0, 700, 300))
    ax[1, ax_col].set_xlabel('Defect Size (px)', fontsize='x-small')

    ax[1, ax_col].text(100, 350, f'Count: {defect_count}', fontsize='small')
    ax[1, ax_col].text(100, 300, f'Area: {defect_area_fraction:0.2%}', fontsize='small')

# create a faceted plot to display the results
fig, ax = plt.subplots(nrows=2, ncols=4, sharex='row', sharey='row', dpi=150)

# calculate and plot each combination
# threshold of 80
process_image(gel1[:, :, 2], 80, ax, 0, 'gel1.jpg\nthreshold=80')
process_image(gel2[:, :, 2], 80, ax, 1, 'gel2.jpg\nthreshold=80')

# threshold of 120
process_image(gel1[:, :, 2], 120, ax, 2, 'gel1.jpg\nthreshold=120')
process_image(gel2[:, :, 2], 120, ax, 3, 'gel2.jpg\nthreshold=120')

With the higher threshold value of 120 (compared to 80), the count of identified defects and total area were both decreased. Using the higher threshold meant that some small defects where considered as black and excluded from the analysis. Additionally, the medium-gray edges of the gels were cut off and considered as background. This is not necessary good or bad, but needs to be managed for consistent results. At both threshold values, it appears that gel1 has a larger count and total defect area when compared to gel2. Additional measurements from other samples (or a knowledge of the test variance) would be required to conduct a statistical t-test to assess signficance of this result.

------------------
## Congratulations

This concludes the Python for Engineering Data Analytics course. You should now be confident exploring engineering data in Python! Feel free to refer back to these lessons and copy code that helps jumpstart your analytics problems. Additionally, look at the [References](../../REFERENCES.md) file for a summary of useful links that were presented in the lessons.