# Solutions: Unit 3
-------------------

Complete the problems below in your copy of the Jupyter Notebook.

## Problem 3.1.

The [American Chemistry Council (ACC)](https://www.americanchemistry.com/chemistry-in-america/data-industry-statistics/statistics-on-the-plastic-resins-industry) tracks the annual production of polymers in the United States. Some of the basic industry statistics are made publicly available for download. The file `polymer_production.csv` contains information from the [PIPS Resin Sales and Production CY Figures, 2021 v 2020](https://www.americanchemistry.com/content/download/10906/file/ACC-PIPS-Resin-Sales-and-Production-CY-Figures-2021vs2020.pdf) report, giving 2021 production of thermoplastics in millions of pounds. Visualize this data, highlighting the top 3 polymers produced in 2021. To make the units more globally-relevent, plot the data in metric kilotons (1 kt = 2.2 million pounds).

1. Load the file `polymer_production.csv` and create a horizontal bar chart for the production volumes, with axis labels and title.
   - Plot the bars at 50% transparency
   - Plot the top 3 values in another color to highlight these values
2. Save the plot to the output directory as `problem3-1a.png` at 60 dpi
3. Save the plot to the output directory as `problem3-1b.png` at 600 dpi
4. Write down the differences between these files in terms of file size and quality

BONUS: determine a method to sort the data frame with production volume in descending order *and* with the "Other thermoplastics" category remaining at the bottom of the list. Plot the data using this order, again highlighting the top 3 polymers. Save this to the output directory as `problem3-1c.png`.

In [None]:
# problem 3.1. solution

import pandas as pd
import matplotlib.pyplot as plt

# optional, but I prefer the ggplot style
plt.style.use('ggplot')

# read the raw data, print out the head of the DataFrame to get the column names
polymer_df = pd.read_csv('../../data/polymer_production.csv')
polymer_df.head()

In [None]:
# create a new column for the converted units
polymer_df['Production-kt'] = polymer_df['Production-MillionLbs'] / 2.2

# sort by volume
polymer_df = polymer_df.sort_values('Production-kt', ascending=False)

# select top 3
polymer_df_top3 = polymer_df.head(3)

# plot the initial values
fig, ax = plt.subplots()
ax.barh(polymer_df.index, polymer_df['Production-kt'], tick_label=polymer_df['Polymer'], alpha=0.5, color='gray')
ax.barh(polymer_df_top3.index, polymer_df_top3['Production-kt'])

# flip the axis to match the order in the DataFrame top-to-bottom
ax.invert_yaxis()

# always label your numerical axes
ax.set_xlabel('2021 U.S. Polymer Production (kt)')

# save files, use plt.tight_layout() force matplotlib to render the plot before saving
plt.tight_layout()
plt.savefig('../../output/problem3-1a-solution.png', dpi=60)
plt.savefig('../../output/problem3-1b-solution.png', dpi=600)

Looking at the file sizes, we observe that the 60 dpi file is only 15 KB whereas the 600 dpi file is 209 KB. There is a marked difference in image quality and the 60 dpi image probably isn't useful for anything. The 600 dpi could be printed at a high quality, but may be unnecessarily large for a presentation file.

In [None]:
# BONUS: create a copy of the DataFrame, so that we don't modify the original
bonus_df = polymer_df.copy()

# create a new column with True/False values to indicate the "Other" 
bonus_df['IsOther'] = bonus_df['Polymer']=='Other thermoplastics'

# sort the values first on the new column (to push "Other" to the bottom), then Production volume
bonus_df = bonus_df.sort_values(['IsOther', 'Production-kt'], ascending=[True, False])

# reset the index to renumber the rows in the sorted order
# inplace=True means that the existing DataFrame will be modified
bonus_df.reset_index(drop=True, inplace=True)

# from here, it should be the same as before
# select top 3
bonus_df_top3 = bonus_df.head(3)

# plot the initial values
fig, ax = plt.subplots()
ax.barh(bonus_df.index, bonus_df['Production-kt'], tick_label=bonus_df['Polymer'], alpha=0.5, color='gray')
ax.barh(bonus_df_top3.index, bonus_df_top3['Production-kt'])

# flip the axis to match the order in the DataFrame top-to-bottom
ax.invert_yaxis()

# always label your numerical axes
ax.set_xlabel('2021 U.S. Polymer Production (kt)')

# save file, use plt.tight_layout() force matplotlib to render the plot before saving
plt.tight_layout()
plt.savefig('../../output/problem3-1c-solution.png', dpi=300)

## Problem 3.2.

Modify the function that you created for Unit 1 to represent the Gaussian (normal) probability distribution to use the `numpy` math functions in place of `math`.

1. Create a figure, axis and set the y-axis limits to $\left[0,0.5\right]$
2. Plot this function, with $\mu_1=2$ and $\sigma_1=1.5$, as a line on the interval [-5, 10], using 100 points in the range
   - Add dashed vertical lines at $\mu_1 \pm 3\sigma_1$
   - Add text centered on the mean in the $x$ direction and 10% above the maximum value of $y$ to provide the value of the mean
   - Plot the function and lines in red
3. On the same plot add this function, with $\mu_2=10$ and $\sigma_2=1$, as a line on the interval [5, 15], using 10 points in the range
   - Add dashed vertical lines at $\mu_2 \pm 3\sigma_2$
   - Add text centered on the mean in the $x$ direction and 10% above the maximum value of $y$ to provide the value of the mean
   - Plot the function and lines in blue
4. Save the plot to the output directory as `problem3-2.pdf`
5. Open the file and zoom in to 600%
   - What happens to the image quality?
   - What are the differences between the curves for parts 2 and 3?

In [None]:
# problem 3.2. solution

# modified gaussian function
import numpy as np
import matplotlib.pyplot as plt

def gaussian(x, mu, sigma):

    # parts of equation broken out for improved readability (not required)
    scale_factor = 1 / (sigma * np.sqrt(2 * np.pi))
    exponent = -0.5 * (x - mu)**2 / sigma**2

    # complete the equation, using the pieces defined above, and return result
    return scale_factor * np.exp(exponent)


# part 1
fig, ax = plt.subplots()
ax.set_ylim((0, 0.5))


# part 2
x1 = np.linspace(-5, 15, 100)
mu1 = 2
sigma1 = 1.5

# calculate the resulting function, given the values that we've defined
y1 = gaussian(x1, mu1, sigma1)
ax.plot(x1, y1, c='red')

# add the vertical lines and text
ax.axvline(mu1-3*sigma1, ls='--', c='red')
ax.axvline(mu1+3*sigma1, ls='--', c='red')

ax.text(mu1, y1.max()*1.1, f'$\mu=${mu1}', ha='center')


# part 3
x2 = np.linspace(5, 15, 10)
mu2 = 10
sigma2 = 1

# calculate the resulting function, given the values that we've defined
y2 = gaussian(x2, mu2, sigma2)
ax.plot(x2, y2, c='blue')

# add the vertical lines and text
ax.axvline(mu2-3*sigma2, ls='--', c='blue')
ax.axvline(mu2+3*sigma2, ls='--', c='blue')

ax.text(mu2, y2.max()*1.1, f'$\mu=${mu2}', ha='center')


# part 4
plt.savefig('../../output/problem3-2-solution.pdf')

Because the plot is saved as a pdf, there is no loss in image quality when zooming in at 600%. Comparing the two examples, we notice the impact of the number of points in our x arrays. By only using 10 points to plot the second distribution, we do not see a smooth curve. This may be confusing to the reader if we are trying to explain the shape of the Gaussian distribution.

## Problem 3.3.

Retail food packages are commonly made by welding plastic films under pressure, at temperatures above their melting point. To test the welding behavior of different polymers, *heat seal curve* is generated, which measures the force required to separate the weld as a function of increasing temperature. For many materials, there is some critical temperature where the failure mode shifts from a peelable seal to a *destruct* seal (where the film fails catastrophically). If you have struggled to open a package that was *supposed* to be peelable, you have experienced this phenomena first hand.

The file `seal_curve.csv` contains such a seal curve, with columns for the temperature (°C), replicate, breaking force (N) and failure mode (0=peelable, 1=destruct). 

Repeat these steps to create a separate plot for each of the `default`, `ggplot` plot styles:

1. Plot the peelable data points as an 'x'
2. Plot points where the film broke as squares
3. Add a dashed horizontal line to indicate the force where the failure mode changes from peelable to film destruct. Calculate this as the average between the strongest peelable strength value and weakest destruct strength value.
4. Add text to the plot, indicating the temperature where the failure mode changes from peelable to film destruct. Calculate this as the average between the highest temperature where peelable seals are observed and the lowest temperature where destruct seals are observed.
5. Save the plot to the output directory as `problem3-3-<stylename>.png` at 300 dpi

In [None]:
# problem 3.3. solution
 
# load the dataset
seal_df = pd.read_csv('../../data/seal_curve.csv')
seal_df.head()

In [None]:
# create separate DataFrames, filtered on the failure mode
peelable_df = seal_df[seal_df['failure_mode']==0]
destruct_df = seal_df[seal_df['failure_mode']==1]

# calculate the force of the film as described in the problem statement
destruct_force = (peelable_df['peak_strength'].max() + destruct_df['peak_strength'].min())/2

# calculate the temperature where the failure mode changes
initiation_temp = (peelable_df['temperature'].max() + destruct_df['temperature'].min())/2


# STYLE=default
plt.style.use('default')

fig, ax = plt.subplots()

# plot the peelable points as 'x'
ax.scatter(peelable_df['temperature'], peelable_df['peak_strength'], marker='x')

# plot the destruct points as 'x'
ax.scatter(destruct_df['temperature'], destruct_df['peak_strength'], marker='s')

ax.set_xlabel('Temperature (°C)')
ax.set_ylabel('Force (N)')

ax.axhline(destruct_force, ls='--', zorder=0, alpha=0.5)
ax.text(80, 65, f'Initiation Temperature: {initiation_temp:0.0f}°C')

plt.savefig('../../output/problem3-3-default-solution.png', dpi=300)


# STYLE=ggplot
plt.style.use('ggplot')

fig, ax = plt.subplots()

ax.scatter(peelable_df['temperature'], peelable_df['peak_strength'], marker='x')
ax.scatter(destruct_df['temperature'], destruct_df['peak_strength'], marker='s')

ax.set_xlabel('Temperature (°C)')
ax.set_ylabel('Force (N)')

ax.axhline(destruct_force, ls='--', zorder=0, alpha=0.5)
ax.text(80, 65, f'Initiation Temperature: {initiation_temp:0.0f}°C')

plt.savefig('../../output/problem3-3-ggplot-solution.png', dpi=300)


--------------
## Next Steps:

1. Advance to [Unit 4](../04-pandas-dataframe/unit04-lesson.ipynb) when you're ready for the next step