# Engineering Data Analysis

> **Mohamad M. Hallal, PhD** <br> Teaching Professor, UC Berkeley

[![License](https://img.shields.io/badge/license-CC%20BY--NC--ND%204.0-blue)](https://creativecommons.org/licenses/by-nc-nd/4.0/)
***

# Data Transformation 

In this notebook, we'll explore how linear transformations (of the form $y = a + bx$) affect descriptive statistics such as mean, median, standard deviation, and range. Use the sliders to adjust the parameters `a` and `b` and see how the statistics change.

Let's get started!

Run the cell below to define some functions that we will use later. You won't see any output after running this cell because it only defines the function.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import ipywidgets as widgets  # import ipywidgets package for interactive widgets

def plot_horizontal_line(a=0, b=1):
    # Define the range for the horizontal line
    x_values = list(range(-16, 17, 2))
    y_value = 0

    # Create the plot
    plt.figure(figsize=(10, 2), dpi=300)
    
    # Set the x-axis limits
    plt.xlim(-17, 17)
    
    original = np.array([0, 2, 4])
    
    plt.scatter(original, [1]*3, 125, zorder=5, label='x')
    
    if a!=0 or b!=1:
        plt.scatter(a+b*original, [-1]*3, 125, zorder=5, 
                    label=f"y={b if b!=1 else ''}x{'+' if a>0 else ''}{a if a!=0 else ''}")
    
    # Set the y-axis limits to have no boxes around
    plt.ylim(-2, 2)
    
    # add legend
    plt.legend()

    # Customize the ticks
    plt.xticks(x_values)
    plt.yticks([])  # Remove y-axis ticks

    # Remove the top and right spines (boxes)
    ax = plt.gca()
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    
    # add grid line
    ax.grid(ls=':')

    # Move the bottom spine (x-axis) to y=0
    ax.spines['bottom'].set_position(('data',0))

    # Display the plot
    plt.show()

## Original Sample

We'll start by generating some sample data. We'll look at a simplified sample with three values: $x = 0, 2, 4$

<div class="alert alert-block alert-info"> <b>YOUR TURN!</b> Run the cell below to plot the sample.</div> 

<div class="alert alert-block alert-warning"> <b>DISCUSS!</b> <b>By looking at the plot, what are the mean, median, standard deviation, and range of the sample?</b></div> 

*Hint: You don't need to calculate these using their formulas. Look at the plot to visually assess the center and spread of the sample.*

In [None]:
# plot the original sample
plot_horizontal_line()   

## Part I: Adding a Constant

Next, we'll investigate the effect of adding or subtracting a constant to each value in the sample ($x+a$) on the summary statistics. 

<div class="alert alert-block alert-info"> <b>YOUR TURN!</b> Run the cell below and change the slider to adjust the parameter <code>a</code> and see how the sample values change.</div> 

<div class="alert alert-block alert-warning"> <b>DISCUSS!</b> What do you observe from the plot below? Specifically: <br> <b>What do you notice about the mean of the transformed data? Does it change or remain constant? Why or why not?</b> 
    <br> <b>Does changing $a$ affect the spread of the data, specifically, the standard deviation and the range? Why or why not?</b>
    <br> <b>If the mean or range change, how is this change related to $a$?</b></div> 

In [None]:
# create 1 slider for a
@widgets.interact(a=(-10,10,1))

# define a function that takes the values from the sliders and plots the data
def transform(a=1):
    # Call the function to plot
    plot_horizontal_line(a=a)   
    return

## Part II: Multiplying by a Constant

Next, we'll investigate the effect of multiplying each value in the sample by a constant ($bx$) on the summary statistics.

<div class="alert alert-block alert-info"> <b>YOUR TURN!</b> Run the cell below and change the slider to adjust the parameter <code>b</code> and see how the sample values change.</div> 

<div class="alert alert-block alert-warning"> <b>DISCUSS!</b> What do you observe from the plot below? Specifically: <br> <b>What do you notice about the mean of the transformed data? Does it change or remain constant? Why or why not?</b> 
    <br> <b>Does changing $b$ affect the spread of the data, specifically, the standard deviation and the range? Why or why not?</b>
    <br> <b>If the mean or range change, how is this change related to $b$?</b>
    <br> <b>How does a negative constant affect the spread of the data, specifically, the standard deviation and the range? </b></div>

In [None]:
# create 1 slider for b
@widgets.interact(b=(-4,4,1))

# define a function that takes the values from the sliders and plots the data
def transform(b=2):
    # Call the function to plot
    plot_horizontal_line(b=b)   
    return

## Part III: Putting It All Together

Finally, we'll investigate the combined effect of adding a constant and multiplying by a constant ($bx + a$) on the summary statistics. 

<div class="alert alert-block alert-info"> <b>YOUR TURN!</b> Run the cell below and change the sliders to adjust the parameters <code>a</code> and <code>b</code> and see how the sample values change.</div> 

<div class="alert alert-block alert-warning"> <b>DISCUSS!</b> What do you observe from the plot below? Specifically: <br> <b>What do you notice about the mean of the transformed data? Does it change or remain constant? Why or why not?</b> 
    <br> <b>What do you notice about the standard deviation and the range of the transformed data? Does it change or remain constant? Why or why not?</b>
    <br> <b>If the mean or range change, how is this change related to $a$ and $b$?</b></div>

In [None]:
# create 2 sliders for a and b
@widgets.interact(a=(-4,4,1), b=(-3,3,1))

# define a function that takes the values from the sliders and plots the data
def transform(a=1, b=2):
    # Call the function to plot
    plot_horizontal_line(a, b)   
    return