In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Floating-Point Numbers 

So, to rehash, when we talk about floating-point represenations of numbers, we mean on a 64-bit machine that we turn a 64-dimensional vector of 0's and 1's, say 

$$
{\bf x}_{f} = \left(s ~c_{10} c_{9} \cdots c_{0} ~f_{1} f_{2} \cdots f_{52} \right), 
$$

where

$$
s=0 ~\mbox{or}~ 1, ~ c_{l}=0~\mbox{or}~1, ~ \mbox{and} ~f_{j}=0~\mbox{or}~1, 
$$

into the real number $x_{f}$ via the formula 

$$
x_{f} = (-1)^{s}2^{\tilde{c}}(1 + \tilde{f}),
$$

where

$$
\tilde{c} = \sum_{l=0}^{10}c_{l}2^{l} - 1023, ~ \tilde{f} = \sum_{j=1}^{52}\frac{f_{j}}{2^{j}}
$$

**Problem**: What are the largest and smallest values the mantissa $\tilde{f}$ can have?  Remember to use our favorite partial geometric series formula

$$
\sum_{j=0}^{n} a^{j} = \frac{a^{n+1}-1}{a-1}.
$$

**Problem**: If $s=0$, and $c_{l}=1$ and $f_{j}=1$ for all $l$ and $j$, what number $x_{f}$ does that correspond to?  How big is it in base-10?  

**Problem**: What 64-bit vector corresponds to $x_{f}=1$?  Note, we have 

$$
\log_{2}(1) = \tilde{c} + \log_{2}(1+\tilde{f})
$$

and of course $\log_{2}(1)=0$.  

**Problem**: For any floating-point number $x_{f}$, show that its corresponding characteristic value $\tilde{c}$ is given by 

$$
\tilde{c} = \lfloor \log_{2}(x_{f}) \rfloor
$$

where $\lfloor \cdot \rfloor$ is the _floor function_, which rounds _down_ to the nearest integer.  So $\lfloor 1.2 \rfloor = 1$ and $\lfloor -1.2 \rfloor = -2$.  Show then that 

$$
\tilde{f} = x_{f}2^{-\lfloor \log_{2}(x_{f}) \rfloor} - 1.
$$

**Problem**: What 64-bit vector gives the closest approximation to $.1$?  To solve this, you will need to use the decimal-to-binary string conversion routines we developed in lecture this week.    


## Catastrophic Cancellation

Now, for the most part, you really won't spend too much time in your life thinking about the details of this, but we do need to discuss details if we want to understand the phenomena of _catastrophic cancellation_.  

To explain this, we have to keep some things in mind:
<ol>
    <li> The ONLY numbers on a computer are floating point numbers.
    <li> Any arithmetic operation between two floating point numbers must result in a floating point number... since that's the only numbers a computer can work with.
</ol>

Now, if the arithmetic operation we are concerned with is addition, then this isn't really much of a problem.  To wit, suppose we have two floating point numbers with the same characteristic, but different mantissas, i.e. 

$$
x_{1} = 2^{\tilde{c}}(1+\tilde{f}_{1}), ~ x_{2} = 2^{\tilde{c}}(1+\tilde{f}_{2})
$$

where

$$
\tilde{f}_{1} = \sum_{j=1}^{52}\frac{f_{j,1}}{2^{j}}, ~ \tilde{f}_{2} = \sum_{j=1}^{52}\frac{f_{j,2}}{2^{j}}
$$

Now, if we add them, we have

\begin{align*}
x_{1} + x_{2} = & 2^{\tilde{c}}(1+\tilde{f}_{1}) + 2^{\tilde{c}}(1+\tilde{f}_{2})\\
= & 2^{\tilde{c}}\left(2 + \tilde{f}_{1} + \tilde{f}_{2} \right)\\
= & 2^{\tilde{c}+1}\left(1 + \frac{\tilde{f}_{1} + \tilde{f}_{2}}{2}\right)
\end{align*}

Note, the last step is necessary because we **HAVE** to return a floating-point nubmer, and the last line gives us a floating point result whereas the second-to-last line does not.  

**Problem**: Suppose $\tilde{f}_{1}=0$ and $\tilde{f}_{2}=2^{-52}$.  What is $x_{1} + x_{2}$?  Note, you MUST return a floating-point number!!! Explain why this case leads to a _loss_ of information of exactly one-bit.

**Problem**: Show that if $\tilde{f}_{1}$ and $\tilde{f}_{2}$ are well defined mantissas then $\frac{\tilde{f}_{1} + \tilde{f}_{2}}{2}$ is as well except for possibly introducing a loss of one-bit.  In other words, you need to show that 

$$
\frac{\tilde{f}_{1} + \tilde{f}_{2}}{2} = \sum_{j=1}^{52}\frac{\bar{f}_{j}}{2^{j}}, ~ \bar{f}_{j} = 0, 1.
$$

The trick here is figuring out what $(f_{j,1}+f_{j,2})/2$ is for each index $j$.  So, if $f_{j,1}=f_{j,2}=0$, then $(f_{j,1}+f_{j,2})/2=0$, and if $f_{j,1}=f_{j,2}=1$, then $(f_{j,1}+f_{j,2})/2=1$.  But what do we do if $f_{j,1}\neq f_{j,2}$?  

So, as we see, at worst, we lose one bit of information when we add two floating point numbers with the same characteristic.  But what if we look at their difference?  For the sake of argument, suppose that $x_{1}>x_{2}$ so that 

$$
x_{1} - x_{2} = 2^{\tilde{c}}\left(f_{1} - f_{2}\right)
$$

**Problem**: Suppose that $f_{1}=2-2^{-52}$ and $f_{2}=2-2^{-51}$.  What is the floating point representation of $x_{1}-x_{2}$?  Why would we say that 52-bits of information are lost in this case?  Keep in mind, you started with two mantissas which each had 52 well defined bits and you should end with a result that has no well-defined bit.  