# Data manipulations

A matrix transformation is a matrix multiplication between a transformation matrix M and a data matrix D that gives you a manipulated data matrix D' as output.

We can use matrix multiplications to transform our data (our data points, represented as feature vectors).

## But first, some review of dot products

What is being done in this cell?

* Element-wise multiply [4,5,6] and [1,2,3] and then sum
* Element-wise multiply [7,8,9] and [1,2,3] and then sum

In [None]:
import numpy as np

v = np.array([1,2,3])
m = np.array([[4,5,6], [7,8,9]])

print(m@v)

And in this cell?
* 32: Element-wise multiply [4,5,6] and [1,2,3] and then sum
* 6540: Element-wise multiply [4,5,6] and [10, 100, 1000] and then sum
* 50: Element-wise multiply [7,8,9] and [1,2,3] and then sum
* 9870: Element-wise multiply [7,8,9] and [10, 100, 1000] and then sum

In [None]:
m2 = np.array([[1,2,3], [10, 100, 1000]])

print(m@m2.T)

## Load and look at our data

Let's load the used car data.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

data = np.array(np.genfromtxt('data/vehiclesNumeric.csv', delimiter=',', skip_header=1, dtype=int, encoding="utf-8"))  

# get a pandas dataframe for plotting
df = pd.DataFrame(data, columns=["id", "price", "year", "odometer"])

Let's get some **summary statistics**.

In [None]:
def getSummaryStatistics(data):
    print("min, max, mean, std per variable")
    return pd.DataFrame([data.min(axis=0), data.max(axis=0), data.mean(axis=0), data.std(axis=0)])

def getShapeType(data):
    print("shape")
    return (data.shape, data.dtype)

print(getSummaryStatistics(data))
print(getShapeType(data))

Let's **reduce the data** to two dimensions, just year and price.


In [None]:
# How are we going to get just those two columns?
reducedData = data[:, [1,2]]

print(getSummaryStatistics(reducedData))
print(getShapeType(reducedData))

Let's plot the used car data.

In [None]:
def plot2d(data):
    sns.scatterplot(pd.DataFrame(data[:, [0, 1]], columns=["price", "year"]), x="year", y="price").set(title="Year vs price for Craigslist used car listings")
    
plot2d(reducedData)

## Translation

Translation is a kind of data transformation where we move data around, but each data point stays the same distance away from every other data point.

Translation is a two step process:
* Add homogeneous coordinate
* Do translation as matrix multiplication

### Add homogeneous coordinate

We need to add a dummy column of ones so we can do the matrix multiplication. Why? See https://www.sciencedirect.com/topics/mathematics/homogeneous-coordinate.

In [None]:
# How do we append a whole column?
homogenizedData = np.append(reducedData, np.array([np.ones(reducedData.shape[0], dtype=int)]).T, axis=1)
print("homogenized data")
print(getSummaryStatistics(homogenizedData))
print(getShapeType(homogenizedData))

### Translate

Let's **translate** that year column so that it too starts at 0.

A translation matrix for two-variable data looks like:
$$\begin{pmatrix} 1 & 0 & x \\ 0 & 1 & y \\ 0 & 0 & 1\end{pmatrix}$$
where $x, y$ are the amount by which you want the $0th$ and $1st$ variables translated, respectively.



In [None]:
# we need to define a transformation matrix that will allow us to shift the price variable; this one will be the identity matrix with the translation specified in an extra last column
translateTransform = np.eye(homogenizedData.shape[1], dtype=int)
translateTransform[1, 2] = -reducedData[:, 1].min()
print("transformMatrix")
print(translateTransform)

print(homogenizedData[0:4])

# now we need to do the translation
transformedData = (translateTransform@homogenizedData.T).T
print("after translation, transformedData")
print(getSummaryStatistics(transformedData))
print(getShapeType(transformedData))
plot2d(transformedData)

Check:
* only the summary statistics for year should have changed
* the standard deviation for year should be the same

## Scaling

Scaling is kind of data transformation where we increase or decrease the range of one or more variables.

### Scaling on its own

Let's **scale** that year column so it's months instead of years.

A scaling matrix for two-variable data looks like:
$$\begin{pmatrix} x & 0 \\ 0 & y \end{pmatrix}$$
where $x, y$ are the amount by which you want the $0th$ and $1st$ variables scaled, respectively. 


In [None]:
scaleTransform = np.eye(reducedData.shape[1], dtype=float)
scaleTransform[1, 1] = 12
print("transformMatrix")
print(scaleTransform)

transformedData = (scaleTransform@reducedData.T).T
print("after scaling, transformedData")
print(getSummaryStatistics(transformedData))
print(getShapeType(transformedData))
plot2d(transformedData)

Check:
* only the summary statistics for year should have changed

### Scaling together with other transformations

If you want to translate *and* scale, you just add the homogeneous coordinate into the scaling matrix too!
$$\begin{pmatrix} x & 0 & 0\\ 0 & y & 0 \\ 0 & 0 & 1 \end{pmatrix}$$



In [None]:
scaleTransform = np.eye(homogenizedData.shape[1], dtype=float)
scaleTransform[1, 1] = 12
print("transformMatrix")
print(scaleTransform)

translateTransform = np.eye(homogenizedData.shape[1], dtype=float)
translateTransform[1, 2] = -reducedData[:, 1].min()
print("transformMatrix")
print(translateTransform)

transformMatrix = translateTransform@scaleTransform

transformedData = (transformMatrix@homogenizedData.T).T
print("after scaling, transformedData")
print(getSummaryStatistics(transformedData))
print(getShapeType(transformedData))
plot2d(transformedData)

Check:
* Although we added the homogeneous coordinate, the scaling produced the same output as before