Data manipulations
---------------------------------------

A matrix transformation is a matrix multiplication between a transformation matrix M and a data matrix D that gives you a manipulated data matrix D' as output.

We can use matrix multiplications to transform our data (our data points, represented as feature vectors).

Let's load and plot the data.
This data comes from https://www.kaggle.com/tolgahancepel/toyota-corolla.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates
import pandas as pd
import seaborn as sns

data = np.array(np.genfromtxt('data/ToyotaCorolla.csv', delimiter=',', converters={3: lambda x: 1 if x == 'Diesel' else 0}, skip_header=1, dtype=int, encoding=None))  

# getting a pandas dataframe so I can visualize the data
df = pd.DataFrame(data, columns=["price", "age", "km", "fueltype", "hp", "metcolor", "automatic", "cc", "doors", "weight"])

# a parallel coordinates plot is useful for figuring out if any variables are more predictive of the dependent variable (price) than any others
pd.plotting.parallel_coordinates(df, "price")
plt.show()

I can't see my data!!

In [None]:
# I can't decide if doors or automatic are more important; how might I decide?
sns.scatterplot(x="km", y="price", size="age", hue ="automatic", palette="colorblind", sizes=(40, 200) , alpha=.6, data=df)

I still can't see my data!

In [None]:
sns.pairplot(df, y_vars = ["price"], x_vars = ["age", "km", "fueltype", "hp", "metcolor", "automatic", "cc", "doors", "weight"], kind = "scatter")

Let's get some **summary statistics**.

In [None]:
def getSummaryStatistics(data):
    return np.array([data.max(axis=0), data.min(axis=0), data.mean(axis=0, dtype=int)])

def getShapeType(data):
    return (data.shape, data.dtype)

print(getSummaryStatistics(data))
print(getShapeType(data))

Let's **reduce the data** to two dimensions, just price and age, since age looks like the one with the clearest correlation with price.

In [None]:
# How are we going to get just those two columns?
reducedData = data[:, :2]

# What if we just wanted price and km?
reducedDataSkipCol = data[np.ix_(np.arange(data.shape[0]), [0, 2])]

print(getSummaryStatistics(reducedDataSkipCol))
print(getShapeType(reducedDataSkipCol))

# You can do projection (down or up) also by matrix multiplication

We need to add a dummy column of ones so we can do the matrix multiplications for these transformations. Why? See https://www.sciencedirect.com/topics/mathematics/homogeneous-coordinate.

In [None]:
# How do we append a whole column?
homogenizedData = np.append(reducedData, np.array([np.ones(reducedData.shape[0], dtype=int)]).T, axis=1)
print("homogenized data")
print(getSummaryStatistics(homogenizedData))
print(getShapeType(homogenizedData))

def plot2d(data):
    sns.scatterplot(x="age", y="price", palette="colorblind", sizes=(40, 200) , alpha=.6, data=pd.DataFrame(data, columns=["price", "age", ""]))
    
plot2d(homogenizedData)

Let's **translate** that price column so that it too starts at 0.

In [None]:
# we need to define a transformation matrix that will allow us to shift the price variable; this one will be the identity matrix with the translation specified in an extra last column
translateTransform = np.eye(3, 3, dtype=int)
translateTransform[0, 2] = -reducedData[:, 0].min()
print("transformMatrix")
print(getShapeType(translateTransform))
print(translateTransform)

# now we need to do the translation
translatePriceData = (translateTransform@homogenizedData.T).T
print("after translation, translatePriceData")
print(getSummaryStatistics(translatePriceData))
print(getShapeType(translatePriceData))
plot2d(translatePriceData)

Let's **scale** that age column so it's months instead of years

In [None]:
scaleTransform = np.eye(3, 3, dtype=int)
scaleTransform[1, 1] = 12
print("transformMatrix")
print(getShapeType(scaleTransform))
print(scaleTransform)

scaleAgeData = (scaleTransform@translatePriceData.T).T
print("after scaling, scaleAgeData")
print(getSummaryStatistics(scaleAgeData))
print(getShapeType(scaleAgeData))
plot2d(scaleAgeData)

Let's try **global (max-min) normalization**

Okay, so here is how that works:
1. subtract the global minimum from each datapoint
2. divide by the global range (max - min)

What is the effect on the data?

What does that look like from the perspective of operations we have learned so far?

In [None]:
translateTransform = np.eye(3, 3, dtype=int)
translateTransform[:, 2] = -reducedData.min()
print("transformMatrix")
print(translateTransform)

scaleTransform = np.eye(3, 3)
scaleTransform[0, 0] = 1 / (reducedData.max() - reducedData.min())
scaleTransform[1, 1] = 1 / (reducedData.max() - reducedData.min())
scaleTransform[2, 2] = 1 / (reducedData.max() - reducedData.min())
print("transformMatrix")
print(getShapeType(scaleTransform))
print(scaleTransform)

totalTransform = scaleTransform@translateTransform
print("transformMatrix")
print(getShapeType(totalTransform))
print(totalTransform)


In [None]:
globalNormalizedData = (totalTransform@homogenizedData.T).T
print("after global normalization, globalNormalizedData")
print(getSummaryStatistics(globalNormalizedData))
print(getShapeType(globalNormalizedData))
plot2d(globalNormalizedData)

I'm not sure global max-min normalization makes sense for data like this. Instead, let's try **max-min normalization per variable**.

In [None]:
translateTransform = np.eye(homogenizedData.shape[1], dtype=int)
translateTransform[:, 2] = np.array([-homogenizedData[:, 0].min(), -homogenizedData[:, 1].min(), 1], dtype=int)
print(translateTransform)
scaleTransform = np.eye(homogenizedData.shape[1]) * [1 / (homogenizedData[:, 0].max() - homogenizedData[:, 0].min()), 1 / (homogenizedData[:, 1].max() - homogenizedData[:, 1].min()), 1]

print("transformMatrix")
print(getShapeType(translateTransform @ scaleTransform))
print(translateTransform @ scaleTransform)


localNormalizedData = (translateTransform @ scaleTransform @homogenizedData.T).T
print("after per variable normalization, localNormalizedData")
print(getSummaryStatistics(localNormalizedData))
print(getShapeType(localNormalizedData))
plot2d(localNormalizedData)

Max-min normalization will move everything to the unit square, but that may not help me see things more clearly. What if I try **z-scoring**: normalizing each feature by its mean and standard deviation instead?

In [None]:


localNormalizedData = 
print("after per variable normalization, localNormalizedData")
print(getSummaryStatistics(localNormalizedData))
print(getShapeType(localNormalizedData))
plot2d(localNormalizedData)

Let's **rotate** the data by 270 degrees, because I like things to go up to the right

In [None]:
rotateTransform = 
print("transformMatrix")
print(getShapeType(rotateTransform))
print(rotateTransform)

rotatedData = 
print("after rotating, rotatedData")
print(getSummaryStatistics(rotatedData))
print(getShapeType(rotatedData))
sns.scatterplot(x="km", y="price", palette="colorblind", sizes=(40, 200) , alpha=.6, data=pd.DataFrame(rotatedData, columns=["price", "km", ""]))

What if I wanted to rotate it *and translate it to be centered on zero*?

Some resources:
* https://staff.fnwi.uva.nl/r.vandenboomgaard/IPCV20162017/LectureNotes/MATH/homogenous.html
* https://primer-computational-mathematics.github.io/book/d_geosciences/remote_sensing/Image_Transformations_and_Orthorectification.html
* https://www.informit.com/articles/article.aspx?p=2854376&seqNum=8
* https://towardsdatascience.com/normalization-techniques-in-python-using-numpy-b998aa81d754
* https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/