Preprocessing and Preparing Data
=========

Topics:
 - Splitting your data into training and testing (also k-folds) ???
 - **Principle Component Analysis**
 - **Non-negative Matrix Factorization (NMF)** ???
 - **t-SNE** ???

In [None]:
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 15})
import numpy as np

Normalizing Data
--------

In [None]:
from sklearn.preprocessing import StandardScaler
# Other options are StandardScaler... others are available too!

# Prepare an example dataset:
correlated_part = np.random.normal(0,3,(1000))
x0 = np.random.normal(10,1,(1000)) + correlated_part
x1 = np.random.normal(10,2,(1000)) + correlated_part

# Make the ntuple
x_train = np.swapaxes(np.array((x0,x1)),0,1)
# x_train.shape

scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)

# Can also do x_scaled = scaler.fit(x_train).transform(x_train)
# or even x_scaled = scaler.fit_transform(x_train)

In [None]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(16, 6))
ax1.scatter(x0,x1)
ax2.scatter(x_train_scaled[:,0],x_train_scaled[:,1])
ax1.set_title('original')
ax2.set_title('scaled');

Principle Component Analysis (PCA)
==========

Construct an **orthonormal basis** from the data in which the data points in the sample are **uncorrelated**. The first few components are the components that **maximize the variance** of the data.

The principle components are the **eigenvectors** of the data's **covariance matrix**. The principle components can  be computed by either an **eigendecomposition of the covariance matrix**, or a **singular value decomposition** of the data matrix.

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=2)
pca.fit(x_train_scaled)
x_pca = pca.transform(x_train_scaled)

In [None]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(16, 6))
ax1.scatter(x0,x1)
ax2.scatter(x_pca[:,0],x_pca[:,1])
ax1.set_title('original')
ax2.set_title('PCA');

In [None]:
# The rotational components
pca.components_

Exploratory Plotting (simple time series; sns pairplot)
=======

 - Plot vs Time
 - Plot the variables of the dataset and their correlations

In [None]:
import seaborn as sns
import pandas as pd

In [None]:
df = pd.read_csv('ninja_pv_wind_profiles_singleindex.csv')

In [None]:
df = df[:1000]
df = df[:]

In [None]:
fig,ax = plt.subplots(figsize=[20,5])
df['time_dt'] = pd.to_datetime(df['time'])
ax.plot(np.array(df['time_dt'][:200]),np.array(df['AT_pv_national_current'][:200]),label='AT')
ax.plot(np.array(df['time_dt'][:200]),np.array(df['AT_wind_national_current'][:200]),label='AT wind')
ax.plot(np.array(df['time_dt'][:200]),np.array(df['BE_pv_national_current'][:200]),label='Belgium solar')
ax.plot(np.array(df['time_dt'][:200]),np.array(df['BE_wind_offshore_current'][:200]),label='Belgium wind offshore')
ax.plot(np.array(df['time_dt'][:200]),np.array(df['SE_pv_national_current'][:200]),label='Sweden')
ax.legend()
plt.show()

In [None]:
df['AL_pv'] = df['AL_pv_national_current']
df['AT_pv'] = df['AT_pv_national_current']
df['AT_wind'] = df['AT_wind_national_current']
sns.pairplot(df,vars=['AL_pv', 'AT_pv','AT_wind'])

Test-train splitting - Scikit-learn
-------

Scikit-learn has an option:

In [None]:
from sklearn.model_selection import train_test_split

def split() :
    x_train, x_test, y_train, y_test = train_test_split(iris_X, iris_y, test_size=0.33, random_state=42)

    # Another option is *stratify*, which allows you to balance your samples to make sure there are
    # enough of a certain y output class in each sample.

Preprocessing in Tensorflow - ImageDataGenerator
=========

```python
from tensorflow.keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenearator(rescale=1/255.)

train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(300,300), # this resizes each image "on-the-fly", not affecting the source data.
    batch_size=128, # number of images per batch{P
    class_mode='binary')

validation_generator = ... #(same as above, but with the validation directory)
```

Keras Preprocessing img_to_array
===========


In [None]:
from tensorflow.keras.preprocessing import image
import numpy as np
img=image.load_img('dog.jpg', target_size=(150, 150))
x=image.img_to_array(img)
x_exp=np.expand_dims(x, axis=0)
images = np.vstack([x_exp])

In [None]:
img

In [None]:
print(x.shape)
print(x_exp.shape)
print(images.shape)

Tokenizing Words with Keras
=========

Notes:
 - Use the same tokenizer for the test sample and the train sample
 - Add an "out-of-vocabulary" token for a placeholder for out-of-vocabulary words
 - Padding

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!',
    'I think my dog is amazing',
]

tokenizer = Tokenizer(num_words = 100,oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

sentences.append('my dog loves my manatee')
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

padded = pad_sequences(sequences) #Options: padding ('post'), maxlen, truncating ('post')
print(padded)