<a href="https://colab.research.google.com/github/andrewcgaitskell/dmtoolnotes/blob/main/Lists%2C_Arrays%2C_Tensors%2C_Dataframes%2C_and_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://colab.research.google.com/github/tensorpig/learning_tensorflow/blob/master/Lists%2C_Arrays%2C_Tensors%2C_Dataframes%2C_and_Datasets.ipynb#scrollTo=0-i0PylHrjWs

In [1]:
import pandas as pd

In [2]:
data = [[1,2,3],[4.0,5.0,6.0],['100','101','102']]
data

[[1, 2, 3], [4.0, 5.0, 6.0], ['100', '101', '102']]

In [8]:
data_df_raw = pd.DataFrame(data=data)
data_df = data_df_raw.T
data_df.columns=['legs','weight','version']
data_df

Unnamed: 0,legs,weight,version
0,1,4,100
1,2,5,101
2,3,6,102


Let's pretend we have a simple regression like problem. We start out with 3 features describing a robotic spider we're building. For example: number of legs (feature 1), weight (feature 2), and version number (feature 3). Say that we so far built three prototype robots, so have 3 values for each feature. 

In [14]:
data_dict = {'legs':[1,2,3],
             'weight':[4.0,5.0,6.0],
             'version':['100','101','102']}
data_df_dict = pd.DataFrame(data=data_dict)
data_df_dict

Unnamed: 0,legs,weight,version
0,1,4.0,100
1,2,5.0,101
2,3,6.0,102


In [4]:
feature1 = [1,2,3]
feature2 = [4.0,5.0,6.0]
feature3 = ['100','101','102']
print(type(feature1))

<class 'list'>


In [10]:
data_df['legs'].tolist()

[1, 2, 3]

In [12]:
data_df.iloc[0]

legs         1
weight       4
version    100
Name: 0, dtype: object

We'll look at the various different data structures you will probably run into when doing ML/AI in pyhton and tensorflow. Combining the features into matrices etc. Starting from basic python lists and progressing up to keras Datasets which you will typically feed into your neural network.

First up: the basic python LIST

In [None]:
list2d = [feature1, feature2, feature3] 
print(type(list2d))
print(list2d)
print('({},{})'.format(len(list2d),len(list2d[0]))) #nr of rows and cols
print(list2d[0]) #first row
print([row[0] for row in list2d]) #first col
print(list2d[0][0]) # value at 0,0
print([[row[i] for row in list2d] for i in range(len(list2d[0]))]) # transpose to make more like excel sheet

A python list is a collection of any data types. The items in a list can be lists again, and there are no requirements for the items in a list to be of the same type, or of the same length.

There is also the Tuple, which has () around the feautes instead of []. A Tuple works hte same, but once creatd, cannot be changed.

Next up the Numpy ARRAY

In [None]:
import numpy as np
array2d = np.array([feature1, feature2, feature3], dtype=object)
print(type(array2d))
print(array2d)
print(array2d.shape) #nr of rows and cols
print(array2d[0,:]) #first element/row = array, could also be just array2d[0]
print(array2d[:,0]) #first column, or actually first element from each 1d array in the 2d array
print(array2d[0,0]) # value at 0,0
print(array2d.transpose()) #more like excel sheet

A numpy array expects all items to be of the same type. If the dtype=object is not used above, all of the values will be converted to strings as this is the minimum type that can hold all values. A numpy array can handle features of different length, but then each element in the array will be of type 'list', so no direct indexing like you would expect from a matrix.

Next up the Pandas DATAFRAME

In [None]:
import pandas as pd
dataframe = pd.DataFrame()
dataframe['feature1'] = feature1
dataframe['feature2'] = feature2
dataframe['feature3'] = feature3
print(type(dataframe))
print(dataframe)
print(dataframe.shape)
print(dataframe.iloc[0].tolist()) # first row, without .tolist() it also shows the column headers as row headers. You can also use loc[0], where 0 is now value in the index column (same as row number here)
print(dataframe['feature1'].tolist()) #first column, without .tolist() it also shows the index. You can also use .iloc[:,0]
print(dataframe.iloc[0,0]) #value at 0,0

A Pandas dataframe is basically an excel sheet. It can handle features with different datatypes, but not different lengths of feature arrays.

Next up TENSORs

In [None]:
import tensorflow as tf
feature3int = [int(x) for x in feature3 ] # map string values to numerical representation (in this case the string is a number so easy)
tensorRank2 = tf.constant([feature1, feature2, feature3int], dtype=float)
print(type(tensorRank2))
print(tensorRank2)
print(tensorRank2.shape)
print(tensorRank2[0,:].numpy()) #first row, without .numpy() a tensor object is returned. Could also use just [0]
print(tensorRank2[:,0].numpy()) #first col
print(tensorRank2[0,0].numpy()) # value at 0,0
print(tf.transpose(tensorRank2)) # more like excel sheet

Tensors are n-dimensional generalizations of matrices. Vectors are tensors, and can be seen as 1-dimensional matrices. All are represented using n-dimensional arrays with a uniform type, and features with uniform length. I had to convert feature3 list to int, although I could also have converted feature1 and fature2 lists to strings.

Next up DATASETs

In [None]:
feature1f = [float(x) for x in feature1 ] # map string values to numerical representation
feature3f = [float(x) for x in feature3 ] # map string values to numerical representation
dataset = tf.data.Dataset.from_tensor_slices([feature1f, feature2, feature3f])
print(type(dataset))
print(dataset.element_spec)
print(dataset)
print(list(dataset.as_numpy_iterator()))
print(list(dataset.take(1).as_numpy_iterator())[0]) #first "row"
print(list(dataset.take(1).as_numpy_iterator())[0][0]) # value at 0,0


A Dataset is a sequence of elements, each element consisting of one or more components. In this case, each element of the Dataset is a TensorSliceDataset of shape (3,) which, when converted to a list, is shown to wrap around an array of 3 floats as expected.

A Dataset is aimed at creating data pipelines, which get data from somewhere, process and transform it (typically in smaller batches), and then output it to a neural network (or somewhere else). A main goal of such a piepline is to avoid getting (all) the data in memory and enable large data sets to be handled in smaller peices. As such, getting values for specific elements in the dataset is not what Dataset are built for (and it shows).  

In [None]:
datasett = tf.data.Dataset.from_tensor_slices((feature1, feature2, feature3))
print(type(datasett))
print(datasett.element_spec)
print(datasett)
print(list(datasett.as_numpy_iterator()))

If you create a Dataset from a tuple of arrays, instead of an array of arrays, you can see each element is now a tuple of 3 TensorSpec of different type and shape () which can be seen wrap around a tuple for transposed feature values. 

This shows that from_tensor_slices() "slices" the tensors along the first dimension