2.2. Data Preprocessing

So far, we have been working with synthetic data that arrived in ready-made tensors. However, to apply deep learning in the wild we must extract messy data stored in arbitrary formats, and preprocess it to suit our needs. Fortunately, the pandas library can do much of the heavy lifting. This section, while no substitute for a proper pandas tutorial, will give you a crash course on some of the most common routines.

In [12]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

In [13]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms RoofType   Price
0       NaN      NaN  127500
1       2.0      NaN  106000
2       4.0    Slate  178100
3       NaN      NaN  140000


In [14]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       NaN           False          True
1       2.0           False          True
2       4.0            True         False
3       NaN           False          True


In [15]:
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0           False          True
1       2.0           False          True
2       4.0            True         False
3       3.0           False          True


In [16]:
import tensorflow as tf

X = tf.constant(inputs.to_numpy(dtype=float))
y = tf.constant(targets.to_numpy(dtype=float))
X, y

(<tf.Tensor: shape=(4, 3), dtype=float64, numpy=
 array([[3., 0., 1.],
        [2., 0., 1.],
        [4., 1., 0.],
        [3., 0., 1.]])>,
 <tf.Tensor: shape=(4,), dtype=float64, numpy=array([127500., 106000., 178100., 140000.])>)

2.2.5. Exercises

    1. Try loading datasets, e.g., Abalone from the UCI Machine Learning Repository and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?

In [17]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
abalone = fetch_ucirepo(id=1) 
wine_quality = fetch_ucirepo(id=186)
heart_disease = fetch_ucirepo(id=45)


A:
- abalone has no missing values, and of the 8 features, 7 are numeric and 1 is categorical. (the target is also categorical)

- wine_quality has no missing values, and of the 13 features, all are numerical and none are categorical. (the target is categorical)

- heart_disease has 6 missing values out of a total of 3939 variables (0.152%), and of 13 features, 7 are numerical and 6 are categorical. (the target is also categorical)

2.2.5. Exercises

    2. Try indexing and selecting data columns by name rather than by column number. The pandas documentation on indexing has further details on how to do this.

In [18]:
abalone.data.features[["Sex", "Length", "Height"]][:5]

Unnamed: 0,Sex,Length,Height
0,M,0.455,0.095
1,M,0.35,0.09
2,F,0.53,0.135
3,M,0.44,0.125
4,I,0.33,0.08


In [19]:
wine_quality.data.features[["fixed_acidity", "volatile_acidity", "citric_acid"]][:5]

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid
0,7.4,0.7,0.0
1,7.8,0.88,0.0
2,7.8,0.76,0.04
3,11.2,0.28,0.56
4,7.4,0.7,0.0


In [20]:
heart_disease.data.features[["age", "sex", "cp"]][:5]

Unnamed: 0,age,sex,cp
0,63,1,1
1,67,1,4
2,67,1,4
3,37,1,3
4,41,0,2


2.2.5. Exercises

    3. How large a dataset do you think you could load this way? What might be the limitations? Hint: consider the time to read the data, representation, processing, and memory footprint. Try this out on your laptop. What happens if you try it out on a server?

A: Pandas DataFrames are loaded into RAM, so it depends on how much RAM is available, how fast the RAM is, any bottlenecks, etc.

2.2.5. Exercises

    4. How would you deal with data that has a very large number of categories? What if the category labels are all unique? Should you include the latter?

A: To deal with a large number of categories, you could try to eliminate any superfluous categories, or categories that may not have data that is significant. Another method would be to try to find high correlations between two or more categories and combine them.
If the category labels are all unique, meaning every value in a column is unique, then the column doesn't carry any useful information and can be dropped.

2.2.5. Exercises

    5. What alternatives to pandas can you think of? How about loading NumPy tensors from a file? Check out Pillow, the Python Imaging Library.

A:
- PIL library can be used to open an image file as an Image object.
    - the Image object can be interpreted as an array, as per the NumPy asarray() method.
- dask adds parallelization to python. still uses standard dataframes, numpy, etc.
- pyspark
- polars
- there are a ton. here's a (probably) non-exhaustive list:
https://github.com/jcmkk3/awesome-dataframes