Dataset Source and Preprocessing Steps

All related information can be found here: Dataset Source and Preprocessing Steps.pdf

Preprocessing

preprocessing.py
1. Preprocesses the dataset
2. Derives its characteristics
3. (Optional) Partitions the dataset into train/validation/test
E.g. python3 preprocessing.py -i "ML-100K.txt" -p 1
You can preprocess additional datasets and/or perform your own version of preprocessing
- You might need to modify / update the following two functions in util.py
- TheDatasetsDilemma/Step 2/util.py
  
  Line 27 in 5dfe4b7
  
  def getFileInfo(dataset):
- TheDatasetsDilemma/Step 2/util.py
  
  Line 167 in 5dfe4b7
  
  def readFile(dataset, datasetFilepath, header, sep, names):

characteristics.py
1. Gathers the characteristics across all datasets into a single file
2. Generates two nicely formatted tables (Table 1, Table 2) for easy viewing

euclidean_distance.py
1. Derives the pairwise Euclidean distance between every pair of datasets based on their characteristics
2. Generates a simple visualisation