- All related information can be found here:
Dataset Source and Preprocessing Steps.pdf
-
- Preprocesses the dataset
- Derives its characteristics
- (Optional) Partitions the dataset into train/validation/test
-
E.g.
python3 preprocessing.py -i "ML-100K.txt" -p 1
-
You can preprocess additional datasets and/or perform your own version of preprocessing
- You might need to modify / update the following two functions in
util.py
TheDatasetsDilemma/Step 2/util.py
Line 27 in 5dfe4b7
TheDatasetsDilemma/Step 2/util.py
Line 167 in 5dfe4b7
- You might need to modify / update the following two functions in
characteristics.py
- Gathers the characteristics across all datasets into a single file
- Generates two nicely formatted tables (Table 1, Table 2) for easy viewing
euclidean_distance.py
- Derives the pairwise Euclidean distance between every pair of datasets based on their characteristics
- Generates a simple visualisation
clustering.py
- Performs k-means++ clustering using scikit-learn
- Hyperparameters: num_clusters, iterations, random_seed
- E.g.
python3 clustering.py -nc 5 -iter 100 -rs 1337
- Stores the clustering result
- Version 1 (Simple):
Clustering (5 Clusters) (Simple).txt
- Version 2 (Detailed):
Clustering (5 Clusters) (Detailed).txt
- Version 1 (Simple):
- Visualises the datasets as well as the clustering result
- Samples 3 datasets from each cluster (for the experiments in Step 3)
- Performs k-means++ clustering using scikit-learn