You can set up the necessary environment for running the project by installing all the required libraries mentioned in the requirements.txt file. You can do this by running the following command in your terminal:

<b> pip install -r requirements.txt </b>


<b> Please note that this project requires Python version 3.7 or higher. Make sure to set up your environment accordingly.</b>

<b> Note:</b> In order to reproduce our experiments, you can refer to the outputs.ipynb file where you can see how the models are called. Our model automatically splits the data into train and test sets and prints the results for evaluation. All you need to do is provide the path of the data as input.

# Summary

The study aimed to evaluate the effectiveness of the graph convolutional network (GCN) and compare it to other machine learning models for the task of disease prediction. The GCN consistently outperformed all state-of-the-art baselines for most values of K, indicating its effectiveness in capturing the latent relationships between symptoms and diseases. The optimal neighborhood size for disease prediction was found to be 2 or 3. Additionally, the GCN was found to offer a unique advantage over traditional neural networks in capturing the complex relationships between variables in clinical decision-making.

Various multilayer perceptron (MLP) architectures were also tested, but the best results were obtained by the GCN. The findings suggest that despite the increased complexity of the GCN, it was still able to capture more information and achieve higher accuracy than the MLP models tested.

Overall, the GCN model described in the paper was well-performing and outperformed all other models. Although the study successfully replicated the results, more time could have resulted in further fine-tuning of the model to increase overall accuracy beyond what was previously described in the paper. The study highlights the importance of exploring different machine learning models and approaches for disease prediction tasks to achieve the best possible results.








#  Reproducibility


<b> Dataset:</b> The dataset used in this project can be found in the data folder of the GitHub repository: https://github.com/aviral38/Dl4healtcare. This will allow for easy access to the dataset for reproducibility of the code used in the project.

To load the dataset and process it for the models, we used the load_dataset function from the Model/load_dataset.py file. This function takes the path of the dataset as a parameter and returns the nodes, adjacency list, labels, and train/test sets.

Here's an example of how to use the load_dataset function:

list_of_nodes, attributes, labels, adjacency_list, train_set, test_set = <b>load_dataset(
    "./data/graph_data/191210/graph-P-191210-00"
)</b>

<b> Using Baseline Models </b>

The baseline models used in the project are SVM, Decision Tree, Random Forest, and Multi-layer Perceptron. The code for these models is available in the <b>baseline_models</b> folder of the GitHub repository. The files svm_model.py, DecisionTree.py, randomForest_model.py, and multi_layer_perceptron.py contain the structures of these models. To use these models, you can call the functions run_svm, run_tree, run_forest, and run_mlp, which are demonstrated in the <b>outputs.ipynb</b> notebook. These functions take the path of the dataset as input. Additionally, for the run_tree function, you need to specify the number of trees to be used in the model.

<b> Using Graph Neural Network </b>

To reproduce the GCN model used in this project, you can refer to the file "model_multi.py" present in the <b>"Model"</b> folder of the project repository. This file contains the implementation of the GCN model used for disease prediction. To train and evaluate the model, you can refer to the <b>outputs.ipynb</b> file. This file shows how the GCN model was trained on the dataset and how its performance was evaluated. By running the code in the "outputs.ipynb" file, you can reproduce the results obtained in this project.

# Results

![overall.png](attachment:overall.png)

<b> All Disease prediction Task Results </b>

![rare.png](attachment:rare.png)

<b> Rare Disease Prediction Task Results </b>

# Key Findings

The Graph model consistently outperforms all state-of-the-art baselines by a significant margin for most values of K (K = 2, 3, 4, 5). This confirms the effectiveness of the model in inductively representing a disease by aggregating the learned representations of its neighboring nodes (i.e., diseases).

Our model consistently outperforms all baselines in the rare disease prediction task when considering K \textgreater 1, similar to the general disease prediction task. These findings suggest that the Graph model is effective in capturing the latent relationships between symptoms and diseases, enabling better distinction among different types of rare diseases.

When analyzing the performance of the model with different values of K. The results show that the model achieves the highest F1 score when K is set to 2 or 3, indicating the optimal neighborhood size for disease prediction.

Our experiments involved testing a variety of multilayer perceptron (MLP) architectures with varying numbers of layers and neurons, in an attempt to surpass the performance of the graph convolutional network (GCN) for disease prediction. However, our the best reuslts obtained by MLP as shown in table (Figure 2, Figure 3) indicate that despite the increased complexity of the GCN, it was still able to capture more information and achieve higher accuracy than the MLP models tested. These findings suggest that GCN is not unnecessarily complex for the task at hand, and that it offers a unique advantage over traditional neural networks in capturing the complex relationships between variables in clinical decision-making

As per our research, the GCN model described in the paper is really well described and out performs all the other models (traditional approaches). While we were able to replicate the results well, having more time on our hand could have resutled in fine-tuning the model even further to the point where the overall accuracy would be even more than what was previously described in the paper.


# References

1. Z. Sun, H. Yin, H. Chen, T. Chen, L. Cui and F. Yang, "Disease Prediction via Graph Neural Networks," in IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 3, pp. 818-826, March 2021, doi: 10.1109/JBHI.2020.3004143.
<b> (original paper) </b>

 
2. Hamilton, W. L., Ying, R., & Leskovec, J. (2017). Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (pp. 1024-1034).


3. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2018). Graph Attention Networks. In International Conference on Learning Representations.


4. Köhler, S., Carmody, L., Vasilevsky, N., Jacobsen, J. O. B., Danis, D., Gourdine, J. P & Smedley, D. (2019). Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic acids research, 47(D1), D1018-D1027. DOI: 10.1093/nar/gky1105


5. Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR).


6. Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM.


7. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015). LINE: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web (pp. 1067-1077). ACM.


8. Wang, D., Cui, P., & Zhu, W. (2016). Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1225-1234). ACM.