<a href="https://colab.research.google.com/github/fedhere/DSPS23_final/blob/main/DSPS2023_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

DSPS 2023 Final Exam
====================

This exam was written by Dr. Federica Bianco and Willow Fox Fortino.

***

## Exam Rules

**MAIN RULE: WORK ALONE!!** while for all assignments you were encouraged to work with others, for this exam you must work alone.

- You may use any other resource to help you (e.g., StackOverflow, lecture slides, etc.) but you may not consult anyone who is not Dr. Bianco or Willow.
- You are encouraged to ask questions in `#final` on Slack.
    - Do _not_ send private messages on Slack to either Dr. Bianco or Willow, instead ask that question on `#final`. We may send you private messages in reply to your question.
    - Do not describe in too much detail what youa are doing or share code in your questions so as to not to accidentally revealing information, we may delete your question if you do and give you directions on how to reask or instruct you to DM us.
- You must copy this notebook to your own Google Drive _before_ you start working.
- You have **72 hours** to work on the exam.

***

## How To Submit Your Exam
- Remember to copy this notebook to your own Google Drive _before_ you start working.
- Share this notebook with Dr. Bianco and Willow to submit.
    - Press the 'Share' button in the top right.
    - Make us editors when you share the notebook.
    - Share the notebook with our emails:
        - fbianco@udel.edu
        - fortino@udel.edu.
- Do _not_ push your work to GitHub. This will enable anyone to see your work. If you do this by accident, let Willow or Dr. Bianco know and we can help you remove it permanently from GitHub.

***

## Exam Overview
This exam is an exercise based on the Competition for Predicting Molecular Properties (CHAMPS). Specifically, you are to predict the coupling constant between two atoms given:
- the two atom types (e.g., C and H),
- the coupling type (e.g., 2JHC),
- any features you are able to create from the molecule structure (xyz) files.

You can find the Kaggle page for this data [here](https://www.kaggle.com/competitions/champs-scalar-coupling/overview)





***

## Exam Expectations
- You will be graded on the following tasks. Each task is described in more detail below.
    1. Data Acquisition (10 points, 5 each sub-taks)
    2. Data Cleaning, Preparation, and Fusion (this step will likely take the most effort) (20 points, 4 each sub-task)
    3. Data Exploration (20 points overall)
    4. Model Choice and Preprocessing (20 points, 5 each sub-task)
    5. Model Evaluation (15 points, there are 3 subtasks but points will be awarded for the answer holistically - be detailed and insightful)
    6. Extend Your Analysis (10 points)
    
    - \+ 10 points for clean and neat presentation
    - \- 10 points if notebook does not run (see reproducibility below)
- **All figures** that you submit should conform to every previously established standard for figures. Every figure should have **captions** that explain both _what_ the figure is and _why_ it is relevant. Every figure should have **axis labels**.
- Put explanation or discussion of your work in text cells.
- For each step of your work, justify your choices, discuss how you were successful and how your work could be improved.
For example you might want to justify are:
    - How do you handle missing or redundant values?
    - Why did you choose certain hyperparameters?
    - Why did you choose a particular model type?
- Present your code neatly, deleting cells of code used for testing but leaving all cells needed for the code to work.

***

## Exam Reproducibility
Your code must be reproducible, meaning that someone could select `Restart Kernel` and `Run All`, and they should get the _exact_ same output that you had initially.

To be clear: "running" your code means that each cell of your code should execute from top to bottom with no errors. If someone clicks `Run All` then every cell should run. Leave enough time before you submit to restart your kernel and run your code from the beginning to ensure that it works, fix any issues, and repeat the operation until it runs smoothly top to bottom.

***

# 0 Environment setup

In [None]:
# mount your drive here

Mounted at /content/drive


In [None]:
# Put all import statements in this cell.

# YOUR CODE HERE

# 1 Data Acquisition

he Kaggle page for the data is [here](https://www.kaggle.com/competitions/champs-scalar-coupling/overview).

Be sure to agree to the competition rules to be able to download the data. See hints and possible troubleshooting help in https://github.com/fedhere/DSPS23_final
- 1.1 Download the data programmatically in this notebook. There is a line of code at the bottom of [the Kaggle data page](https://www.kaggle.com/competitions/champs-scalar-coupling/data) which will do this.
- 1.2 Make a folder in your Google Drive called `DSPS23_final` and put the data in there.
  -   Do not expose your Kaggle API key by printing it anywhere in this notebook.
  -   Recall that we did this for the Titanic dataset [here](https://github.com/fedhere/DSPS_FBianco/blob/main/CodeDemos/titanictree.ipynb).
  -   The two files that you need are `train.csv` and `structures.csv`. Note that there is also a folder called `structures` which you do not need (and that unfortunately takes a long time to unpack, this may be useful https://unix.stackexchange.com/questions/14120/extract-only-a-specific-file-from-a-zipped-archive-to-a-given-directory)

***

<mark> If you are stuck on this task and would like to skip it, and forfeit the points from it </mark>, you may access the `train.csv` and `structures.csv` files from here https://fbb.space/classes/dsps2023/train.csv and https://fbb.space/classes/dsps2023/structures.csv (data folder).



***

In [None]:
# YOUR CODE HERE

In [None]:
champs_train.shape

(4659076, 6)

In [None]:
champs_structure.shape

(2358875, 6)

the data size should be ^^^

# 2 Data Cleaning, Preparation, and Fusion

- 2.1 Read the files `train.csv` and `structures.csv` into Pandas dataframes.
    - Note: Your target variable is `scalar_coupling_constant`.
    - **In a text cell, answer this question**: What kind of machine learning task are we performing if we want to predict `scalar_coupling_constant`, given what type of variable it is.
- 2.2 Check if there are any missing and/or duplicate values in this dataset.
    - If there are missing values, you can fill them in or remove the corresponding row or column.
    - If there are duplicate entries, you should remove them.
- 2.3 Identify the columns containing molecule identifiers (i.e., not properties of the molecules that should be included in the model). NOTE: duplicate data may depend on the variables you decide to use so you may not have the same answer as others on this

- 2.4 Each atom is associated with x-y-z values included in `structures.csv`. Merge the `structures` dataframe to the `train` dataframe. Note that there are 2 atoms involved so - see **HINT** in https://github.com/fedhere/DSPS23_final/edit/main/README.md:
   
- 2.5 At least one variable in the dataset is a multi-class categorical variable. One-hot encode this variable. One-hot encoding appeared a few times. See **HINT**

***


<mark> If you are stuck here do not waste too much time! YOu can skip this task, and forfeit the points from it </mark>, you may access the the file `data.csv` at https://fbb.space/classes/dsps2023/data.csv  folder. This file contains the data after completing all of the cleaning, merging and preparation steps in Task 2.

 It is also acceptable for you to use this code at the beginning of the exam, and then come back later to write your own code.


***

In [None]:
# YOUR CODE HERE


In [None]:
champs_data.shape

(4503143, 18)

given the choices we have made (and some of them may be different from yours!) this is the dataset we ended up with. But yours does not have to have this shape now. Your choice of dealing with NaNs or duplicated may lead to a different size dataset which is ok.

**QUESTION**: Given that we are predicting `scalar_coupling_constant`, what kind of machine learning task are we performing in this exam?

YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE


# 3 Data Exploration

Visualize the following aspects of the data in at least one figure:
- The distribution of each feature in the data.
- The correlation between features.

Decide what to do in case you find peculiar distributions or strong correlations.

**NB**: Some visualizations may take a lot of time to render. If it is taking too long, you may visualize a subset of the data.

In [None]:
# YOUR CODE HERE


# 4 Model Choice and Preprocessing

Now its time to choose your model.
- 4.1 Justify your choice of model based on the nature of the data and the task to be performed.
- 4.2 Prepare (i.e, scale, normalize or whiten) the data appropriately.
- 4.3 Split the data into a training set and a testing set.
    - If you wish, you may also split the data into three sets: training, validation, and testing.
- 4.4 Run your model and tune the hyperparameters.

There is an important hint reg this in https://github.com/fedhere/DSPS23_final/ README

In [None]:
# YOUR CODE HERE


# 5 Model Evaluation

- 5.1 Test for convergence and overfitting.
- 5.2 Report on the results and performance of your model.
- 5.3 Visualize the predictions of your model against the true values of the target variable in at least one figure.

Comments are important here! where is your model perfomeing well, where is it failing? How could you improve it?

In [None]:
# YOUR CODE HERE


# 6. Extend Your Analysis Choose between one of the following two tasks

**Option 1:** Repeat Tasks 4 and 5 with a different model.
- If you used a **CART model** in Task 4...
    - Make a plot of the feature importances of your model. Identify if there is a dominant feature. If there is, remove it and re-fit the data. If the dominant feature is a part of one-hot encoded features, then remove all of those features.
- If you used a **neural network** in Task 4...
    - Try changing the architecture. You could do this by changing the number of layers. Adding dropout layers to address overfitting. Changing the optimizer, loss, or activation functions. Be sure to justify your choices based on the data and the task at hand.
- If you used **any other model** in Task 4...
    - Try a CART model.

**Option 2**, for whichever model you chose you can use additional variables or create additional features by manipulating and combining variables. For example, the x-y-z data can be turned into distances as seen [here](https://www.kaggle.com/code/artgor/molecular-properties-eda-and-models/notebook). This is called "feature extraction" and is an important part of data science.

In [None]:
# YOUR CODE HERE
