<a href="https://colab.research.google.com/github/ds4geo/ds4geo/blob/master/WS%202020%20Course%20Notes/Session%2010.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Session 10

Show typical github workflow if locally with git bash, but do simple example with github/colab

# **Data Science for Geoscientists - Winter Semester 2020**
# **Session 9 - Scientific code collaboration & Deep learning - 9th December 2020**

This week we will cover some general tips for creating content (e.g. figures and code) for scientific publication, go into detail about how to use git/github to collaborate and share code, and explain the basics of the common deep learning library Tensorflow/Keras.

# 10.1 - Creating figures for scientific publication

**Plotting library advice:**
* Seaborn for data exploration
* Plotly/Bokeh/Altair for interactive plots
* Matplotlib for static plots for print

**Other advice:**
* Create high quality vector graphics with:

 `plt.savefig("filename.pdf", dpi=300)`

* For reproducibility, especially if sharing your code, do as much as possible directly in python/Matplotlib (i.e. not with graphics editor)

See:
* https://towardsdatascience.com/an-introduction-to-making-scientific-publication-plots-with-python-ea19dfa7f51e
* 


# 10.2 - Sharing and publishing your code

**General**
* Openly share code: reproducibility, long term impact/core of science!
* GitHub is standard place to share code
  * Create repo for a project/paper
  * Make a good readme!
  * Make it as easy as possible for someone to press a few buttons to reproduce your work
   * e.g. clone repository and run a script/notebook
 * Archive your data, e.g. pangeae.de
 * Good github repos of your scientific work are very valuable in applying for data science jobs!

**Code**
* Try to make code clear, readable and well commented
* Split code into different modules (python files) as necessary
 *  e.g. can define functions separately and import them into notebook, see: 
 https://stackoverflow.com/questions/52681405/how-can-i-import-custom-modules-from-a-github-repository-in-google-colab
* Repository code should work when repository is cloned (not just in colab)
* Include testing scripts as appropriate
* Consider following code style guidelines, e.g. https://www.python.org/dev/peps/pep-0008/

See example repo:
https://github.com/oscarbranson/latools

See also:

http://yo-yehudi.com/2020/09/14/on-preparing-your-code-for-publication.html

# 10.3 Code Collaboration using Git

## Git vs GitHub

* Git is the version control system itself. Branches are the core principle.

* Git can be run locally on a machine, and can integrate with hosting services (see below). 

* Other version control systems exist.

* GitHub is a cloud based hosting service for Git repositories. The cloud based hosting allows easy collaboration via Git between different collaborators via the internet.

* Other hosting services like GitHub exist - e.g. bitbucket

* Experience with Git and GitHub is highly valued in industry.

See also:
https://blog.devmountain.com/git-vs-github-whats-the-difference

## Git for code collaboration - branches and pull requests
For the assignements we've simply all been commiting new changes to the repository directly to the "main" or "master" branch.
This doesn't cause problems as we all work on different files, and there are not interdependencies between the code in the repository.

However, imagine multiple people want to work on one file, or that the repository contains multiple dependent code files (modules) which might break if changed.
To enable efficient and safe collaboration on code, the git system (as implement in github) has a system of **branches**, **merges** and **pull requests**.

It is possible to use Git individually without branches, but branches and merging help to structure developments and make code testing easier (e.g. does my new development work before I merge it back into the main branch?). Pull requests are mainly valuable for collaboration between 

First the theory, then we will do a practical example.

## Branches
* Branches are copies of a repository which can be independently edited/ developed.
* The "base"/default branch is called "main" (formerly "master).
* Branches can be made at any time and branched off from any existing branch.
* Branches can be valuable both for individual and group development:
 * for individual development: to organise code development topics/activities
 * to permit efficient code contributions from multiple users - see below
* Typical uses:
 * branch for a new feature
 * branch to fix a bug

See also:

https://www.atlassian.com/git/tutorials/using-branches

## Merges
* Merges join separate branches back together.
* They have a source and a target, e.g. merging a source *development* branch back into the target *main* branch.
* Can be done automatically in some situations: e.g. branched *dev* from *main*, edited *dev*, merge *dev* back to *main* without having edited *main*.
* If different history of commits/edits since branching, merging may lead to conflicts which must be resolved before merging.

See also:

https://www.atlassian.com/git/tutorials/using-branches/git-merge

## Pull Requests
* When development on a branch is complete (and tested!), a pull request can be made.
* It is a request to "pull" (i.e. merge) the development branch back into the main branch.
* It allows the merge to be reviewed and approved before being performed.
* Allows contribution of code while maintaining quality via peer review or repository owner approval.

See also:

https://www.atlassian.com/git/tutorials/making-a-pull-request



## GitHub Git excercise: branch, pull request, merge

We will together make seme edits to a simple repository to demonstrate how branches and pull requests work.

Please open this repository:
https://github.com/ds4geo/ds4geo_ws2020_git


## Overview of local machine Git usage
Git is variously integrated with different code editors/development environments, and depends on having git locally installed. We therefore will not do an interactive excercise, but the following overview of a simple workflow will be useful as a basis.

These commands can be executed in Git Bash, or the command line if Git is installed properly. It is also possible to do this using the google colab virtual machines for the sake of this demonstration (not optimal for development as you loose the machine if it goes idle).


### Clone a repository
Make a local copy of a repository which you can then work on.bold text

In [None]:
# navigation commands (in linux)
# !cd .. # up one directory level
!ls # list contents of current directory

In [None]:
# Clone the repository into the current directory
!git clone https://github.com/ds4geo/ds4geo_ws2020_git.git

In [None]:
# Change directory into the repo
# Usually:
# !cd ds4geo_ws2020_git
# But due to a quirk in how colab works:
%cd ds4geo_ws2020_git

### Check branches and create a new branch



In [None]:
# See existing branches
!git branch

In [None]:
# Create a new branch
!git branch dev

In [None]:
# See the new branch in the branches list
!git branch

In [None]:
# Switch to the new branch
!git checkout dev

### Do some coding and commit changes

In [None]:
# Do some coding/editing
!touch new.py # just create a new blank python script

In [None]:
# Check it is in the local repository
!ls

In [None]:
# Tell git to scan the repository for your changes
!git add .

In [None]:
# Commit the new changes to your local repository
!git commit -m "added new.py"
# Note: this won't work unless git is properly set up - it isn't on colab

In [None]:
# Push your local commits back to the remote (i.e. cloud - github or bitbucket) repository
!git push

In [None]:
# If you need to update your local repository with new changes from the remote one
# Like re-cloning except doesn't overwrite commits you've made
!git pull

# Deep Learning with Tensorflow/Keras

* Tensorflow:
 * Deep learning framework/library developed by Google
 * Exploits GPU for performance
 * Handles all the underlying maths
 * Dominant in industry (in research pytorch is more common)
* Keras:
 * Started as easier way to make tensorflow models
 * Now integrated in Tensorflow
 * Straightforward by powerful for model building

 Together they are much more powerful than e.g. sklearn, and underpin most real-world deep learning applications.



## Image Classification Example - Walkthrough

We will walk through a simple image classification Convolutional Neural Network example from here:
https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/images/cnn.ipynb

Main page: https://www.tensorflow.org/tutorials/images/cnn


**Super short overview of Convolutional Neural Networks:**

![](https://missinglink.ai/wp-content/uploads/2019/07/A-Convolutional-Neural-Network.png)



In [None]:
# Confusion matrix code to add to above notebook:
res = model.predict(test_images)
resx = np.argmax(res,axis=1)

In [None]:
cmat = confusion_matrix(test_labels, resx, normalize="true")

fig, ax = plt.subplots(figsize=(12,12))
ConfusionMatrixDisplay(cmat, display_labels=class_names).plot(cmap="BuGn", ax=ax)

## Recreating Sklearn MLP for rock classification
The MLP Classifier in sklearn handles a lot of the details automatically, but is therefore not very flexible. The following code re-creates the sklearn model we used last week using Keras. 

In [None]:
# Create a linear sequential model (a single stack of layers)
model = K.Sequential(
    # Add the individual layers - Dense = fully connected layers
    [K.layers.Dense(64, activation="relu", name="layer1", kernel_initializer='glorot_uniform'),
     K.layers.Dense(64, activation="relu", name="layer2", kernel_initializer='glorot_uniform'),
     K.layers.Dense(64, activation="relu", name="layer3", kernel_initializer='glorot_uniform'),
     # Last layer has softmax activation = needed for unique classification task
     # Size of last layer is number of classes in the data
     K.layers.Dense(14, activation="softmax", name="layerout", kernel_initializer='glorot_uniform')
    ])
# Use the Adam optimiser (method for model learning), and the initial learning rate
opt = K.optimizers.Adam(learning_rate=0.001)
# If data is categorical but not "one-hot encoded", use this loss 
# You can also specific which metrics to use to evaluate the model during training
model.compile(optimizer=opt, loss=K.losses.SparseCategoricalCrossentropy(), metrics="accuracy")


In [None]:
model.fit(xt, yt, epochs=200, batch_size=600, validation_data=(xv,yv))

# Main Project

**Task**
* Create a repo associated with your own GitHub account
 * If possible make the repo public
 * If the repo is private, please invite ds4geo to it
* Create a Jupyter Notebook with your analysis
 * Your repository can include other files as necessary
 * Include a readme file with an outline of your project
 * All data should be included in the repository or be downloadable within the notebook
 * You can make more than one notebook if it makes sense to do so
* The analysis should be well described and structured, like in the weather data assignments, but greater focus is put on the analysis itself.

**Content and Topic**
* The topic is open - it does not have to be geoscience related
* The project should include machine learning, or an equivalent detailed analysis (by agreement)


**Feedback**
* I am available by email to discuss and confirm ideas for the project.
* Please consult me early to make sure your idea is along the right lines and is likely to be successful



**Submission**
* Your assignments are submitted by commiting them to your repo
* The **deadline** is 23:59 on 10th Feb 2021
* This assignment comprises 75% of the assessment for the course.* Marks are awarded first for the analysis of the data and visualisation of results, and second for the structure and communication of the analysis/story as a whole

