# What can data scientists learn from engineers?

### Or: _Things that I am still no good at_

Ewan Nicolson


# Why engineering skills?

 - Self-determination, more impact
 - Hand overs and teamwork
 - Long term payoff
 - Mastery
 - You may be smarter, and possibly less likely to embarass yourself


![tweet from josh wills](tweet.png)

https://twitter.com/josh_wills/status/198093512149958656?lang=en

# In the order I was convinced of these values

## Code review

This is brilliant.
Like peer review in science.

Agree the rules beforehand. PEP-8.

Peer review with both data scientists and engineers if you can.



https://www.python.org/dev/peps/pep-0008/

http://flake8.pycqa.org/en/latest/

https://www.youtube.com/watch?v=wf-BqAjZb8M&feature=youtu.be

# Testing

I was very reluctant about testing.

> What if it is a process that is non-deterministic?

and

> What if it is complicated, and I need to make a programme to find the answer?

Couple of gateways into testing:

## These are the same tests you do on paper

Simple values



In [16]:


def f1score(y_true, y_pred):
    """F1 score is given by this formula.

    F1 = 2 * (precision * recall) / (precision + recall)
    """
    y_true = set(y_true)
    y_pred = set(y_pred)

    precision = sum([1 for i in y_pred if i in y_true]) / len(y_pred)
    recall = sum([1 for i in y_true if i in y_pred]) / len(y_true)
    
    if precision + recall == 0:
        return 0.0
    else:
        return (2 * precision * recall) / (precision + recall)

In [18]:

import pytest

assert f1score(
    [1, 2, 3], [2, 3]
) == 0.8

assert f1score(['None'], [2, 'None']) == pytest.approx(2/3)

assert f1score([4, 5, 6, 7], [2, 4, 8, 9]) == 0.25

assert f1score([1, 2, 3, 4], [1]) == 0.4

## Mock anything that has already been tested

For example, don't unit test sklearn


https://testingpodcast.com/33-katharine-jarmul-testing-in-data-science/

http://www.tdda.info/

https://www.eecs.tufts.edu/~dsculley/papers/ml_test_score.pdf

https://smile.amazon.co.uk/Testing-Python-Applying-Unit-Acceptance-ebook/dp/B00LJV2GXI/ref=sr_1_1?ie=UTF8&qid=1519759535&sr=8-1&keywords=testing+python+david+sale

# Data engineering

Best way to get high quality data

Very underrated task. I used to complain about this not being right. **Invest** in this instead.


## Don't get put off by nomenclature

Star, snowflake schemas. Normalised, denormalised.

You have the knowledge to do this.

## Understand how ETL works

Productionise your tasks.

It might be in python already


https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7


https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b

# AWS

Have some knowledge of the AWS tools available to you.

I'm sure the rest are very good too. I like bigquery.

## Data in

Kinesis (Kafka)

Glue

## Data storage

s3

Parquet, Athena

Redshift, RDS

## Data processing

EC2, EMR

Lambda

Sagemaker


https://aws.amazon.com/big-data/

## Deployment

![step1](./step1.png)

_concept drift_

![step1](./step2.png)

_concept drift_

![step1](./step3.png)

_concept drift_

Retraining models shouldn't be difficult. This should be a low cost operation

Retraining shouldn't break anything

Stealing idea from continous deployment/delivery

## Repeatable, automated pipelines

Whole pipeline should be repeatable

Automate deployment, retraining, evaluation

## Logging performance


![grafana dashboard](https://camo.githubusercontent.com/d010ea19c70677a0bfd8a64fc01d2b0948e1ffc1/687474703a2f2f646f63732e67726166616e612e6f72672f6173736574732f696d672f66656174757265732f64617368626f6172645f6578312e706e67)

_developer dashboard_


These dashboards not just for developers!

My colleague Danny logs model performance metrics like AUC or F1 to grafana

Very good talk from [Ravelin](https://www.ravelin.com/) where they talk about logging things like relative importance of features to say if we can use retrained model.

## Microservices

(I'm going to get this wrong)

Each job is a container.

Each container should do one thing and do it well. Connect together like Lego.

Again from Ravelin, I love this idea.

They have a library of model components.

Ensemble them together to make classifiers.

<img src="./library.svg" alt='library' style='width: 500px'/>


## All the experience

Data science is very young field. We can learn from experienced programmers.

Find an experienced programmer, see what they get grumpy about

## Code smells

[code smells](https://blog.codinghorror.com/code-smells/)

They've seen these before. These things have names!

 - Long Parameter List
 - Don't repeat yourself!
 - Conditional Complexity
 - Speculative Generality
 - Shotgun Surgery

## Getting away from notebooks

Data scientists love notebooks (jupyter)

A great experience is watching an engineer hate them

An even better experience is watching an engineer say: _"they are a great tool for collaboration and experimentation, but wouldn't use them for much else"_



https://smile.amazon.co.uk/Pragmatic-Programmer-Andrew-Hunt/dp/020161622X/ref=sr_1_1?s=books&ie=UTF8&qid=1519841835&sr=1-1&keywords=pragmatic+programmer&dpID=41BKx1AxQWL&preST=_SX218_BO1,204,203,200_QL40_&dpSrc=srch


https://github.com/braydie/HowToBeAProgrammer

http://opiateforthemass.es/articles/why-i-dont-like-jupyter-fka-ipython-notebook/

## Agile

I thought that this was just systemised meetings.
It is, but the meetings are useful.
Don't sit there with your laptop, engage with them.
If it genuinely isn't useful then bring it up in the retro.


Make things smaller, be customer centric, iterate

## What can engineers learn from data scientists?

Data awareness

Dealing with uncertainty

Knowing about the domain applications of our work

If a data scientist who knows engineering is awesome, then an engineer who knows data science is too

In [2]:
print("Many thanks")

Many thanks


## Don't have time for, but two very good things


Version control

Debugger

## Get better at Python

Two of my favourites

[Python tips](https://smile.amazon.co.uk/Python-Tricks-Buffet-Awesome-Features-ebook/dp/B0785Q7GSY/ref=sr_1_1?ie=UTF8&qid=1519844374&sr=8-1&keywords=dan+bader)

[Talk Python to me podcast](https://talkpython.fm/)

