# Industry

Monetisation models:
* Sell raw data
* Provide data analytics
* Develop data platform

Hype is around B2C, money is in B2B.

Start with business problem, then work out what data is useful for that.

Challenge is often feature engineering.

Information modelling is important.

# Experimental design (causal inference)

* Define your objective (in data science language, identify scope of inference)
* Need a hypothesis test group and a control (or at least an alternative hypothesis, e.g. A/B testing)
* Counter confounding effects, e.g. non-representative sampling, differing samples in test/control
* Null hypothesis = status quo
* Stochastic proof by contradiction:
  * Given observed data from each intervention, compute test statistic
  * Re-assign samples to interventions using the same sampling approach, but with data correponsding to the null hypothesis
  * Compute test statistic
  * Repeat until I have a histogram of test statistics under the null hypothesis (under all possible randomisations)
  * Is the observed value unusual relative to this distribution? Use the p-value
* Then, we either have enough evidence to reject the null hypothesis, or we don't
* Avoid bad randomisations: block against confounding factors you know, randomise those you cannot. Then sample (for both the observed and null-hypothesis sets)
* Matched-pair: Create nearly identical pairs, and allocate one intervention to each. (extreme case of blocking)
* Difficulties: many covariates, multiple interventions, randomisation restrictions
  * Many covariates -> at least one will be imbalanced across intervention groups. Before experimenting, decide on a validity test for the randomisation. E.g. covariates must be balanced to some tolerance. Keep randomisations if they score within the valid zone (this extends to multiple covariates, and the valid zone should match the shape of the distribution of randomisations). It will be more important to balance some covariates than others.
  * Many interventions -> need balance here as well. Two-factor interaction: how does Two-factor is denoted Xa:Xb an intervention affect the outcome, given another intervention has been administered. Three-, four-, etc. Marginal/main/one-factor effects are probably most important. Need to balance these first. Then can balance two-factor less, three-factor not at all (for example). Two-factor is denoted Xa:Xb
  * Not sure how to extend to multiple covariates **AND** multiple interventions...

# Deep learning

Autoencoders do not require labelled data. Unsupervised method? This is PCA if done in a neive manner. Can tie the encoder weights to the decoder weights (make them the same)

Normalisation of input data allows us to initialise the bias vectors to zero. The original hyperplanes will then go through the training data.

2006 breakthrough: train deep architectures using layer-wise unsupervised learning. Converges the network to different and typically better solutions! Also faster learning overall! Good for when labelled set is small, but there is a large unlabelled set. If the labelled set is large, this doesn't matter so much.

Dropout helps with overfitting. Slows training. Intuition: simulates an ensemble of models, but only outputs the average. Trains larger networks on smaller data.

Work out a framework for implementing iterative methods (e.g. k-means) in the same manner as that great Julia blog post I saw. "Iterative methods done right" (https://lostella.github.io/2018/07/25/iterative-methods-done-right.html)

Look into using Julia as a data science language! Turing institute has released WSJ I think...

Clustering: Single linkage prone to chaining (and big clusters grow bigger), complete linkage is sensitive to outliers (but gives balanced cluster sizes)

Video clustering: Treat video as 4D, look at pixel similarity, save a whole hierarchical tree and let the user choose the threshold. Hierarchical clustering is fast.

Scikit-learn can include connectivity in clustering, meaning clusters must be adjacent.

Clustering evaluation: stability can be tested by clustering a test set, using the resulting labels for a supervised model, then checking the performance of the supervised method on a validation set.

# Cheatsheet

## Scraping
## Visualisation
## Pandas and SQLite
## Probability and Distributions
## Regression
## Classification
## Ensembles
## Bayes
## Text and clustering
## Deep neural networks

```python
from sklearn import SOMETHING
```