# Data science assessment   

This assessment has a few sections but it is up to you to decide which questions to complete given a limit of one hour and **NO** googling.

Topics:
- Statistics
- Linux
- Spark
- Python
- Mathematics

# Statistics

1.1 At what rate does the expression below converge:
$$ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} $$

1.2 In the context of machine learning (ensembles) and mean squared error (MSE):

1.2.1 Define the mean squared error 

1.2.2 The method of bagging, e.g. random forest, reduces _________ at the expense of increasing ________

1.2.3 The method of boosting, e.g. gradient boosting, reduces _________ at the expense of increasing ________

1.3 For the following hypothesis draw a conclusion given a significance level of $\alpha =
0.05$:

$$ H_{0}: \beta_{1} = 0 \text{ vs. } H_{A}: \beta_{1} \neq 0$$ 

1.3.1 $p$-value = 0.0006754 Conclusion:

1.3.2 $p$-value = 0.5489256 Conclusion:

1.4 Given a binary (0, 1) prediction problem and a dataset, list all steps you would
follow for building a classification model?

1.5 What is the consequence/s of multicollinearity on a logistic regression model? Please describe the problem with respect to matrices and why it is an issue?

# Linux

2.1 Explain the difference between a VM and docker container 


2.2 What are symbolic links?

2.3 Which command would you use to change file permissions in Linux? And how many permission levels can one set?

2.4 What does the following command do?: $\text{du -cksh *}$

2.5 Is this a good idea?: $\text{sudo rm -rf /}$

2.6 systemd vs. other init systems (openRC, ...), do you have an opinion?

# Spark

3.1 What is shuffling?

3.2 Why is shuffling an important performance consideration when processing large datasets?

3.3 Suppose we wish to join one small dataframe with one large dataframe. What is typically a good optimisation to consider?

3.4 What are typical causes of excessive cpu time spent on garbage collection and how would one address the issue?

3.5 What issues arise when one calls the .toPandas() member function of a very large spark dataframe?

3.6 Why does the above issue arise? Put differently, what happens under the hood for this to happen?

3.7 What are the necessary considerations when choosing between UDFs and spark native functions?

# Python

### 4.1 Performance

Suppose the `get_resource_identifier` function interrogates some cloud infrastructure to resolve the resource identifier from its name. Note that it takes a long time for the call to finish resolving the name.

Now imagine that we need to resolve the resource by its name multiple times during deployment of infrastructure. How can we speed this up without modifying the body of the `get_resource_identifier` function? Remember, you have no control over how quickly the cloud provider can respond to your API call.

In [None]:
# 4.1.1
import time
def get_resource_identifier(name):
    time.sleep(1)#simulate the delay
    if name is 'foo':
        return 'L9UKvnomjq'
    if name is 'bar':
        return '7U9eyOv7M'
    return 'Not found'

for _ in range(0,100):
    print(get_resource_identifier('foo'))
    print(get_resource_identifier('bar'))
    print(get_resource_identifier('foo'))
    print(get_resource_identifier('zoo'))
    print(get_resource_identifier('bar'))

#### 4.2 Refactor
The section below is an opportunity for you to demonstrate how you refactor code into something simpler and more readable. Refactor the code and write some very simple sanity checks to show that the refactored version is equivalent to the ugly version. You may leave out tests where you think it is not needed.

In [None]:
# 4.2.1
# Don't modify this
colours = ['blue','green','yellow','black','orange']
fruits = ['berry','apple','banana','currant']
# All of the rest below you may modify 
# as you please to achieve the desired output

In [None]:
# 4.2.2
#ugly
for i in range(len(colours)-1,-1,-1):
    print(colours[i])

#refactor below

In [None]:
# 4.2.3
#ugly
for i in range(len(colours)):
    print(i,colours[i])
    
#refactor below

In [None]:
# 4.2.4
#ugly
min_length = min(len(colours),len(fruits))
for i in range(min_length):
    print(colours[i],fruits[i])
    
#refactor below

#### 4.3 Implement
This section provides an opportunity to demonstrate how you would write some very simple things in a pythonic way.

In [None]:
# 4.3.1
#Generate the following string from the colours list defined above:
# 'blue --> green --> yellow --> black --> orange'

In [None]:
# 4.3.2
# find the elements that exist in the first list but not the second
# and the elements that exist in the second, but not in the first
# put this result in into a single list and sort them in ascending order


first = [2,2,5,6,7,2,1,8,9,9]
second = [2,1,5,6,66,7,77]

# Mathematics

5.1 Determine the positive real number $a$ for which $$\sqrt{ \int^{a}_{0} x \text{ }dx} = \int^{a}_{0} \sqrt{x} \text{ } dx$$

5.2 Suppose that the three real numbers $x, y, z$ satisfy the system of equations
$$\begin{array}{rcl} 2^{x}\cdot4^{y}\cdot16^{z} = 1, & \\ 4^{x}\cdot16^{y}\cdot2^{z} = 2, & \\ 16^{x}\cdot2^{y}\cdot4^{z} = 4. &  \end{array}$$
What is the value of $y$?

5.3 Determine the integral $$\lim_{x\to 2013} \frac{\sin(\pi x)}{x-2013}$$

5.4 For which value of $a$ does the following system of equations have no solution?
\begin{array}{rcl} x + 2y = 5, & \\ -3x + ay = 1. & \\ \end{array}