# Personal Notes - Introduction to CS and DS (6.002)

[MIT Open Course by John Guttag](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0002-introduction-to-computational-thinking-and-data-science-fall-2016/index.htm)

## Stats recap

### Distribution types
Uniform, normal and exponential:
![](img/image--000.jpg)

### Central Limit Theorem (CLT)
Consider a population and a set of samples (subsets of the population). [The central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) states:
The means of the samples in a set of samples (the sample means) will be approximately normally distributed
This normal distribution will have a mean close to the mean of the population
The variance of the sample means will be close to the variance of the population, divided by the sample size

<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--001.jpg" height="500" width="500"/>
    </div>
</div>

Fig 1: roll a die 50 times (blue); roll 50 dice 50 times and take means (red).

## Sampling and Standard Error (8)

Sampling can be with/without replacements. 
Simple random sampling, each member has the same probability to be chosen. This is not always appropriate. This happens if the population is not normally distributed (e.g. the students from MIT are not equally from all the departments). 
Therefore, samples are taken from sub-groups, accordingly to their proportions, to represent them.
Temperatures (USA) are not normally distributed.<br/>
<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--002.jpg" height="500" width="500"/>
    </div>

</div>

Fig 2: population, fig 3: sample of size 100
Pop mean and sample mean look ‘similar’. Therefore we apply the CLT to a number of samples.
<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--004.jpg" height="500" width="500"/>
    </div>
</div>

Fig 4: result, fig 5: initial distribution

95% confidence interval is: mean +/- (1.96*std) = [14.5 - 18.1]
Now we want a tighter bound. Two options:
Drawing more samples, std does not change (why?)
Larger samples, std drops (it works)
Note: When two distributions do not overlap, conclusion can be made.
As the size increases, the confidence does too. This does not mean that the mean increase precision against the population mean (II vs III pool).

<div class="image123">
    <div class="imgContainer">
        <img src="img/image--005.jpg" height="500" width="500"/>
    </div>
</div>

Fig 6: Interval @ 95% confidence, population size.

Suppose that you got only one sample: recall CLT, sample mean is normally distributed (1) and close to the population mean (2). Now we use the (3) variance of sample will be close to the variance of the population, divided by the sample size to compute the standard error of the mean (SE or SEM)
SE = stdpop/sqrt(n)
where :
stdpop is the population standard deviation
n is the sample size

stdpop is not known, but can be anticipated by computing the SE, and one sample (instead of 50 sample and std). However, SE makes use of stdpop itself. So, we need a turn-around.

<div class="image123">
    <div class="imgContainer">
        <img src="img/image--006.jpg" height="500" width="500"/>
    </div>
</div>

Fig 7: SE and std of samples

If the sample is large enough, the standard deviation of the sample stds is used instead (500 in this case). What is the reasonable size? This depends on:
Distribution of population: (Uniform, normal, exponential distribution)	YES
Size of population (10000, 100000, 10000000) 				NO

STD accuracy (sample vs population) changes for distribution type, but not for sample size

<div class="image123">
    <div class="imgContainer">
        <img src="img/image--007.jpg" height="500" width="800"/>
    </div>
</div>

Fig 8: std vs distribution, fig 9: std vs population size

The skew (symmetry) matters, the population size does not. So, sample size is f(skewness).

Pseudocode:
Choose sample size n0 based on an estimate of skew
Choose a random sample from the pop, size n0
Calc means and stds 
Use stds to calc an estimation of standard error SE0
Use SE0 to generate confidence intervals around sample mean (see fig 10)

NOTE (on the last point): confidence intervals around sample mean vs pop mean is calculated against 95%. This means that if it’s more, the pop size is probably too big.
NOTE: Chosen samples must be independent to each other. If not (e.g. polls - state samples ...), <to clarify>.
<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--009.jpg" height="500" width="500"/>
    </div>
</div>
    
Fig. 10: We see how many samples have the mean out of 95% accuracy of the pop mean; temps (temps from the USA) and sampleSize are given. numTrials = number of samples. 
<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--010.jpg" height="500" width="500"/>
    </div>
</div>    

## Understanding Experimental Data (9)

Linear Regression
Relate dependent variable to the independent variable. Regression is linear because any polynomial function:

$anxn+...+a2x2+a1x+c=0$

Requires the coefficients an, ... , a, cto be found, and be diverse from 0. Such problem is linear. Objective function can either minimise 
Y (usually least squares - LS) to predict y values
P to predict categories; here, the distance which separates two cat’s is important

<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--011.jpg" height="500" width="500"/>
    </div>
</div>  


i.e. minimise variance*no_of_samples for a polynomial function (e.g. line, parabola, etc). Now let’s assume to have some data to be fit against a polynomial function.

<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--013.jpg" height="500" width="500"/>
    </div>
</div>  

Compare models
Among different models, which one is the best:
Relative to each other
In an absolute sense
In (A) MSE (mean squared error) can be used to evaluate models (the objective function uses that).
Not scale independent (if independent variable is scaled, also MSE is) - so no bound values are provided to understand meaning of SME.
In (B) CD (coefficient of determination), called R2: which tells how good the model is in comparison with the variability of the experiment data.

<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--014.jpg" height="500" width="500"/>
    </div>
</div>  



It is scale-independent
If R2=1 (variability is perfectly predicted)
If R=0 there is no relationship between the model 

High CD values don’t mean that it’s good (overfitting, as it happens in case deg=16). Overfitting means to fit the noise.

<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--015.jpg" height="500" width="500"/>
    </div>
</div>  

## Understanding Experimental Data (10)
    
### Cross validation (simple)

Take two noisy parabola data points. Do:
Dataset 1 build model 1 and validate on dataset 2
Dataset 2 build model 2 and validate on dataset 1
Validation on high dimensionality is poor.

<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--016.jpg" height="500" width="500"/>
    </div>
</div>  

This is the difference between a squared and a linear model with three data points:
set = {(0,0),(1,1),(2,2),(3,3.1)}
Where the last item has a 3% noise.

<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--017.jpg" height="500" width="500"/>
    </div>
</div>  

### Overfit-underfit

Start with low-degree fit and increase it. Then, look at:
How it behaves on training and validation data
R2 values

### Supporting Theory for models

Suppose the Hooke’s law. Linear and quadratic are not suitable. Here, we have to break the data into two parts and see where to split two linears (see image). This requires a good theory to guide the Data Scientist.

<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--018.jpg" height="500" width="500"/>
    </div>
</div>  

In an ideal world, we would run a controlled experiment (e.g., hang weights from a spring), study the results, and retrospectively formulate a model consistent with those results. We would then run a different perspective experiment (e.g., hang different weights from the same spring) and compare the results of that experiment to what the model predicted.

Unfortunately, in many cases it is impossible to run even one controlled experiment. Imagine, for example, building a model designed to shed light on how interest rates affect stock prices. Very few of us are in a position to set interest rates and see what happens. On the other hand, there is no shortage of relevant historical data.

In such situations, one can simulate a set of experiments by dividing the existing data into a training set and a holdout set. Without looking at the holdout set, we build a model that seems to explain the training set. For example, we find a curve that has a reasonable R2 for the training set.

A related but slightly different way to check a model is to train on many randomly selected subsets of the original data, and see how similar the models are to one another. If they are quite similar, than we can feel pretty good. This approach is known as cross validation.

### Cross Validation (methodologies)

LOO-CR (leave one out cross validation). Suitable for small datasets, it leaves one value out at a time.
KF-CR (key-fold cross validation), or repeated random sampling: split the data (from 50-50 to 80-20) for k times => return k Coefficient of Determinations. Report their mean
Let’s take the temperature example. Daily high mean temperatures are plotted. We want to find a model which describes those (we test dimensionality = 1,2,3,4):

<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--019.jpg" height="500" width="500"/>
    </div>
</div>  

Using KF CR (k=10), Coefficients of determination varies statistically (as the CLT describes). So, many subsets must be considered to have a reliable picture of model’s 

<br/>
<div class="image123">
    <div class="imgContainer">
        <img src="img/image--020.jpg" height="500" width="800"/>
    </div>
</div>  