# Data Science Career Guide Machine Learning and Coding
#### Notebook I am using to follow along with Jose's course and practice LaTex

# Machine Learning

## 1. Linear Regression

* ### Linear relationship between x and y
* ### Residual error should be normally distributed
* ### L1 and L2 Regularization

## 2. Logistic Regression

* ### $$ \delta = \frac {1} {1 + e^{-t}} $$
* ### $$ t = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots \beta_nx_n $$
* ### $$ f(x) = \delta(t) = \frac {1} {1 + e^{\beta_0+\beta_1x}} $$

## 3. Decision Tree

* ### Uses information gain with entropy to choose the best feature to split on
* ### Find the feature that best splits the target class into the purest possible children nodes
* ### Entropy is the measure of impurity
* ### Compare entropy before and after the split
* ### $$ Information\:Gain = Entropy_{Before} - Entropy_{After} $$
* ### Gini as well...
* ### 4. Decision trees are good for non linear problems
* ### Easy to interpret and understand
* ### Works with continuous and categorical
* ### No normalization or scaling
* ### Pretty fast

## 5. What is the difference between a random forest versus boosting tree algorithms?

* ### Boosted trees reassign weights to samples based on the results of previous iterations of classifications
* ### Harder to classify points get weighted more
* ### Future trees need previous trees to be created
* ### Random Forest uses bootstrap sampling and can only split on a random subset of features

## 6. Given a data set of features X and labels y, what assumptions are made when using Naive Bayes methods?

* ### Bayes
* ### $$ P(A \mid B) = \frac {P(A) \times P(B \mid A)} {P(B)} $$
* ### $$ P(A \mid B) = \frac {P(A) \times P(B \mid A)} {P(B \mid A) \times P(A) + P(B \mid \bar A) \times P(\bar A)} $$

* ### Each feature is independent of each other
* ### Naive assumption that all features are independent of each other

## 7. Support Vector Machines

* ### SVMs attempt to find a hyperplane that separates classes by maximizing the margin
* ### They can take multiple kernals to map linear no separable features into a higher dimension

## 8. What is overfitting and what causes it? How can we avoid it?

* ### It overfits to the training data and is often too complex
* ### Use a different model that does not overfit, change parameters, etc.
* ### Increase training data size
* ### Regularization (L1 or L2)
* ### K-Fold Cross Validation

## 9. Describe the differences between accuracy, precision, and recall.

 | Predict 0 | Predict 1
--- | ---
**Actual 0** | TN | FP (Type I)
**Actual 1** | FN (Type II) | TP

* ### Accuracy = (TP + TN) / Total
* ### Misclassification Rate = (FP + FN) / Total
* ### Recall aka True Positive Rate aka Sensitivity = TP / Actual Yes
* ### Precision = TP / Predicted Yes
* ### False Positive Rate = FP / Actual No
* ### Specificity = TN / Actual No
* ### Prevalence = Actual Yes / Total

## 10. Metrics for Regression Task.

* ### MSE $$ \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat Y_i)^2 $$
* ### RMSE $$ \sqrt {MSE} $$
* ### MAE $$ \frac{1}{n} \sum_{i=1}^{n} |Y_i - \hat Y_i| $$

# Coding

## 1. Given an array of integers (positive and negative) write a program that can find the largest continuous sum. Return the total sum amount.

* ### [7, 8, 9] = 24
* ### [-1, 7, 8, 9, -10] = 24
* ### [2, 3, -10, 9, 2] = 11
* ### [2, 11, -10, 9, 2] = 14
* ### [12, -10, 7, -8, 4, 6] = 12

In [58]:
ex_1 = [7, 8, 9]
ex_2 = [-1, 7, 8, 9, -10]
ex_3 = [2, 3, -10, 9, 2]
ex_4 = [2, 11, -10, 9, 2]
ex_5 = [12, -10, 7, -8, 4, 6]
ex_6 = [10, 11, -2]
ex_7 = [-10,-20,-5,-30]
ex_s = [ex_1, ex_2, ex_3, ex_4, ex_5, ex_6, ex_7]

In [59]:
from itertools import combinations

In [64]:
def largest_cont_sum_1(in_list):
    '''
    Returns the largest continuous sum for an array
    
    Input
    -----
    in_list: list of positive and negative integers
    
    Output
    -----
    result: the largest continuous sum
    combo_result: the combination that resulted in the largest continuous sum    
    
    ex.
    * [7, 8, 9] = 24 with [7, 8, 9]
    * [-1, 7, 8, 9, -10] = 24 with [7, 8, 9]
    * [2, 3, -10, 9, 2] = 11 with [9, 2]
    * [2, 11, -10, 9, 2] = 14 with [2, 11, -10, 9, 2]
    * [12, -10, 7, -8, 4, 6] = 12 with [12]
    * [10, 11, -2] = 21 with [10, 11]
    '''
    result = min(in_list)
    combo_result = []
    all_combos = [in_list[i:j] for i, j in combinations(range(len(in_list) + 1), 2)]
    for combo in all_combos:
        combo_sum = sum(combo)
        if combo_sum > result:
            result = combo_sum
            combo_result = combo
    return result, combo_result

In [65]:
for ex in ex_s:
    a, b = largest_cont_sum_1(ex)
    print('* {} = {} with {}'.format(ex, a, b))

* [7, 8, 9] = 24 with [7, 8, 9]
* [-1, 7, 8, 9, -10] = 24 with [7, 8, 9]
* [2, 3, -10, 9, 2] = 11 with [9, 2]
* [2, 11, -10, 9, 2] = 14 with [2, 11, -10, 9, 2]
* [12, -10, 7, -8, 4, 6] = 12 with [12]
* [10, 11, -2] = 21 with [10, 11]
* [-10, -20, -5, -30] = -5 with [-5]


In [66]:
def largest_cont_sum_2(arr):
    '''
    Jose's solution for returning the largest continuous sum for an array
    Input
    -----
    arr: array of positive and negative integers
    
    Output
    -----
    result: the largest continuous sum
    '''
    if len(arr) == 0:
        return 0
    max_sum = arr[0]
    current_sum = arr[0]
    for num in arr[1:]:
        current_sum = max(current_sum+num, num)
        max_sum = max(current_sum, max_sum)
    return max_sum

In [68]:
for ex in ex_s:
    a = largest_cont_sum_2(ex)
    print('* {} = {}'.format(ex, a))

* [7, 8, 9] = 24
* [-1, 7, 8, 9, -10] = 24
* [2, 3, -10, 9, 2] = 11
* [2, 11, -10, 9, 2] = 14
* [12, -10, 7, -8, 4, 6] = 12
* [10, 11, -2] = 21
* [-10, -20, -5, -30] = -5


## 2. Given a string in the form 'AAAABBBBBCCCCDDEEE' compress it to become 'A4B5C4D2E3'

* ### 'AAABBBBCCCCCDDE' = 'A3B4C5D2E1
* ### 'AAAaaa' = 'A3a3'

In [109]:
st_ex_1 = 'AAABBBBCCCCCDDE'
st_ex_2 = 'AAAABBCCCDDDDDDDEE'
st_ex_3 = 'AAAaaa'
st_ex_4 = 'CODINGGGGG'
st_exs = [st_ex_1, st_ex_2, st_ex_3, st_ex_4]

In [110]:
def str_compress_1(in_str):
    
    result = ''
    start_ind = 0
    count_letter = in_str[0]
    for ind, letter in enumerate(in_str):
        if letter != count_letter:
            result += count_letter
            result += str(ind-start_ind)
            count_letter = letter
            start_ind = ind
        if ind == len(in_str)-1:
            result += count_letter
            result += str(ind-start_ind+1)
    return result

In [111]:
for st_ex in st_exs:
    print('{} = {}'.format(st_ex, str_compress_1(st_ex)))

AAABBBBCCCCCDDE = A3B4C5D2E1
AAAABBCCCDDDDDDDEE = A4B2C3D7E2
AAAaaa = A3a3
CODINGGGGG = C1O1D1I1N1G5


In [112]:
def str_compress_2(in_str):
    run = ''
    length = len(in_str)
    
    cnt = 1
    i = 1
    while i < length:
        if in_str[i] == in_str[i-1]:
            cnt+=1
        else:
            run = run + in_str[i-1] + str(cnt)
            cnt = 1
        i+=1
        
    run = run + in_str[i-1] + str(cnt)
    return run

In [113]:
for st_ex in st_exs:
    print('{} = {}'.format(st_ex, str_compress_2(st_ex)))

AAABBBBCCCCCDDE = A3B4C5D2E1
AAAABBCCCDDDDDDDEE = A4B2C3D7E2
AAAaaa = A3a3
CODINGGGGG = C1O1D1I1N1G5


## 3. You are given an array of historical stock prices per day. Write an algorithm that figures out what days (index of array) you could buy and sell the stock for maximum profit. You are only allowed to buy the stock once and sell it once. Also no shorting the stock, you have to buy before selling.

* ### [6, 13, 2, 10, 3, 5] = Buy on day 3 and selling on day 4 for $8 gain.
* ### [6, 13, 3, 5, 10, 15, 2, 40, 1, 22]

In [114]:
def stock_trade(arr):
    
    min_stock_price = arr[0]
    max_profit = 0
    
    for price in arr:
        min_stock_price = min(min_stock_price, price)
        comparison_profit = price - min_stock_price
        max_profit = max(max_profit, comparison_profit)
    return max_profit

In [169]:
stock_ex_1 = [2, 10, 12, 1, 3]
stock_ex_2 = [2, 1, 3, 4, 20]
stock_ex_3 = [5, 30]
stock_ex_4 = [6, 13, 4, 5, 10, 15, 2, 40, 1, 22, 60]
stock_ex_5 = [20,2,3,4,5,2,4,10]
stock_exs = [stock_ex_1, stock_ex_2, stock_ex_3, stock_ex_4, stock_ex_5]

In [170]:
for stock_ex in stock_exs:
    print(stock_trade(stock_ex))

10
19
25
59
8


In [171]:
a = [6, 13, 3, 5, 10, 15, 2, 40, 1, 22]
def stock_trade_2(stock_prices):
    min_stock_ind = 0
    min_stock_price = stock_prices[0]
    max_stock_ind = 0
    max_stock_price = stock_prices[0]
    running_min_ind = 0
    running_min_price = stock_prices[0]
    max_profit = 0
    
    for ind, price in enumerate(stock_prices):
        
        current_profit = price - min_stock_price
        
        if current_profit > max_profit:
            max_profit = current_profit
            max_stock_ind = ind
            
        if price < running_min_price:
            running_min_price = price
            running_min_ind = ind
            
        running_profit = price - running_min_price
        
        if running_profit > max_profit:
            max_profit = running_profit
            min_stock_ind = running_min_ind
            min_stock_price = running_min_price
            max_stock_price = price
            max_stock_ind = ind
            
        
    return min_stock_ind, max_stock_ind

In [172]:
for stock_ex in stock_exs:
    print(stock_ex, stock_trade_2(stock_ex))

[2, 10, 12, 1, 3] (0, 2)
[2, 1, 3, 4, 20] (1, 4)
[5, 30] (0, 1)
[6, 13, 4, 5, 10, 15, 2, 40, 1, 22, 60] (8, 10)
[20, 2, 3, 4, 5, 2, 4, 10] (1, 7)


## 4. Array of non-negative integers. A second array is formed by shuffling the elements of the first array and deleting a random element. Given these two arrays, find which element is missing in the second array.

* ### [1, 2, 3, 4, 5] and [1, 3, 4, 5] missing value is 2


In [184]:
def find_missing_value(arr1, arr2):
    for value in arr1:
        if value not in arr2:
            return value
    return 0
def find_missing_value_2(arr1, arr2):
    return sum(arr1) - sum(arr2)

In [185]:
mis_ex_1 = ([1,2,3,4,5], [1,3,4,5])
mis_ex_2 = ([1,2,3,4,5], [1,3,4,2])
mis_exs = [mis_ex_1, mis_ex_2]

In [186]:
for mis_ex in mis_exs:
    print(find_missing_value(mis_ex[0], mis_ex[1]))
for mis_ex in mis_exs:
    print(find_missing_value_2(mis_ex[0], mis_ex[1]))

2
5
2
5
