# TDT4173 Machine Learning - Assignment 1

## Theory

___

###### 1. **[0.1 points]** What is concept learning? + explain with an example

> The problem of searching through a predefined space of potential hyptheses for the hypothesis that best fits the training examples'' - Tom Michell

When a human being is learning something, much of it is based on generalized concepts gained from past experiences. For instance, if a human were to identify if a certain type of car, we differentiate the betwen the cars based on a set of features. This bundle of features can be called a concept. 

Similarly, we can provide a machine with a training sample of a given signal or dataset from which it can learn the correct concepts needed to identify wether new data or objects belong to a specific category. These generalized concepts is commonly referred to as a hypothesis. 

**An example:** Let's say we want to identify reptiles in a dataset containing all types of animals. We extract a random subset for training the model, in which we have a set of features; *scales, coldBlooded, legs, eggLaying*. To start with we have a random sample from the training set as the starting hypothesis. This hypothesis will constantly evolve as we challenge the current hypothesis against the training data. This will go on until the hypothesis remains unchanged, and we have the best possible concept needed to differentiate reptiles from other animals.
___

###### 2. **[0.1 points]** What is function approximation and why do we need them?

Function approximation is the process of adjusting the given model, or function, to most likely represent the true target function. As for the evolution of hypothesis explained in the previous question, we need these function approximations to actively determine the vital parameters and their weight.

3. **[0.4 points]** What is inductive bias in the context of machine learning, and why is it so important? Decision tree learning and the candidate elimination algorithm are two different learning algorithms. What can you say about the inductive bias for each of them?

> An inductive bias of a learner is the set of additional assumptions sufficient to justify its inductive inerference as deductive inference'' - Tom Michell

Inductive bias is a set of assumptions used to predict a given output if it encounters a new input. Without this bias, the algorithm wouldn't have learned anything except how to handle distinct key-value pairs, for instance, if a car encounters a cat, but it is trained to avoid dogs, it might not with high enough certainty know what to do. 

When using a decision tree learning algorithm, we use a bias called a search bias which is greedy and keeps the most relevant searches higher up in the tree to make it as short as possible. The candidate elimination algorithm, however, uses a representational bias because it cannot represent all hypothesis. So instead of greedily choosing which part of the whole hypothesis space to search, it assumes that the solution to the problem can be expressed as a conjunction of concepts.

___

###### 4. **[0.3 points]**  What is overfitting, and how does it differ from *underfitting*? Briefly explain what a validation set is. How can cross-validation be used to mitigate overfitting?

Overfitting refers to a model that models the training data too well. Overfitting occurs when the model learns both the valuable data and noise in the training data, which will be applied to new datasets and negatively impact the model's ability to generalize. Underfitting, on the other hand, refers to a model that neither has learned the training data nor infer from new data.

The validation set makes up about 20 percent of the bulk of data used (training set ~60%) when training the model. The validation set is used for choosing the best of the models found by the training data and optimizing it. During the validation phase, overfitting is checked and avoided.

Cross-validation uses the initial training data to generate *n* different mini train-test subsets and used to generate *n* different hypothesis, which allows us to tune the hyperparameters with only our original training set. This way of repeating the expoeriment multiple times gives a more accurate indication of how well the model generalizes to unseen data. Cross-validation does not prevent overfitting in itself, but it may help in identifying a case of overfitting.

___

###### 2. **[0.6 points]** Apply candidate elimination (CE) algorithm on the data given below in Table 1, where {T reatmentSuccessful} is the target attribute. The tabular data given below is based on physiotherapy questionnaire results for patients having pain concerning musculoskeletal disorders and its treatment successfulness. ‘Problem Area’ indicates region of the pain, ‘Activity Level’ describes the current physical activity level of the patient, ‘Sleep Quality’ indicates the level of sleep quality of the patient and ‘Treatment Successful’ indicates whether the treatment was successful in lowering the pain or not. The task is to learn to predict the value of Treatment Successful for an arbitrary values of the questionnaires. Describe the version space, specific hypothesis and general hypothesis boundary for this task (represent the version space starting from the initial boundary sets corresponding to the most specific and most generic hypotheses. You must represent the version space when CE algorithim visits a new negative or positive sample/example). The representation for “no value is acceptable” is ‘Ø’, and “any value is acceptable” is ‘?’. Also, the hypothesis space should be restricted to include only conjunctions of the attribute values.

| **Sex** | **Problem Area** | **Activity Level** | **Sleep Quality** | **STreatment Successful** |
|---------|:----------------:|:------------------:|:-----------------:|:-------------------------:|
| Female  |        Back      |       Medium       |       Medium      |           yes             |
| Female  |        Neck      |       Medium       |        High       |           no              |
| Male    |      Shoulder    |        Low         |        Low        |           yes             |
| Male    |        Neck      |        High        |      Medium       |           yes             |
| Female  |        Back      |       Medium       |        Low        |           yes             |

The candidate-elimination algorithm computes the version space containing all (and only those) hypotheses from H that are consistent with an observed sequence of training examples. For our hypothesis space (*H*), we will start with the sets of maximally general (*G*) and maximally specific (*S*) hypotheses:
```
S0 = {<Ø,Ø,Ø,Ø>}
G0 = {<?,?,?,?>}
H0 = {<Ø,Ø,Ø;Ø>}
```
If we now feed our data from top to bottom of the list, the sequential hypothesis spaces becomes:
```
D1 = {Female, Back, Medium, Medium} + (positive)
Positive dataset -> we need to generalize our specific hypothesis:
S1 = {<Female,Back,Medium,Medium>}
G1 = {<?,?,?,?>}
H1 = {<Female,Back,Medium,Medium>}
```
```
D2 = {Female, Neck, Medium, High} - (negative)
negative dataset -> make a minimal specialization of G that are consistent with the negative sample:
S2 = {<Female,Back,Medium,Medium>}
G2 = {<?,Back,?,Medium>}
H2 = {<Female,Back,Medium,Medium>}
```
```
D2 = {Female, Neck, Medium, High} +
S2 = {<Male,Shoulder,Low,Low>}
G2 = {<?,Back,?,Medium>}
H2 = {<Female,Back,Medium,Medium>}
```
___

In [None]:
# Programming exercise 1

import os
import numpy as np
import pandas as pd

def create_directory_from_path(direc):
    """
    Loads directory structure with CSV files into directory
    Parameters
    ----------
    direc : str
      Path to directory to be converted to dictionary
    Returns
    ----------
    ds : dictionary
      Directory modelled as a dictionary
    """
    ds = {}
    for i in os.listdir(direc):
    path = os.path.join(direc, i)
    if os.path.isfile(path): continue # continue if file in root
    ds[i] = {}
    for j in os.listdir(path):
        if os.path.splitext(j)[1] == '.csv': # Select only files with given extension
            with open(os.path.join(path, j)) as data:
                keys = j.replace('.csv','')
                df = pd.read_csv(data)
                print(df)
                #variables = {'x1'}
                #ds[i].update()
    return ds

direc = os.getcwd() + "/dataset" # Get current working directory
ds = create_directory_from_path(direc)

# use pseudoinverse when calculating the MSE to avoid zero in denominator
# pinv() in numpy

# 1. Implement linear regression with ordinary least squares (OLS) using the
# closed-form solution seen in Equation 9.

# Load variables
#x = ds['regression']['train_1d_reg_data'][:,0]
#y = ds['regression']['train_1d_reg_data'][:,0]

#print(vals[:,0])

# find weight that gies the OLS
#w = np.linalg.pinv( x1.transpose().dot(x1) ).dot( x1.transpose().dot(y) )

#print(w)