# Module 11: Introduction to Machine Learning

## Lecture 2: ML Algorithm Examples

CSCI 1360: Foundations for Informatics and Analytics

## Overview and Objectives
There are many machine learning algorithms, scikit-learn covers good number of them. However, we will learn about two simple ones in this lecture. By the end of this lecture, you should be able to

* Understand how some simple ML algorithms work.
* Use scikit-learn to create a basic classifier.

### Part 1: Linear Regression

![image.png](attachment:image.png)

* Recall how we transformed linear equations to Python functions in the Linear Algebra module.

* Now we will use some input data and build a model to learn a function that can predict the output.

**Definition**

* In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. [Wikipedia](https://en.wikipedia.org/wiki/Linear_regression)

In [3]:
# import necessary modules
import numpy as np
from sklearn.linear_model import LinearRegression


In [4]:
# prepare input samples
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])

In [5]:
# design a linear function
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3
print(y)

[ 6  8  9 11]


In [38]:
# try commenting line 3 above
y = [ 6 , 8 , 9, 11]

In [39]:
# train the model and evaluate training
reg = LinearRegression().fit(X, y)
print(reg.score(X, y))

1.0


In [40]:
# predict
print(reg.coef_)

print(reg.predict(np.array([[3, 5]])))

[1. 2.]
[16.]


- Notice that we did not perform cross-validation in the previous training procedure.

- By testing the trained model on several testing samples, we can compute the model's accuracy.

- 

**What is data leakage?**

![image.png](attachment:image.png)

### Part 2: Decision Tree



* Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression.

* The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

### Some advantages of decision trees:

* Simple to understand and to interpret. Trees can be visualized.

* Requires little data preparation.

* Able to handle multi-output problems.


### Some disadvantages:

* Decision-tree learners can create over-complex trees that do not generalize the data well.

* Decision trees can be unstable because small variations in the data might result in a completely different tree being generated.

* Decision tree learners create biased trees if some classes dominate.
  * It is therefore recommended to balance the dataset prior to fitting with the decision tree.

![image.png](attachment:image.png)

credit: datadriveninvestor.com

![image.png](attachment:image.png)

credit: datadriveninvestor.com

In [41]:
# import necessary modules
import numpy as np
from sklearn import tree


In [42]:
# Covid symptoms data samples
# temperature and throat pain level [0-2]
X = np.array([
    [97, 0],
    [97, 1],
    [97, 2],
    [99, 0],
    [99, 1],
    [99, 2],
    [100, 0],
    [100, 1],
    [100, 2],
])

In [43]:
y=[0, 0, 0, 0, 0, 0, 0, 0, 1]

In [44]:
# train
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
print(clf.score(X, y))

1.0


In [45]:
#test 
print(clf.predict([[101, 2]]))
print(clf.predict([[101, 1]]))
print(clf.predict([[101, 0]]))
print(clf.predict([[98, 2]]))

[1]
[0]
[0]
[0]


In [46]:
# draw the tree
from sklearn.tree import export_text
r = export_text(clf, feature_names=['temperature', 'pain'])
print(r)

|--- temperature <= 99.50
|   |--- class: 0
|--- temperature >  99.50
|   |--- pain <= 1.50
|   |   |--- class: 0
|   |--- pain >  1.50
|   |   |--- class: 1



### Tree Depth Parameter

![image.png](attachment:image.png)

### Imbalanced data and class weight

Weights associated with classes in the form {class_label: weight}. If None, all classes are supposed to have weight one. 

## Review Questions

Some questions to discuss and consider:

1: Is Linear Regression suitable for all datasets? Why?

2: Can a decision tree model handle a simple categorical feature? Why?

3: Is a decision tree model suitable for a dataset with a huge number of features?

## Additional Resources

 1. VanderPlas, Jake. Python Data Science Handbook: Essential Tools for Working with Data (1st
ed., 2016) ISBN: 9781491912058.

 2. https://scikit-learn.org/stable/modules/tree.html