# Machine Learning and Statistics **PROJECT**
by Andreia Santos

DATE:  10th December 2023
***

In [2]:
# python libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


## Part 1 - Machine learning 

### __Question: In your notebook, you should first explain what supervised learning is and then explain what classification algorithms are.__

__ANSWER:__

Machine learning is just one piece of solving problems. Though there are a lot of technical things it is important to remember the main goals and have present the big picture. 


Machine learning algorithms are segmented in two distinct categories:  supervised and unsupervised learning.  
In supervised learning, the algorithm receives pairs of things (ex : features)   and their correct answers (ex: class) . The scope of this algorithm is to be able to predict answers for new things that it has not seen before. The reason why is categorized as supervised algorithms is because this algorithm learns with the help of a "teacher" who provides correct answers. This “teaching” mode is known as the training process. Specifically, the algorithm creates a model with its internal parameters which intention is to minimize the difference between the predicted output and the actual corrected answers (1). 

Oppositely, the unsupervised learning the algorithm, only knows about the input data and no output information is given. Typically, this are more challenging algorithms to understand and assess compared to supervised methods (1). 


Supervised machine learning has two main types: classification and regression.
In classification, the goal is to sort things into different groups. There are two types of classification: one is to separate things into two groups - binary classification. The other type involves sorting things into more than two groups - multiclass classification (1).

In regression, the aim to predict a continuous number, such as estimating people’s income from factors like education and age. It deals with a range of possible values, where small differences between predicted and actual numbers matter. This contrasts with the classification algorithm once there will be intermediate values for example in predicting income through regression, the result can be any amount, while in classification, things are sorted into clear categories without any in-between stages(1).


One piece of toolkit that is used on Python machine learning is the “scikit-learn”. This library includes submodules that are useful for:
1.	Classification - sorting things into groups. It can use things like SVM, nearest neighbors, random forest, and logistic regression.
2.	Regression - Predicting numbers: It can used to predict numbers using methods like Lasso and ridge regression.
3.	Clustering - Grouping similar things together using k-means and spectral clustering,
4.	  Dimensionality reduction – simplify the data by making simpler big sets of data using PCA, feature selection, and matrix factorization.
5.	Model selection – select the best model to data using grid search and cross-validation for example.
6.	Preprocessing - prepare data so it gets ready for the machine learning by applying algorithm as feature extraction and normalization (2).







(1)	Mueller, A., & Guido, S. (2016). Introduction to Machine Learning with Python (1st edition) . O'Reilly Media



(2)	McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd ed.). O'Reilly Media.


ANSWER


## Part 2 - Classification algorithm

### __Describe at least one common classification algorithm and implement it using the scikit-learn Python library__

__ANSWER:__

The k-NN algorithm is one of the simplest supervised classification algorithms. The model relies exclusively on saving the training dataset. When it needs to predict a new piece of data, it looks for the closest similar data points on the previous saved training dataset which are its nearest neighbors. 


In its basic form, the k-NN algorithm looks at just one nearest neighbor (k=1), the closest point in the training data to the point its being predicted. The predicted output will be the known output of this closest training point. 
 


In [65]:
import sklearn as sk 
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd

df = pd.read_csv('iris.csv')




# Knn classifier with a number of neighbors of 5 (predefined) or 6 gives the best classifer performance, considering a reasonable test size of 20% (N=30)
clf = sk.neighbors.KNeighborsClassifier(n_neighbors=5)


X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
Y= df['class']


# split data between train and test data set (half - half)
X_train, X_test, Y_train, Y_test = sk.model_selection.train_test_split(X, Y, test_size=0.2)

# fit the model to the train subset
clf.fit(X_train, Y_train)

#use the previous created model to the test subset
clf.predict(X_test)


# estimate the amount of correct classifications 

accuracy = clf.score(X_test, Y_test)
num_samples_predicted = X_test.shape[0]

print(f"The percentage of correct predictions is: {accuracy:.2f} considering that {num_samples_predicted} samples have been predicted.")


The percentage of correct predictions is: 0.97 considering that 30 samples have been predicted.


### _Part 3 - Plots_

Throughout your notebook, use appropriate plots, mathematical notation, and diagrams to explain the relevant concepts