<a href="https://colab.research.google.com/github/KhanradCoder/LearnMachineLearning/blob/master/2_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing the data

Here we are going to be using machine learning to predict breast cancer diagnosis.

In [0]:
import pandas as pd

dataset = pd.read_csv('cancer.csv')
print(len(dataset.columns))

32


We know we have a lot of x features, so we print the number of columns to see how many x features we need.

In [0]:
x = dataset.iloc[:, 2:29].values
y = dataset.iloc[:, 1].values

Now let's split the data into training and testing.

In [0]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

Here we perform feature scaling. Because we have a lot of features with different ranges for values, feature scaling will make our correlations more pronounced for our algorithm.

In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Logistic Regression
Logistic regression is the simplest classification algorithm (even though it has regression in the name). ![alt text](https://miro.medium.com/max/1428/1*Vd9ZTC1zWJPtV7iXPMJk1Q.png)
You can see here that for predicting whether y=1 or 0 (a classification problem), linear regression performs poorly in comparison to logistic regression.

In [0]:
from sklearn.linear_model import LogisticRegression

logistic_classifier = LogisticRegression()
logistic_classifier.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
y_preds = logistic_classifier.predict(x_test)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_preds))

[[73  0]
 [ 3 38]]


Here we print the confusion matrix, which shows how accurate our model is.
![alt text](https://miro.medium.com/max/712/1*Z54JgbS4DUwWSknhDCvNTQ.png)

#  Support Vector Machines

SVM's are among the most powerful machine learning algorithms. Put simply, the SVM algorithm tries to form a line that keeps the maximum space between a given number of clusters.


![alt text](https://www.aitrends.com/wp-content/uploads/2018/01/1-19SVM-2.jpg)

In [0]:
from sklearn.svm import SVC

svm = SVC(kernel="rbf")
svm.fit(x_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [0]:
y_preds = svm.predict(x_test)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_preds))

[[72  1]
 [ 2 39]]


# Decision Trees
Decision trees are really common in machine learning. Essentially, the algorithm tries to find a set of rules by which it can classify each datapoint into a given category. See the example below of a decision tree that classifies whether or not a given person is "fit". ![alt text](https://www.tutorialspoint.com/machine_learning_with_python/images/decision_tree_introduction.jpg)

In [0]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion = 'entropy')
tree.fit(x_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [0]:
y_preds = tree.predict(x_test)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_preds))

[[70  3]
 [ 3 38]]


# Random Forest

One tree is usually a pretty weak classifier. However, if you have multiple trees and average out their predictions, your classifier becomes a lot stronger. Think about examples of collective intelligence in nature; the more predictions, the more accurate on average our classifier should be. 

In [0]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators = 100, criterion = 'entropy')
forest.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [0]:
y_preds = forest.predict(x_test)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_preds))

[[71  2]
 [ 3 38]]


We can see that the difference between one decision tree and 100 decision trees is marginal. However, as you come across larger and more complex datasets, the number of trees will become more important. Remember though, decision trees are prone to overfit, especially on a small dataset like the one we are using.