<a href="https://colab.research.google.com/github/Wasabibish/Simple-Neural-Net-/blob/main/Sklearn_for_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Import necessary libraries
import sklearn
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler


# Import and split

## Import datasets from Sklearn

The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch popular reference datasets. It also features some artificial data generators.

Additional link : https://python-course.eu/machine-learning/available-data-sets-in-sklearn.php

In [None]:
#Loading a dataset from sklearn (we choose the famous breast cancer dataset)
data = load_breast_cancer()

In [None]:
#Lets have a look at the data 
data

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

In [None]:
#We notice that our dataset is in a  dictionnary form, lets check its keys
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [None]:
#For better vizualisation, we choose to create a dataframe using Pandas 
pd.DataFrame(data['data'])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


We notice that we have 30 columns (independant variables)

In [None]:
#Same thing for the target
pd.DataFrame(data['target'])

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
...,...
564,0
565,0
566,0
567,0


In [None]:
set(data['target'])

{0, 1}

We notice that we have one target (Yes/No), and its unique values are 0/1

In [None]:
#To provide more details about the data
data.DESCR



## Train Test Split

In [None]:
#The help shows that the default value of test_size is 0.25
help(train_test_split)

Help on function train_test_split in module sklearn.model_selection._split:

train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
    Split arrays or matrices into random train and test subsets.
    
    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.
    
    test_size : float or int, default=None
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to 

In [None]:
#To check how many rows do we have
len(data.data)

569

In [None]:
#From the total of 569 observed values, divide the data for training and evaluation into 7:3 or 8:2. 
#7.5:2.5 is the default value.
#We can specify it using the test_size parameter
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)

In [None]:
#Calculating the size of both train and test set
print(569*0.75, 569*0.25)

426.75 142.25


In [None]:
#Making sure of the size
len(X_train), len(X_test)

(426, 143)

# ML Algorithm steps

## Creating and instantiating the model
Instancing is a process of selecting hyperparameters. Default values are applied to unset hyperparameters. Call the estimator to show parameter status.

In [None]:
#Instancing the estimator and hyperparameter setting
model = DecisionTreeClassifier(criterion='entropy')

In [None]:
#Use the fit method with instance estimator for training. 
#Send the training data and label data together as an argument to supervised learning algorithm.

model.fit(X_train, y_train)

In [None]:
#The instance estimator that has completed training with fitting can be applied with the predict method. 
#‘Predict’ converts the estimated results of the model regarding the entered data

y_pred = model.predict(X_test)
y_pred

array([1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1,
       0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1])

## Calculate the accuracy

In [None]:
#Create two arrays -Predicted_values and -Actual_values to comapre between them

pred_2d = y_pred.reshape(-1, 1)
y_test_2d = y_test.reshape(len(y_pred), 1)

In [None]:
#Converting to a df

df1 = pd.DataFrame(pred_2d)
df2 = pd.DataFrame(y_test_2d)

In [None]:
df_concat = pd.concat([df1, df2], axis = 1)
df_concat.columns = ['pred', 'real']

In [None]:
#We can see the results

df_concat.head(15)

Unnamed: 0,pred,real
0,1,1
1,0,0
2,1,1
3,1,1
4,1,1
5,0,0
6,1,1
7,1,1
8,1,1
9,1,1


In [None]:
#We make a condition to show where the prediction = actual 

df_concat[df_concat['pred'] == df_concat['real']]

Unnamed: 0,pred,real
0,1,1
1,0,0
2,1,1
3,1,1
4,1,1
...,...,...
135,0,0
136,1,1
138,1,1
140,1,1


In [None]:
true = len(df_concat[df_concat['pred'] == df_concat['real']])

In [None]:
#We calculate how many good prediction our model made
#This is what we call the accuracy of a model

true/len(df_concat)

0.9230769230769231

## Increasing Accuracy with Standardization

In [None]:
#The standardization is available as StandardScaler class in the scikit-learn. 

scaler = StandardScaler()

In [None]:
#Our data before standardization

pd.DataFrame(X_train).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,12.87,16.21,82.38,512.2,0.09425,0.06219,0.039,0.01615,0.201,0.05769,...,13.9,23.64,89.27,597.5,0.1256,0.1808,0.1992,0.0578,0.3604,0.07062
1,17.54,19.32,115.1,951.6,0.08968,0.1198,0.1036,0.07488,0.1506,0.05491,...,20.42,25.84,139.5,1239.0,0.1381,0.342,0.3508,0.1939,0.2928,0.07867
2,12.43,17.0,78.6,477.3,0.07557,0.03454,0.01342,0.01699,0.1472,0.05561,...,12.9,20.21,81.76,515.9,0.08409,0.04712,0.02237,0.02832,0.1901,0.05932
3,11.75,17.56,75.89,422.9,0.1073,0.09713,0.05282,0.0444,0.1598,0.06677,...,13.5,27.98,88.52,552.3,0.1349,0.1854,0.1366,0.101,0.2478,0.07757
4,11.74,14.02,74.24,427.3,0.07813,0.0434,0.02245,0.02763,0.2101,0.06113,...,13.31,18.26,84.7,533.7,0.1036,0.085,0.06735,0.0829,0.3101,0.06688


In [None]:
#Our data after standardization

scaler.fit(X_train)
pd.DataFrame(scaler.transform(X_train)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,-0.38352,-0.765395,-0.418896,-0.428593,-0.089628,-0.790667,-0.616066,-0.837589,0.754743,-0.688645,...,-0.507241,-0.362758,-0.553656,-0.510198,-0.252923,-0.468315,-0.349992,-0.870496,1.153187,-0.711634
1,0.938046,-0.020018,0.926222,0.816967,-0.42468,0.315189,0.219097,0.695239,-1.121119,-1.09187,...,0.836472,0.010621,0.941012,0.610235,0.306899,0.55084,0.378618,1.208696,0.049861,-0.272592
2,-0.508036,-0.576055,-0.574292,-0.527523,-1.459165,-1.321424,-0.946769,-0.815666,-1.247665,-0.990339,...,-0.713333,-0.94489,-0.777127,-0.652719,-2.11198,-1.313481,-1.19986,-1.320861,-1.626347,-1.32793
3,-0.70047,-0.441839,-0.6857,-0.68173,0.867142,-0.119974,-0.437398,-0.100276,-0.778699,0.628366,...,-0.589678,0.373817,-0.575973,-0.589144,0.163585,-0.439233,-0.650856,-0.210532,-0.684602,-0.332585
4,-0.703299,-1.290275,-0.753532,-0.669257,-1.271477,-1.151352,-0.830028,-0.537966,1.09344,-0.189689,...,-0.628835,-1.27584,-0.689643,-0.62163,-1.23821,-1.073992,-0.98368,-0.487045,0.332221,-0.915612


* The differences among column values are huge before standardization.

* After standardization, the column values do not significantly deviate from 0. 

* Better performance would be possible compared to before standardization.
