## Your "First" Step-By-Step -  Machine Learning Project in Python .


## 📌Expert Insight : 


**You Can Do Machine Learning in Python :**

- Work through this tutorial below. It will take you 5-to-10 minutes, max!
<br>

- **You do not need to be a Python programmer.** The syntax of the Python language can be intuitive if you are new to it. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a = “b”). This will get you most of the way. You are a developer, you know how to pick up the basics of a language real fast. Just get started and dive into the details later.
<br>
- **You do not need to understand everything.** (at least not right now) Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the help(“FunctionName”) help syntax in Python to learn about all of the functions that you’re using.
<br>
- **You do not need to know how the algorithms work.** It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.
<br>
- **You do not need to be a machine learning expert.** You can learn about the benefits and limitations of various algorithms later.
<br>

- **Consider this as your first project-** Focus on the key steps., namely, loading data, looking at the data, evaluating some algorithms and making some predictions. In later tutorials we will focus on other data preparation and result improvement tasks.

##  📌 Let's go... Step-By-Step... 


### 📍 A machine learning project may not be linear, but it has a number of well known steps

   - 1. Define the Problem.
   - 2. Prepare the Data.
   - 3. Evaluate the Algorithms.
   - 4. Improve the Results.
   - 5. Present the Results.
   
   
## 1. Required Libraries

### 📍 There are 5 key libraries that you will need.

- numpy
- pandas
- scipy 
- sklearn
- matplotlib 


## 📍 1.1 Start Python and Check Versions

- It is always advised to make sure your Python environment was installed successfully and is working as expected.

- The script below will help you understand your environment. 

- It imports each library required for this tutorial and prints the version.

In [None]:
# Check the versions of libraries
 
# Python version
import sys

# scipy
import scipy

# numpy
import numpy as np

# matplotlib
import matplotlib.pyplot as plt

# pandas
import pandas as pd

# scikit-learn
import sklearn


## 📍 2. Load The Data

- The iris flowers dataset. 

- This dataset is famous because it is used as the “hello world” dataset in machine learning and statistics.

- The dataset contains 150 observations of iris flowers. 

- There are four columns of measurements of the flowers in centimeters. 

- The fifth column is the species of the flower observed. 

- All observed flowers belong to one of three species.

## 📍 2.1 Import libraries

- First, import all of the modules, functions and objects needed.

In [None]:
# Load libraries

from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

## 📍 2.2 Load Dataset

- We can load the data directly from the UCI Machine Learning repository.

- We are using pandas to load the data. 

- We will also use pandas next to explore the data both with descriptive statistics and data visualization.

#### Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.

In [None]:
# Load dataset

import pandas as pd

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv('iris.csv', names=names)

a = dataset

**The dataset should load without incident.**

If you do have network problems, you can download the iris.csv file into your working directory and load it using the same method, changing URL to the local file name.

## 📍 3. Summarize the Dataset

**Now it is time to take a look at the data.**

In this step we are going to take a look at the data in a few different ways:

- Dimensions of the dataset.
- Peek at the data itself.
- Statistical summary of all attributes.
- Breakdown of the data by the class variable.

## 📍 3.1 Dimensions of Dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

In [None]:
# shape 

import pandas as pd

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv('iris.csv', names=names)

a = df

b = a.shape()  # You should see 150 instances and 5 attributes:

## 📍 3.2 Peek at the Data

It is always advisable to actually eyeball your data.

In [None]:
# head

import pandas as pd

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv('iris.csv', names=names)

a = df

c = a.head(20)  # You should see the first 20 rows of the data:

## 📍 3.3 Statistical Summary

Now, let us take a look at a summary of each attribute.

This includes the count, mean, the min and max values as well as some percentiles.

In [None]:
import pandas as pd

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv('iris.csv', names=names)

a = df

# descriptions

d = a.describe().transpose() 

## 📍 3.4 Class Distribution

Now, let us take a look at the number of instances (rows) that belong to each class. 

We can view this as an absolute count.

In [None]:
import pandas as pd

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv('iris.csv', names=names)

a = df

# class distribution
c = a.groupby('class').size()

#### We can see that each class had the same number of instances (50 or 33% of the dataset).

## 📍 4. Data Visualization

- As We now have a basic idea about the data. We can extend our understanding with some visualizations.

**We are going to use two types of plots:**

- Univariate plots to better understand each attribute.
- Multivariate plots to better understand the relationships between attributes.

## 📍 4.1 Univariate Plots

- these are plots of each individual variable.

- As the input variables are numeric, we can create box and whisker plots of each.

In [None]:
# box and whisker plots

import matplotlib.pyplot as plt
import pandas as pd

# Load dataset

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']

df = pd.read_csv('iris.csv', names=names)

# plot the data

plt.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)

plt.show()

### This output gives us a much clear idea of the distribution of the input attributes:

**We can also create a histogram of each input variable to get an idea of the distribution.**

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Load dataset

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv('iris.csv', names=names)

# Create histograms for one of the features (e.g., 'sepal-length')

plt.hist(df['sepal-length'], bins=10, color='blue', edgecolor='black')
plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.title('Histogram of Sepal Length')
plt.show()

**It looks like two of the input variables have a Gaussian distribution.**

This will be useful as we can use algorithms that can exploit this assumption.

## 📍 4.2 Multivariate Plots

- Used to look at the interactions between the variables.

- First, let us look at scatterplots of all pairs of attributes. This will be helpful to spot structured relationships between input variables.

In [None]:
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot as plt

# Load dataset

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv('iris.csv', names=names)

# scatter plot matrix

scatter_matrix(df)
plt.show()

**Note: the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.**

## 📍 5. Evaluate Some Algorithms

**Now it is time to create some models of the data and estimate their accuracy on unseen data.**

**This is what we are going to do in this step:**

- Separate out a validation dataset.
- Set-up the test harness to use 10-fold cross validation.
- Build multiple different models to predict species from flower measurements
- Select the best model.

## 📍 5.1 Create a Validation Dataset

- First , we need to know whether the model we created is good.

- Further down , we will use statistical methods to estimate the accuracy of the models that we create on unseen data. 

- We also need a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.

- We are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

**The method :**

- We will split the loaded dataset into two,

   - 80% of which we will use to train, evaluate and select among our models, 
   
   - 20% that we will hold back as a validation dataset.

In [None]:
import pandas as pd 

# Load dataset

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv('iris.csv', names=names)

# Split-out validation dataset

array = df.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)

**Now we have training data in the X_train and Y_train for preparing models and a X_validation and Y_validation sets that we can use later.**

Notice that we used a python slice to select the columns in the NumPy array. 

- If this is new to you, you can refer our numpy - practice. ( How to Index, Slice and Reshape NumPy Arrays.)  - 

## 📍 5.2 Test Harness

We will use stratified 10-fold cross validation to estimate model accuracy.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.

**Stratified** means that each fold or split of the dataset will aim to have the same distribution of example by class as it exists in the whole training dataset.

- We set the random seed via the random_state argument to a fixed number 
  - to ensure that each algorithm is evaluated on the same splits of the training dataset.
- The specific random seed does not matter, 
  - we are using the metric of ‘accuracy‘ to evaluate models.
- This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset    
  - which is multiplied by 100 to give a percentage (e.g. 95% accurate). 

We will be using the scoring variable when we run build and evaluate each model next.

## 📍 5.3 Build Models

We do not know which algorithms would be good on this problem or what configurations we need to use.

We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we can generally expect good results.

**Let us test 6 different algorithms:**

- Logistic Regression (LR)
- Linear Discriminant Analysis (LDA)
- K-Nearest Neighbors (KNN).
- Classification and Regression Trees (CART).
- Gaussian Naive Bayes (NB).
- Support Vector Machines (SVM).

**This is a good combination of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms.**

Let us build and evaluate our models:

In [None]:
import pandas as pd 
import sklearn

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Spot Check Algorithms

models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))

# evaluate each model in turn

results = []
names = []
for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
    

## 📍 5.4 Select Best Model

- We now have 6 models and accuracy estimations for each. 
- We need to compare the models with each other and select the most accurate.

**Running the example above, we got a few raw results:**

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. 

- Consider running the example a few times and compare the average outcome.

- In this case, we can see that it looks like Support Vector Machines (SVM) has the largest estimated accuracy score.

- We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. 

- There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (via 10 fold-cross validation).

**One best way to compare the samples of results for each algorithm is to create a box and whisker plot-**
    - for each distribution and compare the distributions.

In [None]:
# Compare Algorithms

import matplotlib.pyplot as plt

plt.boxplot(results, labels=names)
plt.title('Algorithm Comparison')
plt.show()

**We can see that the box and whisker plots are squashed at the top of the range, with many evaluations achieving 100% accuracy, and some pushing down into the high 80% accuracies.**

## 📍 5.5 Complete Example

- For reference, we will group all of the previous code-blocks together into a single script.

- The combined code-block is given below.

In [None]:
# compare algorithms

from pandas import read_csv
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Load dataset

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv('iris.csv', names=names)


# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))

# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
    
# Compare Algorithms
plt.boxplot(results, labels=names)
plt.title('Algorithm Comparison')
plt.show()

## 📍 6. Make Predictions

**We need to choose an algorithm to use to make predictions.**

- The results in the previous section suggest that the SVM was perhaps the most accurate model. 

- We will use SVM as our final model.

Now we need an idea of the accuracy of the model on our validation set.

- This will give us an independent final check on the accuracy of the best model. 

It is always valuable to keep a validation set to avoid slips made during the training process, you may encounter overfitting to the training set or a data leak. Both of these issues will make your result overly optimistic.

## 📍 6.1 Make Predictions

**We can fit the model on the entire training dataset and make predictions on the validation dataset.**

In [None]:
# Make predictions on validation dataset
model = SVC(gamma='auto')
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)

We might also like to make predictions for single rows of data. For examples on how to do that, see the tutorial:

## 📍 6.2 Evaluate Predictions

We can evaluate the predictions by comparing them to the expected results in the validation set,
- then calculate classification accuracy, as well as a confusion matrix and a classification report.

In [None]:
# Evaluate predictions

print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

We can see that the accuracy is 0.966 or about 96% on the hold out dataset.

The confusion matrix provides an indication of the errors made.

Lastly, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (considering the validation dataset was small).

## 📍 6.3 Complete Example

- For reference, we will group all of the previous code-blocks together into a single script.

- The combined code-block is given below.

In [None]:
# make predictions

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

# Load dataset

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv('iris.csv', names=names)


# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)

# Make predictions on validation dataset
model = SVC(gamma='auto')
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)

# Evaluate predictions
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))