<h1>Knn Classification</h1>
<img src="images/1line.png" width="100%">
<ul><li>
KNN (K — Nearest Neighbors) is one of many (supervised learning) algorithms used in data mining and machine learning</li>
<li>It is a classifier algorithm where the predictions are based “how similar” is a new piece of data (a vector) from other data that has known values.</li>
<li>While it can be used for either regression or classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another.</li>
<li>For example: Imagine you are from a small town and you must decide how you want to vote on a particular issue. To do this, you might go to your nearest neighbors and ask how they stand on the particular ballot measure. If the majority of your 'k' nearest neighbors support the measure, then you would most likely vote for the measure as well.</li>
</ul>

<h3>How Does the K-Nearest Neighbors Algorithm Work?</h3>
<ul>
<li>The K-NN algorithm compares a new data entry to the values in a given data set (with different classes or categories). </li>
<li>Based on its closeness or similarities in a given range (<strong>K</strong>) of neighbors, the algorithm assigns the new data to a class or category in the data set (training data). </li></ul>
<p>The algorithm follows the following steps:</p>
<ul>
<li><strong>Step 1: </strong>Assign a value to <strong>K</strong> (the number of neighbors to look at).</li>
<li><strong>Step 2: </strong>Calculate the distance between the new data entry and other existing data entries.</li>
<li><strong>Step 3:  </strong>Find the <strong>K</strong> nearest neighbors to the new entry based on the calculated distances.</li>
<li><strong>Step 4: </strong>Assign the new data entry to be the same class (or type) as the majority of its nearest neighbors. </li>
</ul>



<h3>K-Nearest Neighbors Example</h3>

<ul><li>We will explain how KNN classification works using a famous dataset called “IRIS” (<a href="data/iris.csv">iris.csv</a>).</li>
<li>The “IRIS” dataset consist of <code>data</code>, sepal length (cm),	sepal width (cm),	petal length (cm),	and petal width (cm) for three different class of Iris flowers.</li>
<li>The flower types – Iris-Setosa, Iris-Versicolour & Iris-Verginica are coded as 0, 1 and 2 respectively. These codes are stored in the datases in the column <code>target</code> or the values to be predicted.</li>
<li>You can see the complete analysis of the iris dataset here: <a href="https://colab.research.google.com/github/dgregg/Jupyter/blob/master/Notebooks/iris-knn-classification-scikit.ipynb" target="_blank" rel="noopener"> <img src="https://colab.research.google.com/assets/colab-badge.svg" /></a></li></ul>
<img src="images\iris_types MIT Liscence.png">

<ul><li>The plot below shows the petal length (cm) on the x-axis and petal width (cm) on the y-axis for the three varieties of irises. </li></ul>
<figure ><img src="images/petals_original.png" alt="knn-data-graph original data for irises" ></figure>
<ul><li>Assume we now want to classify a new unknown iris. This is represented by the purple X in the graph below.</li></ul>
<figure ><img src="images/petals_new.png" alt="knn-data-graph with a new point" ></figure>

<ul><li>To classfy this iris we'll assign a value to <strong>K </strong>which denotes the number of neighbors to consider before classifying the new flower. Let's assume the value of <strong>K </strong>is 7. </li><li>Since the value of <strong>K </strong>is 7, the KNN algorithm will only consider the 7 nearest neighbors to the purple X (new flower).</li>
<li>This is represented by rhe red circel in the plot below. </li></ul>
<figure class="kg-card kg-image-card"><img src="images/petals_knn7.png" ></figure><ul><li>Out of the 7 nearest neighbors in the diagram above, the majority are blue circles so the new entry will be classified as "blue" or Iris-Versicolour. </li></ul>

<h3>Accuracy</h3>

<ul><li>The Iris dataset is frequently used when learning KNN classification because when all 4 values are used to clasify a new iris in the data set, accuracy is very high. </li> 
<li>When I used a value of K = 7 and trained the model using 70% of the data, th KNN Classifier built into scikit-learn accurately predicted the iris type every time.</li>
</ul>


<pre>
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       0.95      1.00      0.98        20
           2       1.00      0.93      0.96        14

    accuracy                           0.98        45
   macro avg       0.98      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45
</pre>


<img src="images/Precisionrecall.svg.png" style="float: right;">
<h4>F1-score</h4>

<li>Classification accuracy is measued using the <strong>F1-score</strong> which based on the model's precision and recall: to both capture positive cases (recall) and be accurate with the cases it does capture (precision).</li>
<ul><li><strong>Recall</strong> is the number of cases that were accurately identified (true positives) divide d by the total number of cases that should have been identified (true positives + false negatives).</li>
<li><strong>Precision</strong> is the number of cases that were accurately identified (true positives) divided by the total number of cases that were identified (true positives + false positives)</li>
<li><strong>F1-score</strong> = 2 x (Percision * Recall) / (Percision + Recall)</li></ul>
</ul>
<h4>Interpreting F1-score</h4>
<ul><li>Over 90% - Very good</li><li>Between 70% and 90% - Good</li><li>Between 60% and 70% - OK</li><li>Below 60% - Poor</li></ul>
<h4>Improving Accuracy</h4>
<ul>
<li>Things that affect classification accuracy include the value for k used, the distance metrics (or algorithm used to compute distance between points)</li>
<li>Selecting a value for k can be a balancing act as different values can lead to overfitting or underfitting. Frequently data analyists will vary the value for k to get the best fit for the classification (highes percent matched correctly)</li>
<li>In the Iris dataset the performance of the classification is very good for <strong>K</strong> values between 1 and 59 (this is unususally good).</li></ul>


<Hr><h3>References</H3>
<ul><li>Ihechikara Vincent Abba, KNN Algorithm – K-Nearest Neighbors Classifiers and Model Example, 1-25-2023 <a href="https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm-classifiers-and-model-example/">https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm-classifiers-and-model-example/</a></li>
<li>Mike Yun, Iris KNN Classification (SciKit), <a href="https://www.kaggle.com/code/barcodereader/iris-knn-classification-scikit">https://www.kaggle.com/code/barcodereader/iris-knn-classification-scikit</a>, Apache 2.0 open source license.</li>
<li>skalskip, Iris data visualization and KNN classification <a href=https://www.kaggle.com/code/skalskip/iris-data-visualization-and-knn-classification>https://www.kaggle.com/code/skalskip/iris-data-visualization-and-knn-classification</a>, Apache 2.0 open source license.</li>
<li>Iris images downloaded from: <a href="https://github.com/andersonpereiradossantos/machine-leaning-knn_Iris_dataset">https://github.com/andersonpereiradossantos/machine-leaning-knn_Iris_dataset</a>, MIT License</li>
<li>Precision & recal File: <a href="https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg">Precisionrecall.svg</a>, <a class="mw-mmv-license" href="https://creativecommons.org/licenses/by-sa/4.0" target="_blank">CC BY-SA 4.0</a><span class="mw-mmv-restrictions"></span><span class="mw-mmv-permission-link mw-mmv-label" style="display: none;">view terms</span> <span class="mw-mmv-datetime">Created: 22 November 2014</span>, By: Walber</ul>

