<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 35: Classification

Associated Textbook Sections: [17.0 - 17.3](https://inferentialthinking.com/chapters/17/Classification.html)

---

## Outline

* [Prediction](#Prediction)
* [Classification](#Classification)

---

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from mpl_toolkits.mplot3d import Axes3D

---

## Prediction

---

### Guessing the Value of an Attribute

* Based on incomplete information
* One way of making predictions: 
    * To predict an outcome for an individual, 
    * find others who are like that individual
    * and whose outcomes you know. 
    * Use those outcomes as the basis of your prediction.
* Two Types of Prediction
    * Classification = Categorical
    * Regression = Numeric


---

### Prediction Example: Visual Plant Identification

<a href="https://unsplash.com/photos/purple-flower-in-tilt-shift-lens-OBtrCoiKlZo" title="purple flower in tilt shift lens"><img src="./iris.avif" alt="an iris plant" width=40%></a>

* What type of plant is this?
* How do we train a computer to make this decision?

---

### Machine Learning Algorithm

* A mathematical model
* calculated based on sample data ("training data")
* that makes predictions or decisions without being explicitly programmed to perform the task

---

## Classification

---

### Classification Example: Playlist Sorting

* 2 Fall 2021 MATH 108 students (Lil Cabrera and Olga Aronov) analyzed music by exploring song attributes from [Spotify's API](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features)
* Spotify's API response to requesting a track's audio features:

```json
{
  "audio_features": [
    {
      "acousticness": 0.00242,
      "analysis_url": "https://api.spotify.com/v1/audio-analysis/2takcwOaAZWiXQijPHIx7B\n",
      "danceability": 0.585,
      "duration_ms": 237040,
      "energy": 0.842,
      "id": "2takcwOaAZWiXQijPHIx7B",
      "instrumentalness": 0.00686,
      "key": 9,
      "liveness": 0.0866,
      "loudness": -5.883,
      "mode": 0,
      "speechiness": 0.0556,
      "tempo": 118.211,
      "time_signature": 4,
      "track_href": "https://api.spotify.com/v1/tracks/2takcwOaAZWiXQijPHIx7B\n",
      "type": "audio_features",
      "uri": "spotify:track:2takcwOaAZWiXQijPHIx7B",
      "valence": 0.428
    }
  ]
}
```
* They classified songs by assigning a song to one of two playlists (Workout or Relax) based on which playlist shared similar attributes.

---

### Classification Example: Targeted Advertising

> Andrew Pole had just started working as a statistician for Target in 2002, when two colleagues from the marketing department stopped by his desk to ask an odd question: "If we wanted to figure out if a customer is pregnant, even if she didn't want us to know, can you do that?" - [How Companies Learn Your Secrets (The New York Times Magazine)](https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html)

---

### Classification Example: Sentiment Analysis

<a href="https://unsplash.com/photos/decorative-egg-and-egg-shells-xcSGMHsaLio" title="egg and egg shells"><img width = 50% src="./happy_sad_eggs.jpeg" alt="egg and egg shells with faces drawn on them"></a>

* Sentiment analysis is a type of classification that focuses on extracting subjective information. For example, a statement can be classified as positive, negative, or neutral. 

* The following is an example of using a sentiment analysis model from [text-processing.com](http://text-processing.com/docs/sentiment.html). 

_You don't need to know about the [`requests` library](https://requests.readthedocs.io/en/latest/), JSON files, or how to make HTTP POSTs, but you might want to one day!_

In [None]:
import requests

text_list = ["I love CCSF!", 
             "I hate CCSF!", 
             "I attend classes at CCSF."] # Statements to classify

url = 'http://text-processing.com/api/sentiment/' # The classification app
for text in text_list:
    data = 'text='+text
    # Send the text to classify to the web app
    response = requests.post(url, data)
    text_label = response.json()['label'] # The returned label for the text
    print(f"'{text}' was labeled as {text_label}.")

---

### Classification through Feature Relationships

* How can data be used to perform classification?
* The relationship between various attributes (features) might reveal patterns! 
* The choice of attributes and the number of attributes can have a big impact on identifying classes.

---

### Demo: Classifying Banknotes

The `banknote.csv` dataset contains wavelet transformation (image processing) information on banknotes (bills) that have been used to [classify banknotes](https://www.researchgate.net/publication/220929082_Feature_Extraction_for_Bank_Note_Classification_Using_Wavelet_Transform).

* Notice that the dataset has two classes of banknotes.
* Explore the relationship between `WaveletVar` and `WaveletCurt` to see if they are helpful features for classifying the banknotes.
* Explore the relationship between `WaveletSkew` and `Entropy` to see if they are helpful features for classifying the banknotes.
* Sometimes you need to adjust the number of features you relate to identify clear separation in the data! Let's see how all three wavelet features can be used to identify the class visually (in 4D).

In [None]:
banknotes = Table.read_table('banknote.csv')
banknotes

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
fig = plt.figure(figsize=(8,8))
ax = Axes3D(fig, auto_add_to_figure=False)
fig.add_axes(ax)
ax.scatter(banknotes.column('WaveletSkew'), 
           banknotes.column('WaveletVar'), 
           banknotes.column('WaveletCurt'), 
           c=banknotes.column('Class'),
           cmap='viridis',
          s=50);

---

### Demo: Classifying Patients (Chronic Kidney Disease)

* Load the `ckd.csv` data that shows patient data. Group the data by class to see how many patients have been labeled as having chronic kidney disease (`1`) or not (`0`).
* Visualize the relationship between `'White Blood Cell Count'` and `'Glucose'` to see if these features might be helpful to identify CKD. Look for separation in the colored points.
* Visualize the relationship between `'Hemoglobin'` and `'Glucose'` to see if these features might be helpful to identify CKD.
* Explore how the natural boundaries in the scatterplot can be used to classify a patient as having CKD or not. Create a function to predict a patient's class based on their hemoglobin and glucose levels based on the visualized boundaries.
* Try out the classifier and think about its limitations.

In [None]:
ckd = (Table.read_table('ckd.csv')
            .relabeled('Blood Glucose Random', 'Glucose'))
ckd.show(3)

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
class_0 = ckd.where('Class',are.equal_to(0))
max_glucose_for_0 = ...
min_hemoglobin_for_0 = ...

In [None]:
def classify(hemoglobin, glucose):
    ...
        ...
    else:
        ...

In [None]:
classify(15, 100)

In [None]:
classify(10, 300)

---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>