# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

## Learning Objective

At the end of this experiment, you will be able to :

* Understand Haberman Survival dataset
* Understand KNN


## Dataset

### Description

The dataset chosen for this experiment is Haberman's Survival Data. The dataset contains cases from a study that was conducted between
1958 and 1970 at the University of Chicago's Billings Hospital on
the survival of patients who had undergone surgery for breast cancer.


The dataset contains 306 instances and four attributes. 

#### Attribute Information 

* Age of patient at time of operation (numerical) 
* Patient's year of operation (year - 1900, numerical) 
* Number of positive axillary nodes detected (numerical) 
* Survival status (class attribute) 

      * 1 = the patient survived 5 years or longer 
      * 2 = the patient died within 5 year
      
      
#### Datasource

https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival


## AI / ML Technique

### K-nearest neighbour (KNN)

K-nearest neighbor is a supervisied learning algorithm where the result of new instance sample is classified based on the majority k-nearest neighbors. 

For example, given a new sample , we find k number of training sample close to the new sample by calculating the distance between them. The classification is done using majority vote among k nearest samples.

## Keywords

* KNN
* Classification
* Training set
* Testing set
* Haberman Survival dataset

## Expected time to complete the experiment is : 30 mins

## Setup Steps

In [0]:
#@title Please enter your registration id to start: (e.g. P181900101) { run: "auto", display-mode: "form" }
Id = "P181902118" #@param {type:"string"}


In [0]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "8860303743" #@param {type:"string"}


In [0]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()
  
notebook="BLR_M0W1_Knn_Haberman_Survival" #name of the notebook
def setup():
#  ipython.magic("sx pip3 install torch") 
    ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/Haberman_Survival_DataSet.csv")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    
    ipython.magic("notebook -e "+ notebook + ".ipynb")
    
    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:        
        print(r["err"])
        return None        
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional, 
              "concepts" : Concepts, "record_id" : submission_id, 
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      print("Your submission is successful.")
      print("Ref Id:", submission_id)
      print("Date of submission: ", r["date"])
      print("Time of submission: ", r["time"])
      print("View your submissions: https://iiith-aiml.talentsprint.com/notebook_submissions")
      print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
      return submission_id
    else: submission_id
    

def getAdditional():
  try:
    if Additional: return Additional      
    else: raise NameError('')
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None
  
def getConcepts():
  try:
    return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None

def getAnswer():
  try:
    return Answer
  except NameError:
    print ("Please answer Question")
    return None

def getId():
  try: 
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup 
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
    from IPython.display import HTML
    HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id))
  
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


### Loading the necessary packages

In [0]:
import pandas as pd

### Loading the data

In [0]:
data = pd.read_csv("Haberman_Survival_DataSet.csv")

To get sense of data, let us print first five entries from the dataset 

In [0]:
data.head()

Unnamed: 0,Age,Year of Operation,Number of positive axillary nodes detected,Survival_status
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


We can observe from the data that the last column i.e. Survival_status is the class labels that we need to predict.

### Splitting the dataset into train and test sets using train_test_split() from sklearn package.

To know more about train_test_split(), you can refer the below link :

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [0]:
from sklearn.model_selection import train_test_split

Here we are splitting the data into train and test with 70 : 30 ratio respectively

In [0]:
X_train, X_test, y_train, y_test = train_test_split(data.values[:,:3], data.values[:,3], test_size=0.33)


In [0]:
X_train.shape

(205, 3)

### Applying KNN

In [0]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3,metric='euclidean')

In [0]:
neigh.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

In [0]:
predicted_labels = neigh.predict(X_test)

In [0]:
print(predicted_labels)
print(neigh.score(X_test,y_test))

result = []
for i in range(len(predicted_labels)):
  if predicted_labels[i] == y_test[i]:
    result.append(True)
  else:
    result.append(False)
    
result.count(True) / len(y_test)

[1 1 2 2 2 1 1 2 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 1 1 2 1 1 1
 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2]
0.7524752475247525


0.7524752475247525

### Please answer the questions below to complete the experiment:

In [0]:
#@title In a classification task using KNN, where each nearest neighbour votes equally, the predicted label is (You should look up the definitions and think clearly. Assume that the labels are all integers.) { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "mode of nearest neighbours labels" #@param ["mode of nearest neighbours labels","median of nearest neighbours labels","mean of nearest neighbours labels"]


In [0]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "Good and Challenging for me" #@param ["Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [0]:
#@title If it was very easy, what more you would have liked to have been added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "good" #@param {type:"string"}

In [0]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["Yes", "No"]

In [0]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id =return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")

Your submission is successful.
Ref Id: 1671
Date of submission:  15 Mar 2019
Time of submission:  14:00:15
View your submissions: https://iiith-aiml.talentsprint.com/notebook_submissions
For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.
