[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ali-rivera/WiDS24_Coding101/blob/main/2Complete_Notebook.ipynb)

# Coding 101
### WiDS 2024 - UVA

**Author: Ali Rivera** - ali.rivera@virginia.edu


In this notebook, we will use Python to solve a data science problem - can we use machine learning to predict if breast cells are cancerous using cell nuclei measurements in our dataset? 

The purpose of this workshop is to get some hands on experience using Python and feel comfortable enough to explore bigger topics on your own. This is meant to be a starting place/launching pad, and is not an all encompassing solution to the problem.

In this notebook, we will:
1. Explore Jupyter notebooks running Python through Google Colab
2. Explore some Python basics through a data cleaning/preprocessing application
3. Impliment a common classification model (kNN) and evaluate model accuracy on test data

## Anatomy of a Jupyter Notebook

In [1]:
#this is a code cell!

# you can add comments using the # to indicate plain text
# or, you can add code for the computer to run. Let's run our first piece of code...

print("hello world!")

hello world!


This is a markdown cell! 

You can add **rich text** in these cells.

In a markdown cell, you can use formatting like:
- *italics*, 
- **bold**, 
- `code`, 
- equations $e=mc^2$ and more. 

Double click on this cell to see how this formatting is applied.

Here is a [markdown cheatsheet](https://images.datacamp.com/image/upload/v1697797990/Marketing/Blog/Markdown_Cheat_Sheet.pdf) that you can use if you're interested in getting started using markdown!

## Python basics

In [2]:
# variables and data types

str1 = "This is a string!" 
int1 = 2 
float1 = 3.14 
bool1 = True 
list1 = ["string", 0, 1.23, False] 

In [3]:
# lets use the print function we used earlier to see what is saved in our variables!

print(str1)
print(int1)
print(float1) 
print(bool1)
print(list1)

This is a string!
2
3.14
True
['string', 0, 1.23, False]


In [4]:
# data types have functions that are specific to them
# notice that we put our variable name BEFORE the function here...

str1.upper()

'THIS IS A STRING!'

In [22]:
# .upper() is a string function, so we can't use it on other data types, like ints or floats

int1.upper() # throws an error

AttributeError: 'int' object has no attribute 'upper'

#### Packages 

Packages are extensions that you can add to your Python environment that help you perform specific tasks. 

Packages contain code to perform tasks that you would otherwise have to write yourself. For example, there is a `math` package that contains funcitons to do math operations beyond the basic operations already in Python.

In Data Science, we frequently use packages for things like data processing (`pandas`), linear algebra (`NumPy`), machine learning (`SciKitLearn`), and much, much more! You can create your own packages and share them with others to use!

In [8]:
#import packages 

import pandas as pd #for data processing

from sklearn.preprocessing import MinMaxScaler # to scale our data
from sklearn.model_selection import train_test_split # to split the train/test sets
from sklearn.neighbors import KNeighborsClassifier # our ML model

## Read in Data

In [9]:
# read in data

url = "https://raw.githubusercontent.com/ali-rivera/WiDS24_Coding101/main/breast-cancer-wisconsin-data.csv"
df = pd.read_csv(url)

# visualize the first 5 rows of the data
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


[`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) is a function that shows the first *n* rows in a dataframe. If no `n` is specified, the function uses the default value of 5. 

## Look at data

We'll also want to understand how many observations are labeled as benign(B)/malignant(M). A considerable difference in these numbers could affect our model performance...

For this, we can use the [`.value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html) function!

In [21]:
# check our outcome distribution
df['diagnosis'].value_counts()

KeyError: 'diagnosis'

## Preproccessing

Before we can put our data into any models, there are a few things we need to do to make it fit nicely.

In the next few cells, we will:
1. Change our outcome variable to a `boolean` (true/false)
2. Drop any duplicative/unnecessary columns for our model
3. Split our data into a training set and a testing set

#### 1. Change outcome variable to a boolean

In [13]:
# turn diagnosis into a boolean (true/false) 
df['cancer'] = df['diagnosis'].replace(['B', 'M'], [False, True])

# Check the new column we added
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,cancer
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,True
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,True
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,True
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,True
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,True


#### 2. Drop any duplicative/unnecessary columns

In [14]:
# drop diagnosis column (now duplicated in cancer) and id 

df.drop(columns = ['diagnosis', 'id'], inplace = True)

#### 3. Split data into training/test sets


In [15]:
# split data into training and testing set
## The X_ sets are the predictors, the y_ sets are the outcomes

X_data = df.drop(columns='cancer')
y_data = df['cancer']

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.33, random_state=42)

## Model building

### kNN (k Nearest Neighbor)

<img src="https://miro.medium.com/v2/resize:fit:810/0*rc5_e6-6AHzqppcr" width = 400>

[image source](https://medium.com/analytics-vidhya/k-nearest-neighbor-the-maths-behind-it-how-it-works-and-an-example-f1de1208546c)


kNN classifies a new observation by finding the k nearest neighbors to the point in n-dimensional space. k is the number of neighbors, and n is the number of features in your data. 

kNN uses the majority vote of the k neighbors to classify the point. The model can be tuned by choosing the k value to produce the best fit. Choosing too small of a k will overfit your model, causing it to be too specific to the training data and not generalizable to new data. Choosing too large of a k will underfit the model and may not capture the important trends in the data. Overfitting and underfitting both result in poor model performance.

For more information on kNN, check out this [IBM page](https://www.ibm.com/topics/knn#:~:text=The%20k%2Dnearest%20neighbors%20(KNN)%20algorithm%20is%20a%20non,used%20in%20machine%20learning%20today.).

--------------------------------

We'll be using the [`KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) model from `SciKitLearn`. 

After we train the model, we can use the [`.score()`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.score) function to make predictions on our test data and calculate the accuracy score.

In [19]:
kNN = KNeighborsClassifier()# create model with default k value
kNN.fit(X_train, y_train) #train

#get the accuracy score (correct predictions/total predictions)
kNN.score(X_test, y_test) #get the accuracy score (correct predictions/total predictions)


0.9521276595744681

#### Let's refine the model a bit...

We need to choose the best k value for our data. We can test a few and see which gives us the best result.

To do this, we'll create a for loop

In [20]:
klist = [3,5,7,9,11,13,15] # a list of kvalues to test

# loop through each value in our list
### for each k value, train the model, test the model, and print the accuracy
for k in klist:
    kNN = KNeighborsClassifier(n_neighbors=k)# create model with current k value
    kNN.fit(X_train, y_train) #train

    #get the accuracy score (correct predictions/total predictions)
    print("k=", k, "\t accuracy=",kNN.score(X_test, y_test)) #print the current k value and accuracy score



k= 3 	 accuracy= 0.9414893617021277
k= 5 	 accuracy= 0.9521276595744681
k= 7 	 accuracy= 0.973404255319149
k= 9 	 accuracy= 0.973404255319149
k= 11 	 accuracy= 0.9787234042553191
k= 13 	 accuracy= 0.9627659574468085
k= 15 	 accuracy= 0.9680851063829787


Looks like the k that gives us the best accuracy is k = 11. 

From here, we could create a final model using that k, and use that to predict whether cells are cancerous or not for new observations.

### Quick Recap!

In this notebook, we...
- learned the parts of a Jupyter Notebook
- ran Python code using Google Colab
- learned 5 foundational python data types
- learned what functions are and used 12 functions in our code
- found, read, and used documentation as a part of our coding practice
- trained and tested a kNN model to make predictions about cells being cancerous
- made a for loop to find the best hyperparameter for our model

## Taking things a step further...

If you're looking to continue your Python Practice, here are some great resources for getting some practice in!
- [Beginner Practice](https://www.w3schools.com/python/exercise.asp) : quick coding prompts for practice with data types and associated functions - with feedback and solutions!
- [Python Cheatsheet](https://images.datacamp.com/image/upload/v1694526244/Marketing/Blog/Python_Basics_Cheat_Sheet-updated.pdf) : quick reference info on data types, operations, common functions, etc.
- [101 Pandas Practice Problems](https://www.machinelearningplus.com/python/101-pandas-exercises-python/) : coding prompts for practice using the `pandas` package, with solutions!
- [Loops Practice Problems](https://pynative.com/python-if-else-and-for-loop-exercise-with-solutions/) : Practice with writing loops
- [Numpy](https://www.w3resource.com/python-exercises/numpy/index.php) : For practice with: all things NumPy (another common package!): data types, syntax, arrays, math operations, and more advanced topics if you're looking for a challenge!


*Remember: an essential part of coding is making mistakes and getting yourself unstuck! Often times the best understanding comes from the biggest struggle. If you're interested in taking another step, give yourself the time and space to make mistakes, troubleshoot, and use documentation to learn along the way, don't expect everything to work on the first try.*