# Welcome to Thunderbird Hacks! 
## First thing to know  
This notebook has lots of cells. Each cell is either text or code. You can see which cell you are on by the black vertical bar to the left. To select a cell, single click on it. 
## Double click on this cell
This will allow you to edit the text. You'll notice that the content doesn't change but there's some extra stuff (like #s) in the cell. That stuff helps format the text so it is easier to read.
## How to run a cell
You can run a cell by clicking the single triangle, or play button, in the menu above. The keyboard shortcut for this is `ctrl`+`enter`. For a text cell, this will return the text to read, and for a code cell, it will run the code. Try this now if you haven't already.

## This is the next cell.
This is a text cell, just like the one above. Single click on it to select it, double click to edit it, and `ctrl`+`enter` to run it.

## The cell below is a code cell
Single clicking on it allows you to select and edit it. 

In [None]:
# This is a code cell
# <- that hash means this is a comment
# When you run your code cells, it ignores any comments
# Comments are useful for telling you what the code does 

#Print out a greeting
print("Hello World!")

#try running this cell to see what happens

## Congratulations! You ran your first Python code
You should see the `Hello World!` below the cell above and none of the comments. 

So, lets break down what you just ran:  
`print()` is a function - it does something to whatever is inside the parentheses.  

`"Hello World!"` - this is called a string. Anything in between quotation marks are strings.  

Test it out! Try removing the quotation marks around Hello World in the code cell above, run it, and see what happens.

You may see a red box around some text. This is an error - you broke the code!  
You can fix it by putting the quotation marks back. You can also try editing the text - enter your name or your favorite sandwich inside the quotation marks and run the code to see what happens.

Going forward, functions will always have parentheses after them, often with parameters inside the parentheses.

## Now, lets learn about variables

Look at the code cell below. Can you guess what happens?

The comments should help, and take note of the colors of the code.

Run the cell and see if your guess was correct.

In [None]:
foo = "Hello" #set the variable foo
bar = "World" #set bar

print(foo) #print foo
print("foo") #print foo
print(foo + bar) #print foo and bar

Did you guess correctly? Let's run through what happened:  
1. `foo = "Hello"` sets the variable foo to be a string with the value Hello
2. `bar = "World"` sets the variable bar to be a string with the value World
3. `print(foo)` prints the *value* of the variable foo
4. `print("foo")` prints foo as a string because of the quotation marks
5. `print(foo + bar)` prints the values of both foo and bar. Notice there is no space inbetween the words because there were no spaces in the strings when they were defined.

Are you confused? Try editing the code above and test out what happens with different changes you make to the code. If you're still confused, ask for help.

Print statements are very useful for debugging code - since code is run top to bottom, they can help you figure out what is happening when you get error messages.

## What about math?

The code below shows a few basics.

In [None]:
a = 5 #sets the variable a to 5 
#notice there are not any quotation marks
#this is because we want 5 to be treated as a number

b = "5" #sets the variable b to a string
print(a) #print the value of a
print(b) # print the value of b

print("a+a=")
print(a+a) #print a+a
print("b+b=")
print(b+b) #print b+b

#Run this cell and observe what happens for each print statement
print("was this surprising?")

There are a lot of math operations that you can use, though many won't be super useful to you today. If you feel like it, you can use the cell below to test out what you've learned above.

In [None]:
# You can use this cell for testing concepts you learned above.

## A little bit about data types

Every variable you use in Python will have a data type. The first kind we saw were strings - the letters inside quotations. You've also seen integers - numeric values without decimal points. So what else might you see today?
 
`float` - numbers with decimals, ex. `1.5` <br> 
`list` - a sequence type, ex. [1,2,3], ["cat","dog","fish"]

### Arrays will be very important

An array is like a list, it can hold multiple variables at once. The easiest way to start with arrays is with lists. So, lets define a list **ar** as:  

`ar = ["cat", "dog", "fish"]`  

If we want to get the first value, `cat`, out of the array, we can just use this:  

`ar[0]`

We have a zero in the square brackets because the *index* for arrays and lists in Python starts at zero.  

We can get the length of a list using:  

`len(ar)`


In [None]:
#lets test this out with some code
ar = ["cat", "dog", "fish"] #setting the array ar
print("The array ar: ")
print(ar)

print("The first value of ar: "+ar[0])
print("The second value of ar: "+ar[1])
print("The last value of ar: "+ar[2])


print("The length of ar: ")
print(len(ar))

## Numpy arrays

A numpy array is just a better way of storing arrays, it has to do with memory and speed. A lot of functions won't take lists, you have to use numpy arrays.

If you really want to learn all about them, I suggest you read the [numpy documentation](https://numpy.org/doc/stable/user/absolute_beginners.html#whats-the-difference-between-a-python-list-and-a-numpy-array). However, in the interest of time, the explanation in this cell should be enough for today.

Look at the code below, then run it to learn how to create numpy arrays.


In [None]:
a_list = [1,2,3,4] #this is a list
b_list = [5,6,7,8] #this is another list
c_list = [a_list, b_list] #this is a list of lists, or nested list

print("A nested list: {}".format(c_list)) #print out c_list

import numpy as np #import numpy so we can use it, you'll learn more about imports later

#now, we'll turn our lists into numpy arrays using the function np.asarray()
np_alist = np.asarray(a_list) #np.asarray() turns a_list into a numpy array
np_clist = np.asarray(c_list) #turn c_list into a numpy array

print("a_list as a numpy array: {}".format(np_alist)) #print out np_alist

print("a nested list as a numpy array:")
print(np_clist) #print out np_clist


Notice that that lists have commas separating each value and numpy arrays do not.

In [None]:
# You can also find the shape of an array
shape = np.shape(c_list)
print(shape)

Shape errors are very common errors to run into in machine learning. Understanding all of them takes several classes in linear algebra, so if you do run into them and get stuck, the best course of action is to ask for help.

## An introduction to libraries

Earlier, you may have seen:
`import numpy as np`

This imports, or loads, the numpy library so that you can use all of the functions it contains. Notice the `as np`, this is like a nickname we can use for the library.

We used the function `np.asarray()` earlier. The `np` tells the computer to look in the numpy library to find the function `asarray()`

Run the cell below and you're all set to go with the libraries you will need today!

In [None]:
#imports
import numpy as np # linear algebra library
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os #useful library for dealing with filepaths
#sklearn has everything else you will need
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

# Let's get started with the task!

The cell below tells you all the files you can use during this competition. Notice that we are using the `os` library and the `listdir()` function within that library. We are passing a string (the file path) into the function.

Run the cell, you should see ['train.csv', 'test.csv'] as output. If you do not see anything, or get an error, click on `Run` in the menu above, and select factory reset. You will need to re-run the import cell above first, but it should fix your error when you rerun the cell below. If you are still having issues, ask for help.

In [None]:
path = "/kaggle/input/abq-crime-dataset"
os.listdir(path)

## Load the training data

Running the cell below will load the training data into a pandas dataframe called `df_train`.


In [None]:
#load train.csv
train_path = os.path.join(path, "train.csv") #join the folder path and file name
df_train = pd.read_csv(train_path) #read the csv from file into a pandas dataframe
df_train #output the dataframe

Now, we want to prepare out dataset for training. We will do this in several steps:
- Remove the `id` column, as it is a random value and not something that will help us learn what features impact crime type
- Split the data into an `X` and `y` dataset, with `X` being the location and call features and `y` being the incident type
- One hot encode the alphanumeric data. If you looked at the raw data, you'll see that there are a lot of words, or categories. One-hot encoding is a way to turn categories into numbers so that computers can understand them. For example, if you have three colors: (red, green, and blue), one-hot encoding would represent red as (1, 0, 0), green as (0, 1, 0), and blue as (0, 0, 1). This way, each color gets its own unique code without accidentally ranking them.

In [None]:
#remove ID Columns
df_train.drop(["OBJECTID", "ReportDateTime"], axis=1, inplace=True) 
#axis=1 means we're dropping a column and inplace means df_train will have that dropped column


#Split data into X and y
X = df_train.drop("IncidentType", axis=1)
y = df_train["IncidentType"]

If we look at a snippet of the data (just by running the cell below), we can see which columns will likely need to be one-hot encoded.

In [None]:
X

We will encode DayOfWeek, PoliceAreaCommand, and NeighborhoodAssociation

In [None]:
categorical_cols = X[['DayOfWeek', 'PoliceAreaCommand', 'NeighborhoodAssociation']]
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(categorical_cols)
X_one_hot = pd.DataFrame(enc.transform(categorical_cols).toarray(), columns=enc.get_feature_names_out())

#X_one_hot = pd.get_dummies(X[['DayofWeek', 'PoliceAreaCommand', 'NeighborhoodAssociation']]) 
#One hot encode

X = pd.concat([X['Time'], X_one_hot], axis=1)

### Why do we split data into X and y variables?

In math or science class, you may have learned about independent and dependent variables. Our independent variables are in the X dataset (or what would go on the x-axis), which represent the features or characteristics of the data. The dependent variable is in the y dataset (or what would go on the y-axis), which represents the target or class label we want to predict. We know that our dependent variables depend on our independent variables some way. 

Oftentimes, information in the X dataset will give us hints as to what the corresponding label in the y dataset is. For example, if we're trying to figure out if an animal is a cat or a dog, having information on the animal's weight, height, fur color and more can be very useful. Then, once we have those values, we can guess if the animal is a cat or dog with confidence!

Now, it's pretty easy for humans to look at an animal and figure out if it's a cat or dog without considering those factors. We just use our eyes! But for a machine learning model, it's not so easy. It has to learn the relationship between the information in the X dataset and the corresponding label in the y dataset. Then, it can make predictions on new data.

In the graph below, let's pretend all green dots are cats and all red dots are dogs. See how they're clustered together? The goal of the machine learning model is to be able to properly categorize a new purple dot as cat or dog given what it's learned.

![graph](https://upload.wikimedia.org/wikipedia/commons/2/2d/K_nearest_neighbour_explain.png)

## Back to coding

Now, we'll split the data up even further, into a training set and a validation set. The training set is still what you will train your model on, and the validation set is what you will test your model on, so you can get an idea of how it will perform before you submit it.

In [None]:
x_train, x_val, y_train, y_val = model_selection.train_test_split(X, y, test_size=0.3, random_state=6)

Let's train our first model!


We are using a [k-nearest neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#kneighborsclassifier) algorithm to start you off. 

K-nearest neighbors (KNN) is a simple way to classify things based on how similar they are to other things. Imagine you have a dataset with different features (like height, weight, and age) and a "y" column that tells you what category each item belongs to (like "cat" or "dog").

Here's how KNN works:

1. Choose a number - This is how many neighbors you want to look at. For example, if you chose 3, you'll look at the three closest items to the one you're trying to classify.

2. Find the neighbors - For a new item (like a new animal), you calculate how close it is to all the items in your dataset using their features. You can use a distance formula, like the Euclidean distance, to find out which items are closest.

3. Vote for a category - Once you have the closest items, you see which category they belong to. The category that appears the most among those neighbors is the one you assign to your new item.

So, if most of the three closest animals are cats, you would classify the new animal as a cat! KNN is like asking your friends for their opinions to decide what you should do.

The function KNeighborsClassifier does these three steps for you. To learn more about the algorithm, the [wikipedia article](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) is pretty good, or ask an AI chatbot!

In [None]:
model = KNeighborsClassifier(n_neighbors=5) #this is the algorithm, we're specifying 5 neighbors
trained_model = model.fit(x_train, y_train) #this fits/trains the algorithm to the data

Congratulations, you've trained your first machine learning model! Now, lets see how your model performs:

In [None]:
train_predict = trained_model.predict(x_train) #make predicts for the training set
val_predict = trained_model.predict(x_val) #make predictions for the validation set

#now we'll compute the accuracy for the train and validation sets
train_acc = accuracy_score(y_train, train_predict)
val_acc = accuracy_score(y_val, val_predict)

#and print out the results:
print("The accuracy for your training set is: ")
print(train_acc)
print("The accuracy for your validation set is: ")
print(val_acc)

The goal is to have the accuracy scores as high as possible for both your train and validation sets. If your training accuracy is a lot higher than your validation accuracy, then you are likely [overfitting to your data](https://www.ibm.com/topics/overfitting).

Accuracy score is a way to measure how well your machine learning model is performing. It tells us the percentage of correct predictions out of the total predictions made. To compute it, we compare the model's predictions to the actual outcomes, count how many predictions were correct, and then divide that number by the total number of predictions. For example, if our model made 10 predictions and got 6 of them right, the accuracy would be 60%. In Python, we can use the accuracy_score function from the sklearn.metrics library to easily calculate this.


$\text{Accuracy Score} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$

The validation set is just a small dataset, however. We like to see performance on a small dataset before doing bigger ones to catch any mistakes or performance issues early. But now that we like our validation accuracy, we can move onto a bigger test set! This test set is in test.csv.


Let's make the prediction on test.csv and make a submission to the leaderboard!

In [None]:
####PREPARE SUBMISSION
#load and prepare the file
test_path = os.path.join(path, "test.csv")
df_test = pd.read_csv(test_path)
df_test.drop('ReportDateTime', axis=1, inplace=True)
Xt_one_hot = pd.DataFrame(enc.transform(df_test[['DayOfWeek', 'PoliceAreaCommand', 'NeighborhoodAssociation']]).toarray(), columns=enc.get_feature_names_out())
X_test = pd.concat([df_test['Time'], Xt_one_hot], axis=1)
#make the predictions
final_preds = trained_model.predict(X_test)

#check to make sure the crime incident values look sensible
final_preds

In [None]:
####PREPARE SUBMISSION
#if all above looks good, prepare the submission
my_submission = pd.DataFrame({'OBECTID':df_test.OBJECTID , 'IncidentType': final_preds})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

After running the above cell, go to `Submit to Competition` on the right and click `Submit`. Follow the instructions and it'll submit your submissions.csv file for you.

Now, it is time to improve your model. We suggest:
* Reading the [K-Nearest Neighbor documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) and seeing if there is anything you can change that might change your model. 
    * `n_neighbors` is the number of points closest to the point we're trying to classify. If we set this number to be too large, we may end up including points from other classes that aren't actually that "near" -- this could lead to incorrect classifications! We always try to keep `n_neighbors` to be an odd number... can you figure out why?
    * `weights` is a cool way of specifying how "important" each point is to the classification of a point. For example, if we do 5-nearest neighbor, and four points are really close to the point we're trying to figure out, but the fifth is really far away (maybe even in another class!), we might want it to be less influential than the four closer ones. In that case, we might use a `distance` metric of weights. By default, however, all points are equally weighted (using a `uniform` weight).

> **Help! How do I try different parameters?** Parameters are specific details you can give your code to individualize your functions a bit more. For example, you might want to try using 3 nearest neighbors instead of five! We can try different parameters by putting values in between the parentheses of `KNeighborsClassifier()` and specifiying what we want to change. For example, if I want to try 3-NN, I would do `KNeighborsClassifier(n_neighbors = 3)`. You can also have multiple parameters at once! For example: `KNeighborsClassifier(n_neighbors = 3, weights = distance)`.

In [None]:
# try experimenting with different values of k. what happens if it's very large? very small?
""" Delete this line and the three quotes at the bottom to uncomment code
k = ...     # change me to a number!

model = KNeighborsClassifier(n_neighbors = k) 
trained_model = model.fit(x_train, y_train)

train_predict = trained_model.predict(x_train)
val_predict = trained_model.predict(x_val)

train_acc = accuracy_score(y_train, train_predict)
val_acc = accuracy_score(y_val, val_predict)

print(f"The accuracy for your training set with {k} nearest neighbors is: ")
print(train_acc)
print(f"The accuracy for your validation set with {k} nearest neighbors is: ")
print(val_acc)
"""

In [None]:
""" Delete this line and the three quotes at the bottom to uncomment code

# what about if you change the `weights`?

# uncomment one `weight_type` at a time to see how it works!
weight_type = 'uniform'
# weight_type = 'distance'

model = KNeighborsClassifier(n_neighbors = 3, weights = weight_type) 
trained_model = model.fit(x_train, y_train)

train_predict = trained_model.predict(x_train)
val_predict = trained_model.predict(x_val)

train_acc = accuracy_score(y_train, train_predict)
val_acc = accuracy_score(y_val, val_predict)

print(f"The accuracy for your training set using {weight_type} weights is: ")
print(train_acc)
print(f"The accuracy for your validation set using {weight_type} weights is: ")
print(val_acc)
"""

In [None]:
""" Delete this line and the three quotes at the bottom to uncomment code

# time to experiment!! you can change the number of neighbors, weights, or any other parameter you'd like!

model = KNeighborsClassifier(...)   # fill me out!
trained_model = model.fit(x_train, y_train)

train_predict = trained_model.predict(x_train)
val_predict = trained_model.predict(x_val)

train_acc = accuracy_score(y_train, train_predict)
val_acc = accuracy_score(y_val, val_predict)

print("The accuracy for your training set is: ")
print(train_acc)
print("The accuracy for your validation set is: ")
print(val_acc)
"""

And remember, to make a sumbission, make sure you can run the cells that start with ####Prepare submission