# Welcome to Thunderbird Hacks! 
## First thing to know  
This notebook has lots of cells. Each cell is either text or code. You can see which cell you have selected on by the black bar to the left. To select a cell, single click on it. 
## If you double click on a text cell
This will allow you to edit the text. You'll notice that the content doesn't change but there's a lot of extra stuff in the cell. That stuff helps format the text so it is easier to read. Try double clicking this cell if you haven't already.
## How to run a cell
You can run a cell by clicking the single triangle, or play button, in the menu above. The keyboard shortcut for this is `ctrl`+`enter`. For a text cell, this will return the text to read, and for a code cell, it will run the code. Try this now if you haven't already.

## This is the next cell.
This is a text cell, just like the one above. Single click on it to select it, double click to edit it, and `ctrl`+`enter` to run it.

## The cell below is a code cell
Single clicking on it allows you to select and edit it. 

In [None]:
# This is a code cell
# <- that hash means this is a comment
# When you run your code cells, it ignores any comments
# Comments are useful for telling you what the code does 

print("Hello World!") #prints Hello World!

#try running this cell to see what happens

## Congratulations! You ran your first Python code
You should see the `Hello World!` below the cell and none of the comments. 

So, lets break down what you just ran:  
`print()` is a function - it does something to whatever is inside the parentheses  
`"Hello World!"` - this is called a string. Anything inbetween quotation marks are strings.  

Test it out! Try removing the quotation marks around Hello World in the code cell above, run it, and see what happens.

You may see a red box around some text. This is an error - you broke the code!  
You can fix it by putting the quotation marks back. You can also try editing the text - enter your name or your favorite sandwich inside the quotation marks and run the code to see what happens.


## Now, lets learn about variables

Look at the code cell below. Can you guess what happens?

The comments should help, and take note of the colors of the code.

Run the cell and see if your guess was correct.

In [None]:
foo = "Hello" #set the variable foo
bar = "World" #set bar

print(foo) #print foo
print("foo") #print foo
print(foo + bar) #print foo and bar

Did you guess correctly? Let's run through what happened:  
1. `foo = "Hello"` sets the variable foo to be a string with the value Hello
2. `bar = "World"` sets the variable bar to be a string with the value World
3. `print(foo)` prints the *value* of the variable foo
4. `print("foo")` prints foo as a string because of the quotation marks
5. `print(foo + bar)` prints the values of both foo and bar. Notice there is no space inbetween the words because there were no spaces in the strings when they were defined.

Are you confused? Try editing the code above and test out what happens with different changes you make to the code. If you're still confused, ask for help.

Print statements are very useful for debugging code - since code is run top to bottom, they can help you figure out what is happening when you get error messages.

## What about math?

The code below shows a few basics.

In [None]:
a = 5 #sets the variable a to 5 
#notice there are not any quotation marks
#this is because we want 5 to be treated as a number

b = "5" #sets the variable b to a string
print(a) #print the value of a
print(b) # print the value of b

print("a+a=")
print(a+a) #print a+a
print("b+b=")
print(b+b) #print b+b

#Run this cell and observe what happens for each print statement
print("was this surprising?")

There are a lot of math operations that you can use, though many won't be super useful to you today. If you feel like it, you can use the cell below to test out what you've learned above.

In [None]:
# You can use this cell for testing concepts you learned above.

## A little bit about data types

Every variable you use in Python will have a data type. The first kind we saw were strings - the letters inside quotations. You've also seen integers - numeric values without decimal points. So what else might you see today?
 
`float` - numbers with decimals, ex. `1.5`
`list` - a sequence type, ex. [1,2,3], ["cat","dog","fish"]

### Arrays will be very important

An array is like a list, it can hold multiple variables at once. The easiest way to start with arrays is with lists. So, lets define a list ar as:  

`ar = ["cat", "dog", "fish"]`  

If we want to get the first value, `cat`, out of the array, we can just use this:  

`ar[0]`

We have a zero in the square brackets because the *index* for arrays and lists in Python starts at zero.  

We can get the length of a list using:  

`len(ar)`


In [None]:
#lets test this out with some code
ar = ["cat", "dog", "fish"] #setting the array ar
print("The array ar: ")
print(ar)

print("The first value of ar: "+ar[0])
print("The second value of ar: "+ar[1])
print("The last value of ar: "+ar[2])


print("The length of ar: ")
print(len(ar))

## Numpy arrays

If you really want to learn all about them, I suggest you read the [numpy documentation](https://numpy.org/doc/stable/user/absolute_beginners.html#whats-the-difference-between-a-python-list-and-a-numpy-array). However, in the interest of time, the explanation in this cell should be enough for today.

A numpy array is just a better way of storing arrays, it has to do with memory and speed. A lot of function won't take lists, you have to use numpy arrays.

Look at the code below, then run it to learn how to create numpy arrays.


In [None]:
a_list = [1,2,3,4] #this is a list
b_list = [5,6,7,8] #this is another list
c_list = [a_list, b_list] #this is a list of lists, or nested list

print("A nested list: {}".format(c_list)) #print out c_list

import numpy as np #ignore this line for now, you'll learn about it later

#now, we'll turn our lists into numpy arrays using the function np.asarray()
np_alist = np.asarray(a_list) #np.asarray() turns a_list into a numpy array
np_clist = np.asarray(c_list) #turn c_list into a numpy array

print("a_list as a numpy array: {}".format(np_alist)) #print out np_alist

print("a nested list as a numpy array:")
print(np_clist) #print out np_clist


Notice that that lists have commas separating each value and numpy arrays do not.

In [None]:
# You can also find the shape of an array
shape = np.shape(c_list)
print(shape)

Shape errors are very common errors to run into in machine learning. Understanding all of them takes several classes in linear algebra, so if you do run into them and get stuck, the best course of action is to ask for help.

## An introduction to libraries

Earlier, you were told to ignore this line in the code:
`import numpy as np`

This imports, or loads, the numpy library so that you can use all of the functions it contains. Notice the `as np`, this is like a nickname we can use for the library.

We used the function `np.asarray()` earlier. The `np` tells the computer to look in the numpy library to find the function `asarray()`

Run the cell below and you're all set to go with the libraries you will need today!

In [None]:
#imports
import numpy as np # linear algebra library
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os #useful library for dealing with filepaths
#sklearn has everything else you will need
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn import tree

# Let's get started with the task!

The cell below tells you all the files you can use during this competition. Notice that we are using the `os` library and the `listdir()` function within that library. We are passing a string (the file path) into the function.

Run the cell, you should see ['train.csv', 'test.csv'] as output. If you do not see anything, or get an error, click on `Run` in the menu above, and select factory reset. You will need to re-run the import cell above first, but it should fix your error when you rerun the cell below. If you are still having issues, ask for help.

In [None]:
path = "/kaggle/input/thunderbird-hacks-2024"
os.listdir(path)

## Load the training data

Running the cell below will load the training data into a pandas dataframe called `df_train`.


In [None]:
#load train.csv
train_path = os.path.join(path, "train.csv") #join the folder path and file name
df_train = pd.read_csv(train_path) #read the csv from file into a pandas dataframe
df_train #output the dataframe

Now, we want to prepare out dataset for training. There are several things the provided code will do for you:  
- remove the `song_name` column, as it is a string and difficult to deal with
- remove the `id` column, as it is a random value and not something that will help us learn what features impact popularity
- split the data into an `X` and `y` dataset, with `X` being the song features and `y` being the song popularity

In [None]:
#remove song_name and id columns
df_train.drop(["song_name","id"], axis=1, inplace=True)

#Split data into X and y
X = df_train.drop("song_popularity", axis=1)
y = df_train["song_popularity"]

Why do we split data into X and y variables?  

In math class you may have learned that the equation for a line is $y=mx+b$, like for the graph below: 

![graph](https://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Linear_Function_Graph.svg/1200px-Linear_Function_Graph.svg.png)

and in math or science class, you may have learned about independent and dependent variables. Our independent variables here are in the X dataset, or what would go on the x axis in a graph. The dependent variable is in the y dataset, or what would go on the y axis in the graph.   

What your machine learning model is trying to do is learn the equation to predict y since it depends on the features in X.


## Back to coding

Now, we'll split the data up even further, into a training set and a validation set. The training set is still what you will train your model on, and the validation set is what you will test your model on, so you can get an idea of how it will perform before you submit it.

In [None]:
x_train, x_val, y_train, y_val = model_selection.train_test_split(X, y, test_size=0.3, random_state=6)

Let's train our first model!


We are using a [linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to start you off.

In [None]:
model = LinearRegression(fit_intercept=False) #this is the algorithm
trained_model = model.fit(x_train, y_train) #this fits/trains the algorithm to the data

Congratulations, you've trained your first machine learning model! Now, lets see how your model performs:

In [None]:
train_predict = trained_model.predict(x_train) #make predicts for the training set
val_predict = trained_model.predict(x_val) #make predictions for the validation set

#now we'll compute the mae for the train and validation sets
train_mae = mean_absolute_error(train_predict, y_train)
val_mae = mean_absolute_error(val_predict, y_val)

#and print out the results:
print("The MAE for your training set is: ")
print(train_mae)
print("The MAE for your validation set is: ")
print(val_mae)

The goal is to have the MAE scores as low as possible. If your training MAE is a lot lower than your validation MAE, then you are likely [overfitting to your data](https://www.ibm.com/topics/overfitting).

Now, lets make the prediction on test.csv and make a submission to the leaderboard!

In [None]:
#load and prepare the file
test_path = os.path.join(path, "test.csv")
df_test = pd.read_csv(test_path)
df_test.drop('song_name', axis=1, inplace=True)
X_test = df_test.drop("id", axis=1)

#make the predictions
final_preds = trained_model.predict(X_test)

#check to make sure the song popularity values look sensible
final_preds

In [None]:
#if all above looks good, prepare the submission
my_submission = pd.DataFrame({'id':df_test.id , 'song_popularity': final_preds})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

After running the above cell, go to `Submit to Competition` on the right and click `Submit`. Follow the instructions and it'll submit your submissions.csv file for you.

Now, it is time to improve your model. We suggest:

- Reading the [linear regression documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) and seeing if there is anything you can change that might change your model
- Looking through the list of other [sklearn regression models](https://scikit-learn.org/stable/supervised_learning.html) and implementing a few

Note that as long as you name your trained model `trained_model`, all of the cells for looking at scores and preparing your submission will not need to be edited. You will only need to edit the cell that starts with:  
`model = LinearRegression(fit_intercept=False) #this is the algorithm`