# Introduction to Machine Learning  

# Assignment 1: Machine Learning Fundamentals 

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links of 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).

### Assignment Learning Goals:

By the end of the module, students are expected to:

- Be able to explain motivation to study machine learning.
- Be able to differentiate between supervised and unsupervised learning.
- Differentiate between classification and regression problems.
- Explain machine learning terminology such as features, targets, training, and error.
- Use DummyClassifier/ Dummy Regressor as a baseline for machine learning problems.
- Explain the `.fit()` and `.predict()` paradigm and use `.score()` method of ML models.

This assignment covers [Module 1]() of the online course. You should complete this module before attempting this assignment.

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct.

In [None]:
# Import libraries needed for this lab
from hashlib import sha1

import altair as alt
import graphviz
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from IPython.display import HTML
from sklearn import tree
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.tree import DecisionTreeClassifier
import test_assignment1 as t

## Exercise 1: Interpreting the Decision Tree

Below is a toy data set that we will be using to get familiar with Scikit-learn. 

In [None]:
toy_data = {
    # Features
    'is_sweet': [0, 0, 1, 1, 0, 1, 0],
    'diameter': [3, 3, 1, 1, 3, 1, 4],
    # Target
    'target': ['Apple', 'Apple', 'Grape', 'Grape', 'Lemon', 'Grape', 'Apple']
}

df = pd.DataFrame(toy_data)
df

**Question 1.1** 
<br> {points: 3}

a) How many features are present in the dataset? *Assign your answer to a variable of type `int` called `answer1_1_a`*

b) Express the names of the features as strings in a list. ex `["Feature1", "Feature2", "Feature3"]` *Assign your answer to a variable called `answer1_1_b`.*

c) How many different classes are possible in the dataset? *Assign your answer to a variable of type `int` called `answer1_1_c`.*

In [None]:

answer1_1_a = None
answer1_1_b = None
answer1_1_c = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_1_a(answer1_1_a)

In [None]:
t.test_1_1_b(answer1_1_b)

In [None]:
t.test_1_1_c(answer1_1_c)

With the data we have above, we want to first design a decision tree and the fit(train) it using the data's features and labels. Once we have fit our decision tree we can use the `predict` function from the model to predicts the class. 

In [None]:
# instantiate a class of the DecisionTreeClassifier
model = tree.DecisionTreeClassifier(random_state = 2) 

# prepare data for model fitting
X = df[['is_sweet','diameter']]
y = df['target']

# fit the model to the data. The semicolon at the end is used to suppress displaying the output of model.fit
model.fit(X, y);

# we can use the .predict method of our model class to make predictions
predictions = model.predict(X)
predictions

Since we know the true fruit identities, we can compare the predictions of our tree model to the true values.

In [None]:
df['prediction'] = predictions
df

**Question 1.2** 
<br> {points: 3}

a) How many unique labels does the decision tree classify the dataset into? *Assign your answer to a variable of type `int` called `answer1_2_a`.*

b) What index does the model incorrectly predict the class?  *Assign your answer to a variable of type `int` called `answer1_2_b`.*

c) What class should the model have predicted for this fruit?  *Assign your answer to a variable of type `str` called `answer1_2_c`.*

In [None]:

answer1_2_a = None 
answer1_2_b = None
answer1_2_c = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_2_a(answer1_2_a)

In [None]:
t.test_1_2_b(answer1_2_b)

In [None]:
t.test_1_2_c(answer1_2_c)

Below is a function we are using to visualize our decision tree.

In [None]:
# define a function to plot our model (hint: you'll use this function later in the lab too)
def save_and_show_decision_tree(model, 
                                class_names,
                                feature_names,
                                save_file_prefix = 'test'):
    """
    Saves the decision tree model as a pdf and shows how the data is split and 
    classified

    Parameters
    ----------
    model: sklearn.tree.DecisionTreeClassifier
        The sklearn model decision tree
    class_names : list
        The names of all the possible classifications
    feature_names : list
        The names of all the features
    save_file_prefix: str
        The name you wish to save the file

    Returns
    -------
    graphviz.files.Source
        The decision tree graph
    """
    dot_data = tree.export_graphviz(model, out_file=None, 
                             feature_names=feature_names,  
                             class_names=class_names,  
                             filled=True, rounded=True,  
                             special_characters=True)  

    graph = graphviz.Source(dot_data) 
    graph.render(save_file_prefix) 
    return graph

graph = save_and_show_decision_tree(model,
                                    class_names = ['Apple', 'Grape', 'Lemon'],
                                    feature_names = ['is_sweet','diameter'])
graph

**Question 1.3** 
<br> {points: 10}

a) How deep is the decision tree? *Assign your answer to a variable of type `int` called `answer1_3_a`.*

b) What feature is used in the first split? *Assign your answer to a variable of type `str` called `answer1_3_b`.*

c) What is this split's threshold? *Assign your answer to a variable of type `float` called `answer1_3_c`.*

d) What feature is used in the second split? *Assign your answer to a variable of type `str` called `answer1_3_d`.*

e) What is the second split's threshold? *Assign your answer to a variable of type `float` called `answer1_3_e`.*

f) What fruit is classified with a single split? How many samples does this cover? *Express your answer in form of a list `[ "Fruit", 1]` and assign it to a variable called `answer1_3_f`. The first item in the list should be of type `str` and the second of type `int`.*

g) After the first split how many fruit are in the left and right node? *Express your answer in form of a list `[1, 8]` and assign it to an variable called `answer1_3_g`.*

h) Would you say this was balanced? *Express your answer type `bool` (`True` or `False`) and assign it to an variable called `answer1_3_h`.*

i) If you chose the second feature and threshold as the first split, how many samples would have been in the left and right node? *Express your answer in form of a list `[1, 8]` and assign it to an variable called `answer1_3_i`.*

j) If you chose the second feature and threshold as the first split, what would be the number of samples that are no longer ambiguous (samples that have been classified) *Assign your answer to a variable of type `int` called `answer1_3_j`.*

In [None]:

answer1_3_a = None 
answer1_3_b = None 
answer1_3_c = None 
answer1_3_d = None 
answer1_3_e = None 
answer1_3_f = None 
answer1_3_g = None 
answer1_3_h = None 
answer1_3_i = None 
answer1_3_j = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_3_a(answer1_3_a)

In [None]:
t.test_1_3_b(answer1_3_b)

In [None]:
t.test_1_3_c(answer1_3_c)

In [None]:
t.test_1_3_d(answer1_3_d)

In [None]:
t.test_1_3_e(answer1_3_e)

In [None]:
t.test_1_3_f(answer1_3_f)

In [None]:
t.test_1_3_g(answer1_3_g)

In [None]:
t.test_1_3_h(answer1_3_h)

In [None]:
t.test_1_3_i(answer1_3_i)

In [None]:
t.test_1_3_j(answer1_3_j)

**Question 1.4** 
<br> {points: 1}

What would need to increase for the model to predict the classification lemon? 

a) The number of features used in training

b) The number of samples with a `lemon` lable

c) Both of these 

d) Neither of these 

*Express your answer in form of a string `"H"` and assign it to a variable called `answer1_4`.*

In [None]:

answer1_3_a = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1_4(answer1_4)

## Exercise 2: Exploratory Data Analysis <a name="2"></a>
 
 
For the rest of the lab you'll be using Kaggle's [Spotify Song Attributes](https://www.kaggle.com/geomack/spotifyclassification/home) dataset.
The dataset contains a number of features of songs from 2017 and a binary target variable representing whether the user liked the song or not. See the documentation of all the features [here](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/). The question we will focus on is what kinds of songs the user likes.

This dataset is publicly available on Kaggle, but not licensed to be freely distributed. As a result, we do not provide this dataset in your repository, and you will have to download it yourself. Follow the steps below to get the data CSV. 

- If you do not have an account with Kaggle, you will first need to create one (it's free); 
- Login to your account and [download](https://www.kaggle.com/geomack/spotifyclassification/downloads/spotifyclassification.zip/1) the dataset;  
- (You should always) Read the [terms and conditions](https://www.kaggle.com/terms) before using the data.

**Question 2.1** 
{points: 2}

1. Read in the data CSV and store it as a pandas dataframe named `spotify_df`. The first column of the .csv file should be set as the index.
2. Show some summary statistics of each feature using the `describe` method, and store the results in a dataframe named `spotify_summary`. 

In [None]:

spotify_df = None 
spotify_summary = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_1_a(spotify_df)

In [None]:
t.test_2_1_b(spotify_summary)

**Question 2.2** 
{points: 5}

a) Which features would need to be one hot encoded? *Express your answer in form of a list `[1, 8]` and assign it to an variable called `answer2_2_a`.* 

b) Which categorical features would need to be preprocessed further to be able to use them? *Express your answer in form of a list `[1, 8]` and assign it to an variable called `answer2_2_b`.* 

In [None]:

answer2_2_a = None
answer2_2_b = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2_2_a(answer2_2_a)

In [None]:
t.test_2_2_b(answer2_2_b)

**Question 2.3** 
{points: 5}

The code given below is as function that produces histograms for the features (in order) `danceability`, `tempo`, `instrumentalness`, `valence`, that show the distribution of the feature values separated for `target` values of `0` and `1`. Fill in the missing code below in order to run the function.


Notes: 

- The plots are being used with python package `altair`


In [None]:
### Please copy and paste this cell into a new one and fill in the blanks.
def plot_histogram(df,feature):
    """
    plots a histogram of a decision trees feature

    Parameters
    ----------
    feature: str
        the feature name
    Returns
    -------
    altair.vegalite.v3.api.Chart
        an Altair histogram 
    """
    histogram = alt.Chart(df).mark_bar(
        opacity=0.7).encode(
        alt.X(feature, bin=alt.Bin(maxbins=50)),
        alt.Y('count()', stack=None),
        alt.Color('target:N')).properties(
        title= str.title(feature))
    return ....

feature_list = ....
figure_dict = dict()
for feature in .... :
    figure_dict.update({feature:plot_histogram(....,feature)})
figure_panel = alt.vconcat(*.... .values())
figure_panel


In [None]:
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_2_3(feature_list, figure_dict, figure_panel)

Briefly think about your observations; which features might be useful in differentiating the target classes?