In [1]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import math

In [2]:
data = pd.read_csv("nces-ed-attainment.csv", na_values=["---"])
data

Unnamed: 0,Year,Sex,Min degree,Total,White,Black,Hispanic,Asian,Pacific Islander,American Indian/Alaska Native,Two or more races
0,1920,A,high school,,22.0,6.3,,,,,
1,1940,A,high school,38.1,41.2,12.3,,,,,
2,1950,A,high school,52.8,56.3,23.6,,,,,
3,1960,A,high school,60.7,63.7,38.6,,,,,
4,1970,A,high school,75.4,77.8,58.4,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
209,2014,F,master's,9.3,11.1,5.0,3.6,20.8,,,7.5
210,2015,F,master's,10.4,12.0,7.2,4.1,23.2,,,10.2
211,2016,F,master's,11.2,12.3,6.3,6.3,28.8,,,8.2
212,2017,F,master's,10.5,11.8,6.8,5.0,25.8,,,5.4


# scikit-learn
Below are a list of functions/features you will most likely use on this assignment. This is not the only functions possible to use and these functions can actually be used in a lot more cool and complicated ways, but we are going to focus on the basics in this wordbank. For these examples, we will use the iris dataset provided by `seaborn`

To run this document, you must first run the following cell(s).

In [3]:
import pandas as pd
import seaborn as sns
sns.set()

iris = pd.read_csv('/course/lessons/iris.csv')
iris.head()

FileNotFoundError: [Errno 2] No such file or directory: '/course/lessons/iris.csv'

In [None]:
# Commonly people call the features X and the labels y
X = iris.loc[:, iris.columns != 'species']
y = iris['species']

## `sklearn.model_selection.train_test_split`
This function can split your dataset into train and test sets with sizes of the given ration. Returns a 4-tuple of `(train_data, test_data, train_label, test_label)`.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [None]:
print(f'Train size: {len(X_train)} ({len(X_train) * 100 / len(X):0.2f}%)')
print(f'Test size: {len(X_test)} ({len(X_test) * 100 / len(X):0.2f}%)')

## `sklearn.tree.DecisionTreeClassifier`
A tree-based model to solve classification task (e.g. predicting a label like "spam" or "not spam"). See sections below for functions that can be used on any model.

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

## `sklearn.tree.DecisionTreeRegressor`
A tree-based model to solve a regression task (e.g. predicting numerical quantity). See sections below for functions that can be used on any model.

In [None]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()

## Model: `fit`
Every model has a `fit` function that takes a dataset (features and labels) and trains the model using that data. For this example, since we will be using the iris dataset which is predicting the class of iris from information about its petals, we will be using a classifier.

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

model.fit(X_train, y_train)

## Model: `predict`
Every model has a `predict` function that takes a dataset (features) and predicts all the labels for that dataset. You must `fit` the model before you may `predict` with it. For this example, since we will be using the iris dataset which is predicting the class of iris from information about its petals, we will be using a classifier.

We assume the previous cell weas the last to run.

In [None]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

In [None]:
#pd.Series(y_train_pred)
d = X_train.copy()
d['actual'] = pd.Series(y_train_pred)
print(pd.Series(y_train_pred))
y_train_pred

## `sklearn.metrics.accuracy_score`
If you are solving a classification problem, a common metric for the quality of the model is the accuracy in terms of the ratio of examples it predicted the correct label. A higher accuracy means the model more closely fit the data.

Mathematically, this is defined as 

$$Accuracy(y_{true}, y_{pred}) = \sum_{i=1}^n \textbf{1}\left(y_{true}(i) = y_{pred}(i)\right)$$

where $y_{true}$ are the true labels, $y_{pred}$ are the predicted labels, $n$, is the number of examples, and $\textbf{1}$ takes the value 1 if the condition inside is true, and 0 otherwise.

Alternatively, you could write this in code as

```python
def accuraccy(y_true, y_pred):
  correct = 0
  for i in range(len(y_true)):
    if y_true[i] == y_pred[i]:
      correct += 1
  return correct / len(y_true)
```

It's much simpler to have scikit-learn compute this for you like below (assumes the cells above have been run):

In [None]:
from sklearn.metrics import accuracy_score

print('Train accuracy:', accuracy_score(y_train, y_train_pred))
print('Test accuracy:', accuracy_score(y_test, y_test_pred))

## `sklearn.metrics.mean_square_error`
If you are solving a regression problem, a common metric for the quality of the model is the average value of squares of taking the difference between a prediction and the true value. This is called mean squared error or MSE. A lower MSE means the model more closely fit the data.

Mathematically, this is defined as 

$$MSE(y_{true}, y_{pred})= \frac{1}{n}\sum_{i=1}^n \left( y_{true}(i)-y_{pred}(i)\right)^2$$

where $y_{true}$ are the true values, $y_{pred}$ are the predicted values, $n$, is the number of examples. 

Alternatively, you could write this in code as

```python
def mse(y_true, y_pred):
  total_error = 0
  for i in range(len(y_true)):
    total_error += (y_true[i] - y_pred[i]) ** 2
  return total_error / len(y_true)
```

It's much simpler to have scikit-learn compute this for you like below. **You can't actually run these cells** since the data and predictions were made for a classification problem, but if they were used for regression you could then run.

In [None]:
from sklearn.metrics import mean_squared_error

print('Train MSE:', mean_squared_error(y_train, y_train_pred))
print('Test MSE:', mean_square_error(y_test, y_test_pred))