# Basic Modeling in scikit-learn
  
Machine learning models are easier to implement now more than ever before. Without proper validation, the results of running new data through a model might not be as accurate as expected. Model validation allows analysts to confidently answer the question, how good is your model? We will answer this question for classification models using the complete set of tic-tac-toe endgame scenarios, and for regression models using fivethirtyeight’s ultimate Halloween candy power ranking dataset. In this course, we will cover the basics of model validation, discuss various validation techniques, and begin to develop tools for creating validated and high performing models.  
  
Before we can validate models, we need an understanding of how to create and work with them. This chapter provides an introduction to running regression and classification models in scikit-learn. We will use this model building foundation throughout the remaining chapters.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Introduction to model validation
  
**What is model validation?**
  
So what is model validation? Well, model validation consists of various steps and processes that ensure your model performs as expected on new data. The most common way to do this is to test your model's accuracy on data it has never seen before (called a holdout set). If your model's accuracy is similar for the data, it was trained on, and the holdout data, you can claim that your model is validated. However, model validation can also consist of choosing the right model, the best parameters, and even the best metric. The ultimate goal of model validation is to end up with the best performing model possible, that achieves high accuracy on new data. Before we begin exploring model validation, let's review some basic modeling steps using scikit-learn.
  
**scikit-learn modeling review**
  
Modeling in Python follows a simple procedure, regardless of the type of model you are constructing. Whether you are a seasoned scikit-learn veteran or new to building models with this module, let's take a quick look at these steps. First, we create a model by specifying the model type and its parameters. In this case, we are creating a random forest regression model with `RandomForestRegressor()`. Second, we fit the model using the `.fit()` method. This method has two main arguments. `X`, an array of data used in the model as training data, and `y`, an array of response values matching the size of the X array. When `.fit()` is used, the model parameters will be printed in the console.
  
<img src='../_images/scikit-learn-basic-modeling-schema.png' text='alt text' width='740'>
  
**Modeling review continued**
  
To assess model accuracy, we generate predictions for data using the `.predict()` method. And lastly, we look at the accuracy metrics. Here we are comparing the model's predictions (the variable predictions) and the actual responses, `y_test`. Future lessons and exercises will be devoted to accuracy metrics, as they are a vital component to model validation. For this current example though, we are looking at the mean absolute error (MAE). This function takes two arrays as arguments. The true values, `y_true`, and the predicted values, `y_pred`, and returns the mean absolute error (MAE) between them.
  
<img src='../_images/scikit-learn-basic-modeling-schema1.png' text='alt text' width='740'>
  
**Mean Absolute Error (MAE)**
  
The MAE is calculated by taking the average of the absolute differences between the true and predicted values for each sample. MAE is calculated as the sum of absolute errors divided by the sample size. MAE is the average absolute difference between the true and predicted values. Furthermore, each error contributes to MAE in proportion to the absolute value of the error.  
  
*This is in contrast to RMSE which involves squaring the differences, so that a few large differences will increase the RMSE to a greater degree than the MAE*.
  
$formula:$  
  
$\Large {MAE}=\frac{\sum_{i=1}^{n} {|y_i-\hat{y}_i|}}{n}$
  
$let:$  
  
$n \equiv$ Represents the number of samples  
$y_i \equiv$ Represents the true values  
$\hat{y}_i \equiv$ Represents the predicted values  
  
**Review prerequisites**
  
This process of generating a model, fitting, predicting, and then reviewing model accuracy will be repeated throughout this course. If you are unfamiliar with these steps, you should consider taking the prerequisite courses. They will go into more detail about using Python and performing these modeling steps.
  
**fivethirtyeight's candy dataset**
  
Throughout this course, we will use fivethirtyeight's ultimate Halloween candy power ranking dataset several times. This dataset contains 85 different candies, data on their various characteristics, and a column specifying how often that candy was selected in a head-to-head match-up with other candies. This column is a win-percentage and contains values between 0 and 100.
  
**Seen vs. unseen data**
  
Model validation's main goal is to ensure that a predictive model will perform as expected on new data. Obtaining predictions for training data (or seen data) and testing data (or unseen data) is coded in the same way and uses the `.predict()` method. Generally, models perform a lot better on data they have seen before, as unseen data may have features or characteristics that were not exposed in the model. If your training and testing errors are vastly different, it may be a sign that your model is overfitted. We will use model validation to make sure we get the best testing error possible.
  
<img src='../_images/scikit-learn-basic-modeling-schema2.png' text='alt text' width='740'>
  
**Let's begin!**
  
Let's see why model validation is so important by looking at an example of training and testing accuracies.

### Modeling steps
  
The process of using scikit-learn to create and test models has four steps, and you will use these four steps throughout this course.
  
Which of the following is **NOT** a valid method in the four-step scikit-learn model validation framework?
  
Possible Answers

- [ ] `.predict()`
- [ ] `.fit()`
- [x] .validate()
  
Correct! Validation is a technique all in its own and is not completed with a .validate() method. You need to learn a few tools and techniques before you can validate a model.

### Explaining ML split-sets (in my own words)
  
NOTE: Remember that `X_train` is used as "homework" for the model, and `y_train` is used as the "self-graded quiz" for the model. `X_test` is similar to an "exam question(s)" for the model, while `y_test` is similar to a professors grading-book of the "correct exam answers" are contained. `y_pred` also known as $\hat y$ (y hat) is accquired by letting the model take its "exam", where `y_pred` represents the results of taking the exam (`X_test`).
  
Continuing, where `X_train` holds the processed features which the model will use to learn it's task, similar to how we as humans do homework or read a book for college-class preparation. The way that we (as humans) will judge our preformance and preparedness is by having something to benchmark our knowledge with, in this regard, the `y_train` for a model would be similar to us partaking in self-graded quizzes. In-turn giving us a benchmark of where we stand in leading up to exam day.  
  
Now, college-classes will usually have exams, and these exams are very important right? They will pass or fail us in our college-classes. In a similar manner, the `X_test` is just like how we will have exams (or test), it provides an opportunity for the model to be judged/assessed/evaluated on external data that it has never had direct exposure to. As such, the `y_test` set is similar to how our professor(s) will have their grading-book that contains the correct answers to our exam(s).
  
Lastly, we would like to see what our exam grade is, right? I mean, we studied (`X_train`) for the exam, we took self-graded quizzes (`y_train`) to see where we stood, and we took the exam (`X_test`). The only thing left for us would be to compare our exam answers (`y_pred`) to the professors exam scoring card (`y_test`) so that we can see what grade we made on the exam. In a similar manner to us, the model wants to see how *it* preformed on the exam as well. The model read the "exam questions" (`X_test`) and produced/selected its very own answers to the "questions" we tasked it with, these answers to its exam would be the `y_pred` representation.
  
Going one step backwards in the process of creating ML split-sets, we can further conclude that the features which are represented by `X` consist of data that will be used for learning/interpreting, and the target represented by `y` is used as the evaluation or "grade".
  
I like to think about it in this way sometimes as it is a good way to explain what we are trying to do with our model, often times I can get confused about what split should be used and when. In such times, I like to refer back to thinking about things in this way. - Alexander Gursky

### Seen vs. unseen data
  
Model's tend to have higher accuracy on observations they have seen before. In the candy dataset, predicting the popularity of Skittles will likely have higher accuracy than predicting the popularity of Andes Mints; Skittles is in the dataset, and Andes Mints is not.
  
You've built a model based on 50 candies using the dataset `X_train` and need to report how accurate the model is at predicting the popularity of the 50 candies the model was built on, and the 35 candies (`X_test`) it has never seen. You will use the mean absolute error, `.mae()`, as the accuracy metric.
  
1. Using `X_train` and `X_test` as input data, create arrays of predictions using `model.predict()`.
2. Calculate model accuracy on both data the model has seen and data the model has not seen before.
3. Use the print statements to print the seen and unseen data.

In [3]:
# Loading required dataframe
candy = pd.read_csv('../_datasets/candy-data.csv')
candy.head()

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


In [7]:
# X/y split
X = candy.drop(['competitorname', 'winpercent'], axis=1)    # Taking all values except the names and win-percentages for X (Features)
y = candy['winpercent']                                     # The win-percentages only for y (Target)

# Displaying the features
X

Unnamed: 0,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent
0,1,0,1,0,0,1,0,1,0,0.732,0.860
1,1,0,0,0,1,0,0,1,0,0.604,0.511
2,0,0,0,0,0,0,0,0,0,0.011,0.116
3,0,0,0,0,0,0,0,0,0,0.011,0.511
4,0,1,0,0,0,0,0,0,0,0.906,0.511
...,...,...,...,...,...,...,...,...,...,...,...
80,0,1,0,0,0,0,0,0,0,0.220,0.116
81,0,1,0,0,0,0,1,0,0,0.093,0.116
82,0,1,0,0,0,0,0,0,1,0.313,0.313
83,0,0,1,0,0,0,1,0,0,0.186,0.267


In [8]:
# Displaying the target
y

0     66.971725
1     67.602936
2     32.261086
3     46.116505
4     52.341465
        ...    
80    45.466282
81    39.011898
82    44.375519
83    41.904308
84    49.524113
Name: winpercent, Length: 85, dtype: float64

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as mae
from sklearn.ensemble import RandomForestRegressor


# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)  # Train with 60%, test with 40%

# Model Instantiation
model = RandomForestRegressor(n_estimators=50)  # Number of trees in the forest, 50 trees

In [10]:
# The model is fit using the training sets, X_train and y_train
model.fit(X_train, y_train)

# Create vectors of predictions
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

# Train/Test Errors
train_error = mae(y_true=y_train, y_pred=train_predictions)
test_error = mae(y_true=y_test, y_pred=test_predictions)

# Print the accuracy for seen and unseen data
print("Model error on seen data: {0:.2f}.".format(train_error))
print("Model error on unseen data: {0:.2f}.".format(test_error))

Model error on seen data: 3.66.
Model error on unseen data: 9.89.


Excellent. When models perform differently on training and testing data, you should look to model validation to ensure you have the best performing model. In the next lesson, you will start building models to validate.

## Regression models
  
Welcome to another lesson on model validation. There are two types of predictive models discussed in this course. Models built for continuous variables, or regression models, and models built for categorical variables, called classification models. This lesson focuses purely on regression models, and more specifically, random forest regression models using scikit-learn.
  
**Random forests in scikit-learn**
  
Although this is not a machine learning course, it is important to understand the basic principles of the models we will be running and discussing. For that reason, we will stick with random forest models throughout this course, and only run random forest regression or random forest classification models. Both models have similar parameters and are called in the same way when using scikit-learn.
  
<img src='../_images/regression-models-random-forest-decision-trees.png' text='alt text' width='740'>
  
**Decision trees**
  
To understand random forest models, we should review decision trees. Decision trees look at various ways to split data until only a few or even a single observation remains. The splits may be categorically based, "Are you left-handed?", or continuously based, "what is your age?" A new observation will follow the tree based on its own data values until it reaches an end-node (called a leaf). In the given example, Bob - who is left-handed, 18 years old, and likes onions, would be predicted to be in $4,000 of debt if we followed this decision tree. The value in the end-node represents the average of all people in the training data who ended in that leaf.
  
<img src='../_images/regression-models-random-forest-decision-trees1.png' text='alt text' width='740'>
  
**Averaging decision trees**
  
Random forest regression models generate a bunch of different decision trees and use the mean prediction of the decision trees as the final value for a new observation. Here we created five decision trees. Their average prediction for Bob was $4,200 of debt.
  
<img src='../_images/regression-models-random-forest-decision-trees2.png' text='alt text' width='740'>
  
**Random forest parameters**
  
Although these algorithms have a lot of parameters, we will focus on only three. `n_estimators=` is the number of trees to create for the random forest. `max_depth=` is the maximum depth for these trees, or how many times we can split the data. It is also described as the maximum length from the beginning of a tree to the tree's end nodes. These two parameters alone can make a big impact on model accuracy. Lastly, `random_state=` allows us to create reproducible models. I will always use 1,111 as my random state. If you ever see a different number, I promise I did not code that example! There are two ways to set these parameters. They can be set when `RandomForestRegressor()` is initiated, which is the most common way for setting model parameters. They can also be set later, by assigning a new value to a models attribute. The second method could be helpful when testing out different sets of parameters.
  
<img src='../_images/regression-models-random-forest-decision-trees3.png' text='alt text' width='740'>
  
**Feature importance**
  
After a model is created, we can assess how important different features (or columns) of the data were in the model by using the `.feature_importances_` attribute. If the data is a `pandas` DataFrame, `X`, we can access the column names and print the importance score quite easily. The larger this number is, the more important that column was in the model. In our example, we loop through the values from `.feature_importances_` and match the score to the column from `X`. The output tells us that `eye_color` is not that useful in our model, but the fact that someone is left-handed is highly important.
  
<img src='../_images/regression-models-random-forest-decision-trees4.png' text='alt text' width='740'>
  
**Let's begin**
  
Let's create a random forest regression model and look at its output.

### Set parameters and fit a model
  
Predictive tasks fall into one of two categories: regression or classification. In the candy dataset, the outcome is a continuous variable describing how often the candy was chosen over another candy in a series of 1-on-1 match-ups. To predict this value (the win-percentage), you will use a regression model.
  
In this exercise, you will specify a few parameters using a random forest regression model `rfr`.
  
1. Add a parameter to `rfr` so that the number of trees built is 100 and the maximum depth of these trees is 6.
2. Make sure the model is reproducible by adding a random state of 1111.
3. Use the `.fit()` method to train the random forest regression model with `X_train` as the input data and `y_train` as the response.

In [11]:
from sklearn.ensemble import RandomForestRegressor


# Model Instantiantion
rfr = RandomForestRegressor()

In [12]:
# Set the number of trees
rfr.n_estimators = 100

# Add a maximum depth
rfr.max_depth = 6

# Set the random date
rfr.random_state = 1111

# Fit the model
rfr.fit(X_train, y_train)

You have updated parameters after the model was initialized. This approach is helpful when you need to update parameters. Before making predictions, let's see which candy characteristics were most important to the model.

### Feature importances
  
Although some candy attributes, such as chocolate, may be extremely popular, it doesn't mean they will be important to model prediction. After a random forest model has been fit, you can review the model's attribute, `.feature_importances_`, to see which variables had the biggest impact. You can check how important each variable was in the model by looping over the feature importance array using `enumerate()`.
  
If you are unfamiliar with Python's `enumerate()` function, it can loop over a list while also creating an automatic counter.
  
1. Loop through the feature importance output of `rfr`.
2. Print the column names of `X_train` and the importance score for that column.

The `0` inside the curly braces refers to the index of the argument that will be passed to the `.format()` method. The `s` after the colon specifies that the argument at index `0` should be formatted as a string.  
  
The `1` inside the curly braces refers to the index of the argument that will be passed to the `.format()` method. The `.2f` after the colon specifies that the argument at index `1` should be formatted as a floating-point number with 2 decimal places.

In [13]:
# Print how important each column is to the model
for i, item in enumerate(rfr.feature_importances_):
    # Use i and item to print out the feature importance of each column
    print("{0:s}: {1:.2f}".format(X_train.columns[i], item))

chocolate: 0.47
fruity: 0.03
caramel: 0.03
peanutyalmondy: 0.03
nougat: 0.01
crispedricewafer: 0.04
hard: 0.02
bar: 0.02
pluribus: 0.03
sugarpercent: 0.17
pricepercent: 0.17


Well done. No surprise here - chocolate is the most important variable. `.feature_importances_` is a great way to see which variables were important to your random forest model.

## Classification models
  
In this lesson, we switch from regression to classification models.
  
**Classification models**
  
This lesson focuses on reviewing classification models— or models built for when the response variable is categorical. Predicting a newborn's hair color, the winner of a basketball game, or the genre of the next song to come on the radio station are all examples of categorical responses - and we can build a classification model for each of them.
  
**The Tic-Tac-Toe dataset**
  
When looking at classification models during this course, we will primarily use the Tic-Tac-Toe end-game dataset. This dataset contains the complete set of possible configurations at the end of a game of Tic-Tac-Toe. Each of the first nine columns represents one of the nine squares of a Tic-Tac-Toe board. A "b" means the square is blank, an "X" represents player one, and an "O" is for player two. The final column indicates if the first player won (labeled positive) or not (labeled negative).
  
<img src='../_images/tic-tac-toe-classification.png' text='alt text' width='500'>
  
**Why this dataset**
  
The tic_tac_toe dataset is ideal for model validation because we have the complete set of outcomes. We can include as much, or as little, of this data in our models as we want. This allows us to really test how well the model is performing on unseen data. And if you just got an urge to play Tic-Tac-Toe, Google will play against you as long as you would like!
  
**Using `.predict()` for classification**
  
Several methods are shared across all scikit-learn models, but some are unique to the specific type of model. Before, we used the `.predict()` method to predict the value of new observations. scikit-learn's classifier, `RandomForestClassifier()` also has the method `.predict()`. This time, the new class of the observations is returned. We can also view how many observations were assigned to each class by turning the array of predictions into a `pandas` Series, and then using the method `.value_counts()`.
  
<img src='../_images/tic-tac-toe-classification1.png' text='alt text' width='750'>
  
**Predicting probabilities**
  
Another prediction method is `.predict_proba()`, which returns an array of predicted probabilities for each class. Sometimes in model validation, we want to know the probability values and not just the classification. Each entry of the array contains probabilities that sum to 1. For example, the second entry has values of 0.1 and 0.9. This indicates that for this data point, player one has a 10% chance of losing given the current game board, and a 90% chance of winning.
  
<img src='../_images/tic-tac-toe-classification2.png' text='alt text' width='750'>
  
**Methods continued**
  
Finally, we introduce two additional methods. The first method, `.get_params()` is used to review which parameters went into a scikit-learn model. It will print out a dictionary of parameters and their values, allowing us to see exactly which parameters were used. Knowing these parameters is essential when assessing model quality, rerunning models, and even parameter tuning. The final method we will introduce is `.score()`. It is a quick way to look at the overall accuracy of the classification model. Accuracy measures will be discussed more in chapter two, but basically, this method provides values from 0 to 1 on the percent of observations that were correctly labeled. In this example, almost 90% of games were correctly predicted by our model.
  
<img src='../_images/tic-tac-toe-classification3.png' text='alt text' width='750'>
  
**Let's classify Tic-Tac-Toe end-game scenarios**
  
Now that we have had an introduction to random forest classification models let's work through a couple of exercise.

### Classification predictions
  
In model validation, it is often important to know more about the predictions than just the final classification. When predicting who will win a game, most people are also interested in how likely it is a team will win.
  
<table>
  <tr>
    <th>Probability</th>
    <th>Prediction</th>
    <th>Meaning</th>
  </tr>
  <tr>
    <td>x &lt; 0.50</td>
    <td>0</td>
    <td>Team Loses</td>
  </tr>
  <tr>
    <td>x &gt; 0.50</td>
    <td>1</td>
    <td>Team Wins</td>
  </tr>
</table>
  
In this exercise, you look at the methods, `.predict()` and `.predict_proba()` using the `tic_tac_toe` dataset. The first method will give a prediction of whether Player One will win the game, and the second method will provide the probability of Player One winning. Use `rfc` as the random forest classification model.
  
1. Create two arrays of predictions. One for the classification values and one for the predicted probabilities.
2. Use the `.value_counts()` method for a `pandas` Series to print the number of observations that were assigned to each class.
3. Print the first observation of `probability_predictions` to see how the probabilities are structured.

In [16]:
# Loading df and viewing
tic_tac_toe = pd.read_csv('../_datasets/tic-tac-toe.csv')
tic_tac_toe.head()

Unnamed: 0,Top-Left,Top-Middle,Top-Right,Middle-Left,Middle-Middle,Middle-Right,Bottom-Left,Bottom-Middle,Bottom-Right,Class
0,x,x,x,x,o,o,x,o,o,positive
1,x,x,x,x,o,o,o,x,o,positive
2,x,x,x,x,o,o,o,o,x,positive
3,x,x,x,x,o,o,o,b,b,positive
4,x,x,x,x,o,o,b,o,b,positive


In [19]:
# X/y split
y = tic_tac_toe['Class'].apply(lambda x: 1 if x == 'positive' else 0)  # Select target, replace word with one-hot integers, apply to target
X = tic_tac_toe.drop('Class', axis=1)  # Extract all features except the target
X = pd.get_dummies(X)  # Creating numeric dummy values in place of the elements

# Display
pd.set_option('display.max_columns', 50)  # Display up to 50 columns, overwriting the standard display
print(X.shape)
X.head()

(958, 27)


Unnamed: 0,Top-Left_b,Top-Left_o,Top-Left_x,Top-Middle_b,Top-Middle_o,Top-Middle_x,Top-Right_b,Top-Right_o,Top-Right_x,Middle-Left_b,Middle-Left_o,Middle-Left_x,Middle-Middle_b,Middle-Middle_o,Middle-Middle_x,Middle-Right_b,Middle-Right_o,Middle-Right_x,Bottom-Left_b,Bottom-Left_o,Bottom-Left_x,Bottom-Middle_b,Bottom-Middle_o,Bottom-Middle_x,Bottom-Right_b,Bottom-Right_o,Bottom-Right_x
0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0
1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,1,0
2,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1
3,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0
4,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0


In [20]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier


# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8)

# Model instantiation
rfc = RandomForestClassifier()

In [27]:
# Fit the rfc model
rfc.fit(X_train, y_train)

# Create arrays of predictions
classification_predictions = rfc.predict(X_test)
probability_predictions = rfc.predict_proba(X_test)

# Print out count of binary predictions
print(pd.Series(classification_predictions).value_counts())

# Print the first value from probability_predictions
print('The first predicted probabilities are (no,yes): {}'.format(probability_predictions[0]))

1    574
0    193
dtype: int64
The first predicted probabilities are (no,yes): [0.39 0.61]


You can see there were 574 observations where Player One was predicted to win the Tic-Tac-Toe game. Also, note that the `predicted_probabilities` array contains lists with only two values because you only have two possible responses (win or lose). Remember these two methods, as you will use them a lot throughout this course.

### Reusing model parameters
  
Replicating model performance is vital in model validation. Replication is also important when sharing models with co-workers, reusing models on new data or asking questions on a website such as Stack Overflow. You might use such a site to ask other coders about model errors, output, or performance. The best way to do this is to replicate your work by reusing model parameters.
  
In this exercise, you use various methods to recall which parameters were used in a model.
  
1. Print out the characteristics of the model `rfc` by simply printing the model.
2. Print just the random state of the model.
3. Print the dictionary of model parameters.

In [29]:
# Model instantiation
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

# Print the classification model
print(rfc)

# Print the classification model's random state parameter
print('The random state is: {}'.format(rfc.random_state))

# Print all parameters
print('Printing the parameters dictionary: {}'.format(rfc.get_params()))

RandomForestClassifier(max_depth=6, n_estimators=50, random_state=1111)
The random state is: 1111
Printing the parameters dictionary: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 6, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 50, 'n_jobs': None, 'oob_score': False, 'random_state': 1111, 'verbose': 0, 'warm_start': False}


Recalling which parameters were used will be helpful going forward. Model validation and performance rely heavily on which parameters were used, and there is no way to replicate a model without keeping track of the parameters used!

### Random forest classifier
  
This exercise reviews the four modeling steps discussed throughout this chapter using a random forest classification model. You will: Create a random forest classification model. Fit the model using the `tic_tac_toe` dataset. Make predictions on whether Player One will win (1) or lose (0) the current game. Finally, you will evaluate the overall accuracy of the model.
  
Let's get started!
  
1. Create `rfc` using the scikit-learn implementation of random forest classifiers and set a random state of 1111.
2. Fit rfc using `X_train` for the training data and `y_train` for the responses.
3. Predict the class values for `X_test`.
4. Use the method `.score()` to print an accuracy metric for `X_test` given the actual values `y_test`.

In [38]:
from sklearn.ensemble import RandomForestClassifier


# Create a random forest classifier
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

# Fit rfc using X_train and y_train
rfc.fit(X_train, y_train)

# Create predictions on X_test
predictions = rfc.predict(X_test)
print(predictions[0:5])  # You can see that you can index the array if you want, this shows the first 5 predictions
print(predictions)

# Print model accuracy using score() and the testing data
print(rfc.score(X_test, y_test))

[1 1 0 1 1]
[1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 1 1 0 1
 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1
 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
 0 1 0 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0
 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1
 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1
 1 0 1 0 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1
 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1
 0 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1
 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0
 1 0 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1
 1 1 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0
 1 1 1 1 1 1 

That's all the steps! Notice the first five predictions were all 1, indicating that Player One is predicted to win all five of those games. You also see the model accuracy was only 82%. 
  
Let's move on to Chapter 2 and increase our model validation toolbox by learning about splitting datasets, standard accuracy metrics, and the bias-variance tradeoff.