## Pipeline: Fit a basic model

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will fit and evaluate a basic model using 5-fold Cross-Validation.

### Read in data & create train/validation/test set

![Initial Model](img/fit_model.png)

_Welcome back, in this lesson we're going to fit a single, basic model using Cross-Validation._

_Remember that this phase is still going to take place on **only** the training set. So we will import all of the packages we need which will include `RandomForestClassifier` which is the algorithm we'll be using, `cross val score` which will aid us in our Cross Validation and then the same `train test split` method we have used previously. We will import our data and then we're going to split our data into train, test, and validation data._

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split

titanic = pd.read_csv('../titanic_cleaned.csv')

features = titanic.drop('Survived', axis=1)
labels = titanic['Survived']

X_train, X_val, y_train, y_val = train_test_split(features, labels, test_size=0.4, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_val, y_val, test_size=0.5, random_state=42)

### Fit and evaluate a basic model using 5-fold Cross-Validation

![CV Image](img/CV_image.png)

_As a refresher, review this diagram explaining what Cross-Validation is. 5-fold Cross Validation will split our training set into 5 sections and then it will iterate through the data 5 times, each time fitting a model on 4 sections and evaluating the model on the 5th section. In the end, each section and thus every example will have been used to train a model 4 times and evaluate a model once. Then at the end we will get an array with all of our model scores in it for each of the five folds._

_So the first thing we do is instantiate our `Random Forest Classifier` and we'll store that as rf. I'll note that we **could** pass in hyperparameters here but we will choose to leave that empty for now (we'll get to that in the next lesson) so it will just use the default hyperparameters settings._

_Then the only thing we have to do is call `cross val score`. What this method expects is you to pass in a model object (so `rf`), your features (so `X train`), your labels (so `y train`), and lastly tell it hold many folds you want. So we will pass in 5, and then store it as scores._

In [2]:
rf = RandomForestClassifier()
scores = cross_val_score(rf, X_train, y_train, cv=5)



_Now, this `Cross Val Score` method will handle everything in the diagram above under the hood. As you can see, it's **really** easy to use. Now lets just print out the scores._

_You can see that the average is probably right around 80% accuracy but the great thing about this is it shows you what the score is training and evaluating on different sections of the data to make sure overfitting to some subset of data isn't impacting our scores. So we can say around 80% accuracy but depending on the subset trained on that could be as low as 77% or as high as 83% so it gives us a nice range._

_Now, for instance if we had only fit it on one training set - maybe we would get an accuracy of 77% but we would have no idea what a reasonable range might be and we wouldn't know if it was just overfitting to this one particular subset of data. Perhaps 77% is the average but it could go as high as 85% or as low as 70%. We really wouldn't know. That is just one of the benefits of Cross-Validation._

In [3]:
scores

array([0.82407407, 0.82242991, 0.78504673, 0.77358491, 0.82075472])