# Getting a Supervised Learning Model to Run

This code trains a regression model to predict prices in the Ames housing dataset and evaluates that model against a test set:

In [11]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

ames_df = pd.read_csv('../assets/data/ames_train.csv')

X = ames_df.drop('SalePrice', axis='columns')
X = X.select_dtypes(['int64', 'float64']).dropna(axis='columns')
y = ames_df.loc[:, 'SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

rfc = RandomForestRegressor(n_estimators=200)
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)

0.8985422257627853

- Retype the code above, and make sure it runs successfully. Do NOT copy and paste -- actually typing the code will help you get used to the syntax and will force you to think more about what is happening. See how much of it you can understand. It's OK if much of this code is mysterious, but if you have the time you might try Googling some parts of it that are unclear.

*Tips:*

If you get an error, scroll to the bottom of the error message to see a description of the error and an arrow pointing to the line of code that triggered the error. Sometimes the actual mistake is on the line above the line that triggers the error (e.g. when you fail to close some set of parentheses), so look at the previous line if you can't find the problem.
 
Options if you are stuck on an error:
    - Google the error message.
    - Post your code and your error on our discussion channel.
    - Fall back to copy-and-pasting the line that is triggering the error.

- Copy and paste the code you just retyped into the cell below, and then make a few changes:
    - Instead of the Ames housing data at '../assets/data/ames_train.csv', load phone churn data at '../assets/data/churn.csv'.
    - Instead of `ames_df`, call the dataset `churn_df`.
    - Instead of SalePrice, predict the variable "Churn?". You will need the line `y = churn_df.loc[:, 'Churn?'].astype(int)`.
    - Instead of a `RandomForestRegressor` model, use a `RandomForestClassifier`.

**Note:** You should get a score of about 91.5% This score is the model's accuracy -- that is, the proportion (0 to 1) of cases it predicted correctly. 85.5% percent of customers did not churn, so anything about 85.5% indicates that your model is successfully learning something about the relationship between the features and the target.

In [22]:
# /scrub/

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

churn_df = pd.read_csv('../assets/data/churn.csv')

X = churn_df.drop('Churn?', axis='columns')
X = X.select_dtypes(['int64', 'float64']).dropna(axis='columns')

y = churn_df.loc[:, 'Churn?']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)

0.919664268585132

**CHALLENGE.** See how far you can get in repeating these steps on a dataset from [Kaggle](https://www.kaggle.com/). If you get stuck, read the error message (bottom to top), check out the Kaggle forums for that dataset, consult Google, and/or ask for help on our discussion channel. It's totally fine if you get stuck!