**Regression** problems involve the prediction of a continuous, numeric value from a set of characteristics.

In this example, we'll build a model to predict house prices from characteristics like the number of rooms and the crime rate at the house location.

## Reading data

We'll be using the **pandas** package to read data.

Pandas is an open source library that can be used to read formatted data files into tabular structures that can be processed by python scripts.

In [None]:
# Make sure you have a working installation of pandas by executing this cell
import pandas as pd

In this exercise, we'll use the [Boston Housing dataset](http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) to predict house prices from characteristics like the number of rooms and distance to employment centers.

In [None]:
# Read 'datasets/boston.csv' with pandas


Pandas allows reading our data from different file formats and sources. See [this link](http://pandas.pydata.org/pandas-docs/stable/io.html) for a list of supported operations.

In [None]:
# Use the head() method to print the first five entries in the dataset


In [None]:
# Use the info() method to print information about the dataset


[This link](http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) describes the meaning of each column in the Boston Housing dataset.

In [None]:
# Use the describe() method to print summary statistics of the dataset


Pandas is a powerful library for manipulation of datasets. 
[This repository](https://github.com/guipsamora/pandas_exercises) contains several Pandas exercises exploring different aspects of it like data filtering, grouping and sorting.

## Visualizing data

After reading our data into a pandas DataFrame and getting a broader view of the dataset, we can build charts to visualize tha "shape" of the data.

We'll use python's *Matplotlib* library to create these charts.

### An example

Suppose you're given the following information about four datasets:

In [None]:
datasets = pd.read_csv('datasets/anscombe.csv')

for i in range(1, 5):
    dataset = datasets[datasets.Source == 1]
    print('Dataset {} (X, Y) mean: {}'.format(i, (dataset.x.mean(), dataset.y.mean())))

print('\n')
for i in range(1, 5):
    dataset = datasets[datasets.Source == 1]
    print('Dataset {} (X, Y) std deviation: {}'.format(i, (dataset.x.std(), dataset.y.std())))

print('\n')
for i in range(1, 5):
    dataset = datasets[datasets.Source == 1]
    print('Dataset {} correlation between X and Y: {}'.format(i, dataset.x.corr(dataset.y)))

They all have roughly the same mean, standard deviations and correlation. How similar are they?

![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Anscombe%27s_quartet_3.svg/638px-Anscombe%27s_quartet_3.svg.png)

This dataset is known as the [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe's_quartet) and it's used to illustrate how tricky it can be to trust only summary statistics to characterize a dataset.

### Now back to our dataset...

In [None]:
import matplotlib.pyplot as plt
# This line makes the graphs appear as cell outputs rather than in a separate window or file.
%matplotlib inline

In [None]:
# Extract the house prices and average number of rooms to two separate variables
prices =
rooms =

# Create a scatterplot of these two properties using plt.scatter()

# Specify labels for the X and Y axis

# Show graph


In [None]:
# Extract the house prices and average number of rooms to two separate variables

# Create a scatterplot of these two properties using plt.scatter()

# Specify labels for the X and Y axis

# Show graph


Matplotlib is one of the most well-known libraries in Python for the creation of plots and charts (although [Seaborn](http://seaborn.pydata.org/) is gaining a lot of traction and is also worth taking a look at). 

## Predicting house prices

We could see in the previous graphs that some features have a roughy linear relationship to the house prices. We'll use [Scikit-Learn's LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to model this data and predict house prices from other information.

The example below builds a LinearRegression model using the average number of rooms to predict house prices:

In [None]:
from sklearn.linear_model import LinearRegression

x =  # extract the values of the average number of rooms (rm column)
y =  # extract the values of the house prices (column medv)

lr = # fit a LinearRegression model

# Print the predicted price of a house with six rooms


In [None]:
# Show the Linear Regression line
prediced = [lr.predict(r)[0][0] for r in x]
plt.scatter(x, y)
plt.plot(x, predicted, color='red')
plt.show()

We'll now use all the features in the dataset to predict house prices.

Let's start by splitting our data into a *training* set and a *validation* set. The training set will be used to train our linear model; the validation set, on the other hand, will be used to assess how accurate our model is.

In [None]:
X =  # Extract all columns except the house price by dropping the 'medv'column from the dataset
y =  # extract the values of the house prices (column medv)

In [None]:
# Use sklean's train_test_plit() method to split our data into two sets.
# See http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split

Xtr, Xts, ytr, yts = 

In [None]:
# Use the training set to build a LinearRegression model
lr = 

In [None]:
# Use the validation set to assess the model's performance.
# See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
from sklearn.metrics import mean_squared_error


What kind of enhancements could be done to get better results?