# Practical 1: Load and display a dataset using Python

In this pratical, we will load a dataset in the .csv format, which describes the housing price in Boston. This is a small dataset with only 506 cases. But it would be a good illustration how Python can be used for loading a dataset. The data was originally published by Harrison, D. and Rubinfeld, D.L. Hedonic prices and the demand for clean air, J. Environ. Economics & Management, vol.5, 81-102, 1978.

## Download data
A copy of the .csv data is already there if you git clone from this repository. The .csv format is a format for spreadsheet, which means you can open it using Microsoft Excel or Libreoffice.

## Import libraries
The pandas library is used for loading datasets and for data analysis. The matplotlib library is used for the data visualisation. The sklearn library is used for linear regression.

Importing the libraries is the first step we will take in the lesson.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

## Load data
The dataset is in .csv format. We use the read_csv() function to load it. Then we display the size of the dataset and the first several lines in the dataset.

In [None]:
df = pd.read_csv('BostonHousing.csv')
print(df.shape)
print(df.head())

# Dataset
Each row is a case of the housing price. There are 506 cases in total. Each column is an attribute, there are 14 attributes, including:

**crim**: per capita crime rate by town

**zn**: proportion of residential land zoned for lots over 25,000 sq.ft.

**indus**: proportion of non-retail business acres per town

**chas**: Charles River dummy variable (1 if tract bounds river; 0 otherwise)

**nox**: nitric oxides concentration (parts per 10 million)

**rm**: average number of rooms per dwelling

**age**: proportion of owner-occupied units built prior to 1940

**dis**: weighted distances to five Boston employment centres

**rad**: index of accessibility to radial highways

**tax**: full-value property-tax rate per \$10,000

**ptratio**: pupil-teacher ratio by town

**b**: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

**lstat**: lower status of the population

**medv**: Median value of owner-occupied homes in $1000's

# Simple statistics
Let's first look at the statistics of the house prices, including the mean and standard deviation.

In [None]:
print('Mean = {0}, Std = {1}'.format(df['medv'].mean(), df['medv'].std()))

# Simple visualisation
Let's look at the distribution of the house prices using matplotlib for visualisation.

In [None]:
plt.hist(df['medv'])
plt.xlabel('Price ($1000\'s)')
plt.ylabel('Count')

Now look at the housing price agains the crime rate.

In [None]:
plt.scatter(df['crim'], df['medv'])
plt.xlabel('Crime rate')
plt.ylabel('Price ($1000\'s)')

# Exercise
Would you like to do more plots for other attributes please?

# Simple analysis

Finally, let's try a simple linear regression model for fitting and predicting the house prices, using the other 13 attributes.

In [None]:
model = LinearRegression()
X = df.iloc[:, :13]
price = df.iloc[:, 13]
model.fit(X, price)
predicted = model.predict(X)
plt.scatter(price, predicted)
plt.xlim([0, 55])
plt.ylim([0, 55])
plt.xlabel('True price ($1000\'s)')
plt.ylabel('Predicted price ($1000\'s)')

# Exercise

Would you like to check the documentation of sklearn.LinearRegression() to see what it does please?

# Exercise

Would you like to check what the dataset that you have downloaded look like please?

If it is in .csv format, do something similar. Otherwise, what format is it? Is it easy to read?

# The end