# Data exploration and editing
You will be demonstrating model building on freely available datasets.

Very often, raw data contains errors, some data is missing, data is in the wrong format, etc.

It is always a good idea to understand the data before you start working with it, and to adjust it if necessary.

There are a number of freely available data sources on the internet that you can test your skills on.
- https://archive.ics.uci.edu/ml/index.php
- https://www.kaggle.com/
- https://toolbox.google.com/datasetsearch
- github datasets

## Boston Housing Dataset

The housing dataset is derived from housing information for the Boston, Massachusetts area collected by the U.S. Census Bureau.  

The data were originally published in an article by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. 

The dataset contains information on 506 different homes in Boston.

Dataset features
* CRIM - crime rate per capita by city
* ZN - the proportion of residential lots over 25,000 square feet.
* INDUS - share of non-commercial business space per city
* CHAS - Charles River dummy variable (1 if the tract borders a river; 0 otherwise)
* NOX - nitrogen oxide concentration (parts per 10 million)
* RM - average number of rooms per dwelling
* AGE - percentage of owner-occupied units built before 1940
* DIS - weighted distances to Boston's five employment centers
* RAD - accessibility index to radial freeways
* TAX - full property tax rate per $10,000 of assessed value
* PTRATIO - pupil-teacher ratio by city
* B - 1000(Bk - 0.63)^2, where Bk is the proportion of blacks in each city.
* LSTAT - percentage of lower status population
* MEDV - median value of owner-occupied housing in $1000

## Reading data from CSV file

In [None]:
import pandas as pd 

In [None]:
data = pd.read_csv ("..\\dataset\\HousingData.csv")

Let's look at the structure of the file.

In [None]:
data.info()

## Basic data characteristics

It is useful to have an overview of the input data before creating the model.
This can prevent problems later on. For example, some models require specific data.

Data preview.

In [None]:
data.head(10)

Basic statistics of the data in columns are displayed using the **describe** function
- number of records
- mean value - average
- variance
- minimum
- 25% percentile
- 50% percentile - median
- 75% percentile
- maximum

For some columns the mean and median differ significantly - CRIM, ZN

For some columns, the mean and median are similar - RM

This will be clearly visible when the distribution of values is displayed.

In [None]:
data.describe()

Some columns contain NULL data. We need to decide how to solve this problem.
* Incomplete rows can be removed from the dataset
* Problematic columns should not be input parameters to the model
* Missing values could be produced as average, zeros, ...
*...

In [None]:
data.isna().sum()

## Value distribution

Visualizing the distribution of data in columns could reveal skewed, abnormal values.

At the same time, some statistical methods may not work properly on atypically distributed data.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

The distribution of the data can be well understood from the graphs.

We will create a graph that combines a histogram with an estimate of the distribution function.

We obtain the distribution function estimate using the seaborn library and the kernel density estimate line.

From the plots, we can see that some variables have almost normal distributions (RM), while others have almost uniform distributions (NOX). 

Some variables have a large representation of small values and high values are almost absent in the dataset (CRIM).

For some variables, we can see that the maximum values are much represented (B, TAX).

In [None]:

pos = 1
fig = plt.figure(figsize=(16,24))
for i in data.columns:
    ax = fig.add_subplot(7,2, pos)
    pos = pos + 1
    sns.histplot(data[i], ax=ax, kde=True)

Similar information can be read on the boxplot. The distribution data is not as detailed, but the chart is compact. In addition, the outliers can be read very nicely.

That is why it appears very often in technical articles when you need to present data in a small space.

In [None]:
data.plot(
    kind='box', 
    subplots=True, 
    sharey=False, 
    figsize=(15, 6)
)
plt.subplots_adjust(wspace=1) 
plt.show()

## Relationships between variables
There are many variables in datasets. Often there is a relationship between them. If one variable changes, another variable is likely to change.

These relationships may or may not be causal. Sometimes it can be a coincidence. 

That's why it's a good idea to try to decipher these relationships. 
* Uncovering relationships - see if a change in one variable is related to a change in another (e.g. height and weight).
* Redundancy - strongly correlated variables often carry the same information → it is not necessary to have both when modelling.
* Prediction - if one variable is strongly correlated with another, we can use it to predict (e.g. age ↔ income).
* Hidden relationships - weak or no correlation may mean that the relationship is non-linear or influenced by other factors.

There are a number of methods to detect dependencies.

We start by creating graphs for all combinations of the two functions.

The human brain is trained to look for patterns. We may see a relationship at first glance. 

Usually we look for a graph shape that shows a mathematical curve (line, parabola, hyperbola, etc.).

In [None]:
sns.pairplot(data)

The graphs show that there is some direct art between RM, LSTAT and MEDV.

* RM - average number of rooms per apartment (input variable)
* LSTAT - percentage of lower population (input variable)
* MDEV - median value of owner-occupied dwellings in $1,000 (output variable)

The relationships between variables were estimated by eye. But it can also be done exactically by correlation.

Correlation shows us how strongly and in which direction two (or more) variables are **linearly** dependent. Note some phenomena may not have a linear dependence, but another. For other types of relationships than linear, the correlation coefficient will not work.

Correlation coefficient (Pearson's r):
* Values from -1 to 1
* r ≈ 1 → strong positive linear dependence
* r ≈ -1 → strong negative linear dependence
* r ≈ 0 → no linear dependence (but may be non-linear)

In [None]:
corr=data.corr()
corr

A strong direct or indirect correlation may indicate a relationship between input parameters.
This can help us in choosing the input parameters of the model.

Sometimes it can be useful to show correlations using a heatmap.
Especially if the correlation matrix is large, the colours can help us to orient ourselves.

In [None]:
plt.figure(figsize = (10,8))
sns.heatmap(corr.abs(), annot=True, vmin=0, vmax=1)

For example, the CHAS column (Charles River dummy variable) has no relationship to other elements.

In contrast, the columns LSTAT, TAX, RAD, NOX, INDUS have relationships to other columns.

The next class will attempt to create a statistical model that will estimate the MEDV property price based on the input parameters.

We will use linear regression to do this.

Focusing on the MEDV row, suitable input parameters may be the RM, LSTAT columns.

## Data editing and standardization

### Data cleaning

Some columns contain NULL data. We need to decide how to solve this problem. 
* Incomplete rows can be removed from the data file.
* Problem columns should not be model input parameters.
* Records with extreme values can be excluded from the dataset. For example, because they are measurement errors.

In [None]:
print (data.isnull().sum())

In [None]:
data=data.dropna()

Sometimes it is useful to discard data with outliers. 
We remove rows from the dataset where the median house value is greater than 50.

In [None]:
data = data[~(data['MEDV'] >= 40.0)]

### Data standardization

Each function has a different mean and standard deviation.

It is a good practice to standardize the data before entering it into a mathematical model.

Reasons:
* to prevent some variables from dominating the model.
* can help machine learning models converge more quickly
* can make it easier to interpret the coefficients of a machine learning model

Calculation:
* x_new = (x - mean) / standard_deviation
* mean = sum of (x) / number of (x)
* standard_deviation = sqrt( sum ( (x - mean)^2 ) / count (x))

We can do the standardization manually. We calculate the mean and standard deviation and adjust the data. 

In [None]:
data["AGE"].mean()

In [None]:
data["AGE"].std()

In [None]:
data['AGE_STD'] = (data['AGE'] - data['AGE'].mean()) / data['AGE'].std()
data['LSTAT_STD'] = (data['LSTAT'] - data['LSTAT'].mean()) / data['LSTAT'].std()

We can look at the distribution charts to see how the original data has changed to the new data.

The shape of the graph is identical, but the standardized graph is relatively centered around the value 0.

In [None]:
fig = plt.figure (figsize=(10, 5))
axes = fig.subplots (1, 2)           
ax1 = axes[0]
ax2 = axes[1]
sns.histplot(data['AGE'],ax=ax1, kde=True)
sns.histplot(data['AGE_STD'],ax=ax2, kde=True)

Same for LSTAT

In [None]:
fig = plt.figure (figsize=(10, 5))
axes = fig.subplots (1, 2)           
ax1 = axes[0]
ax2 = axes[1]
sns.histplot(data['LSTAT'],ax=ax1, kde=True)
sns.histplot(data['LSTAT_STD'],ax=ax2, kde=True)