This notebook is divided in following sections:

- Install packages if missing
- **Load data**
  - Load data from single source
  - Load data from multiple sources
  - Merge multi-source data into single datasource
- Handle **Missing values**
- **Manipulate Data**
- Plotting / **Visualization**
  - Univariate Analysis
  - Bivariate Analysis
- Preparation of **Training & Test Dataset**


# Install packages if missing

Uncomment following code to install required packages

In [None]:
# ! pip install pandas  
# ! pip install numpy 
# ! pip install matplotlib 
# ! pip install seaborn 
# ! pip install sklearn

# Load data

We use pandas to load csv data as **dataframe**

[Boston dataset detail](http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)

### Variables
There are 14 attributes in each case of the dataset. They are:
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000's

In [None]:
import pandas as pd
import numpy as np

### Load data from single source

In [None]:
df_boston = pd.read_csv("boston.csv")
df_boston.head()

In [None]:
print("df_boston's type =", type(df_boston))

In [None]:
print("df_boston.shape =", df_boston.shape)

### Load data from multiple sources (This is a seperate Data nothing to do with Boston Dataset)

We have two data files which we load and merge into a single dataframe

In [None]:
df1 = pd.read_csv("rating_final.tsv", sep='\t')
df2 = pd.read_csv("parking.csv")

In [None]:
print("df1.shape =", df1.shape)

In [None]:
print("df2.shape =", df2.shape)

In [None]:
df1.head()

In [None]:
df2.head()

### Merge Data into single datasource

More information can be found in [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/merging.html) 

In [None]:
df_merge = pd.merge(left=df1, right=df2, on="placeID", how="left")

print("df_merge.shape =", df_merge.shape)

df_merge.head(10)

# Handle Missing values in boston dataset

In [None]:
print("df_boston.shape =", df_boston.shape)

df_boston.head()

#### Check if any missing values in dataframe

In [None]:
df_boston.isnull().values.any()

#### To define Custom missing values . For a scenario let us consider "n/a", "na", "--" as missing values. How to handle it please find the explanation in the below blog.

https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b

In [None]:
print (df_boston.isnull().sum())

In [None]:
null_data = df_boston[df_boston.isnull().any(axis=1)]
print(null_data)

### Drop row if they contain null values

In [None]:
df_boston = df_boston.dropna(axis=0, how='any')   # 0 = rows and 1 = columns

print("df_boston.shape =", df_boston.shape)

#### Statistic summary of dataframe 

In [None]:
df_boston.describe()

# Manipulate boston dataset

In [None]:
print("df_boston shape's =", df_boston.shape)

df_boston.head(2)

### Select only few features/columns from dataframe

In [None]:
our_feature = ['CRIM', 'TAX', 'INDUS']

df_filtered = df_boston[our_feature]

print("df_filtered shape's =", df_filtered.shape)

df_filtered.head(2)

### Remove few cases/observations from dataframe

In [None]:
drop_features = [ 0 , 2 , 3]

df_removed = df_boston.drop(drop_features, axis=0)  ##  1 = column / 0 = row

print("df_removed shape's =", df_removed.shape)

df_removed.head(2)

**Select specific row by index**

In [None]:
df_boston.iloc[4]

**Select multiple rows by index range**

In [None]:
df_boston.iloc[2:4]  # start index : end index - 1

**Select single column by row index**

In [None]:
df_boston.iloc[1:4]["NOX"]

In [None]:
df_boston.iloc[1:4]["NOX","CRIM","LSTAT"]

**Select multiple column by row index**

In [None]:
columns = ["RM", "AGE"]

df_boston[columns].iloc[1:4]

# Plotting / Visualization

In [None]:
import matplotlib.pyplot as plt 
%matplotlib inline

## Univariate Analysis

#### Using Histogram plot with Binning

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

df_boston["AGE"].plot.hist(ax=ax1, bins=5, range=(0,100))
df_boston["AGE"].plot.hist(ax=ax2, bins=100, range=(0,100))

plt.show()

In [None]:
df_merge.

## Bivariate Analysis

Using Scatter plot

In [None]:
print(str(df_boston['RM'].iloc[0]) + ",", df_boston['MEDV'].iloc[0])

In [None]:
df_boston.plot.scatter(x='MEDV', y='LSTAT');

# set x & y labels in plot
plt.xlabel("MEDV")
plt.ylabel("LSTAT")

plt.show()

# Preparation of Training & Test Dataset

In [None]:
from sklearn.model_selection import train_test_split

Store features in X variable and target ( or label ) in y variable

In [None]:
features = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT']
target = ['MEDV']

X = df_boston[features]
y = df_boston[target]

print('df_boston shape =', df_boston.shape)
print('X shape =', X.shape)
print('y shape =', y.shape)

**Split boston dataset into train (80%) and test set (20%)**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

print('X_train shape =', X_train.shape)
print('y_train shape =', y_train.shape)
print('X_test shape =', X_test.shape)
print('y_test shape =', y_test.shape)

In [None]:
print(X_test.head(10))

In [None]:
print(y_test.head(10))