# Recitation 1
## Introduction to Python and some of the widely used libraries for ML

* In this recitation we will be going over a simple dataset.
* Your job would be to utilize python and some of its libraries to prepare the dataset for model  training.
* There will be no model training involved we will just be preparing the data.

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Data loading and prep

### Boston Housing Dataset
* CRIM: Per capita crime rate by town
* ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
* INDUS: Proportion of non-retail business acres per town
* CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* NOX: Nitric oxide concentration (parts per 10 million)
* RM: Average number of rooms per dwelling
* AGE: Proportion of owner-occupied units built prior to 1940
* DIS: Weighted distances to five Boston employment centers
* RAD: Index of accessibility to radial highways
* TAX: Full-value property tax rate per \$10,000
* PTRATIO: Pupil-teacher ratio by town
* B: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
* LSTAT: Percentage of lower status of the population
* MEDV: Median value of owner-occupied homes in $1000s

In [None]:
boston_df = pd.read_csv('boston-data.csv')
boston_df.head(5)

In [None]:
boston_df.tail(5)

In [None]:
boston_df.describe()

In [None]:
boston_df.info()

In [None]:
boston_df.isnull().sum()

# Checking correlation between features and labels

In [None]:
plt.figure(figsize=(20, 5))

features = ['crim', 'rm']
target = boston_df['medv']

for i, col in enumerate(features):
    plt.subplot(1, len(features) , i+1)
    x = boston_df[col]
    y = target
    plt.scatter(x, y, marker='o')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('medv')

# Creating Training and Target data

- One way to create the training data with the two features is to create a dictionary and then make it a dataframe.
- In order to create a dictionary the columns need to be transformed from type Series to list
- The keys to the dictionary will be the column names for the dataframe

In [None]:
training_dict = {'crime': boston_df['crim'].tolist(), 'rooms': boston_df['rm'].tolist()}
train_data = pd.DataFrame(training_dict)
train_data.head(2)

- Another way to do the same process is using the numpy function ```np.c_```
    * more info: https://docs.scipy.org/doc/numpy/reference/generated/numpy.c_.html
- This stacks the two columns

In [None]:
train_data = pd.DataFrame(np.c_[boston_df['crim'], boston_df['rm']], columns=['crime', 'rooms'])
train_data.head(2)

- The target data can be kept as a pandas series or converted to a numpy array

In [None]:
target_data = boston_df['medv']
target_data.head(2)

In [None]:
# The shapes should be (number of examples, columns)
# The training data uses two features and hence it uses two columns
# the test data uses one column as we need to predict one value
print(train_data.shape, target_data.shape)