# Machine Learning Intro #

Let's go through an Regression problem. The main steps we are going to take is: 

1. Load the data
2. Discover and visualize the data to gain insights
3. Prepare data for machine learning algorithms
4. Select an algorithm and train a model
5. Validate our model

The dataset we are going to use is the [California Housing Dataset](https://github.com/ageron/handson-ml/tree/master/datasets/housing) which contains data drawn from the 1990 U.S. Census. 

The first task you are asked to perform is to build a model of housing prices in California using the California census data. This data has metrics such as the **population**, **median income**, **median housing
price**, and so on for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will just call them “districts” for short.


Your model should learn from this data and be able to **predict the median housing price in any district, given all the other metrics**.

## Create the workspace ##

If you are working *offline* you can create a conda environment now with `conda create -name linear` and install the libraries needed with `pip install matplotlib numpy pandas scipy scikit-learn`

In [0]:
#Several imports that will be needed - Check them before starting
import numpy as np
import pandas as pd
import scipy
import sklearn
import os

## Data Loading ##

We are loading the data using pandas. Write a small function for this, we are gonna need it later. We take a quick look into the data using pandas.

In [2]:
DATASET_PATH = "housing.csv"

def load_data_csv(csv_path=DATASET_PATH):
  return pd.read_csv(csv_path)

housing = load_data_csv()
housing.head() # Top 5 rows 

FileNotFoundError: ignored

## Data discovery ##

In [0]:
housing.info() # Quick description of the data (types, columns, entries etc)

There are 20,640 instances in the dataset. Notice that the total_bed
rooms attribute has only 20,433 non-null values, meaning that 207 districts are missing this feature. We will need to take care of this later. 

Also, all attributes are number except the oceant_proximity field. It is type object, since we loaded from csv, it is a text field. Take a look back at the top5 rows.

In [0]:
# Let's find all about this field
housing["ocean_proximity"].value_counts()

In [0]:
# Some further information about all the fields
housing.describe()

# STD = Standard Deviation, 25% = 25th percentile, 50% = median , 75% = 75th percentile

## Data visualization ##

### General visualization ###

Plotting histograms for each numerical value also helps us understand the data. A histogram shows the number of instances
(on the vertical axis) that have a given value range (on the horizontal axis).

In [0]:
import matplotlib.pyplot as plt # Necessary import
housing.hist(bins=50, figsize=(20,15)) #Using like this plots for each 
#nummerical value, we choose the number of bins - the number of segments -.
plt.show()

We notice:
 * Some have been capped (median income at 15, median house age, median house value)
 * Different scales
 * Tail heavy histograms (they extend much farther to the right of the median than to the left)

## Visualizing Geographical Data ##

Since there is geographical information (latitude and longitude), it is a good idea to
create a scatterplot of all districts to visualize the data. 

Here is California in the map to help us understand our data better.

<img src="http://www.orangesmile.com/common/img_city_maps/california-state-map-3.jpg " alt="california" width="400"/>

In [0]:
housing.plot(kind="scatter", x="longitude", y="latitude") # Scatter with longitute in the x axis and the latitude in th y axis. 

This looks like California all right, but other than that it is hard to see any particular pattern. Setting the alpha option to 0.1 makes it much easier to visualize the places where there is a high density of data points.

In [0]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1) # Scatter with longitute in the x axis and the latitude in th y axis. 

We can now see the high density areas (around the main cities of california).

More generally, our brains are very good at spotting patterns on pictures, but you may need to play around with visualization parameters to make the patterns stand out.

Let's include the housing prices. The radius of each circle represents the district’s population (option s), and the color represents the price (option c). We will use a predefined color map (option cmap) called jet, which ranges from blue (low values) to red (high prices).


In [0]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population",
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
)
plt.legend()

# Prepare the data for Machine Learning Algorithms #

Instead of just doing this manually, you should write functions to do that, for several good reasons:
 
 * Reproduce the transformations on any dataset
 * Build a library of transformations functions
 * Easily try various transformations
 
 Let's work with our training set and separate the features (input values) from the labels (output values - median house value)

In [0]:

x = housing.drop("median_house_value", axis=1) # Creates a copy without the specified column
y = housing["median_house_value"].copy() # Copies the column to the specified variable

## Data Cleaning ##

Most Machine Learning algorithms cannot work with missing features, so let’s create a few functions to take care of them. You noticed earlier that the total_bedrooms attribute has some missing values, so let’s fix this. You have three options:

1. Get rid of the corresponding districts.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.).

In cases where you have more columns with missing values you can easily use the [Simple Imputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) from the scikit learn library. 


In [0]:
#housing.dropna(subset=["total_bedrooms"]) # option 1
# housing.drop("total_bedrooms", axis=1) # option 2
median = x["total_bedrooms"].median() # Get the median of this column
x["total_bedrooms"].fillna(median) # option 3

## Handling Text and Categorical Attributes ##

The ocean_proximity fields is a text attribute. Most machine learning algorithms prefer to work with numbers, so is it better to convert these text labels to numbers.

In [0]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = x["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
print(housing_cat_encoded)
print(encoder.classes_)


Each category is mapped to a value ex. <1H OCEAN is mapped  to 0, INLDAND is mapped to 1 etc.

One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. Obviously this is not the case (for example, categories 0 and 4 are more similar than categories 0 and 1).

To fix this we create one binary category for each attribute. One attribute is equal to 1 when the category matched the ocean proximity of the district and 0 to all other attributes.


Scikit-Learn provides a [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) encoder to convert integer categorical values into one-hot vectors. Let’s encode the categories as one-hot vectors. Note that fit_transform() expects a 2D array, but housing_cat_encoded is a 1D array, so we need to reshape it.

In [0]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot

Notice that the output is a SciPy sparse matrix, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories. After one-hot encoding we get a matrix with thousands of columns, and the matrix is full of zeros except for one 1 per row. Using up tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the nonzero elements. You can use it mostly like a normal 2D array,19 but if you really want to convert it to a (dense) NumPy array, just call the toarray() method.

In [0]:
housing_cat_1hot.toarray()

## Pipelines ##

You can combine all the transformations into one [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). The Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers (i.e., they must have a fit_transform() method). The names can be anything you like.

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler

housing_num = x.drop("ocean_proximity", axis=1) # Get only the numerical values


# Set up the pipeline
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('normalization', MinMaxScaler()),
])

housing_num_tr = num_pipeline.fit_transform(housing_num)

pd.DataFrame(housing_num_tr).head()


Let's join the numerical values with the categorical. We do that with the [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

In [0]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import FeatureUnion
from sklearn.compose import ColumnTransformer

# Specify the transformer and the columns to affect. You also use pipelines as transformer.
full_pipeline = ColumnTransformer([
    ("num_pipeline", num_pipeline, x.columns[:-1]),
    ('label_binarizer', OneHotEncoder(),[x.columns[-1]])
])
housing_prepared = full_pipeline.fit_transform(x)
pd.DataFrame(housing_prepared).head()

## Select and Train a Model  ##

Following all these steps made the process of training a model and evaluating it much simpler. In this sections will train a regression using the sklearn method of LinearRegression.

### Linear Regression (sklearn) ###

We train a linear regression model using the scikit learn library.

In [0]:
from sklearn.linear_model import LinearRegression


lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, y)

In [0]:
# Let's try out on few instances from the training set
some_data = x.iloc[:5] # Choosing some data
some_labels = y.iloc[:5] # Don't forget to also get the labels
some_data_prepared = full_pipeline.transform(some_data) # Transform thoses data with the pipeline
print("Predictions:\t", lin_reg.predict(some_data_prepared))
print("Labels:\t\t", list(some_labels))

## Evaluation ##

Predictions doesn't seem very accurate. Let's measure this regression model’s Root Mean Square Error (RMSE)

In [0]:
from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared) # Taking the predictions on the training set
lin_mse = mean_squared_error(y, housing_predictions) # Computing mean squared error
lin_rmse = np.sqrt(lin_mse) # Taking the root
lin_rmse