#   **Problem Statement** : 
------
<blockquote>
        <p>To build a model that learn from **Califoria Census Data** and be able to predict the median housing prices in any district in California given all other metrics. 
        </p>
</blockquote>
 [Housing Dataset](http://lib.stat.cmu.edu/datasets/)

### Machine Learning Project check-List-(**Start from here ALWAYS**)
------
<ul>
    <li>look at the big picture.</li>
    <li>Get the data.</li>
    <li>Discover and visualize the data to gain insights.</li>
    <li>Prepare the data for Machine Learning algorithms.</li>
    <li>Select a model and train it.</li>
    <li>Fine-tune your model.</li>
    <li>Present your solution.</li>
    <li>Launch, monitor, and maintain your system.</li>
</ul>

### Looking at the big pucture.
<blockquote>
        <ul>
            <li><h3>While Framing a problem. Ask following ?</h3>
               <ul><li>
               What is the business objective.</li>
               <li>
                What current solutions look like(if any)
               </li></ul>
               **Answer:**
               <p>Okay, with all this information you are now ready to start designing your system. it is clearly a typical __***supervised learning***__ task since you are given labeled training examples (each instance comes with the expected output, i.e., the district’s median housing price).</p>
               <p>Moreover, it is also a typical regression task, since you are asked to predict a value. More specifically, this is a __***multivariate regression***__ problem since the system will use multiple features to make a prediction (it will use the district’s population, the median income, etc.).</p>
               <p>there is no continuous flow of data coming in the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain __***batch learning***__ should do just fine</p>
            </li>
            <li><h3>Select the performance Measure.?</h3>
                <p>A typical performance measure for regression problems is the __***Root Mean Square Error (RMSE)***__. It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors.RMSE(X,h) is the cost function measured on the set of examples using your hypothesis h.
                </p>
                <p>Suppose that there are many __***outlier districts***__. In that case, you may consider using the __***Mean Absolute Error***__ (also called the Average Absolute Deviation</p>
               <blockquote>__***The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE(L2) is more sensitive to outliers than the MAE(L1). But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.***__</blockquote>
            </li>
            <li><h3>Check the Assumptions?</h3>
            <p>Check whether the predition output from your system needed to be in numeric format or categorical format("cheap","medium","expensive") which will be used by another downstream system. Based on this assumption, you need to change the model from regression to classification.
            </p>
            </li>
        </ul>
</blockquote>

### Download the data & Familize yourself with the data schema
------

In [2]:

# Automate the data fetching process.
import os
import tarfile
from six.moves import urllib

DOWNLOAD_PATH ="https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH  ="datasets/housing"
HOUSING_URL   = DOWNLOAD_PATH + HOUSING_PATH + "/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path= os.path.join(housing_path,"/housing.tgz")
    
    urllib.request.urlretrieve(housing_url,tgz_path)
    
    housing_tgz=tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()    

### loading data to pandas & quick look at the Data Structure( Df ).
------

In [3]:
import numpy as np
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path,'housing.csv')
    return pd.read_csv(csv_path)

In [4]:
%%time
housing = load_housing_data()
print(housing.head())


   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY  
CPU times: user 37 ms, sys: 6.27 ms, total: 43.3 ms
Wall time: 42.8 ms


In [5]:
print(type(housing),housing.shape,"\n", housing.columns.values,"\n") #print(data.head(n=1))
print('*' *100)
housing.info()
print('*' *100)
# the "total_bedrooms" attribute has 20433  non-null value, so we need to take care of "missing values."

<class 'pandas.core.frame.DataFrame'> (20640, 10) 
 ['longitude' 'latitude' 'housing_median_age' 'total_rooms' 'total_bedrooms'
 'population' 'households' 'median_income' 'median_house_value'
 'ocean_proximity'] 

****************************************************************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
***************************************************************************

### Feature Distributions
------

In [6]:
import seaborn as sns
sns.set_style("whitegrid")

ax = sns.boxplot(x=housing,orient="h", palette="Set2")

TypeError: unsupported operand type(s) for +: 'float' and 'str'

In [None]:
#MIN. , MAX. and number of unique values in each column.
dum1,dum2,dum3,dum4,dum5 = housing.longitude.unique() , housing.latitude.unique() , housing.housing_median_age.unique() ,housing.ocean_proximity.unique(), housing.population.unique()
print(housing.longitude.min(),"-->",housing.longitude.max(),len(dum1),"\n")
print(housing.latitude.min(),"-->",housing.latitude.max(),len(dum2),"\n")
print(housing.housing_median_age.min(),"-->", housing.housing_median_age.max(),len(dum3),"\n")
print(housing.ocean_proximity.min(),"-->", housing.ocean_proximity.max(),len(dum4),"\n",np.sort(dum4),"\n")
print(housing.population.min(),"-->", housing.population.max(),len(dum5),"\n",np.sort(dum5),"\n")
print('*' *100)
# to find how many categories exist for attribute "ocean_proximity" and associated rows.
# we infer: How many districts for each category.
print(housing["ocean_proximity"].value_counts())

#or 
print(housing.groupby('ocean_proximity').size())
print('*' *100)
print(housing.describe())

#understanding:
# 1.std :: measures how dispersed the values.
# 2.percentiles ::
# 25% of the districts(#of rows) have feature " housing_median_age" lower than 18. --> 1st quartile
# 50% of the districts(#of rows) have feature " housing_median_age" lower than 29  
# 75% of the districts(#of rows) have feature " housing_median_age" lower than 37  -->3rd quartile

### to visualize the data.
------

In [None]:
import matplotlib.pyplot as plt

housing.hist(bins=50, figsize=(20,15))
plt.show()

### Feature-Feature Relationships
------

In [None]:
from pandas.tools.plotting import scatter_matrix

#df = pd.DataFrame(housing, columns=['logitude', 'latitude', 'median_income', 'population'])

sm = scatter_matrix(housing, alpha=0.2, figsize=(15, 15), diagonal='kde')

#Change label rotation
[s.xaxis.label.set_rotation(45) for s in sm.reshape(-1)]
[s.yaxis.label.set_rotation(0) for s in sm.reshape(-1)]

#May need to offset label when rotating to prevent overlap of figure
[s.get_yaxis().set_label_coords(-0.3,0.5) for s in sm.reshape(-1)]

#Hide all ticks
[s.set_xticks(()) for s in sm.reshape(-1)]
[s.set_yticks(()) for s in sm.reshape(-1)]

#plt.show()

### Create a Test set
------

def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    
    test_set_size = int(len(data)*test_ratio)
    test_indices  = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    
    return data.iloc[train_indices], data.iloc[test_indices]

In [None]:
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set),"train_set ,",len(test_set),"test_set")
#just to check whether the test_set and train_set is changing after every execution.--it is..!!
print(train_set.head(n=2))
print(test_set.head(n=2))


In [2]:
print ("\033[1;35;46m NOTE: if you execute the above program again then it will change the test_set.','\n',' As you train and test you ML algorithm repeatedly. The algorithm will get to see almost the whole dataset. This does not generalise(not really getting low generalization error).','\n','To avoid it either use no.random.seed(42) before calling np.random.permutation() so that it always generates the same shuffled indices. or save the test_set on the first run and then load it in subsequent runs")

[1;35;46m NOTE: if you execute the above program again then it will change the test_set.','
',' As you train and test you ML algorithm repeatedly. The algorithm will get to see almost the whole dataset. This does not generalise(not really getting low generalization error).','
','To avoid it either use no.random.seed(42) before calling np.random.permutation() so that it always generates the same shuffled indices. or save the test_set on the first run and then load it in subsequent runs


In [None]:
print(train_set.head(n=2))
print(test_set.head(n=2))

[flexible-apply](http://pandas.pydata.org/pandas-docs/version/0.17.1/groupby.html)

[pandas.DataFrame.loc](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)   

In [None]:
import hashlib
print ("\033[1;35;46m the above solutions do not work when we fetch a updated dataset.\n SOLUTION: to use each instance's identifier to decide whether or not it should go to test_set.{Assuming that the instances have a unique and immutable identifier. \n we could compute a hash of each instance's identifier, keep only the last two bytes of the hash, and put the instance in the test set if value is <= 256*test_ratio i.e 51} \n  This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset. The new test_set will contain 20% of the new instances,but it will not contain any instance that was previosly in the train_set ")
def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < test_ratio*256

def split_train_test(data, test_ratio,id_column , hash=hashlib.md5):
    #print(type(data))
    ids=data[id_column]
    #print(type(ids)) #print(ids.head())
    in_test_set=ids.apply(lambda id_ : test_set_check(id_,test_ratio,hash))
    #print(type(in_test_set)) #print(in_test_set.head()) #print(~in_test_set.head()) #print(data.loc[~in_test_set].head()) #print(data.loc[in_test_set].head())
    return data.loc[~in_test_set], data.loc[in_test_set] # all False value are in train and Tre

housing_with_ID= housing.reset_index() # add an 'index' column
#print(housing_with_ID.head())
train_set, test_set = split_train_test(housing_with_ID, 0.2, "index")


In [None]:
print ("\033[1;35;46m NOTE: if you use row index as the UNIQUE IDENTIFIER, then make sure that \n 1. new data must get appended to the end of the dataset. \n 2. no row ever get deleted.\n NOTE: if you we not able to make the above approach possoible, select a UNIQUE IDENTIFIER from the most stable features. eg. district longitude and latitude are 100% stable,so combime them to useas an ID \n Instead we can use most stable features to build unique identifiers\n here we can combine logitude and latitude that are garrunted to be stable for a few million years.")

In [None]:
housing_with_ID["id"]= housing["longitude"]*1000 + housing["latitude"]

train_set, test_set = split_train_test(housing_with_ID, 0.2, "id")
train_set.shape,    test_set.shape

###### using sckit-learn for above task -----RANDOM SAMPLING

In [None]:
from sklearn.model_selection import train_test_split
print ("\033[1;32;43m QUESTION: you can pass this function multiple datasets with an identical number of rows, and it will split them on the same indices--THIS APPROACH IS USEFULL IF YOU HAVE A SEPERATE DATAFRAME FOR LABELS. \n")
train_set,test_set = train_test_split(housing, test_size=0.2, random_state=42)
train_set.shape,    test_set.shape


In [None]:
print ("\033[1;35;46m 1. It is important to use a training set that is representative of the cases you want to generalise to. \n if the sample is too small you will have a SAMPLING NOISE {I.E THE NONREPRESENTATIVE DATA AS A RESULT OF CHANCE.}. with non-representative training set, we will train a model that is unlikely to make accurate predictions.\n")
print ("\033[1;30;46m But, even very large samples can be non-representative if the sampling method is flawed :: SAMPLING BIAS   \n")
print ("\033[1;35;46m So far we have considered purely RANDOM SAMPLING METHODS. This works if you have large dataset(especially related to # of attributes but, if it is not than you run into introducing SIGNIFICANT SAMPLING BIAS.)  \n")
print ("\033[1;30;46m Eg, when a survey company decides to call 1000 people to ask them few questions, they do not just pick just 1000 random people.\n they try to ensure that these 1000 people are representative of the whole population.\n   \n")
print ("\033[1;35;46m suppose US population has 51.3% FEMALE and 48.7% MALE, so well conducted survey in US would try to maintain this ratio in the SAMPLE : 513 FEMALES and 487 MALES. \n this is called STRATIFIED SAMPLING.\n THE POPULATION IS DEVIDED INTO HOMOGENEOUS SUBGROUPS CALLED STRATA and, THE RIGHT NUMBER OF INSTANCES IS SAMPLED FROM EACH STRATUM TO GAURANTEE THAT THE testset IS REPRESENTATIVE OF WHOLE POPULATION. \n")

print ("\033[1;30;46m if PURE RANDOM SAMPLING IS USED THEN THERE WOULD BE 12% CHANCE OF SAMPLING A SKEWED testset WITH EITHER 49% FEMALE or 54% FEMALE.\n either way the survey results would be significantly biased.\n")



In [None]:
# Suppose, experts said that "median_income" is a very important attribute to predict median housing pricces.
# you many want to ensure that the testset is representative of various categories of incomes in the whole dataset.
dum10 = housing.median_income.unique()
print(housing.median_income.min(),"-->", housing.median_income.max(),len(dum10),housing['median_income'].count(),"\n",np.sort(dum10),"\n")
housing.shape

In [None]:
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
housing['median_income'].plot.hist(bins=50, figsize=(6,6))
plt.show()
print ("\033[1;32;43m QUESTION: How to see  histogram more closely \n most 'median housing' values are clustered between 2-5 but, some 'median income' go far beyond 6\n")
print ("\033[1;35;46m Therefore, your dataset must contain sufficient number of instances from each stratum, otherwise the estimate(coeff.) of the stratum's importance may be biased. This means that you should not have too many stratas, and each stratum should be large enough\n")

In [None]:
print("median_income min max",housing.median_income.min(),"-->", housing.median_income.max(),"\n")
print ("\033[1;35;46m 1. divide by 1.5 & taken CEIL() to limit the number of 'income_categories' \n 2. then merging all categories > 5\n")

#print(housing['median_income'].tail(),"\n")
housing['income_cat']= np.ceil(housing['median_income'] / 1.5)
#print("income_cat min max",housing['income_cat'].min(),"-->", housing['income_cat'].max(),"",housing['income_cat'].unique() )
#print(housing['income_cat'].tail())
#whereverm the condition is FALSE replace the entry by "5" in inplace manner. {df.where(cond, change..?)}
#it is quivalent to np.where(condition, src, change..?)
housing['income_cat'].where(housing['income_cat'] < 5, 5, inplace='True') 

#print("income_cat min max",housing['income_cat'].min(),"-->", housing['income_cat'].max(),"",housing['income_cat'].unique() )


###### Now you can do STRATIFIED SAMPLING

In [None]:


from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing,housing['income_cat']):
    strat_train_set = housing.loc[train_index]
    strat_test_set  = housing.loc[test_index]
    
strat_train_set.info()    

In [None]:
#housing.info() # now has 11 columns
from sklearn.model_selection import train_test_split

new_train_set, new_test_set = train_test_split(housing,test_size=0.2, random_state=42)

RANDOM     =new_test_set['income_cat'].value_counts() / len(new_test_set)    # TEST SET WITH RANDOM SAMPLING
OVERALL    =housing['income_cat'].value_counts() / len(housing)              #---overall DATA SET
STRATIFIED =strat_test_set['income_cat'].value_counts() / len(strat_test_set)# TEST SET WITH STRATIFIED SAMPLING 

err1 = (OVERALL-RANDOM)*100
err2 = (OVERALL-STRATIFIED)*100

d = {'OVERALL':OVERALL,'RANDOM':RANDOM,'STRATIFIED':STRATIFIED,'Rand.%error':err1,'Strat.%error':err2}
compare=pd.DataFrame(data=d,columns=['OVERALL','RANDOM','STRATIFIED','Rand.%error','Strat.%error'] )
table  =compare.sort_index()
print ("\033[1;35;46m Observation: test_set generated by STRATFIED SAMPLING has income category proportions almost identical to those in the full dataset\n whereas, the test_set  generated by pure RANDOM SAMPLING is quite SKEWED \n")

from IPython.display import display
display(table)
print ("\033[1;32;43m QUESTION: Think of how to display Skewness \n")

In [None]:
#now remove the "income_cat" from the housing dataset to make it to normal.

strat_train_set.info()
for set in (strat_train_set,strat_test_set):
    set.drop(['income_cat'],axis=1,inplace='True')
strat_train_set.info()
   

### DISCOVER & VISUALIZE THE DATA TO GAIN INSIGHT
------
Make sure that you have kept the data aside and you are only exploring the training set.
if the training set is very large, You may want to sample and EXPLORATION SET, to male manipulation easy and fast.


In [None]:
# create a copy of training set.

housing=strat_train_set.copy()
#housing.info()       # its training set only, not full set

###### visualizing the geographical Data.

In [None]:
housing.plot(kind='scatter', x='longitude', y='latitude')
plt.show()
print ("\033[1;35;46m scatter plot of all the districts(instances) to visulaize the data. \n ")


In [None]:
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.1)
plt.show()
print ("\033[1;35;46m you can see high density of data points by changing the Alpha value.\n ")
print ("\033[1;35;46m The bay Area, around los Angeles and San Diego, + a fairly high density in the central valley(Sacramento & Fresno has HIGH DENSITY) \n")

In [None]:
#print(housing['population'].head())
# Lets look at housing prices.
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4,
            s= housing['population'] /100, label='population',
            c= housing['median_house_value'], cmap=plt.get_cmap('jet'),colorbar='True',
            )
plt.legend()
plt.show()
print ("\033[1;35;46m clearly tells the house prices are very much related to lacation \n ")
print ("\033[1;32;43m QUESTION: How to use clustering Algo to detect \n 1. the main clusters \n 2. add new features that measure the proximity to the cluster centers.\n")


###### looking at the correlation

In [None]:
#computing pearson's r :: standard correlation coeff. between pair of attributes
corr_matrix=housing.corr()
corr_matrix

In [None]:
corr_matrix['median_house_value'].sort_values(ascending='False')
print ("\033[1;35;46m The correlation coeffs. ranges from -1 to 1. \n  If it is close to 1 : There exist a strong positive correlation. i.e 'median_house_price' tends to go up when the 'median_income goes up, 0.687160 \n if the coeff. is negative: there exist a strong negative correlation, i.e the 'median_house_price' have the tendency to go down when we move upward in NORTH direction (latitude) \n The coeff. with near to 0 means that no LINEAR CORRELATION : i.e the x goes UP when the y goes UP/DOWN.\n It can completely missout the NON-LINEAR RELATIONSHIPS. \n ")

In [None]:
#another way to watch the correlation between attributes.
from pandas.tools.plotting import scatter_matrix

attributes = ['median_house_value','median_income','total_rooms','housing_median_age' ]

scatter_matrix(housing[attributes], figsize= (12,8))
plt.show()

In [None]:
housing.plot(kind='scatter', x='median_income',y='median_house_value',alpha=0.1)
plt.show()
print ("\033[1;35;46m 1. Correlation is very strong.{: can see a upward trend & points are not too dispersed}\n 2. There is price cap as a horizontal line around $500,000.\n\n But the plot reveals other obvious straight lines these are around $350,000,$450,000, $280,000. We may want to remove these DATA QUIRKS to prevent our algorithm from learning it.  \n ")

###### Experimenting with various attributes combinations

In [None]:
print(housing.dtypes.index)
print ("\033[1;35;46m The 'total_rooms' in a district is not useful instead we need # of 'rooms_per_household' \n\n  'total_bedrooms' is not useful instead we want to compare it to the # of rooms \n\n another intresting attribute is 'population_per_household'  \n ")
#housing['households'].tail()
housing['rooms_per_household']     =housing['total_rooms'] / housing['households']
housing['bedrooms_per_room']       =housing['total_bedrooms'] / housing['total_rooms']
housing['population_per_household']=housing['population'] / housing['households']

corr_matrix=housing.corr()
print(corr_matrix['median_house_value'].sort_values(ascending=False))

print ("\033[1;35;46m 'bedrooms_per_room' attribute is more correlated than 'total_bedrooms' & 'total_rooms' :: aparantly, hoses with lower bedroom/room ratio tend to be more expensive.]n Also, it is more informative than 'total_rooms' in a district. \n")

## prepare your data for machine learning algorithms
-------

In [None]:
# the "PREDICTORs"

housing= strat_train_set.drop('median_house_value',axis=1)

# the "Response", "Labels"
housing_labels= strat_train_set['median_house_value'].copy()


###### Data Cleaning

In [None]:
# most ML algos does not work with "Missing Values".
# we know that some values we missing

housing.info()
#Get rid of the corresponding districts.(removing whole row)
#housing.dropna(subset=['total_bedrooms'])

#Get rid of the whole attribute.
#housing.drop('total_bedrooms',axis=1)

#Set the values to some value(mean, 0, median etc.)
print ("\033[1;35;46m  you should compute the median values on training set and and use it to fill the missing values in the training set \n Also, save the median values you have computed. This will be used to\n 1.replace the missing values in the testset and \n 2.to replace missing values in new Data when the system goes live.  \n")
median =housing['total_bedrooms'].median()
nansFilled=housing['total_bedrooms'].fillna(median)
nansFilled.size # All 16512 entries retained instead of 16354.


In [None]:
# Scikit-learn provides a handy class to handle missing values.
from sklearn.preprocessing import Imputer

imputer = Imputer(strategy= 'median')
#Since the median can be computed only on the numerical Attributes, therefore 'ocean_proximity' has to be dropped
housing_num = housing.drop('ocean_proximity', axis=1)
#now fit the imputer instance to training data: computed the median of each attribute and stored the result in its  'statistics_' instance variable.
print ("\033[1;35;46m  The reason of computing median on every ATTRIBUTe is that, in future when the system goes live , we might recieve missing values in any atttibute, \n therefore, we have to be sure \n")
imputer.fit(housing_num)

print(imputer.statistics_)

print(housing_num.median().values)
#used the trained imputer to transform the training set by replacing the missing values(eg. NANs) by the learned medians.
X= imputer.transform(housing_num)
print(type(X))
# if required tranform 'numpy.ndarray' to pandas's Df.
housing_tr=pd.DataFrame(data=X, columns=housing_num.columns)
#housing_tr.head()

###### NOTE
*  **ESTIMATORS: Any object that can estimate some parameters based on datasets _eg. Imputer_. **
The esitmation is performed by fit() method and it takes only a **'dataset'** as the parameter or two for supervised Learning.

 Any other parameter needed to guide the estimation process is **HYPERPARAMETER** _eg. Impute's strategy_ and it **must be set as an instance variable** ,generally, via contructor parameter: always given before training.
* **TRANSFORMERS: Some estimators such as _Imputer_ can also transform datasets.**
 The transformation is performed by transform() method with the **'dataset'** to transform parameter.

 It returns the transformed dataset. These transformations generally relies on **LEARNED PARAMETERs** _eg. median_,as in case of _Imputer_.
* **PREDICTORS: Some estimators are capable of making predictions given a dataset.**
a predictor has a predict() method which takes **'dataset'** of a new instance and return a datasets of corresponding predictions.

 It also has a score() method that measures the **QUALITY of prediction** given a TESTSET(& corresponding labels in terms of supervised learning algorithms.)

* All the **estimator's HYPERPARAMETERS** are accessible directly via **public instance variables** _{eg. Imputer.strategy}_ and

* All the **estimator's learned parameters** are also accessible via **public instance variables** with an **underscore suffix** _eg. Imputer.statistics_ _  

###### How to handle text and CATEGORICAL ATTRIBUTES:

In [None]:
#Use a transformer: LabelEncoder.
from sklearn.preprocessing import LabelEncoder

#create an instance variable
encoder = LabelEncoder()
housing_cat= housing['ocean_proximity']
housing_cat_encoded=encoder.fit_transform(housing_cat)
housing_cat_encoded.shape
print ("\033[1;35;46m  Now we can use this numerical data in any ML algorthims.")
print(encoder.classes_)

In [None]:
print ("\033[1;35;46m Issue with above encoding scheme: The ML algorithms will assume that the two nearby values are more similar than two distant values.\n{eg. for 'ocean_proximity' 0 and 4 looks more similar than 0 and 1 }\n\n SOLUTION: create one BINARY attribute per category:ONE HOT ENCODING.\n\n 1:HOT, 0:COLD ")
from sklearn.preprocessing import OneHotEncoder # converts INTEGER categorical values to oneHot vectors.
hot_encoder= OneHotEncoder()
housing_cat_1hot=hot_encoder.fit_transform(housing_cat_encoded.reshape(-1,1))# fit_transform() expect a 2D array so we need to reshape our 1D array.
housing_cat_1hot
type(housing_cat_1hot)# scipy sparse matrix : 
print ("\033[1;35;46m The benefit of using scipy sparce matrix intead of NumPy array(dense), because sparse representation stores the location of nonzero elements(in our case 1). ")
print ("\033[1;32;43m QUESTION:What is the reason behind that the ML algorithms assume that two nearby values are more similar than two distant values. ")

# to convert it to NumPy array (dense representation.)
housing_cat_1hot.toarray()


In [None]:
# TRANSFORMATION: text cat.--> Integer cat.--> One Hot vectors.
from sklearn.preprocessing import LabelBinarizer
encoder_OneHot=LabelBinarizer()# pass ~~ sparse_output='True' ~~ to get sciPy sparse matrix
housing_cat_oneHot= encoder_OneHot.fit_transform(housing_cat) #return a dense NumPy array
housing_cat_oneHot

###### Custom Transformers

In [None]:
# You need to write your own TRANSFORMERS for task such as CUSTOM CLEANUP OPERATIONS or  COMBINING SPECIFIC ATTRIBUTES.
# Need to create a class and implement 3 methods.
from sklearn.base import BaseEstimator, TransformerMixin
#assigning column values : 'total_rooms'=3, 'total_bedrooms'=4, 'population'=5, 'households'=6
room_ix, bedrooms_ix, population_ix, household_ix = 3,4,5,6
class combinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedroom_per_room =True):#
        self.add_bedroom_per_room = add_bedroom_per_room
    def fit(self, X, y):
        return self
    def tranform(self, X, y=None):
        room_per_household = X[:, room_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedroom_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, room_ix]
            return np.c_[X, room_per_household, population_per_household, bedrooms_per_room ]
        else:
            return np.c_[X, room_per_household, population_per_household]
        

attr_adder = combinedAttributesAdder(add_bedroom_per_room = False)
housing_extra_attri=attr_adder.tranform(housing.values)
housing_extra_attri.shape # 9 + two more columns have been added.

###### Feature Scaling
ML algos **does not perfrom** well when the input numerical attributes have very different **scales**.
Note : Scaling the target values, Generally, not required.

In [None]:
dumy1, dumy2 = housing.total_rooms.unique(),housing.median_income.unique()
print(housing.total_rooms.min(),"-->", housing.total_rooms.max(),"\n")
print(np.floor(housing.median_income.min()),"-->", np.floor(housing.median_income.max()),"\n")


There are two methods for scaling:
* **Min-Max scaling / NORMALIZATION**: **every value** if **substracted** by **Min. value** and **divided** by **Min. - Max**. The values are shifted and rescaled so that they end up ranging from 0 to 1.
* **STANDARDIZATION** : **every value