### Decision Tree & Random Forest for two datasets

#### Decision Tree Regressor

In [1]:
import pandas as pd

#### 1.Data description

The Boston data frame has 506 rows and 14 columns.

This data frame contains the following columns:

crim = per capita crime rate by town.

zn = proportion of residential land zoned for lots over 25,000 sq.ft.

indus = proportion of non-retail business acres per town.

chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox = nitrogen oxides concentration (parts per 10 million).

rm = average number of rooms per dwelling.

age = proportion of owner-occupied units built prior to 1940.

dis = weighted mean of distances to five Boston employment centres.

rad = index of accessibility to radial highways.

tax = full-value property-tax rate per \$10,000.

ptratio = pupil-teacher ratio by town.

black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat = lower status of the population (percent).

medv = median value of owner-occupied homes in \$1000s.

reference:https://www.kaggle.com/c/boston-housing

In [2]:
df = pd.read_csv("/Users/derickmorales/Desktop/python-ml-UDEMY/datasets/boston/boston.csv")
df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [3]:
df.shape

(506, 14)

In [4]:
column_names = df.columns.values.tolist()
predictors = column_names[:13]
target = column_names[13]

X = df[predictors]
Y = df[target]

#### Decision Tree Regressor

In [5]:
from sklearn.tree import DecisionTreeRegressor

regr_tree = DecisionTreeRegressor(min_samples_split=30, min_samples_leaf=10, random_state=0)

In [6]:
regr_tree.fit(X,Y) 

DecisionTreeRegressor(min_samples_leaf=10, min_samples_split=30, random_state=0)

In [7]:
prediccion = regr_tree.predict(df[predictors])

In [8]:
df["prediccion"] = prediccion 

#valor de la prediccion vs. valor original del df "medv"
df[["prediccion", "medv"]] 

Unnamed: 0,prediccion,medv
0,22.840000,24.0
1,22.840000,21.6
2,35.247826,34.7
3,35.247826,33.4
4,35.247826,36.2
...,...,...
501,22.840000,22.4
502,20.624138,20.6
503,28.978261,23.9
504,31.170000,22.0


#### RandomForestRegressor

In [9]:
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(n_jobs=2, oob_score=True, n_estimators=100)
forest.fit(X,Y)

RandomForestRegressor(n_jobs=2, oob_score=True)

In [10]:
df["rforest_predict"]=forest.oob_prediction_
df[["rforest_predict", "medv"]]

Unnamed: 0,rforest_predict,medv
0,29.566667,24.0
1,22.814286,21.6
2,34.651515,34.7
3,35.402222,33.4
4,34.334286,36.2
...,...,...
501,24.296429,22.4
502,17.421053,20.6
503,26.554545,23.9
504,26.044118,22.0


In [11]:
#error 
df["rforest_error2"] = (df["rforest_predict"]-df["medv"])**2
sum(df["rforest_error2"])/len(df)

11.155116488007828

In [12]:
#coeficiente de determinacion
forest.oob_score_

0.8678609910318378

#### RandomForestClassifier (data iris)

#### 2.Data description

The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

Id
SepalLengthCm
SepalWidthCm
PetalLengthCm
PetalWidthCm
Species

reference: https://www.kaggle.com/uciml/iris

In [13]:
df_iris = pd.read_csv("/Users/derickmorales/Desktop/python-ml-UDEMY/datasets/iris/iris.csv")
df_iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [14]:
column_names2 = df_iris.columns.values.tolist()
predictors2 = column_names2[:4]
target2 = column_names2[4]

X_iris = df_iris[predictors2]
Y_iris = df_iris[target2]

In [15]:
from sklearn.ensemble import RandomForestClassifier

forest_clas = RandomForestClassifier(n_jobs=2, oob_score=True, n_estimators=100)

In [16]:
forest_clas.fit(X_iris,Y_iris)

RandomForestClassifier(n_jobs=2, oob_score=True)

In [17]:
forest_clas.oob_decision_function_

array([[1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        ],
       [1.

In [18]:
forest_clas.oob_score_

0.9533333333333334