# Display Data
In this part I use DataFrame and Series

In [1]:
import pandas as pd

DataFrame is a table. It contains an array of individual entries (which has values). The list of the row labels called index.  Here is an example: 

In [2]:
pd.DataFrame({"Max's notes": [44, 64], "Average Notes": [56, 99]}, index=("Math","Physics"))

Unnamed: 0,Max's notes,Average Notes
Math,44,56
Physics,64,99


A series is a sequence of data values, not a table. Like that:

In [3]:
pd.Series([41,54,55,57,58,59],index= [2002, 2003, 2004, 2005, 2006, 2007], name= "Patient's weight over the years")

2002    41
2003    54
2004    55
2005    57
2006    58
2007    59
Name: Patient's weight over the years, dtype: int64

# Selecting Data for Modeling

Now we will learn some ways to play with that data. 


In [4]:
melb_data = pd.read_csv(".\melb_data.csv")

**Format 1:** data[ [ wanted_labels ] ]

In [5]:
X = melb_data[['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']]
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,13580.0,13580.0,13580.0,13580.0,13580.0
mean,2.937997,1.534242,558.416127,-37.809203,144.995216
std,0.955748,0.691712,3990.669241,0.07926,0.103916
min,1.0,0.0,0.0,-38.18255,144.43181
25%,2.0,1.0,177.0,-37.856822,144.9296
50%,3.0,1.0,440.0,-37.802355,145.0001
75%,3.0,2.0,651.0,-37.7564,145.058305
max,10.0,8.0,433014.0,-37.40853,145.52635


`.head()` shows the top few rows

In [6]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
0,2,1.0,202.0,-37.7996,144.9984
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
3,3,2.0,94.0,-37.7969,144.9969
4,4,1.0,120.0,-37.8072,144.9941


**Format 2:** Dot Notation <br>
We used it to select the "prediction target"

In [7]:
y = melb_data.Price
y.describe()

count    1.358000e+04
mean     1.075684e+06
std      6.393107e+05
min      8.500000e+04
25%      6.500000e+05
50%      9.030000e+05
75%      1.330000e+06
max      9.000000e+06
Name: Price, dtype: float64

If you want to just get columns you can use this: 

In [8]:
melb_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

**.dropna**  <br> <br> axis{0 or ‘index’, 1 or ‘columns’}, default 0 <br>
Determine if rows or columns which contain missing values are removed. <br>
<ul><li>0, or ‘index’: Drop rows which contain missing values.</li><li>1, or ‘columns’: Drop columns which contain missing value. </li></ul> <br>
Pass tuple or list to drop on multiple axes. Only a single axis is allowed.

In [9]:
melb_data = melb_data.dropna(axis= 0)

# Build the Model <br> 
The steps to building and using a model are:
<br>

<ul><li> <strong>Define:</strong> What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.</li> <li><strong>Fit:</strong> Capture patterns from provided data. This is the heart of modeling.</li>
<li><strong>Predict:</strong>Just what it sounds like </li>
<li><strong>Evaluate :</strong> Determine how accurate the model's predictions are.</li>
</ul>

In [10]:
from sklearn.tree import DecisionTreeRegressor

# random_state is the inital value of the model. Specifying a number for random_state ensures you get the same results in each run. 
# It also can help you to compare the performance of different models or reproduce your results in the future. I doesnt affect model quality
melb_model = DecisionTreeRegressor(random_state=5)

The 'fit' method trains the algorithm on the training data, after the model is initialized. <br>
<p align="center">
    <img width="500" src="sklearn-fit_syntax-explanation.png" alt="Material Bread logo">
</p>

In [11]:
melb_model.fit(X, y)
print(X.head())
print(melb_model.predict(X.head()))

   Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2       1.0     202.0   -37.7996    144.9984
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
3      3       2.0      94.0   -37.7969    144.9969
4      4       1.0     120.0   -37.8072    144.9941
[1480000. 1035000. 1465000.  850000. 1600000.]
