<img src="./images/shouke_logo.png"
     style="float: right"
     width=100
     style="padding-bottom:100px;"/>
<br>
<br>

<table style="float:center;">
    <tr>
        <td>
            <img src='./images/python-logo.png'width=120>
        </td>
        <td>
            <img src='./images/pandas-logo.png'width=150>
        </td>
        <td>
            <img src='./images/scikit_learn_logo.png'width=150>
        </td>
    </tr>
</table>

<h1 style='text-align: center;'>Dividing Data</h1>
<h3 style='text-align: center;'>Shouke Wei, Ph.D. Professor</h3>
<h4 style='text-align: center;'>Email: shouke.wei@gmail.com</h4>

## Objective
- how to divide dataset into independent variables and dependent variable,
- how to split dataset for model estimation and testing

In [22]:
# import required package(s)
import pandas as pd
from sklearn.model_selection import train_test_split

# read data
df = pd.read_csv('./data/gdp_china_encoded.csv')

# didplay the first 5 rows
df.head()

Unnamed: 0,year,gdp,pop,finv,trade,fexpen,uinc,prov_hn,prov_js,prov_sd,prov_zj
0,2000,1.074125,8.65,0.314513,1.408147,0.108032,0.976157,0.0,0.0,0.0,0.0
1,2001,1.203925,8.733,0.348443,1.501391,0.132133,1.041519,0.0,0.0,0.0,0.0
2,2002,1.350242,8.842,0.385078,1.830169,0.152108,1.11372,0.0,0.0,0.0,0.0
3,2003,1.584464,8.963,0.48132,2.346735,0.169563,1.238043,0.0,0.0,0.0,0.0
4,2004,1.886462,9.052298,0.587002,2.955899,0.185295,1.362765,0.0,0.0,0.0,0.0


## 1. Slice data into features X and target y
- `X`: also called independent variables, the predictors, explanatory, treatment variables, factors, input variables, x-variables, or right-hand variables (because they appear on the right side of the regression equation)
- `y`: also called dependent variable, the response, outcome variable, y-variable, or  left-hand variable 

### (1) DataFrame structure 

In [23]:
X = df.drop(['gdp'],axis=1)
y = df['gdp']

In [24]:
X

Unnamed: 0,year,pop,finv,trade,fexpen,uinc,prov_hn,prov_js,prov_sd,prov_zj
0,2000,8.650000,0.314513,1.408147,0.108032,0.976157,0.0,0.0,0.0,0.0
1,2001,8.733000,0.348443,1.501391,0.132133,1.041519,0.0,0.0,0.0,0.0
2,2002,8.842000,0.385078,1.830169,0.152108,1.113720,0.0,0.0,0.0,0.0
3,2003,8.963000,0.481320,2.346735,0.169563,1.238043,0.0,0.0,0.0,0.0
4,2004,9.052298,0.587002,2.955899,0.185295,1.362765,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
90,2014,9.436000,3.078217,0.399111,0.602869,2.367206,1.0,0.0,0.0,0.0
91,2015,9.480000,3.566035,0.459535,0.679935,2.557561,1.0,0.0,0.0,0.0
92,2016,9.532000,4.041509,0.471385,0.745374,2.723292,1.0,0.0,0.0,0.0
93,2017,9.392000,4.449690,0.474870,0.821552,2.955790,1.0,0.0,0.0,0.0


### (2) Numpy array structure

In [25]:
X = df.drop(['gdp'],axis=1).values
y = df['gdp'].values

In [26]:
X

array([[2.00000000e+03, 8.65000000e+00, 3.14513000e-01, 1.40814657e+00,
        1.08032000e-01, 9.76157000e-01, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [2.00100000e+03, 8.73300000e+00, 3.48443000e-01, 1.50139086e+00,
        1.32133000e-01, 1.04151900e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [2.00200000e+03, 8.84200000e+00, 3.85078000e-01, 1.83016893e+00,
        1.52108000e-01, 1.11372000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [2.00300000e+03, 8.96300000e+00, 4.81320000e-01, 2.34673452e+00,
        1.69563000e-01, 1.23804300e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [2.00400000e+03, 9.05229828e+00, 5.87002000e-01, 2.95589873e+00,
        1.85295000e-01, 1.36276500e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [2.00500000e+03, 9.19400000e+00, 6.97793000e-01, 3.50576062e+00,
   

## 2. Split train and test data
- model estimation and validation 

<img src="./images/training_validation_test_sets.png" 
     align="center" 
     width="400"/>

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30,random_state=1)
X_train

array([[2.00900000e+03, 5.27600000e+00, 1.07423200e+00, 1.28238953e+00,
        2.65335000e-01, 2.46108100e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 1.00000000e+00],
       [2.01600000e+03, 9.94700000e+00, 5.33229400e+00, 1.54765718e+00,
        8.75521000e-01, 3.40120800e+00, 0.00000000e+00, 0.00000000e+00,
        1.00000000e+00, 0.00000000e+00],
       [2.01700000e+03, 7.65600000e+00, 5.32770000e+00, 3.99975000e+00,
        1.06210300e+00, 4.36218000e+00, 0.00000000e+00, 1.00000000e+00,
        0.00000000e+00, 0.00000000e+00],
       [2.00700000e+03, 9.36700000e+00, 1.25377000e+00, 9.31295660e-01,
        2.26185000e-01, 1.42647000e+00, 0.00000000e+00, 0.00000000e+00,
        1.00000000e+00, 0.00000000e+00],
       [2.01400000e+03, 9.78900000e+00, 4.24955500e+00, 1.70112192e+00,
        7.17731000e-01, 2.92219400e+00, 0.00000000e+00, 0.00000000e+00,
        1.00000000e+00, 0.00000000e+00],
       [2.00400000e+03, 7.53227636e+00, 6.55705000e-01, 1.41408306e+00,
   

In [32]:
X_train.shape

(66, 10)

In [34]:
X_test.shape

(29, 10)