# Splitting Data



In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Supervisied Learning

            The answer for the data is already given. Predetermined data learning. 

## Splitting X and Y

### Using Zip function

In [2]:
X, y = zip(['a', 1], ['b', 2], ['c', 3])
print('X data :',X)
print('y data :',y)

X data : ('a', 'b', 'c')
y data : (1, 2, 3)


In [3]:
sequences = [['a', 1], ['b', 2], ['c', 3]]
X, y = zip(*sequences)
print('X data :',X)
print('y data :',y)

X data : ('a', 'b', 'c')
y data : (1, 2, 3)


### Splitting using DataFrame

In [4]:
values = [['Last Benefit for You!', 1],
["Please check if I can see you tomorrow...", 0] ,
['Doyeon, how are you doing? Long time no see...',0] ,
['(Advertising) You can predict stock prices with AI!', 1]]
columns = ['Mail Body', 'Spam Mail Present']

df = pd.DataFrame(values, columns=columns)
df

Unnamed: 0,Mail Body,Spam Mail Present
0,Last Benefit for You!,1
1,Please check if I can see you tomorrow...,0
2,"Doyeon, how are you doing? Long time no see...",0
3,(Advertising) You can predict stock prices wit...,1


In [5]:
X = df['Mail Body']
y = df['Spam Mail Present']

In [6]:
print('X data :',X.to_list())
print('y data :',y.to_list())

X data : ['Last Benefit for You!', 'Please check if I can see you tomorrow...', 'Doyeon, how are you doing? Long time no see...', '(Advertising) You can predict stock prices with AI!']
y data : [1, 0, 0, 1]


### Splitting using Numpy 

In [9]:
np_array=np.arange(0,16).reshape((4,4))
print('the total data:\n',np_array)

the total data:
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]


In [12]:
x = np_array[:, :3]
y = np_array[:,3]

print('x data:\n',x)
print('y data:\n',y)

x data:
 [[ 0  1  2]
 [ 4  5  6]
 [ 8  9 10]
 [12 13 14]]
y data:
 [ 3  7 11 15]


## Spliting the test data

### splitting using sklearn

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=1234)

In [14]:
# producing X and y data 

X, y = np.arange(10).reshape((5, 2)), range(5)

print('X total data :')
print(X)
print('y total data :')
print(list(y))

X total data :
[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]
y total data :
[0, 1, 2, 3, 4]


            Here, we separate the data at a ratio of 7:3. train_test_split() basically shuffles the order of the data and then separates the training data from the test data. If you write down the random_state value in a specific number and then write it in the same number next time, you can always get the same training data and test data. However, if you change the values, they are separated in a different order, resulting in different training and test data. Let's understand it through practice. The random_state value was arbitrarily specified as 1234.

In [15]:
# splitting it with 7:3 ratio 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

In [16]:
print('X training :')
print(X_train)
print('X test :')
print(X_test)

X training :
[[2 3]
 [4 5]
 [6 7]]
X test :
[[8 9]
 [0 1]]


In [17]:
print('y training data :')
print(y_train)
print('y test data :')
print(y_test)

y training data :
[1, 2, 3]
y test data :
[4, 0]


In [19]:
# changing the value of the randome state --> different random values 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

print('y training data :')
print(y_train)
print('y test data :')
print(y_test)

print('X training :')
print(X_train)
print('X test :')
print(X_test)


y training data :
[4, 0, 3]
y test data :
[2, 1]
X training :
[[8 9]
 [0 1]
 [6 7]]
X test :
[[4 5]
 [2 3]]


### Splitting the data manually

In [20]:
# producing the X, y data
X, y = np.arange(0,24).reshape((12,2)), range(12)

print('X total data :')
print(X)
print('y total data :')
print(list(y))

X total data :
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]
 [16 17]
 [18 19]
 [20 21]
 [22 23]]
y total data :
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]


In [21]:
num_of_train = int(len(X) * 0.8) 
num_of_test = int(len(X) - num_of_train) 
print('size of the training data :',num_of_train)
print('size of the test data :',num_of_test)

size of the training data : 9
size of the test data : 3


                We haven't divided the training data and the test data yet, but we've just decided how many of these two should be. Here num_of_test should not be calculated as len(X) * 0.2. Data may be missing. For example, assuming that the total number of data is 4,518, the value of 80% of 4,518 is 3,614.4, and if you lower the decimal point to 3,614. In addition, a 20% value of 4,518 is 903.6, which is 903 if you lower the decimal point. And 3,614 + 903 = 4517, so one data is missing. Therefore, you have to calculate either side first and exclude that value.

In [22]:
X_test = X[num_of_train:] # the bottom 20 percent data saved
y_test = y[num_of_train:] # the bottom 20 percent data saved
X_train = X[:num_of_train] # the top 20 percent data saved
y_train = y[:num_of_train] # the top 20 percent data saved

print('X test data :')
print(X_test)
print('y test data :')
print(list(y_test))

X test data :
[[18 19]
 [20 21]
 [22 23]]
y test data :
[9, 10, 11]
