# Machine Learning

## 3️⃣ Data Preprocessing

### Why do we need data preprocessing?

- To convert data into machine learning's input type (generally matrix)
- To purify data by processing missing values and outliers
- To separate data into data for learning and data for evaluation

### Preprocessing Categorical Data

#### Preprocessing norminal data

- **Numerical mapping method**

Generally maps category to 0 or 1. (i.e. male : 0, female : 1)

If there can be three or more value for a category, map the numbers by making the intervals of the numbers equal. (i.e. 0, 1, 2, ...)

In [1]:
import pandas as pd
import numpy as np

titanic = pd.read_csv('./titanic.csv')
print('Before preprocessing: \n', titanic['Sex'].head())

titanic=titanic.replace({'male':0, 'female':1})

print('After preprocessing: \n', titanic['Sex'].head())

Before preprocessing: 
 0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: object
After preprocessing: 
 0    0
1    1
2    1
3    1
4    0
Name: Sex, dtype: int64


- **Dummy method**

Create new dummy columns, and give value 1 only to the corresponding column, and give 0 to the rest.

In [5]:
print('Before preprocessing: \n', titanic['Embarked'].head())

dummies = pd.get_dummies(titanic[['Embarked']])

print('After preprocessing: \n', dummies.head())

Before preprocessing: 
 0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object
After preprocessing: 
    Embarked_C  Embarked_Q  Embarked_S
0           0           0           1
1           1           0           0
2           0           0           1
3           0           0           1
4           0           0           1


#### Preprocessing Ordinal Data

- Numerical mapping method

Differences between figures can be customized. (i.e. very much -> 10, much -> 6, none -> 0)

### Preprocessing Numerical Data

#### Scaling

If the values of a specific column are too large compared to other columns, the column will have a greater impact on machine learning than other columns.    
You can use **Scaling** to eliminate this effect.

- Normalization

$X'$ is normalized value of variable $X$.

$X' = \frac{X - X_{min}}{X_{max} - X_{min}}$

In [13]:
def normal(data):
    data = (data - data.min()) / (data.max() - data.min())
    return data
    
titanic = pd.read_csv('./titanic.csv')
print('Before preprocessing: \n',titanic['Fare'].head())

Fare = normal(titanic['Fare'])

print('After preprocessing: \n', Fare.head())

Before preprocessing: 
 0     7.2500
1    71.2833
2     7.9250
3    53.1000
4     8.0500
Name: Fare, dtype: float64
After preprocessing: 
 0    0.014151
1    0.139136
2    0.015469
3    0.103644
4    0.015713
Name: Fare, dtype: float64


- Standardization

$X'$ is standardized value of variable $X$.   
$X' = \frac{X - {\mu}}{{\sigma}}$

In [14]:
def standard(data):
    data = (data - data.mean()) / data.std()
    return data

titanic = pd.read_csv('./titanic.csv')
print('Before preprocessing: \n',titanic['Fare'].head())

Fare = standard(titanic['Fare'])

print('After preprocessing: \n', Fare.head())

Before preprocessing: 
 0     7.2500
1    71.2833
2     7.9250
3    53.1000
4     8.0500
Name: Fare, dtype: float64
After preprocessing: 
 0   -0.502163
1    0.786404
2   -0.488580
3    0.420494
4   -0.486064
Name: Fare, dtype: float64


#### Categorization

You can use **categorization** when the category is more important than the value of the variable.   
For example, if **whether the test score is above average or below average** is more important than the test score itself, use categorization.



In [7]:
def categorize_Age(data):
    return data.apply(lambda row: row >= 20 if not np.isnan(row) else 'nan')
    
titanic = pd.read_csv('./titanic.csv')
print('Before preprocessing: \n',titanic['Age'].head())

Age = categorize_Age(titanic['Age'])

print('After preprocessing: \n', Age.head())
# categorized into whether each person is an adult.

Before preprocessing: 
 0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
After preprocessing: 
 0    True
1    True
2    True
3    True
4    True
Name: Age, dtype: object


### Processing Missing Data

#### Deleting column that has many missing data

In [3]:
titanic = pd.read_csv('./titanic.csv')

print(titanic.info(), '\n')
#You can see that 'Cabin' column has lots of missing data.

#Deleting "Cabin" column
titanic_1 = titanic.drop(columns=["Cabin"])

print(titanic_1.info(), '\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   S

#### Deleting samples that have missing data

In [4]:
#Delete samples that has missing data using "dropna"

titanic_2 = titanic_1.dropna()

print(titanic_2.info(), '\n')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Name         712 non-null    object 
 4   Sex          712 non-null    object 
 5   Age          712 non-null    float64
 6   SibSp        712 non-null    int64  
 7   Parch        712 non-null    int64  
 8   Ticket       712 non-null    object 
 9   Fare         712 non-null    float64
 10  Embarked     712 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 66.8+ KB
None 



### Processing Outlier

#### Deleting outlier using boolean indexing

In titanic.csv, there are some samples whose 'age' value is expressed in decimal places. Let's delete those samples using **boolean indexing**.

In [9]:
outlier = titanic_2[titanic_2['Age']-np.floor(titanic_2['Age']) > 0 ]['Age']
print(outlier)

print('Number of samples before deleting outlier: %d' %(len(titanic_2)))
print('Number of outliers: %d' %(len(outlier)))

#Deleting outlier using boolean indexing

titanic_3 = titanic_2[titanic_2['Age'] - np.floor(titanic_2['Age']) == 0]
print('Number of samples after deleting outlier: %d' %(len(titanic_3)))

57     28.50
78      0.83
111    14.50
116    70.50
122    32.50
123    32.50
148    36.50
152    55.50
153    40.50
203    45.50
227    20.50
296    23.50
305     0.92
331    45.50
469     0.75
525    40.50
644     0.75
676    24.50
735    28.50
755     0.67
767    30.50
803     0.42
814    30.50
831     0.83
843    34.50
Name: Age, dtype: float64
Number of samples before deleting outlier: 712
Number of outliers: 25
Number of samples after deleting outlier: 687


### Splitting Data

#### Feature data & Label Data

- Feature data : Input value for predicting label. (i.e. Study time)
- Label data : Data to be predicted. (i.e. Test score)

Use **train_test_split** from sklearn library to split data into data for learning and evalution.

```python
X_train, X_test, y_train, y_test = train_test_split(feature data, 
label data, 
test_size= 0~1, 
random_state=random seed value)
```

In [11]:
from sklearn.model_selection import train_test_split

Feature = titanic_3.drop(columns = ['Survived'])
Label = titanic_3['Survived']

# Splitting into data for learning & evaluation
X_train, X_test, y_train, y_test = train_test_split(Feature, Label, test_size = 0.3, random_state = 42)

print('Number of feature data: %d' %(len(X_train)))
print('Number of label data: %d' %(len(X_test)))

Number of feature data: 480
Number of label data: 207
