# **Data Preparation in Machine Learning**


Data preparation is a fundamental step in any machine learning pipeline. It ensures the quality and reliability of the data being fed into the models. This process involves data collection, cleaning, transformation, and splitting.

---






































## 1. **Data Collection**


### Sources of Data:


- **Structured Data**: Databases, CSV files, etc.
- **Unstructured Data**: Text, images, videos, etc.
- **APIs**: Open datasets via REST APIs (e.g., Kaggle, Open Data Portal).

### Tools for Data Collection:


- Python libraries like `requests`, `BeautifulSoup` (for web scraping).
- Tools like Postman for API testing.
- Databases such as MySQL, MongoDB.

### Example Code:


```python
import pandas as pd

data = pd.read_csv('data.csv')  # Load data from a CSV file
print(data.head())
```

##
---

## 2. **Data Cleaning**


Data cleaning involves removing or correcting inaccuracies in the dataset.


### Steps in Data Cleaning:


- Handling **missing values**:
  - Replace with mean/median/mode.
  - Drop rows/columns.
- Removing **duplicates**.
- Handling **outliers**: [Outliers Detection Methods](5.2%20-%20data_cleaning_process.ipynb)
  - Use Z-score or IQR methods.

### Example Code:


```python
# Handling missing values
data.fillna(data.mean(), inplace=True)  # Replace NaN with mean

# Removing duplicates
data.drop_duplicates(inplace=True)
```

- [Pandas Examples 1](../../Python%20Libraries/pandas/pandas_ex1.ipynb)
- [Pandas Examples 2](../../Python%20Libraries/pandas/pandas_ex2.ipynb)

##
---

## 3. **Data Transformation**



Data transformation prepares data for better performance in machine learning algorithms.

### **Normalization and Standardization**


#### Normalization:


- Scales the data to a fixed range, typically [0, 1].
- Useful when features have different units or scales. <br><br>

![Normalization.png](../images/normalization.png)

##### Example Code:


```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data_normalized = scaler.fit_transform(data)
print(data_normalized[:5])
```

#### Standardization:


- Centers the data around zero with a standard deviation of 1.
- Useful when the data follows a Gaussian distribution. <br><br>

![Standardization.png](../images/standardization.png)

##### Example Code:


```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_standardized = scaler.fit_transform(data)
print(data_standardized[:5])
```

### **Encoding Categorical Variables**


#### One-Hot Encoding:


- Converts categorical variables into a binary matrix.
- Avoids ordinal relationships between categories.



![One-hot encording.png](../images/one-hot%20encording_example.png)

##### Example Code:


```python
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Sample data
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})

encoder = OneHotEncoder()
data_encoded = encoder.fit_transform(data).toarray()
print(data_encoded)
```

##### Python Implementation:

- ##### **Using OneHotEncoder Class**

In [2]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Sample data
data = pd.DataFrame({'Result': ['Pass', 'Fail', 'Pass', "Pass", "Absent", "Fail", "Fail", "Pass", "Pass", "Absent", "Pass"]})

encoder = OneHotEncoder()
data_encoded = encoder.fit_transform(data).toarray()
print(data_encoded)


[[0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]]


- **encoder.categories_** : Return the order of encode dataset when use **OneHotEncoder** Class

In [15]:
encoder.categories_

[array(['Absent', 'Fail', 'Pass'], dtype=object)]

- ##### **Using LabelBinarizer Class**

In [16]:
from sklearn.preprocessing import LabelBinarizer
import pandas as pd

# Sample data
data = pd.DataFrame({'Result': ['Pass', 'Fail', 'Pass', "Pass", "Absent", "Fail", "Fail", "Pass", "Pass", "Absent", "Pass"]})

encoder = LabelBinarizer()
data_encoded = encoder.fit_transform(data)
print(data_encoded)

[[0 0 1]
 [0 1 0]
 [0 0 1]
 [0 0 1]
 [1 0 0]
 [0 1 0]
 [0 1 0]
 [0 0 1]
 [0 0 1]
 [1 0 0]
 [0 0 1]]


- **encoder.classes_** : Return the order of encode dataset when use **LabelBinarizer** Class

In [17]:
encoder.classes_

array(['Absent', 'Fail', 'Pass'], dtype='<U6')

[See more about *Data Transformation Tecniques*](./5.1%20-%20advance_data_transformation.ipynb)

####
---

#### Label Encoding:


- Assigns an integer value to each category.
- Suitable for ordinal variables.



![Lable Encording.png](../images/label-encording_example.png)

##### Example Code:


```python
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
data['Color_encoded'] = encoder.fit_transform(data['Color'])
print(data)
```

##### Python Implementation:

In [18]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
data['Result_encoded'] = encoder.fit_transform(data['Result'])
print(data)

    Result  Result_encoded
0     Pass               2
1     Fail               1
2     Pass               2
3     Pass               2
4   Absent               0
5     Fail               1
6     Fail               1
7     Pass               2
8     Pass               2
9   Absent               0
10    Pass               2


##
---


## 4. **Data Splitting**



Data is split into training, validation, and test sets to evaluate the performance of the model.

### Purpose of Splits:


- **Training Set**: Used to train the model.
- **Validation Set**: Used to tune hyperparameters and prevent overfitting.
- **Test Set**: Used to evaluate the final model.

### Common Splits:


- Training: 70%-80% of the data.
- Validation: 10%-15%.
- Testing: 10%-15%.

### Example Code:


```python
from sklearn.model_selection import train_test_split

# Example dataset
X = data.drop('target', axis=1)  # Features
y = data['target']  # Target variable

# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Further splitting training set into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)  # 0.25 x 0.8 = 0.2
```

### **Actual Python Code**

In [19]:
import numpy as np
from sklearn.model_selection import train_test_split

In [20]:
x_vals = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
y_vals = np.array([10, 12, 8, 7.5, 4.6, 2.5, 15.8, 19.7, 1.2, 3.8, 6.3, 14.1, 11.3, 13.7, 16.1, 17, 18.6, 20, 5, 9.1])

In [21]:
x_train, x_test, y_train, y_test = train_test_split(x_vals, y_vals)
print(x_train) # defalt 75% of x values
print(x_test) # defalt 25% of x values
print(y_train) # defalt 75% of y values
print(y_test) # defalt 25% of y values

[13 15 11  7  6  5 18  1 19  8  4 12  2  3  9]
[16 14 17 20 10]
[11.3 16.1  6.3 15.8  2.5  4.6 20.  10.   5.  19.7  7.5 14.1 12.   8.
  1.2]
[17.  13.7 18.6  9.1  3.8]


In [22]:
# defining training size of the dataset. 
# The 'test_size' parameter of train_test_split must be a float in the range (0.0, 1.0)
x_train, x_test, y_train, y_test = train_test_split(x_vals, y_vals, test_size=0.2) 
print(x_train) # 80% of x values
print(x_test) # 20% of x values
print(y_train) # 80% of y values
print(y_test) # 20% of y values

[15 18 10 13 14  1  4  2  3  8 19  6 17 12  9 16]
[11 20  7  5]
[16.1 20.   3.8 11.3 13.7 10.   7.5 12.   8.  19.7  5.   2.5 18.6 14.1
  1.2 17. ]
[ 6.3  9.1 15.8  4.6]


##
---

## Summary



| Step               | Description                                        | Tools/Libraries    |
|--------------------|----------------------------------------------------|--------------------|
| Data Collection    | Gather data from various sources.                 | Pandas, APIs       |
| Data Cleaning      | Remove inaccuracies, handle missing values.       | Pandas, NumPy      |
| Data Transformation| Normalize, standardize, encode categorical data.  | Scikit-learn       |
| Data Splitting     | Split data into training, validation, test sets.  | Scikit-learn       |

##
---

## References


- [Scikit-learn Documentation](
https://scikit-learn.org/stable/documentation.html
)
- [Pandas User Guide](
https://pandas.pydata.org/docs/
)


[Data Preparation Example](5.1%20-%20advance_data_transformation.ipynb#data-preprocessing-with-example-datasets)