# Data Preprocessing Tools

if our data is missing and categorical data then use these tools to take care of this

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

 **iloc** is here locate indexis

**:** colon

(' ') parenthesis with coat

**X = dataset.iloc[:, :-1].values**

**.iloc** is used **to select rows and columns by index position** in a DataFrame.

**[:, :-1]** means:

**:** selects all rows.

**:-1** selects all columns except the last one (because -1 refers to the last column).

**.values** converts **the selected portion into a NumPy array**.

**y = dataset.iloc[:, -1].values**

**[:, -1]** selects:

**:** selects all rows.

**-1** selects only the last column (in this case, Purchased).

**.values** converts the selected portion into a NumPy array.

In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

**SimpleImputer** is a class from the sklearn library used to handle missing values in a dataset.

It replaces missing values with a specified strategy (e.g., the **mean, median, or a constant**).

**Parameters:**

**missing_values=np.nan**: Specifies that the missing values in the dataset are represented as NaN (Not a Number).

**strategy='mean'**: Specifies the `imputation strategy` to replace missing values:

In this case, missing values will be replaced with the mean of the respective column.

**fit()** calculates the necessary statistics (e.g., mean) from the data.

**transform()** replaces the missing values in the selected columns with the statistics (mean in this case) calculated during the **fit()** step.

What is X?

**X** is **a list of lists** in Python. Each sublist represents a row in a dataset, and the columns in each row have specific meanings.

**Structure of X**:

The dataset X has 3 columns:

Column 1 (Country): The country name ('France', 'Spain', 'Germany', etc.).

Column 2 (Age): The age of an individual (44.0, 27.0, etc.).

Column 3 (Salary): The salary of the individual (72000.0, 48000.0, etc.). Some values are missing (nan).



##Python's Zero-Based Indexing
Python uses zero-based indexing, meaning the **first element** in a sequence is accessed **using the index 0,** the second element with index 1, and so on.

This applies to **tuples, lists strings, arrays**, and other data structures.

It provides consistency with many programming languages like C, Java, and JavaScript.

In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

The **ColumnTransformer** is used to apply transformations to specific columns of a dataset.

In this case, it applies OneHotEncoder to the "Country" column (indexed as 0).
Parameters of ColumnTransformer:

transformers: **A list of tuples** specifying transformations. Each tuple contains:
A name for the transformer (here: '**encoder**').

The transformation method (here: **OneHotEncoder()**).

The column(s) to apply the transformation to (here: **[0]** for the "Country" column).

remainder='**passthrough**': This means columns not specified in the transformers list (e.g., "Age", "Salary", "Purchased") will remain unchanged.

OneHotEncoder

Converts the "Country" column into a **one-hot encoded format**:

If the column has three categories (France, Spain, Germany), they will be transformed into **three new binary columns**:

[1, 0, 0] for France

[0, 1, 0] for Spain

[0, 0, 1] for Germany

**X = np.array(ct.fit_transform(X))**

**ct.fit_transform(X):**
**Fits the ColumnTransformer to the dataset X** (containing the features) and applies the transformations.

The "Country" column is replaced with its one-hot encoded representation.

**np.array(...):**
**Converts the result into a NumPy array**, which is often required for compatibility with other Scikit-learn functions.

In [None]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

**le = LabelEncoder()**:

Creates an instance of the LabelEncoder.

**le.fit_transform(y)**:

**fit(y)**: Identifies the unique categories in the target variable y.

**transform(y)**: Converts each category into its corresponding numerical representation.

Combines both steps (fit and transform) in one call for convenience.

The transformed y will now be a NumPy array with numerical values.


**Note**:

**LabelEncoder** ---It should only be used for the target variable (y), not for input features. For input features, use encoders like OneHotEncoder.

**When to use it:**

When the target variable is categorical and needs to be encoded numerically for machine learning.


In [None]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

####four variable return after train_test_split funtion (x_train, x_test, y_train, y_test)

### Uses the `train_test_split` **function** `from sklearn.model_selection` to split the dataset into `training` and `testing ` **subsets**.

### `X`:
The features (independent variables) of your dataset. This is typically a 2D array or DataFrame containing input data such as numeric or categorical features.

`y`:

The target variable (dependent variable) you want to predict. This can be a 1D array or Series containing labels or numerical outputs (e.g., classification labels or regression values).

`test_size=0.2`:

The proportion of the dataset to include in the test split. In this case, 20% of the data will be allocated to the test set, and the remaining 80% will be allocated to the training set.

`random_state=1`:

This is a seed value used to ensure the random splitting is reproducible. Setting the `random_state` **ensures that every time you run the code, the train-test split will produce the same result**.


In [None]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [None]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [None]:
print(y_test)

[0 1]


## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

**Feature scaling** always applied in colomns

two tecniques:-

Normalization **[0;1]**,

Standardization**[-3;+3]**

value should be inbwtween **scaling range**

We perform feature scaling after splitting test set and training set (to prevent data leakage)

& ensure the integrity of our machine learning model's evaluation.

### in this code we do not need to add a specific parameter(sc = StandardScaler())

## feature scaling means **all the value should be in the same range**

### **only apply feature scaling on numerical values** / `not on dummy variable such (0.0, 0.1, 0.1)`

feature scaling we do not use in simple linear regression, multiple linear regression, and ploynomial linear regression. Why ?

In [None]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [None]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
