## *1. Data pre-processing*
Also known as data cleaning or wrangling <br>
*The process of converting or mapping data from the initial "Raw" for into another format, in order to prepare teh data for further analysis.*

### **Learning Objectives**
- Identify and handle missing values
- Data Formating
- Data normalization(centering/scaling)
- Data Binning
- Turning categorical Data into Numeric variables

*df["Symbling"] = df["symboling"]+1*

## *2. Missing values*
What is missing value?<br>
*Missing value occur when no data value is stored for a variable (feature) in an observation* <br>
*Could be represented as "?", "N/A", 0 or just a blank cell* <br>

### **How to deal with missing data?**
**Check with teh data collection source**
**Drop the missing values**
- Drop the variable
- Drop the data entry
**Replace the missing values**
- replace it with an average
- replace it by frequency
- replace it based on other functions
**Leave it as missing data**

### **How to drop missing values**
Use df.dropna():
|highway-mpg | price |
|----------| ------ |
| ... | ....
| 20 | 23875 
|22 | NaN 
| 29 | 16430
| ... | ...

**axis = 0** drops entire row <br>
**axis = 1** drops entire column

|highway-mpg | price |
|----------| ------ |
| ... | ....
| 20 | 23875 
| 29 | 16430
| ... | ...


df.dropna(subset = ['price'], axis = 0, inplace = True) <br>
here inplace means this will directly modify the dataset otherwise it can be done by <br>
df = df.dropna(subset = ['price'], axis = 0) <br>
df.dropna(subset= ['Price'], axis = 0) this doesnt change dataset

### **How to replace missing values in Python**
Use df.replace(missing_value, new_value) <br>
mean = df['normalized_losses'].mean() <br>
df['noramlized_losses'].replace(np.nan, mean)

### **Data Formatting**
- Data is usually collected from different palces and stored in different formats
- Bringing data into a common standard of expression allows users to make meaningful comparison
**Not-formatted:**
  - confusing
  - hard ot aggregate
  - hard to compare
| city |
|------ |
| N.Y.
| NY
| Ny
| New York

**Formatted:**
- more cleaer
- easy to aggreagate
- easy to compare
| city |
| ---- |
| New York
| New York
| New York
| New York

### **Applying calculations to an etire column**
*Let's take data for distance covered by car on litters of fuel at city we ask perfoma people calculate mile per gallon "mpg" but we need in L/100km so we would do this*
<br>
df["city_mpg"] = 235/df["city_mpg"] <br>
Now lets change the column name as well <br>
df.rename(columns = {"city_mpg": "city_L/100km"}, inplace = True)

### **Correcting data types**
To identify data types:
- df.dtypes <br>
To convert
- df.astype() <br>

df['price'] = df['price'].astype('int')

### **Data Normalizations**
- Uniform the features values with different raneg.
| scale | [150, 250] | [50, 100] | [50, 100] |
| ------ | ----------| -------- | ----------- |
| imapact | large | small | small |

**Not_Normalized**
- "age" and "icome" are in different range.
- hard to compare
- "income" will iflunce the result more intrinisicaly

| age | income | 
| ----- | ---- |
| 20 | 10000 |
| 30 | 20000 |
| 40 | 500000 |

**Normalized**

| age | income |
| ----| ------ |
| 0.2 | 0.2 |
| 0.3 | 0.04 |
| 0.4 | 1 |

### **Methods of normalizating data**
Several approaches for normalization:
1. Simple Feature scaling
<br>
x(new) = x(old) / x(max)

2. Min - min
<br>
x(new) = x(old) - x(min) / x(max) - x(min)

3. Z-score
<br>
x(new) = x(old) - meo / std

**Examples**
#### 1. Simple feature scalling

df["length"] = df["length"] / df["length"].max()
<br>
<br>

#### 2. Max - min

df['lenght'] = (df['length'] - df['length'].min()) / (df['lenght'].max() - df['lenght'].min())
<br>
<br>

#### 3. Z-score

df['length'] = (df['lenght'] - df['lenght'].mean()) / df['lenght'].std()
<br>

### **Binning**
- Binning: Grouping of values into bins"
- Convererts numeric into categorical variables
- Group a set of numerical values into a set of "bins"
- "price" is an feature range from 5000 to 45000

| Price | 5000 - 10000 | 11000 - 30000 | 31000 - 45000 |
| ----- | ------------ | -------------- | ---------- |
|bins: | Low | Mid | High

bins = np.linspace(min(df["price"]),max(df["price"], 4)<br><br>
group_names = ["Low", "Medium", "High"] <br><br>
df["price-binned"] = pd.cut(df["price"], bins, labels = group_names, include_lowest = True) <br><br>

### **Turing Categorical Variables into Quantitative Variables in Python**
Problem: 
- Most statistical models cannot take in the objects/strings as input
| Car | Fuel | ... |
| ----|------|-----|
| A | gas | ... |
| B | diesel | ... |
| C | gas | ... |
| D | gas | ... |

Assign 0 0r 1 in each category
| Car | Fuel | ... | gas | diesel |
| ----|------|-----| ---- | ---- |
| A | gas | ... | 1 | 0
| B | diesel | ... | 0 | 1
| C | gas | ... | 1 | 0
| D | gas | ... | 1 | 0

pd.get_dummies(df['fuel'])

In [3]:
import pandas as pd
Data = { "Fuel": ["gas", "diesel", "gas", "gas"]}
df = pd.DataFrame(Data)
df.head()

Unnamed: 0,Fuel
0,gas
1,diesel
2,gas
3,gas


In [4]:
# Now lets create its its dummies converting categorical data into numeric values
pd.get_dummies(df['Fuel'])

Unnamed: 0,diesel,gas
0,False,True
1,True,False
2,False,True
3,False,True
