Data refers to information or facts that are collected, stored, and used for various purposes, such as analysis, decision-making, or research. In the context of machine learning and statistics, data typically refers to observations or measurements of variables.

There are several types of data:

1. **Quantitative Data**:
   - Numerical data that represents quantities or amounts.
   - Examples: Height, weight, temperature.
   - Subtypes: Discrete (countable) and continuous (measurable).

2. **Qualitative Data**:
   - Categorical or non-numerical data that represents qualities or characteristics.
   - Examples: Gender, color, type of fruit.
   - Subtypes: Nominal (no inherent order) and ordinal (ordered).

3. **Binary Data**:
   - Data with only two possible values.
   - Examples: True/False, Yes/No, 0/1.

4. **Time Series Data**:
   - Data collected at regular intervals over time.
   - Examples: Stock prices, weather measurements, sales data.

5. **Spatial Data**:
   - Data related to geographical locations or spatial coordinates.
   - Examples: GPS coordinates, maps, satellite imagery.


**Categorical Data**

1. **Nominal Data**:
   - Nominal data represents categories with no inherent order or hierarchy.
   - Examples include gender, color, or type of fruit.

2. **Ordinal Data**:
   - Ordinal data represents categories with a natural order or rank.
   - Examples include education level (e.g., elementary, high school, college) or satisfaction level (e.g., low, medium, high).
   
   
   we use for Ordinal data ML tecqunices
   
   1. **Ordinal Encoding for Input**:
   - Ordinal Encoding preserves the order or hierarchy of categorical variables, making it suitable for features with inherent ordinal relationships.
   - It assigns unique integers to categories based on their order, ensuring that the encoded values reflect the natural sequence of the data.

2. **Label Encoding for Output Class**:
   - Label Encoding assigns unique integers to categorical labels without considering order, making it suitable for encoding the output class or target variable.
   - It facilitates compatibility with machine learning algorithms that require numerical input for the target variable, ensuring efficient model training and interpretation.
   
   
 **Nominal data**
 
    we use for Nominal data ML tecqunices
 Nominal data techniques in machine learning primarily involve encoding categorical variables into a numerical format. Here are some techniques commonly used:

1. **One-Hot Encoding**:
   - Creates binary columns for each category, where each column indicates the presence or absence of a category.
   - Suitable for nominal categorical variables where there's no inherent order among categories.

2. **Dummy Encoding**:
   - Similar to one-hot encoding but omits one of the binary columns to avoid multicollinearity.
   - Typically used in regression analysis to represent categorical variables with multiple categories.



**Today we work on nominal categorical data** 

In [20]:
import numpy as np
import pandas as pd

In [21]:
df = pd.read_csv('/kaggle/input/one-hot-encoding/cars.csv')

In [22]:
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [23]:
df['owner'].value_counts()

owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

In [25]:
df['fuel'].value_counts()

fuel
Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: count, dtype: int64

In [26]:
df['brand'].value_counts()

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

**OneHotEncoding using Pandas**

useing pandas is not good when we work on project of ML because they not remander col position 

we solved this problem we use SKlearn 

In [24]:
pd.get_dummies(df,columns=['fuel','owner'])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False


**2. K-1 OneHotEncoding**

In [7]:
pd.get_dummies(df,columns=['fuel','owner'],drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


In [8]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:4],df.iloc[:,-1],test_size=0.2,random_state=2)

In [9]:

X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
5571,Hyundai,35000,Diesel,First Owner
2038,Jeep,60000,Diesel,First Owner
2957,Hyundai,25000,Petrol,First Owner
7618,Mahindra,130000,Diesel,Second Owner
6684,Hyundai,155000,Diesel,First Owner


**3. OneHotEncoding using Sklearn**

In [10]:

from sklearn.preprocessing import OneHotEncoder

In [11]:
ohe = OneHotEncoder(drop='first',sparse=False,dtype=np.int32)

In [12]:

X_train_new = ohe.fit_transform(X_train[['fuel','owner']])



In [13]:

X_test_new = ohe.transform(X_test[['fuel','owner']])

In [14]:
X_train_new.shape

(6502, 7)

In [15]:
np.hstack((X_train[['brand','km_driven']].values,X_train_new))

array([['Hyundai', 35000, 1, ..., 0, 0, 0],
       ['Jeep', 60000, 1, ..., 0, 0, 0],
       ['Hyundai', 25000, 0, ..., 0, 0, 0],
       ...,
       ['Tata', 15000, 0, ..., 0, 0, 0],
       ['Maruti', 32500, 1, ..., 1, 0, 0],
       ['Isuzu', 121000, 1, ..., 0, 0, 0]], dtype=object)

**4. OneHotEncoding with Top Categories**

In [16]:
counts = df['brand'].value_counts()

In [17]:

df['brand'].nunique()
threshold = 100

In [18]:
repl = counts[counts <= threshold].index

In [27]:
repl

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Force', 'Land', 'Isuzu', 'Kia',
       'Ambassador', 'Daewoo', 'MG', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object', name='brand')

In [19]:
pd.get_dummies(df['brand'].replace(repl, 'uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
6253,False,False,False,False,True,False,False,False,False,False,False,False,False
4648,False,False,False,False,False,False,False,False,False,True,False,False,False
5500,False,False,True,False,False,False,False,False,False,False,False,False,False
36,False,False,False,False,False,False,True,False,False,False,False,False,False
2770,False,False,False,False,False,False,False,False,False,True,False,False,False
