### What is a categorical feature??

- Categorical features consists to different string categories
- In categorical features, we have a fixed set of values or class [Eg: Gender : Male, Female, Not_Specified]
- A machine cannot understand string categories. We have to convert the string categories to numerical categories using Feature Enginnering.

### Step 1: Label Encoding:

- Whenever we see a categorical feature, we first have to apply "label encoding"
- label encoding means, each label in the categorical feature will be assigned a numerical value
- Eg: In the Gender column, Male = 0, Female = 1, Not Specified = 2
- Label Encoding is best suited for Binary Categorical Features (Only 2 features like Male and Female).
    
### Disadvantage of Label Encoding:

- If there are multiple categories in a single column, the label encoding will assign a number to each category.
- Eg: Male = 0, Female = 1, Not Specified = 2
- Now the machine learning algorithm will assume that 2 is greater than 1 and 1 is greater than 0. Which is incorrect.

### Step 2: One Hot Encoding

- One hot encoding applies a simple mechanism. Depending on the number of categories in the features, it will divide the feature into that many columns
- Eg: Gender category will be divided into 3 columns: Male, Female, Not Specified
- Hence, whenever the category is male, the male column gets updated by 1 and female and Not Specified get updated by 0
- This happens for all 3 categories.
- Now the ML algorithm can clearly distinguish between the categories.
- After doing One Hot Encoding, we always drop the first column (Drop_first = True) This is known as Dummy Variable Trap. It says, we should not use all the columns. We just have to use n-1 columns.


#### We will use pd.get_dummies to perform label encoding and One Hot Encoding

In [15]:
import pandas as pd
import numpy as np

In [16]:
data = pd.read_csv("C:/Users/Ashish/Desktop/Python Tutorials/CSV files/50_Startups.txt")

In [17]:
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


#### This is a Regression Type of Problem where we have to predict the profit based on R&D, Administration, Marketing and State.

#### State Feature is a categorical feature. Rest all the features are continuous numerical features.

#### We have to convert the State Feature to numerical feature.

In [18]:
data["State"].value_counts()

New York      17
California    17
Florida       16
Name: State, dtype: int64

- The state feature has 3 categories, [New York, California, Florida]

In [19]:
state = pd.get_dummies(data["State"], drop_first = True)

In [20]:
state.head()

Unnamed: 0,Florida,New York
0,0,1
1,0,0
2,1,0
3,0,1
4,1,0


In [21]:
# concat state with original dataset 

data = pd.concat([state, data], axis = 1)

In [22]:
data.head()

Unnamed: 0,Florida,New York,R&D Spend,Administration,Marketing Spend,State,Profit
0,0,1,165349.2,136897.8,471784.1,New York,192261.83
1,0,0,162597.7,151377.59,443898.53,California,191792.06
2,1,0,153441.51,101145.55,407934.54,Florida,191050.39
3,0,1,144372.41,118671.85,383199.62,New York,182901.99
4,1,0,142107.34,91391.77,366168.42,Florida,166187.94


In [23]:
# drop the state categorical column

data = data.drop("State", axis = 1)

In [24]:
data.head()

Unnamed: 0,Florida,New York,R&D Spend,Administration,Marketing Spend,Profit
0,0,1,165349.2,136897.8,471784.1,192261.83
1,0,0,162597.7,151377.59,443898.53,191792.06
2,1,0,153441.51,101145.55,407934.54,191050.39
3,0,1,144372.41,118671.85,383199.62,182901.99
4,1,0,142107.34,91391.77,366168.42,166187.94


### Profit feature is the DEPENDENT FEATURE
### All the remaining features are INDEPENDENT FEATURES

In [29]:
# Seperate Independent and dependent features

X = data.iloc[:, :5]
y = data.iloc[:, [5]]

In [30]:
X.head()

Unnamed: 0,Florida,New York,R&D Spend,Administration,Marketing Spend
0,0,1,165349.2,136897.8,471784.1
1,0,0,162597.7,151377.59,443898.53
2,1,0,153441.51,101145.55,407934.54
3,0,1,144372.41,118671.85,383199.62
4,1,0,142107.34,91391.77,366168.42


In [31]:
y.head()

Unnamed: 0,Profit
0,192261.83
1,191792.06
2,191050.39
3,182901.99
4,166187.94


In [32]:
# dividing the model to train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 40)

In [33]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

In [34]:
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

In [35]:
from sklearn.metrics import r2_score
score = r2_score(y_test, y_pred)

In [36]:
print(score)

0.9281444810414338


### As R2 score is very close to 1. Therefore, our model has a good accuracy!!

## Converting a categorical feature with like 100 features to numerical feature. It is possible