# Support Vector Regression Implementation

## Create dataset - we use Tips dataset from seaborn

- predict total bill on the basis on the given independent features.
- we also have categorical features like - sex(Male/Female),smoke(Yes/No) etc...

The categoraical independent features are binary categories - only 2 categories ---> for these features we do 
    FEATURE ENCODDING.

### Feature Encoding - 
-  We apply 2 types of encoding here 
    1. Label Encoding - if we have 2 categories like sex(M/F), smoke(Yes/No) with 0 & 1 
        Label Encoding assigns a single, unique number to each category.

    2. One Hot Encoding - if we don't have categorical data eg - day[sat, sun, mon, tues friday], time[Dinner, Lunch, Brekfast]. 
        One-hot encoding is simply a standard data preparation step that turns words into a numerical checklist of 0s and 1s so that a computer can process the information correctly.
        One-Hot Encoding creates a new column for every category, using only 0s and 1s.



In [38]:
# Tips dataset
import seaborn as sns
import numpy as np
import sys
import pandas as pd
df = sns.load_dataset('tips')

In [12]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [14]:
# Categorical category - label enoding here for features like this
df['sex'].value_counts()

sex
Male      157
Female     87
Name: count, dtype: int64

In [15]:
# non-categorical - one hot encoding goes here for features like this
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

Split the independent/ dependent features from dataset

In [16]:
df.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

In [18]:
# X = df[['tip', 'sex', 'smoker', ...]] (Double Brackets [[]])

# In machine learning, the input features X are typically expected to be a 2D matrix (a table of features) where every row is a sample and every column is a feature. 
# The train_test_split function and most scikit-learn models require this 2D structure.
# X is a 2-dimensional table (DataFrame)
#    tip  sex smoker    day   time  size
# 0  1.01  Female     No  Sun  Dinner     2
# 1  1.66    Male     No  Sun  Dinner     3
# 2  3.50    Male     No  Sun  Dinner     3

# -------------------------------------------

# y = df['total_bill'] (Single Brackets [])

# machine learning, the target variable y is typically expected to be a 1D vector (a single list of outputs we are trying to predict).
# y is a 1-dimensional Series of values
# Name: total_bill, dtype: float64
# 0    16.99
# 1    10.34
# 2    21.01

X = df[['tip', 'sex', 'smoker', 'day', 'time', 'size']]
y = df['total_bill']

## Train test split 

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10)

We do train train - test split before Feature Encoding(Label Encoding & One Hot Encoding) to overcome the problem of DATA LEAKAGE.

### DATA LEAKAGE - 
    Model should have no information about the test data.

In [23]:
# apply label encoding for all categorical/ binary indepenent fetaures
# create equal number of encoders as the number of binary indepenent fetaures
# In our example we have 3 binary catergory - sex, smoker, time ---> so 3 label encoders
# each imput column will have separate label encoder for it

from sklearn.preprocessing import LabelEncoder
le1 = LabelEncoder()
le2 = LabelEncoder()
le3 = LabelEncoder()

# fit & transform each binary indepenent fetaures from training data
X_train['sex'] = le1.fit_transform(X_train['sex'])
X_train['smoker'] = le2.fit_transform(X_train['smoker'])
X_train['time'] = le3.fit_transform(X_train['time'])


In [24]:
X_train.head()

Unnamed: 0,tip,sex,smoker,day,time,size
58,1.76,1,1,Sat,0,2
1,1.66,1,0,Sun,0,3
2,3.5,1,0,Sun,0,3
68,2.01,1,0,Sat,0,2
184,3.0,1,1,Sun,0,2


In [25]:
# apply same transformation for test data

# fit only each binary indepenent fetaures from test data
X_test['sex'] = le1.fit_transform(X_test['sex'])
X_test['smoker'] = le2.fit_transform(X_test['smoker'])
X_test['time'] = le3.fit_transform(X_test['time'])

X_test.head()

Unnamed: 0,tip,sex,smoker,day,time,size
162,2.0,0,0,Sun,0,3
60,3.21,1,1,Sat,0,2
61,2.0,1,1,Sat,0,2
63,3.76,1,1,Sat,0,4
69,2.09,1,1,Sat,0,2


### One Hot Encoding
- applied along with Column Transformer 

Column Transformer - 
    1. param 1 - 'onehot' - takes what kind of transfomer we are going to apply. We provide the transformers in the form of tuples i.e a list of tuples.
    2. param 2 - (drop) 
        - preventing the "Dummy Variable Trap."
        - The "Dummy Variable Trap": In your example of day[Sat, Sun, Mon, Tues, Fri], you get 5 new columns. 
            If you know the values for 4 of those columns (e.g., Sat=0, Sun=0, Mon=0, Tues=0), you automatically know the fifth one must be Fri=1. The fifth column is redundant information.
        - By setting drop='first', you remove one of the resulting columns, which simplifies the model without losing any information.
    3. param 3 - (eg - 3) 
        - for which index we need to apply One Hot Encoding.
        - in our dataframe [3] means "apply one-hot encoding only to the column at index 3" (which is your 'Day' feature). 
    4. param 4 - remainder 
        - This parameter belongs to the main ColumnTransformer object itself, controlling how it handles all columns you haven't listed in your transformer tuples.
        - remainder='passthrough' (Most Common):
            Meaning: "For all columns not mentioned in my transformer list (like 'tip', 'size', etc.), just keep them exactly as they are and include them in the final dataset." This is usually what you want to do.
        - remainder='drop':
            Meaning: "Drop every column that I didn't explicitly list in one of my transformers." This creates a final dataset containing only the columns you specifically transformed.                


In [27]:
# apply one hot encoding for all categorical fetaures with multiple categories
# in our example data set - day feature which has 4 categories

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('onehot', OneHotEncoder(drop='first'), [3])], remainder='passthrough')

# start transformation
X_train = ct.fit_transform(X_train)

# transform the test data
X_test = ct.transform(X_test)

In [None]:
X_train

array([[1., 0., 0., ..., 1., 0., 2.],
       [0., 1., 0., ..., 0., 0., 3.],
       [0., 1., 0., ..., 0., 0., 3.],
       ...,
       [1., 0., 0., ..., 0., 0., 2.],
       [0., 0., 1., ..., 0., 1., 6.],
       [0., 1., 0., ..., 0., 0., 2.]])

In [36]:
# to see full training data after transformation

np.set_printoptions(threshold=sys.maxsize)

X_train

array([[ 1.  ,  0.  ,  0.  ,  1.76,  1.  ,  1.  ,  0.  ,  2.  ],
       [ 0.  ,  1.  ,  0.  ,  1.66,  1.  ,  0.  ,  0.  ,  3.  ],
       [ 0.  ,  1.  ,  0.  ,  3.5 ,  1.  ,  0.  ,  0.  ,  3.  ],
       [ 1.  ,  0.  ,  0.  ,  2.01,  1.  ,  0.  ,  0.  ,  2.  ],
       [ 0.  ,  1.  ,  0.  ,  3.  ,  1.  ,  1.  ,  0.  ,  2.  ],
       [ 0.  ,  1.  ,  0.  ,  4.  ,  0.  ,  1.  ,  0.  ,  2.  ],
       [ 0.  ,  1.  ,  0.  ,  5.2 ,  0.  ,  0.  ,  0.  ,  4.  ],
       [ 0.  ,  0.  ,  1.  ,  4.  ,  1.  ,  1.  ,  1.  ,  4.  ],
       [ 0.  ,  0.  ,  1.  ,  5.  ,  1.  ,  0.  ,  1.  ,  5.  ],
       [ 1.  ,  0.  ,  0.  ,  3.  ,  0.  ,  0.  ,  0.  ,  2.  ],
       [ 1.  ,  0.  ,  0.  ,  1.75,  1.  ,  0.  ,  0.  ,  2.  ],
       [ 0.  ,  1.  ,  0.  ,  3.12,  1.  ,  0.  ,  0.  ,  4.  ],
       [ 0.  ,  0.  ,  1.  ,  3.  ,  0.  ,  0.  ,  0.  ,  2.  ],
       [ 1.  ,  0.  ,  0.  ,  5.  ,  1.  ,  0.  ,  0.  ,  3.  ],
       [ 0.  ,  0.  ,  1.  ,  4.  ,  1.  ,  0.  ,  1.  ,  2.  ],
       [ 1.  ,  0.  ,  0.

- converting the resulting NumPy array back into a Pandas DataFrame is the best way to understand the results, as you can re-attach meaningful column names.
The main challenge is that you have to manually define the names for the new one-hot encoded columns, as scikit-learn doesn't automatically carry those over from the original DataFrame when outputting a NumPy array.
Here is how you can convert the array back into a DataFrame with understandable column names:

In [None]:
# Assuming you have already run your ColumnTransformer and stored the output in X_processed
# X_processed = ct.fit_transform(X) 

# 1. Get the new column names using the transformer object (ct)
# This generates names like 'encoder__sex_Female', 'remainder__tip', etc.
new_column_names = ct.get_feature_names_out()

# 2. Convert the NumPy array back into a Pandas DataFrame
X_Train_processed_df = pd.DataFrame(X_train, columns=new_column_names)

# 3. View the first few rows of the clean DataFrame
X_Train_processed_df.head()

Unnamed: 0,onehot__day_Sat,onehot__day_Sun,onehot__day_Thur,remainder__tip,remainder__sex,remainder__smoker,remainder__time,remainder__size
0,1.0,0.0,0.0,1.76,1.0,1.0,0.0,2.0
1,0.0,1.0,0.0,1.66,1.0,0.0,0.0,3.0
2,0.0,1.0,0.0,3.5,1.0,0.0,0.0,3.0
3,1.0,0.0,0.0,2.01,1.0,0.0,0.0,2.0
4,0.0,1.0,0.0,3.0,1.0,1.0,0.0,2.0


In [29]:
X_test

array([[0.  , 1.  , 0.  , 2.  , 0.  , 0.  , 0.  , 3.  ],
       [1.  , 0.  , 0.  , 3.21, 1.  , 1.  , 0.  , 2.  ],
       [1.  , 0.  , 0.  , 2.  , 1.  , 1.  , 0.  , 2.  ],
       [1.  , 0.  , 0.  , 3.76, 1.  , 1.  , 0.  , 4.  ],
       [1.  , 0.  , 0.  , 2.09, 1.  , 1.  , 0.  , 2.  ],
       [0.  , 0.  , 1.  , 5.  , 1.  , 1.  , 1.  , 2.  ],
       [0.  , 1.  , 0.  , 3.51, 1.  , 0.  , 0.  , 2.  ],
       [1.  , 0.  , 0.  , 5.16, 1.  , 1.  , 0.  , 4.  ],
       [0.  , 1.  , 0.  , 5.  , 1.  , 0.  , 0.  , 2.  ],
       [1.  , 0.  , 0.  , 3.6 , 1.  , 0.  , 0.  , 3.  ],
       [0.  , 1.  , 0.  , 5.65, 1.  , 1.  , 0.  , 2.  ],
       [1.  , 0.  , 0.  , 2.5 , 0.  , 1.  , 0.  , 3.  ],
       [0.  , 0.  , 1.  , 1.44, 1.  , 0.  , 1.  , 2.  ],
       [1.  , 0.  , 0.  , 3.09, 0.  , 1.  , 0.  , 4.  ],
       [0.  , 1.  , 0.  , 2.  , 1.  , 0.  , 0.  , 4.  ],
       [0.  , 0.  , 1.  , 1.36, 0.  , 0.  , 1.  , 3.  ],
       [0.  , 0.  , 1.  , 2.  , 0.  , 0.  , 1.  , 2.  ],
       [0.  , 0.  , 1.  , 1.68,

### Apply SVR( Support Vector Regressor) Ml algp to predict the output and accuracy

In [41]:
from sklearn.svm import SVR
svr = SVR()

# fit the training data
svr.fit(X_train, y_train)

# predict on test data
y_pred = svr.predict(X_test)

In [42]:
# Check accuracy

from sklearn.metrics import r2_score, mean_absolute_error
print(r2_score(y_test, y_pred))
print(mean_absolute_error(y_test, y_pred))

0.46028114561159283
4.1486423210190235


### Hyperparameter Tuning using GridSearch CV

In [43]:
from sklearn.model_selection import GridSearchCV

# define parameter range
param_grid = {
              'C':[0.1, 1, 10, 100, 1000],
              'gamma':[1,0.1,0.01, 0.001, 0.0001],
              'kernel':['rbf']
              }

# apply GridSearchCV
grid = GridSearchCV(SVR(), param_grid, refit=True, verbose=3)

# fit the training data
grid.fit(X_train,y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5] END .......C=0.1, gamma=1, kernel=rbf;, score=-0.067 total time=   0.0s
[CV 2/5] END .......C=0.1, gamma=1, kernel=rbf;, score=-0.058 total time=   0.0s
[CV 3/5] END .......C=0.1, gamma=1, kernel=rbf;, score=-0.145 total time=   0.0s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.025 total time=   0.0s
[CV 5/5] END .......C=0.1, gamma=1, kernel=rbf;, score=-0.089 total time=   0.0s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.013 total time=   0.0s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.021 total time=   0.0s
[CV 3/5] END .....C=0.1, gamma=0.1, kernel=rbf;, score=-0.010 total time=   0.0s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.124 total time=   0.0s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.050 total time=   0.0s
[CV 1/5] END ....C=0.1, gamma=0.01, kernel=rbf;, score=-0.053 total time=   0.0s
[CV 2/5] END ....C=0.1, gamma=0.01, kernel=rbf;

In [44]:
grid.best_params_

{'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}

In [45]:
# priedict on test data
grid_predicted  = grid.predict(X_test)

In [46]:
# Check accuracy

from sklearn.metrics import r2_score, mean_absolute_error
print(r2_score(y_test, grid_predicted))
print(mean_absolute_error(y_test, grid_predicted))

0.5081599655420066
3.8685177092539527


Observation after GridSearchCv 
- r2 score imcreased after gridSearchCv / hyperparameter tuning - accuracy improved
- mae decreased fter gridSearchCv / hyperparameter tuning - errors got reduced