# Regression
Regression models are used in continuos data
### There are some types of regression models:
- Linear Regression
- Support Vector Regression (SVR)

## Lineal Regression
Is the most common algorithm in regression. It assumes a linear relationship between the dependent variable and independent


### Lets Make an example:

In [4]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

#### Clean Data

In [5]:
df = pd.read_csv('Electronic_sales_Sep2023-Sep2024.csv')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Customer ID        20000 non-null  int64  
 1   Age                20000 non-null  int64  
 2   Gender             19999 non-null  object 
 3   Loyalty Member     20000 non-null  object 
 4   Product Type       20000 non-null  object 
 5   SKU                20000 non-null  object 
 6   Rating             20000 non-null  int64  
 7   Order Status       20000 non-null  object 
 8   Payment Method     20000 non-null  object 
 9   Total Price        20000 non-null  float64
 10  Unit Price         20000 non-null  float64
 11  Quantity           20000 non-null  int64  
 12  Purchase Date      20000 non-null  object 
 13  Shipping Type      20000 non-null  object 
 14  Add-ons Purchased  15132 non-null  object 
 15  Add-on Total       20000 non-null  float64
dtypes: float64(3), int64(4

In [7]:
df.isna().any().sum()

2

In [8]:
df.dropna(inplace=True)

#### Drop useless columns

In [9]:
df.drop(columns=['Customer ID', 'Gender', 'SKU', 'Unit Price', 'Purchase Date', 'Payment Method', 'Order Status'], inplace=True) # in this example we will
# try to predict the total price, if we give him the unit price and the add-on price the predictions will be perfect
# so lets make it more difficult

In [10]:
df.head()

Unnamed: 0,Age,Loyalty Member,Product Type,Rating,Total Price,Quantity,Shipping Type,Add-ons Purchased,Add-on Total
0,53,No,Smartphone,2,5538.33,7,Standard,"Accessory,Accessory,Accessory",40.21
1,53,No,Tablet,3,741.09,3,Overnight,Impulse Item,26.09
3,41,Yes,Smartphone,2,3164.76,4,Overnight,"Impulse Item,Impulse Item",60.16
4,75,Yes,Smartphone,5,41.5,2,Express,Accessory,35.56
5,41,No,Smartphone,5,83.0,4,Standard,"Impulse Item,Accessory",65.78


#### Preparing to predict 

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X = df.drop(columns=['Total Price'])
y = df['Total Price']

X_encoded = pd.get_dummies(X, columns=['Loyalty Member', 'Product Type', 'Shipping Type', 'Add-ons Purchased'], dtype=int)

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2)


In [13]:
Model = LinearRegression()

In [14]:
Model.fit(X_train, y_train)

#### Predict


In [15]:
y_pred = Model.predict(X_test)

In [16]:
from sklearn.metrics import mean_absolute_error

In [17]:
mean_absolute_error(y_test, y_pred)

1254.4493844462338

In [18]:
Model.score(X_test, y_test)

0.5766131149620801

In [19]:
Model.score(X_train, y_train)

0.6054669697116969

In [20]:
results = pd.DataFrame({'y_pred': y_pred, 'y_test': y_test})

In [21]:
results

Unnamed: 0,y_pred,y_test
3906,5797.316406,7120.71
18102,2876.453125,2279.36
15846,6927.843750,10257.12
8338,3682.339844,3955.95
18775,2576.199219,1838.00
...,...,...
14092,2898.519531,1805.90
4112,6241.699219,7911.90
5656,2966.671875,2783.76
8073,3205.765625,4224.15


## Support Vector Regression (SVR)

If you are seeing this module after the classification model where we talk about Support Vector Machines (SVM) well, this is a similar model but for continuos data. While the goal of SVM is to find the optimal hyperplane that separates classes, the goal of SVR is to find a function that predicts continuous values with the smallest error margin. Here's how SVR

### Lets prove with an example:
(this dataset also could be used in categorical learning but in this case we will use it for regression)

In [22]:
import pandas as pd
import numpy as np
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [23]:
df = pd.read_csv('extinct_species_dataset.csv')

In [24]:
df = df.sample(100000)

In [25]:
print(df['Species Name'].unique())
print('\n')
print(df['Extinction Reason'].unique())

['Sabertooth Tiger' 'Steller’s Sea Cow' 'Trilobite' 'Megalodon' 'Smilodon'
 'Dodo' 'Quagga' 'Plesiosaur' 'Woolly Mammoth' 'Tyrannosaurus Rex']


['Human Impact' 'Natural Disaster' 'Mass Extinction' 'Habitat Loss'
 'Asteroid Impact' 'Predation' 'Climate Change']


#### Clean data


In [26]:
df.isna().any()

Species Name                   False
Years Lived (Million Years)    False
Extinction Reason              False
dtype: bool

#### Prepare data

In [27]:
X = df.drop(columns=['Years Lived (Million Years)'])
y = df['Years Lived (Million Years)']

X_encoded = pd.get_dummies(X, columns=['Species Name', 'Extinction Reason'], dtype=int)

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.5)


#### Training

In [28]:
clf = SVR(kernel = 'linear', gamma='scale', C=1.0, epsilon=0.2)

In [29]:
clf.fit(X_train, y_train)

In [30]:
y_pred = clf.predict(X_test)

In [31]:
mean_squared_error(y_test, y_pred)

20769.455457864762

In [32]:
results = pd.DataFrame({'y_pred': y_pred, 'y_test': y_test})

In [33]:
results

Unnamed: 0,y_pred,y_test
231086,252.751359,391.50
477813,251.011585,16.17
72690,254.040362,209.03
538088,248.661658,103.17
419942,256.291185,447.49
...,...,...
271385,253.771605,430.70
680740,249.888451,171.59
784533,254.040362,92.08
892297,251.011585,235.40
