# What is the purpose of Supervised Learning Regression model

**Supervised Learning is a type Machine Learning where the model is trained on labeled data (that means the model sees also the corresponding outputs with inputs) and then it makes output predictions from unseen inputs.** </br>

You can imagine it in the following way, the model will learn from data that looks like this: X --> Y (the arrow is mapping from X to Y)
The model learn the mapping, what should he add, substract or so on with every input. And finally when it gets new input X, it will used the taught mapping (the arrow -->) and will predict X --> Y. </br>

**There are two types of Supervised Learning: Regression and Classification, today we will look at the first one - Regression.
Regression algorithms predict a number of infinitely possible outputs.** </br>

For example you have this graph with housing price predictions, on the x-axis is the house size on the y-axis is the house price. Model learned the mapping what operations should it do to get output from any new input. And when it gets input it will simply give us the exact output price. </br>

### Real World Use Cases of Supervised Learning Regression Algorithms
**algorithm determines what should that arrow do with input**

House ---> Prices based on properties that house has </br>
Factory ---> Demand of input Products based on production and sales </br>
Self-driving cars ---> The speed that car should drive based of external factors </br>

# Importing Libraries

In [9]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

# Data Preprocessing

In [17]:
# load the data
df = pd.read_csv('Car_Prices.csv')

# Get information about data
print(df)
print(df.columns)
print(df.shape)

     car brand   model  generation_name  year  mileage  vol_engine      fuel  \
0         opel   combo       gen-d-2011  2015   139568        1248    Diesel   
1         opel   combo       gen-d-2011  2018    31991        1499    Diesel   
2         opel   combo       gen-d-2011  2015   278437        1598    Diesel   
3         opel   combo       gen-d-2011  2016    47600        1248    Diesel   
4         opel   combo       gen-d-2011  2014   103000        1400       CNG   
...        ...     ...              ...   ...      ...         ...       ...   
3099      opel  vectra  gen-c-2002-2008  2007   248000        1796  Gasoline   
3100      opel  vectra  gen-c-2002-2008  2003   263000        1796       LPG   
3101      opel  vectra  gen-c-2002-2008  2008   200000        1796  Gasoline   
3102      opel  vectra  gen-c-2002-2008  2005   148266        1910    Diesel   
3103      opel  vectra  gen-c-2002-2008  2007   182000        1796  Gasoline   

                 city       province  p

## Spliting the Dataset

In [18]:
# When and why to do Train_Test_Split and do it :D

# Think what are the inputs and what is the output, what data we need for model training and what operation should we do on data
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

print(X)
print(y)

# y is now Series (basically Dataframe with one column), not real Dataframe, this mean that u can and have to use different operations
y = y * 0.26
y.name = "price (USD)"
print(y)

     car brand   model  generation_name  year  mileage  vol_engine      fuel  \
0         opel   combo       gen-d-2011  2015   139568        1248    Diesel   
1         opel   combo       gen-d-2011  2018    31991        1499    Diesel   
2         opel   combo       gen-d-2011  2015   278437        1598    Diesel   
3         opel   combo       gen-d-2011  2016    47600        1248    Diesel   
4         opel   combo       gen-d-2011  2014   103000        1400       CNG   
...        ...     ...              ...   ...      ...         ...       ...   
3099      opel  vectra  gen-c-2002-2008  2007   248000        1796  Gasoline   
3100      opel  vectra  gen-c-2002-2008  2003   263000        1796       LPG   
3101      opel  vectra  gen-c-2002-2008  2008   200000        1796  Gasoline   
3102      opel  vectra  gen-c-2002-2008  2005   148266        1910    Diesel   
3103      opel  vectra  gen-c-2002-2008  2007   182000        1796  Gasoline   

                 city       province  


## Encoding Categorical Data

In [19]:
models = {}

# Iterate through the rows of the DataFrame
for index, row in df.iterrows():
    model = row['model']
    generation_name = row['generation_name']
    
    # Add generation name to the model's set
    if model not in models:
        models[model] = set()
    models[model].add(generation_name)

# Check models with more than one generation name
for model, generations in models.items():
    if len(generations) > 1:
        print(f"Model '{model}' has multiple generations: {generations}")

print(models)

# That means we dont even need generation_name, because it is everytime with the same model
# delete brand, city, province, generation_name 
# delete fuel, just u will easier get the OneHotEncoding
X = X.iloc[:, [1, 3, 4, 5]]
print(X)

{'combo': {'gen-d-2011'}, 'vectra': {'gen-c-2002-2008'}, 'agila': {'gen-b-2008'}, 'astra': {'gen-h-2004-2013'}, 'insignia': {'gen-a-2008-2017'}}
       model  year  mileage  vol_engine
0      combo  2015   139568        1248
1      combo  2018    31991        1499
2      combo  2015   278437        1598
3      combo  2016    47600        1248
4      combo  2014   103000        1400
...      ...   ...      ...         ...
3099  vectra  2007   248000        1796
3100  vectra  2003   263000        1796
3101  vectra  2008   200000        1796
3102  vectra  2005   148266        1910
3103  vectra  2007   182000        1796

[3104 rows x 4 columns]


In [20]:
# Attention, run this cell only once or it will again OneHotEncode your dataset
print(X)

# encode model and fuel
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(
    transformers=[('encoder', OneHotEncoder(), [0])], 
    remainder='passthrough'  # Keep the rest of the columns as is
)

X = np.array(ct.fit_transform(X))
X = X.astype(int)

print(X)
print(X[0])
print(X[80])
print(X[200])

       model  year  mileage  vol_engine
0      combo  2015   139568        1248
1      combo  2018    31991        1499
2      combo  2015   278437        1598
3      combo  2016    47600        1248
4      combo  2014   103000        1400
...      ...   ...      ...         ...
3099  vectra  2007   248000        1796
3100  vectra  2003   263000        1796
3101  vectra  2008   200000        1796
3102  vectra  2005   148266        1910
3103  vectra  2007   182000        1796

[3104 rows x 4 columns]
[[     0      0      1 ...   2015 139568   1248]
 [     0      0      1 ...   2018  31991   1499]
 [     0      0      1 ...   2015 278437   1598]
 ...
 [     0      0      0 ...   2008 200000   1796]
 [     0      0      0 ...   2005 148266   1910]
 [     0      0      0 ...   2007 182000   1796]]
[     0      0      1      0      0   2015 139568   1248]
[     0      0      0      0      1   2005 149000   1796]
[     1      0      0      0      0   2008 142353   1248]


In [21]:
# train_test_split 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Training the Multiple Linear Regression Model

In [22]:
# Here video about How Linear Regression Models work

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)

# Testing the Model with Predictions

In [28]:
# prediction
y_pred = reg.predict(X_test)
print(y_pred.shape)
print(X_test.shape)
print(y_test.shape)

# precision
np.set_printoptions(precision=2)
print(y_pred)

# Toto zobrazenie urobit asi inak
# np.concatenate
y_test = np.array(y_test)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

(776,)
(776, 8)
(776,)
[10104.97  6610.8   3931.07 12694.94  3750.06  1826.45 12000.05  4744.46
  2492.67  3232.08  6983.88  2681.49  9931.24  2518.48  2949.6   4878.54
  9620.26  6970.09 13397.    6901.67  2349.72  5675.28  4144.4   4060.04
  7598.1   8950.32 -1923.24  4894.07 11741.24  1923.44  9151.77  4339.64
  5364.54  4190.32  7555.66  5622.39  7327.97  3851.6   2822.93  4808.95
 11159.68  4634.04  4634.04  6598.45  1950.7  12393.33  6800.17  3997.43
  7235.83  3531.58  7868.23  2730.62  3380.05  9391.46  1950.7   8537.53
  5356.9   2457.52  2325.61  4412.83  3163.25  4499.13  1997.96  9526.28
  1871.23  2102.8   9248.12  2876.12   549.41   268.95 11619.07  4504.37
  3950.97  2339.89  4715.99  5174.36   467.91  3889.96  3267.71  4394.26
 11054.73 11275.45  3196.85 10871.81 10827.61  4388.8   7702.09  9693.06
  7195.11  3914.66 11520.31 10236.77 11728.04  5364.54  4512.62  2263.6
   874.59  5981.23  7170.08 10187.56  2259.95  3049.63  9531.56  5702.11
  3837.03  3623.27  6064.09  

# Evaluation of the Model

In [30]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.862195570062051