<h1>Modelo de Prediccion del Precio</h1>

In [14]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

<h2>Organizacion de los datos</h2>

In [15]:
df = pd.read_csv('ford.csv')
df.head(10)

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,Fiesta,2017,12000,Automatic,15944,Petrol,150,57.7,1.0
1,Focus,2018,14000,Manual,9083,Petrol,150,57.7,1.0
2,Focus,2017,13000,Manual,12456,Petrol,150,57.7,1.0
3,Fiesta,2019,17500,Manual,10460,Petrol,145,40.3,1.5
4,Fiesta,2019,16500,Automatic,1482,Petrol,145,48.7,1.0
5,Fiesta,2015,10500,Manual,35432,Petrol,145,47.9,1.6
6,Puma,2019,22500,Manual,2029,Petrol,145,50.4,1.0
7,Fiesta,2017,9000,Manual,13054,Petrol,145,54.3,1.2
8,Kuga,2019,25500,Automatic,6894,Diesel,145,42.2,2.0
9,Focus,2018,10000,Manual,48141,Petrol,145,61.4,1.0


In [16]:
df['price'].max()

54995

In [17]:
df.shape

(17966, 9)

<h3>Descripcion de las variables</h3>
<p>model -  Modelo del auto<br>
year - Año de produccion<br>
price - Precio del auto<br>
transmission - Transmicion del auto (Automatic, Manual, Semi-Auto)<br>
mileage - Cantidad de millas recorridas<br>
fuel_Type - Tipo de combustible que utiliza (Petrol, Diesel, Hybrid, Electric, Other)<br>
tax - Impuestos a pagar por las emiciones de CO2<br>
mpg -  Millas por galon de combustible<br>
engineSize - Tamaño del motor</p>

<h3>Eliminar columnas innecesarias</h3>

In [18]:
cols_drop = ['model']
df = df.drop(cols_drop, axis=1)

<h3>Buscar valores NaN</h3>

In [19]:
df.isna().sum()

year            0
price           0
transmission    0
mileage         0
fuelType        0
tax             0
mpg             0
engineSize      0
dtype: int64

<h3>Separar los datos en conjuntos de Train y Test</h3>

In [20]:
X = df.drop('price', axis=1)
y = df['price']
y.max()

54995

<h3>Convetir varibles del tipo caracter a numerico</h3>
<p>Como podemos observar las columnas "transmission" y "fuelType" presentan variables del tipo categoricas</p>

In [21]:
print('Variables transmission: ',X['transmission'].unique(),'\nVariables fuelType:',X['fuelType'].unique())

Variables transmission:  ['Automatic' 'Manual' 'Semi-Auto'] 
Variables fuelType: ['Petrol' 'Diesel' 'Hybrid' 'Electric' 'Other']


<p>Para convertir estas variables utilizaremos 'One Hot Encoding'</p>

In [22]:
X_ohe = pd.get_dummies(X[['transmission','fuelType']])
X = X.join(X_ohe)
X = X.drop(['transmission','fuelType'], axis=1)
X.head()

Unnamed: 0,year,mileage,tax,mpg,engineSize,transmission_Automatic,transmission_Manual,transmission_Semi-Auto,fuelType_Diesel,fuelType_Electric,fuelType_Hybrid,fuelType_Other,fuelType_Petrol
0,2017,15944,150,57.7,1.0,1,0,0,0,0,0,0,1
1,2018,9083,150,57.7,1.0,0,1,0,0,0,0,0,1
2,2017,12456,150,57.7,1.0,0,1,0,0,0,0,0,1
3,2019,10460,145,40.3,1.5,0,1,0,0,0,0,0,1
4,2019,1482,145,48.7,1.0,1,0,0,0,0,0,0,1


In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

<h2>Modelo de Prediccion del Precio</h2>

In [24]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(random_state=1)
model.fit(X_train, y_train)

DecisionTreeRegressor(random_state=1)

In [25]:
y_pred = model.predict(X_test)

In [26]:
from sklearn.metrics import *

print('Explained_variance_score: ',explained_variance_score(y_test, y_pred))
print('mean_absolute_error:', mean_absolute_error(y_test, y_pred))

Explained_variance_score:  0.8624471875571122
mean_absolute_error: 1190.5389210448204


<h2>Exportar el modelo ya entrenado en su version 1.0v</h2>

In [28]:
import pickle
pickle.dump(model, open('modelo_prediccion_precio_1.0v.pkl', 'wb'))