# Predicting Car Prices
In this project, we will use the CRISP-DM process to predict a car's market price using its attributes. The data set we will be working with contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more. 

## Business Understanding
The CRISP-DM process starts with the understanding of the business problem. Imagine for example a used car dealer who needs estimates what the price of a used care could be. The car dealer could be interest in predicting the price of a car based on its attributes. In this project we try to answer to the following 3 business questions:
* Is the price of a car related to the horsepower?
* Is the price of a care related to the number of doors it has?
* Can the price of a car be predicted based in its attribute with reasonable accuracy? 


## Data Understanding 

The data is taken from the UCI machine learning repository and can be downloaded <a href = "https://archive.ics.uci.edu/ml/datasets/automobile"> here </a>. Let us read in the data and have a first look at it:

In [18]:
import pandas as pd
import numpy as np

In [7]:
data = pd.read_csv(r"C:\Repositories\Predicting-Car-Proces\data\imports-85.data")

In [8]:
data.head()

Unnamed: 0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,...,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
1,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
2,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
3,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
4,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250


We realize that the data is stored in csv-format without the column names (the first record is use as column names which is of course wrong). The name of the columns can be found using the same link as above. We store them here in a python list and load the data again:

In [9]:
cols = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 
        'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 
        'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower',
        'peak-rpm', 'city-mpg', 'highway-mpg', 'price']

data = pd.read_csv(r"C:\Repositories\Predicting-Car-Proces\data\imports-85.data", names=cols)

In [10]:
data.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


Now we see that our data contains the correct column names. A detailed description of the dataset including a description of all the attributes can be found <a href ="https://archive.ics.uci.edu/ml/datasets/automobile"> here </a>. For example, the documentation of the dataset contains the information if an attribute is continuous. We will restrict our analysis to the continuous attributes, although it would be possible to extend the analysis to the non-continuous ones. The documentation tells us that the following attributes are continuous:

In [15]:
continuous_values_cols = ['normalized-losses', 'wheel-base', 'length', 'width', 'height',
                          'curb-weight', 'bore', 'stroke', 'compression-rate', 'horsepower',
                          'peak-rpm', 'city-mpg', 'highway-mpg', 'price']

data = data[continuous_values_cols]
data.head()

Unnamed: 0,normalized-losses,wheel-base,length,width,height,curb-weight,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,?,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,13495
1,?,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,16500
2,?,94.5,171.2,65.5,52.4,2823,2.68,3.47,9.0,154,5000,19,26,16500
3,164,99.8,176.6,66.2,54.3,2337,3.19,3.4,10.0,102,5500,24,30,13950
4,164,99.4,176.6,66.4,54.3,2824,3.19,3.4,8.0,115,5500,18,22,17450


Now the data contains only the continous variables. Our aim is to predict the 'price' attribute.

## Data Preparation

If we want to use the data for machine learning, one of the first preparation steps consists of cleaning the data:

### Data Cleaning

We usually can't have any missing values in the data if we want to use them for predictive modeling. Based on the data set preview above, we can see that the normalized-losses column contains missing values represented using "?". Let's replace these values and look for the presence of missing values in other numeric columns.

In [19]:
data = data.replace('?', np.nan)
data.head()

Unnamed: 0,normalized-losses,wheel-base,length,width,height,curb-weight,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,13495
1,,88.6,168.8,64.1,48.8,2548,3.47,2.68,9.0,111,5000,21,27,16500
2,,94.5,171.2,65.5,52.4,2823,2.68,3.47,9.0,154,5000,19,26,16500
3,164.0,99.8,176.6,66.2,54.3,2337,3.19,3.4,10.0,102,5500,24,30,13950
4,164.0,99.4,176.6,66.4,54.3,2824,3.19,3.4,8.0,115,5500,18,22,17450


Because ? is of string type, columns containing this value were cast to the pandas object data type (instead of a numeric type like int or float). After replacing the ? values, let us cast the columns to float type and check if other mising values exists in other columns:

In [20]:
data = data.astype('float')
data.isnull().sum()

normalized-losses    41
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

We see that the columns normalized-loss, bore, stroke, horsepower, peak-rpm and price contain missing values.  
Because price is the column we want to predict, let's remove any rows with missing price values:

In [21]:
data = data.dropna(subset=['price'])
data.isnull().sum()

normalized-losses    37
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 0
dtype: int64

We need to come up with a strategy to deal with the remaining missing values in other columns. The choice we made is replacing missing values in other columns using column means:

In [23]:
data = data.fillna(data.mean())

Let use check if there are indeed no missing values anymore:

In [24]:
data.isnull().sum()

normalized-losses    0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
bore                 0
stroke               0
compression-rate     0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64