# **My First Data Science Project!**

### Goal:
The goal is to create a machine learning model that predicts the price of a used car based on a number of characteristics and determines if it is over or underpriced.

My methodology and process of creating this model is far from perfect, but I hope that in due time I can come back to this and improve my model.

*Drum Rolls* 3-2-1-Go!

**Initial Setup**

First I imported all of the libraries needed to create and execute my machine learning model.

In [37]:
### Import Libraries ### 
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn import preprocessing
file = pd.read_csv("../input/craigslist-carstrucks-data/vehicles.csv")

**Data Exploration**
Next I wanted to explore my data. By executing the code below, I was able to determine that there were 509577 rows and 25 columns. Of the 25 variables, I decided to start with only a few features and incrementally add more to avoid over-fitting.
'Price' was my prediction variable (y), and 'year', 'manufacturer', 'model', 'odometer' were my initial features. 


In [39]:
### Explore Data ### 
print(file.columns)   # Shows columns headers
print(len(file.index)) # Shows number of rows
print(len(file.columns)) # Shows number of columns

Index(['id', 'url', 'region', 'region_url', 'price', 'year', 'manufacturer',
       'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status',
       'transmission', 'vin', 'drive', 'size', 'type', 'paint_color',
       'image_url', 'description', 'county', 'state', 'lat', 'long'],
      dtype='object')
509577
25


In [40]:
pd.set_option('display.max_columns', 30)
file.head(1)

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,vin,drive,size,type,paint_color,image_url,description,county,state,lat,long
0,7034441763,https://saltlakecity.craigslist.org/cto/d/salt...,salt lake city,https://saltlakecity.craigslist.org,17899,2012.0,volkswagen,golf r,excellent,4 cylinders,gas,63500.0,clean,manual,WVWPF7AJ6CW316713,4wd,compact,hatchback,black,https://images.craigslist.org/00G0G_fTLDWM5Xyv...,PRICE REDUCED! -Garage kept -Low Miles (63K)...,,ut,40.7372,-111.858


In [27]:
file.price.describe()

count    5.095770e+05
mean     5.479684e+04
std      9.575025e+06
min      0.000000e+00
25%      3.995000e+03
50%      9.377000e+03
75%      1.795500e+04
max      3.600029e+09
Name: price, dtype: float64

In [29]:
#Top 10 most used price points
file.price.value_counts().iloc[1:10]

6995    4651
7995    4542
3500    4525
8995    4362
2500    4353
4500    4350
5995    4319
4995    4152
9995    4000
Name: price, dtype: int64

In [54]:
file[file["price"]>1000000].count()

id              66
url             66
region          66
region_url      66
price           66
year            66
manufacturer    56
model           62
condition       47
cylinders       43
fuel            66
odometer        34
title_status    66
transmission    66
vin             23
drive           39
size            28
type            35
paint_color     39
image_url       66
description     66
county           0
state           66
lat             66
long            66
dtype: int64

In [55]:
file[file["price"]>200000].count()

id              135
url             135
region          135
region_url      135
price           135
year            135
manufacturer     94
model           127
condition        77
cylinders        78
fuel            135
odometer         88
title_status    135
transmission    133
vin              65
drive            69
size             46
type             86
paint_color      81
image_url       135
description     135
county            0
state           135
lat             134
long            134
dtype: int64

### Cleaning Data ### 
On further anaysis of the prices column, I found that there were 66 cars that were being sold for values >1 million. Whether these were due the cars being super luxuary cars or an error value, I did not check.
I had to make some hard decisions now. I creating a threshold for 'price' of 200k.
Similarly, the mileage threshold for the odometer was set to 1 million and the year to 1960.
In addition, I removed blank lines for the manuafacturer (22764 entries).

In [41]:
file.isnull().sum()

id                   0
url                  0
region               0
region_url           0
price                0
year              1527
manufacturer     22764
model             7989
condition       231934
cylinders       199683
fuel              3985
odometer         92324
title_status      3062
transmission      3719
vin             207425
drive           144143
size            342003
type            141531
paint_color     164706
image_url           14
description         16
county          509577
state                0
lat              10292
long             10292
dtype: int64

In [56]:
file.shape

(509577, 25)

In [57]:
file.drop(file[file.price > 200000].index, inplace = True)
file.shape

In [59]:
file.drop(file[file.year < 1960].index, inplace = True)
file.shape

In [61]:
file = file.dropna(axis=0, subset=['manufacturer'])
file.shape

**Data Modelling**

We are now able to model our data. I decided to use a Random Forest Model because random forests are very versitile.

In [65]:
### Data Modelling ###
# Setting the prediction target
y = file.price

# Feature Selection
used_car_features = ['year', 'odometer', 'manufacturer']
X = file[used_car_features]

# Need to convert string to float
le = preprocessing.LabelEncoder()
X = X.apply(le.fit_transform)

# Split training and testing data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

# Define model
used_car_model = RandomForestRegressor(random_state=1)

# Fit model
used_car_model.fit(train_X, train_y)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=1, verbose=0,
                      warm_start=False)

**Data Evaluation**
The mean absolute error is at 4333. Over time, I should be able to improve the MAE by adding more variables.

In [66]:
### Data Evaluation ###
val_predictions = used_car_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
print(file['price'].mean())
print(len(file.index)) # Shows number of rows

4333.330103197482
12326.25860313036
485056


In [None]:
### Adding Variables to Improve the Model ###
I was trying to the the "model" variable but it did not work so let us look into that. To be continued.


In [68]:
file.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 485056 entries, 0 to 509576
Data columns (total 25 columns):
id              485056 non-null int64
url             485056 non-null object
region          485056 non-null object
region_url      485056 non-null object
price           485056 non-null int64
year            485053 non-null float64
manufacturer    485056 non-null object
model           477403 non-null object
condition       264384 non-null object
cylinders       295800 non-null object
fuel            481796 non-null object
odometer        401908 non-null float64
title_status    482468 non-null object
transmission    481472 non-null object
vin             291446 non-null object
drive           351897 non-null object
size            160054 non-null object
type            353817 non-null object
paint_color     330758 non-null object
image_url       485056 non-null object
description     485055 non-null object
county          0 non-null float64
state           485056 non-null obj

In [74]:
# Not to myself:
#Find the missing values percentage. If more than 50%, then drop the column.#
more_than_50 = file.columns.where((file.isnull().sum()/len(file) * 100) >= 50).dropna()
file = file.drop(columns = more_than_50)
file.shape

(485056, 23)