In [1]:
import sys
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Custom KMeans class has been created, and can now be used for clustering
src_path = os.path.abspath(os.path.join('..', 'src'))
sys.path.append(src_path)
from kmeans import CustomKMeans, plot_clusters

# Import the cleaned data for use in the project
cleaned_data_path = os.path.abspath(os.path.join('..', 'data', 'cleaned', 'car_prices_cleaned.csv'))
car_df = pd.read_csv(cleaned_data_path)
car_df.head(5)

Unnamed: 0,year,make,model,trim,body,transmission,vin,state,condition,odometer,color,interior,seller,mmr,sellingprice,saledate
0,2015,Kia,Sorento,LX,SUV,automatic,5xyktca69fg566472,ca,5.0,16639.0,white,black,"kia motors america, inc",20500,21500,2014-12-16
1,2015,Kia,Sorento,LX,SUV,automatic,5xyktca69fg561319,ca,5.0,9393.0,white,beige,"kia motors america, inc",20800,21500,2014-12-16
2,2014,BMW,3 Series,328i SULEV,Sedan,automatic,wba3c1c51ek116351,ca,4.5,1331.0,gray,black,financial services remarketing (lease),31900,30000,2015-01-14
3,2015,Volvo,S60,T5,Sedan,automatic,yv1612tb4f1310987,ca,4.1,14282.0,white,black,volvo na rep/world omni,27500,27750,2015-01-28
4,2014,BMW,6 Series Gran Coupe,650i,Sedan,automatic,wba6b2c57ed129731,ca,4.3,2641.0,gray,black,financial services remarketing (lease),66000,67000,2014-12-18


In [None]:
#Using anomaly detection fro condition, odometer, and selling prince
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import LocalOutlierFactor

# Might be needed for encoding
oe = OrdinalEncoder()
# The features are selected
features = car_df[['condition', 'odometer', 'sellingprice']]

# Scale the features to get more accurate results
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# iforest is used to detect anomalies, however might need to be modified to reduce execution time
iforest = IsolationForest(n_estimators=1000, contamination=0.001, random_state=42)
iforest.fit(features)

labels = iforest.predict(features)
car_df['anomaly_label'] = labels  # -1 means anomaly, 1 means normal
# Seperate anomalies from normal data
anomalies = car_df[car_df['anomaly_label'] == -1]
normal_data = car_df[car_df['anomaly_label'] == 1]

# Create local outlier factor 
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.001)

# Fit the model and predict anomalies
lof_labels = lof.fit_predict(features_scaled)

# returns -1 for outliers and 1 for inliers
car_df['lof_anomaly_label'] = lof_labels

# separate the normal stuff from the anomalies
lof_anomalies = car_df[car_df['lof_anomaly_label'] == -1]
lof_normal_data = car_df[car_df['lof_anomaly_label'] == 1]




In [None]:
# Print out number of anomalies, and display a few of them
print("Total anomalies detected:", len(anomalies))
print("Example anomalies:\n", anomalies[['condition', 'odometer', 'sellingprice']].head(10))
# Print amount of normal data
print("Non anomalies:", len(normal_data))

print("\n")

# Print out the number of anomalies found by LOF 
print("Total anomalies detected by LOF:", len(lof_anomalies))
print("Example anomalies detected by LOF:\n", lof_anomalies[['condition', 'odometer', 'sellingprice']].head(10))
# print the normal data points found 
print("Non-anomalies detected by LOF:", len(lof_normal_data))


Total anomalies detected: 473
Example anomalies:
       condition  odometer  sellingprice
258         1.0  999999.0          2500
421         4.9      71.0         64000
2285        1.0  291087.0          3600
2645        5.0   43506.0         66500
4095        1.0  311164.0           700
4241        1.0  288484.0           800
4271        1.0  287704.0           400
4304        5.0    5357.0         73000
4342        4.9     183.0         75000
4362        4.9    4225.0         83500
Non anomalies: 471863


Total anomalies detected by LOF: 473
Example anomalies detected by LOF:
        condition  odometer  sellingprice
16           1.7   13441.0         17000
2737         1.8   88389.0          6800
3588         1.8  133727.0          1350
3590         1.7   87958.0         14700
3664         1.8  119294.0          3500
3689         1.8  205256.0          3300
3893         1.8  197843.0          3300
4181         2.0       1.0           200
6384         1.0  154633.0         19600
102

### Anomalies Found
There have been a pretty large amount of anomalies found. Initially the amount of anomalies was too high when using a contamination value of 0.1, reducing this to 0.001 and scaling the data led to better results, with more extreme outliers.

Some unique findings is a car marked at 1.0 for condition at 999999.0 miles selling for 2500. It is possible this is a placeholder value or an error in entry as it is extremley improbable for a car to have his many miles with terrible condition selling for $2500. 

The other vehicles show up as anomalies because they have low milage and high prices, or high milage and low prices. These vehicles are not very common usually in the market, and therefore show up as anomalies, however these results are to be expected. Some vehices that have higher milage but are still selling for high prices are also anomalous, howeve they may be luxury expensive vehicles, leading to higher prices.

Overall, there are pieces of data which are outliers due to incorrect data, others are simply extreme cases that can be present in the car market.

### Anomalies Found by Local Outlier Factor
Only 473 anomalies were detected out of 471,863 non-anomalies. This puts the rate of anomalies at about 0.1%. The condition values found as anomalies tend to be very low around 1-2 which could indicate an unusually poor condition. It could be a bit surprising it would be considered an anamoly since it seems as though used cars would tend to trend on the poorer condition side anyways. 

The outlier mileages seemed to be picking up both extremely high and extremely low mileages. This indicates that the algorithm is picking out any mileage extreme in either direction. 

The selling prices outliers tend to show a wide range from very low to moderate. These could be flagged by the algorithm because they do not alighn with cars for that given mileage or condition. 

LOF considers local density so it does consider other data features. The flagged records tend to indicate unusual combinations of condition, odometer, and selling price. 