# Car Data Exploration
## Author: Aaron Walber
### Email: awalber94@gmail.com
#### Start by Importing all relevant python packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Examine the data
To view the data, first it will be loaded into pandas. Initial examination shows that the data is semi-colon delimited
with an additional row after the column names indicating the data types.

Please note that this notebook will assume that the data is contained in the same folder as this notebook

In [None]:
df = pd.read_csv("cars.csv",delimiter=";",skiprows=[1])
# drop rows which contain an NaN value
df.dropna(axis=0)
df.head(10)

## Pick out some important features of the data

Finding the car with the highest MPG can be performed by simply selecting the MPG columns and searching for the maximum. Finding a maximum in an unsorted list is always going to be O(n) operations. In this case, we want the index of the maximum so that value can be pulled out of an array. In general, referencing something by a lookup value in a pandas dataframe will always be slower than indexing it.

In [None]:
# create a numpy array out of the MPG column
mpg = df["MPG"].values
max_mpg = np.argmax(mpg)
max_mpg # the maximum mpg index
# finally, utilize this index to grab the entire row for this vehicle
df.iloc[max_mpg]

From running the snippet above, we can see that the car with the maximum MPG is a Mazda GLC, which is made in Japan. We can also examine some other important aspects of this vehicle which might prove useful in predicting the MPG. For example, the weight of the vehicle most likely plays an important factor as well as the displacement, number of cyliders, acceleration, and horsepower.


## Find the average MPG per cylinder count:

This can be done by performing a vectorized operation between two arrays.

In [None]:
# create numpy arrays from these two columns
cylinders = df["Cylinders"].values
mpgs = df["MPG"].values
# perform a vectorized divide on each item in mpg with each item from cylinders
mpg_per_cylinder = mpgs/cylinders
# the average MPG/cylinder for the entire CSV will be a mean of the variable mpg_per_cylinder
print("The average MPG per cylinder is {}".format(np.mean(mpg_per_cylinder)))

## Find the average MPG for each Manufacturer

This could be important information when trying to decide which car best suits your needs. Obviously if one manufacturer has a consistently higher MPG than their competition, this could weigh in to a decision made by a consumer.

From our earlier examination of the data, the manufacturer isn't a direct column in the data. This means we will need to extract it from the "Car" column. It might be helpful to save this out as a separate column in itself in case this is needed later.

This information will be stored in a dictionary because each key can represent the car manufacturer and each value can be a list of cars found in this CSV.

In [None]:
makes = []
print(df)
for row,car in enumerate(df["Car"]):
    make = car.split(" ")[0]
    if make.endswith(" "):
        make = make[:-1]
    makes.append(make)
df["Make"] = makes
df.head(5)

We can see that the make has been added to the dataframe. Now all that's left is to group the dataframe by each unique "Make" value and calculate the average MPG. This information can be stored in a dictionary.

In [None]:
avg_mpg = {}
for make,_df in df.groupby("Make"):
    avg_mpg[make] = np.mean(_df["MPG"].values)
avg_mpg

Examining the dictionary shows Nissan has the highest average MPG with 36 and the lowest average MPG as Citroen at 0 MPG. Perhaps this is a mistake in the data because examining the only car for this make yields the following results:

In [None]:
df[df["Make"]=="Citroen"]

## Build a Model to Predict MPG

As stated previously, this will probably be best performed by looking at the following factors:
* Cylinders
* Displacement
* Horsepower
* Weight
* Acceleration

Obviously MPG would be the dependent variable in examining these other variables. Values like Make seem to be indicative, but without a larger sample size for certain makes, this would quickly lead to overfitting on any new data.

A decent approach to figuring out if these should be used would be to create a plot of every value against the MPG value to determine if they are correlated or not. A strong correlation means that these values should definitely be used. Plotting assists with this to help identify the type of correlation, because it could be linear, nonlinear, or not correlated at all.

In [None]:
mpg = df["MPG"].values
independent_vars = {name:df[name].values for name in ["Cylinders","Displacement","Horsepower","Weight","Acceleration"]}

## Plot the relationships between MPG and the other identified variables

Here, MPG will be plotted on the y-axis and the other variable will be plotted on the x-axis.

In [None]:

for name,arr in independent_vars.items():
    plt.scatter(arr,mpg)
    plt.xlabel(name)
    plt.ylabel("MPG")
    plt.show()

## Insights from Plotting the Data

Plotting the data immediately reveals a lot of important information. First off, there are some erroneous values for MPG that should be removed. The MPG of a vehicle can not physically be 0, and attempting to train on this value would only mislead the model. These plots also make it apparent that Cylinders and Acceleration would not be very good predictors of MPG since they would likely have low correlation with a linear or nonlinear model. On the other hand, Displacement, Horsepower, and Weight all seem to be somewhat correlated with MPG with some obvious outliers. It's also apparent that the relationship is definitely nonlinear for Horsepower. Displacement and Weight could arguably have either a linear or nonlinear relationship with MPG. To figure it out, let's take some simple regressions of this data.

In [None]:
non_zero_inds = np.where(mpg>0)
mpg = mpg[non_zero_inds]
independent_vars = {name:np.atleast_2d(value[non_zero_inds]).T for name,value in independent_vars.items() if name not in ["Cylinders","Acceleration"]}
mpg = np.atleast_2d(mpg).T

In [None]:

from sklearn.linear_model import LinearRegression
for name,value in independent_vars.items():
    linreg = LinearRegression()
    reg = linreg.fit(value,mpg)
    print("Linear regression r^2 value: {}, variable: {}".format(linreg.score(value,mpg),name))

## Test out nonlinear regression scores

The scores achieved by fitting the data were lower than anticipated, so we'll test all variables against a simple nonlinear model. A favorite nonlinear regressor of mine is Support Vector Regression

In [None]:
# change mpg back to a 1d array
mpg = mpg.reshape(mpg.shape[0])
mpg.shape

In [None]:
from sklearn.svm import SVR
reg_params = [0.1,1.0,2.0,10.0,20]
for c in reg_params:
    for name,value in independent_vars.items():
        reg_model = SVR(C=c)
        reg = reg_model.fit(value,mpg)
        print("SVR regression C value: {} r^2 value: {}, variable: {}".format(c,reg_model.score(value,mpg),name))


The code above tests each variable utilizing a Radial Basis kernel function while changing the regularization. It's apparent that each of these fits performs better than the linear regression model, so it might be best to continue utilizing the SVR model. Setting the regularization to 10 seems to perform better than lower values, so we'll keep this. However there are some other parameters that can be tuned, like epsilon and tolerance

In [None]:
# a higher value for epsilon could help to allow for the natural noise that appears in the data
for eps in [1, 1e-1, 1e-2]:
    for name,value in independent_vars.items():
        reg_model = SVR(C=c,epsilon=eps)
        reg = reg_model.fit(value,mpg)
        print("SVR regression epsilon value: {} r^2 value: {}, variable: {}".format(eps,reg_model.score(value,mpg),name))
print()
for tol in [1e-3, 1e-4, 1e-5]:
    for name,value in independent_vars.items():
        reg_model = SVR(C=c,tol=tol,epsilon=1)
        reg = reg_model.fit(value,mpg)
        print("SVR regression tolerance value: {} r^2 value: {}, variable: {}".format(tol,reg_model.score(value,mpg),name))


Tuning epsilon yields slightly better results, most likely from omitting the small amount of variance in the data. However, adjusting the tolerance doesn't produce a strikingly better result, so we'll keep the defaults other than C and epsilon, which should now be set to 10 and 1 respectively.

## Train the Model with the Adjusted Hyper-Parameters

In [None]:
model = SVR(C=10)
X = np.concatenate([arr for arr in independent_vars.values()],axis=1)
y = mpg
model.fit(X,y)
pred = model.predict(X)
print("The r^2 value of this model is {}".format(model.score(X,y)))

## Visualize the Predictions

In [None]:
resids = abs(pred - mpg)
plt.scatter(range(len(resids)),resids)
plt.xlabel("Index of Car in Data")
plt.ylabel("Absolute Residual Value")
plt.title("Residuals for each Car's MPG and the Corresponding Prediction")
print("The average of the absolute residuals for the predictions is {}".format(np.mean(resids)))

## Other Error Metrics:

In [None]:
from sklearn.metrics import max_error, mean_squared_error

print("The maximum Error in the Predictions is {}".format(max_error(mpg,pred)))
print("The MAE of the Predictions is {}".format(mean_squared_error(mpg,pred)))

## Conclusion

Inspecting the data revealed a slightly noisy but very manageable dataset. The obvious problem in predicting the MPG of each car was that some of the MPG values were 0, which indicates a bad data point since this value shouldn't ever be 0. This model was trained and tested without those data points to avoid unnecessary error in predictions. However, no prediction was made on these missing data points because any prediction made would likely be erroneous anyway.

The error in the model is acceptable, except there is no testing data to confirm these predictions. Obviously the Support Vector Regression model utilizes a cross-validation set, and achieved a very low value for the average absolute residual between the ground truth MPG and the predicted value. When comparing this error against the variability of the data (excluding the trend) the model was trained on, it's clear the model reached the threshold where any further error could be considered irreducible error. Essentially, when regressing Weight, Displacement, and Horsepower against the MPG, the perfect fit would still have an absolute error between 5 and 20 just by visual inspection. Thus, utilizing a regression model that combined all 3 of the independent variables to predict the MPG improved performance a significant amount without overfitting the data.