# Predicting Car Fuel Efficiency Using Machine Learning

## Intro

What if you could have a reliable estimator for a car’s mpg given some known specifications about the vehicle? Then, you could beat a competitor to market by both having a more desirable vehicle that is also more efficient, reducing wasted R&D costs and gaining large chunks of the market.

This notebook contains the following columns: **mpg**, **cylinders**, **horsepower**, **weight**, **acceleration**, etc., which should all be self-explanatory.

**Displacement** is the volume of the car’s engine, usually expressed in liters or cubic centimeters.

**Origin** is a discrete value from 1 to 3. This dataset does not describe it beyond that, but for this notebook we assumed 1 to be American-origin vehicle, 2 is European-origin, 3 is Asia/elsewhere.

**Model year** is given as a decimal number representing the last two digits of the 4-digit year (eg. 1970 is model year = 70).


## Important Note:

According to others using this dataset, some of the mpg values for the cars are incorrect, meaning that some of our predictions will be off by a large amount, but we shouldn’t always trust the listed mpg value.

There are also unknown mpg values in the dataset, marked with a ‘?’. We will need to manually replace these with the correct mpg value.



## Data Preprocessing

The purpose of the data preprocessing stage is to minimize potential errors in the model as much as possible. Generally, a model is only as good as the data passed into it, and the data preprocessing we do ensures that the model has as accurate a dataset as possible. While we cannot perfectly clean the dataset, we can at least follow some basics steps to ensure that our dataset has the best possible chance of generating a good model.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
cars = pd.read_csv('auto-mpg.csv')
cars.head()

In [None]:
print("presence of null values: ", cars.isnull().values.any())

In [None]:
for col in cars.columns:
    try:
        print(cars[~cars[col].str.contains('[0-9]')])
        print('This DOES contain strings:', col)
    except:
        print('This DOES NOT contain strings:', col)
        pass

In [None]:
cars['horsepower_updated'] = np.where(cars['horsepower']=='?', np.nan, cars['horsepower'])
cars[cars['horsepower']=='?']

In [None]:
cars.columns

In [None]:
cars['horsepower_updated'] = cars['horsepower_updated'].fillna(cars['horsepower_updated'].mode()[0])
cars[cars['horsepower']=='?']

In [None]:
cars.pop('horsepower')

The null values in this dataset are actually marked with a ‘?’, so we will have to manually update the information for them.

We have manually gone through the dataset and input the missing values for each vehicle. There were 6 ‘?’ rows, so we simply searched the year and model of the car and found the most commonly occurring horsepower.

### Feature Engineering

The next step of preprocessing would be to categorize the ‘car name’ column. We can create a new feature based on whether or not diesel was included in the title of the vehicle.

In [None]:
" ".join(cars['car name'].to_list())

In [None]:
cars['type'] = np.where(cars['car name'].str.contains('diesel'), 1, 0)
cars['type'].value_counts().plot(kind='bar')

In [None]:
cars.pop('car name')

In [None]:
print ("Presence of any null values:" + str(cars.isnull().values.any()))

## EDA

The purpose of EDA is to enhance our understanding of trends in the dataset without involving complicated machine learning models. Oftentimes, we can see obvious traits using graphs and charts just from plotting columns of the dataset against each other.

We’ve completed the necessary preprocessing steps, so let’s create a correlation map to see the relations between different features.

In [None]:
corr = cars.corr()
sns.heatmap(corr)
plt.show()

### Notes:

- There are some strong correlations between each column. For cylinders, displacement, horse-power, and weight, it makes sense that the mpg would be negatively correlated with rising trends in any of the named features.
- Model year and origin also make sense, since non-American/European countries may contain more fuel-efficient standards due to different fuel prices in those areas.

Next, we can plot the number of cars based on their origin (US = 1, Asia = 2, Europe = 3). This is important to us because we’re assuming that different regions have different fuel efficiency priorities, so our model will be skewed towards the region with the most cars in the dataset.

In [None]:
cars['origin'].value_counts().plot(kind='bar')

With 1 corresponding to American cars, we can see that the US accounts for the majority of the cars produced here. **This may be a problem for our model, but if our accuracy is too low, then we can always normalize the presence of each area in the dataset to get predictions that aren’t skewed towards American car mpg**.

We can also view the distributions of different cylinder counts among our dataset.

In [None]:
cars['cylinders'].value_counts().plot(kind='bar')

Notice how many cars have 4 cylinders versus 8/6 and 3/5. The V3 is an older engine design that was rarely used in production cars, while the V5 counts can be attributed to Volkswagen’s VR5 engine design.

Since the dataset uses pre-2000s cars, it makes sense how 4 cylinder cars are extremely popular. As time went on, the popularity of the SUV led to more cars having 6–8 cylinders in their engines.

A boxplot will help us better visualize what is happening with the data. Using seaborn’s builtin boxplot method, I’ve made the plot below, which plots car origin against the mpgs of the individual cars:

In [None]:
sns.boxplot(x='origin', y='mpg', data=cars)

Before we discuss the box plot, it seems that outliers are affecting our averages, especially for European cars. We can use Python and pandas to see what outliers are present.

In [None]:
american_cars = cars[cars['origin'] == 1]
japanese_cars = cars[cars['origin'] == 3]
european_cars = cars[cars['origin'] == 2]

quantile_usa = american_cars['mpg'].quantile(0.90)
quantile_jp = japanese_cars['mpg'].quantile(0.90)
quantile_eu = european_cars['mpg'].quantile(0.90)

american_cars[american_cars['mpg'] < quantile_usa] 
european_cars[european_cars['mpg'] < quantile_eu] 
japanese_cars[japanese_cars['mpg'] < quantile_jp]

frames = [american_cars, european_cars, japanese_cars] 
df = pd.concat(frames)

sns.boxplot(x = 'origin', y = 'mpg', data = df)

**Note:** USA = 1, Europe = 2, and Asia = 3

In [None]:
cars[cars['horsepower']=='?']