# Prediction Fuel Efficiency of Vehicles

In this series, we'd be going from data collection to deploying the Machine Learning model:

1. **Data Collection** - we are using the classic Auto MPG dataset from UCI ML Repository.
2. **Define Problem Statement** - We'll frame the problem based on the dataset description and initial exploration.
3. **EDA** - Carry our exploratory analysis to figure out the important features and creating new combination of features.
4. **Data Preparation** - Using step 4, create a pipeline of tasks to transform the data to be loaded into our ML models.
5. **Selecting and Training ML models** - Training a few models to evaluate their predictions using cross-validation.
6. **Hyperparameter Tuning** - Fine tune the hyperparameters for the models that showed promissing results.
7. **Deploy the Model using a web service** - Using Flask web framework to deploy our trained model on Heroku.

# Imports

In [11]:
# importing a few general use case libraries
import pandas as pd
import numpy  as np

import matplotlib.pyplot as plt
import seaborn           as sns

# Step 1: Collecting data from UCI ML repository

In [12]:
cols = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
       'Acceleration', 'Model Year', 'Origin']

df_raw = pd.read_csv( "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", sep=' ',
                    names=cols, na_values = '?', comment = '\t', skipinitialspace=True)

df_raw.head()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1


In [13]:
df1 = df_raw.copy()

# Step 2: Problem Statement

The data contains MPG variable which is continuous data and tells us about the efficiency of fuel consumption of a vehicle in 70s and 80s.

Our aim here is to predict the MPG value for a vehicle given we have other attributes of the vehicle.

# Step 3: Exploratory Data Analysis

1. Check for Data type of columns
2. Check for null values
3. Check for outliers
4. Look for the category distribution in categorical columns
5. Plot for correlation
6. Look for new variables

In [15]:
# checking the data info
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MPG           398 non-null    float64
 1   Cylinders     398 non-null    int64  
 2   Displacement  398 non-null    float64
 3   Horsepower    392 non-null    float64
 4   Weight        398 non-null    float64
 5   Acceleration  398 non-null    float64
 6   Model Year    398 non-null    int64  
 7   Origin        398 non-null    int64  
dtypes: float64(5), int64(3)
memory usage: 25.0 KB


In [17]:
# checking for all the null values
df1.isnull().sum()

MPG             0
Cylinders       0
Displacement    0
Horsepower      6
Weight          0
Acceleration    0
Model Year      0
Origin          0
dtype: int64

In [18]:
# summary statistics of quantitive variables
df1.describe()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
count,398.0,398.0,398.0,392.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.469388,2970.424623,15.56809,76.01005,1.572864
std,7.815984,1.701004,104.269838,38.49116,846.841774,2.757689,3.697627,0.802055
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,104.25,75.0,2223.75,13.825,73.0,1.0
50%,23.0,4.0,148.5,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,262.0,126.0,3608.0,17.175,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0
