## Auto MPG (miles per gallon)

Link : https://archive.ics.uci.edu/ml/datasets/auto+mpg
> Date: July 7, 1993

- Instances : 398
- Attributes: 9 including the class attribute
- horsepower has 6 missing values

In [1]:
import pandas as pd

import numpy as np

from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

import plotly.express as px

In [2]:
df = pd.read_csv("data/auto-mpg.data", sep='\s+', header=None)
df.columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year", "origin", "car_name"]
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    float64
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car_name      398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


In [4]:
df.describe()

Unnamed: 0,mpg,cylinders,displacement,weight,acceleration,model_year,origin
count,398.0,398.0,398.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,2970.424623,15.56809,76.01005,1.572864
std,7.815984,1.701004,104.269838,846.841774,2.757689,3.697627,0.802055
min,9.0,3.0,68.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,104.25,2223.75,13.825,73.0,1.0
50%,23.0,4.0,148.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,262.0,3608.0,17.175,79.0,2.0
max,46.6,8.0,455.0,5140.0,24.8,82.0,3.0


Here **horsepower** attribute contain '?' value. that's why pandas is recogarnized as object column.
In solution, first we will replace with **NaN** and then set datatype as floating point.

In [5]:
df.horsepower.replace('?', np.NaN, inplace=True)
df.horsepower = df.horsepower.astype('float32')

In [6]:
df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
count,398.0,398.0,398.0,392.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.469391,2970.424623,15.56809,76.01005,1.572864
std,7.815984,1.701004,104.269838,38.491158,846.841774,2.757689,3.697627,0.802055
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,104.25,75.0,2223.75,13.825,73.0,1.0
50%,23.0,4.0,148.5,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,262.0,126.0,3608.0,17.175,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


In [7]:
px.histogram(df, x='mpg')

In [8]:
px.scatter(df, x='weight', y='model_year', color='origin')

In [43]:
px.scatter(df, x='origin', y='weight')

In [46]:
# https://www.kaggle.com/code/devanshbesain/exploration-and-analysis-auto-mpg/notebook

In [53]:
px.pie(df, names='origin')

In [51]:
px.box(df, x='origin', y='mpg')