## Sample

For this assignment, I have chosen the [Auto MPG Data Set](https://archive.ics.uci.edu/ml/datasets/auto+mpg) from UCI Machine Learning Repository. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.

The level of analysis is done at the individual level. That means different attributes like MPG, number of cylinders etc of a single car is taken and reported. Back that time in 1970s and 80s, cars were used to big, heavy and used lots of gas. The Auto MPG sample data set is a collection of 398 automobile records from 1970 to 1982. It contains attributes like car name, MPG, number of cylinders, horsepower and weight. This dataset is widely popular and it also comes as built in dataset in R. This dataset can be used to quickly explore and predict the relationships between retro MPG, horsepower and weight data. Here, I will use this data to practice some useful analysis techniques and visualizations to observe relationship between variables.

## Procedure

This study is simply data reporting of various attributes of an automobile. So we can say that this is a observational study where no explanatory variables are manipulated. All the variables are simply reported and then analysis is made. The original purpose of this dataset was to study the effect of different variables on mileage (MPG) of an automobile. The data are collected by simply observing the automobiles. For example, number of cylinders the car have, displacement from controls of the cars etc. So the data are collected just by observation. The data was collected between the time period 1970-80. This dataset was originally collected by the StatLib library which is maintained at Carnegie Mellon University.

## Measures

**Explanatory Variables**

<ol><li>**cylinders**: Number of Cylinders the automobile has</li>
<li>**displacement**: Engine Displacement</li>
<li>**horsepower**: Engine power measured in horsepower</li>
<li>**weight**: Weight of the automobile</li>
<li>**acceleration**: Acceleration of the automobile</li>
<li>**model year**: Year when the automobile was built</li>
<li>**origin**: Automobile origin</li>
<li>**car name**: Name of the automobile</li></ol>

**Response Variable**

<ol><li>**mpg**: Miles per gallon of the automobile</li></ol>

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv("auto-mpg.data", delim_whitespace=True, header=None, names=['mpg', 'cylinders', 'displacement', 
                                                                              'horsepower', 'weight', 'acceleration',
                                                                              'model year', 'origin', 'car name'],
                   dtype={'mpg':np.float64, 'cylinders':'category', 'model year':'category', 'origin':'category',
                         'car name':'category'})

In [2]:
data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg             398 non-null float64
cylinders       398 non-null category
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null float64
acceleration    398 non-null float64
model year      398 non-null category
origin          398 non-null category
car name        398 non-null category
dtypes: category(4), float64(4), object(1)
memory usage: 30.3+ KB


**Scales for explanatory and response variables**

In [4]:
data.describe()

Unnamed: 0,mpg,displacement,weight,acceleration
count,398.0,398.0,398.0,398.0
mean,23.514573,193.425879,2970.424623,15.56809
std,7.815984,104.269838,846.841774,2.757689
min,9.0,68.0,1613.0,8.0
25%,17.5,104.25,2223.75,13.825
50%,23.0,148.5,2803.5,15.5
75%,29.0,262.0,3608.0,17.175
max,46.6,455.0,5140.0,24.8


the describe method only shows the measure scales for continuous variables. For categorical variables we need to explicitly use **value_counts()** method for each categorical variables

In [5]:
data['cylinders'].value_counts()

4    204
8    103
6     84
3      4
5      3
Name: cylinders, dtype: int64

The above code shows the distribution of the variable **cylinders**, similarly we can check for other categorical variables. This dataset does not need any data management step for explanatory and response variables.