<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#Data Acquisition">Data Acquisition</a>
    <li><a href="#Basic Insights from the Data set">Basic Insights from the Data set</a></li>
</ol>


</div>
<hr>


# Data Acquisition

The Pandas Library is a useful tool that enables us to read various datasets into a data frame; our Jupyter notebook platforms have a built-in <b>Pandas Library</b> so that all we need to do is import Pandas without installing.


In [1]:
# import pandas library
import pandas as pd
import numpy as np

<h2>Read Data</h2>


Utilize the Pandas method `read_csv()` to load the data into a dataframe.


In [None]:
filepath = "AutoMobile-Dataset.csv"
df = pd.read_csv(filepath, header=None)

In [None]:
# show the first 5 rows using dataframe.head() method
print("The first 5 rows of the dataframe") 
df.head(5)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #1: </h1>
<b>Check the bottom 10 rows of data frame "df".</b>
</div>


In [1]:
print("The bottom 10 rows of the dataframe") 
df.tail(10)

The bottom 10 rows of the dataframe


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
195,-1,74,volvo,gas,std,four,wagon,rwd,front,104.3,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,13415
196,-2,103,volvo,gas,std,four,sedan,rwd,front,104.3,...,141,mpfi,3.78,3.15,9.5,114,5400,24,28,15985
197,-1,74,volvo,gas,std,four,wagon,rwd,front,104.3,...,141,mpfi,3.78,3.15,9.5,114,5400,24,28,16515
198,-2,103,volvo,gas,turbo,four,sedan,rwd,front,104.3,...,130,mpfi,3.62,3.15,7.5,162,5100,17,22,18420
199,-1,74,volvo,gas,turbo,four,wagon,rwd,front,104.3,...,130,mpfi,3.62,3.15,7.5,162,5100,17,22,18950
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.4,23.0,106,4800,26,27,22470
204,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,19,25,22625


<h3>Add Headers</h3>
<p>
First, create a list "headers" that include all column names in order.
Then, use <code>dataframe.columns = headers</code> to replace the headers with the list you created.
</p>


In [None]:
# create headers list
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
print("headers\n", headers)

 Replace headers and recheck our data frame:


In [None]:
df.columns = headers
df.columns

You can also see the first 10 entries of the updated data frame and note that the headers are updated.


In [None]:
df.head(10)

Now, we need to replace the "?" symbol with NaN so the dropna() can remove the missing values:


In [None]:
df1=df.replace('?',np.NaN)


You can drop missing values along the column "price" as follows:


In [None]:
df=df1.dropna(subset=["price"], axis=0)
df.head(20)

Here, `axis=0` means that the contents along the entire row will be dropped wherever the entity 'price' is found to be NaN


 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #2: </h1>
<b>Find the name of the columns of the dataframe.</b>
</div>


In [3]:
df.columns

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

# Basic Insights from the Data set


<h2>Data Types</h2>



Returns a series with the data type of each column.


In [None]:
# check the data type of data frame "df" by .dtypes
print(df.dtypes)

<h2>Describe</h2>
If we would like to get a statistical summary of each column such as count, column mean value, column standard deviation, etc., use the describe method:


This method will provide various summary statistics, excluding <code>NaN</code> (Not a Number) values.


In [None]:
df.describe()

You can add an argument <code>include = "all"</code> inside the bracket. Try it again.

In [None]:
# describe all the columns in "df" 
df.describe(include = "all")

<h1> Question #3: </h1>

Apply the  method to ".describe()" to the columns 'length' and 'compression-ratio'.



In [4]:
df[['length','compression-ratio']].describe()

Unnamed: 0,length,compression-ratio
count,205.0,205.0
mean,174.049268,10.142537
std,12.337289,3.97204
min,141.1,7.0
25%,166.3,8.6
50%,173.2,9.0
75%,183.1,9.4
max,208.1,23.0


<h2>Info</h2>


This method prints information about a data frame including the index dtype and columns, non-null values and memory usage.


In [None]:
# look at the info of "df"
df.info()