# Multiple Linear Regression

## Predicting Car Prices

Notebook by Anthony Rodriguez

In [9]:
import os, math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from utils_common import *

# Introduction

This notebook displays exploratory data analysis (EDA) and multiple linear regression on a [kaggle dataset for predicting car prices with multiple linear regression](https://www.kaggle.com/datasets/hellbuoy/car-price-prediction). This data set is just going to be used to practice using basic EDA, simple data-cleaning, and linear regression with multiple variables.

# Exploratory Data Analysis (EDA)

In [10]:
filename = "CarPrice_Assignment.csv"

In [11]:
data_file = fetch_file(filename)
data_file

WindowsPath('data/CarPrice_Assignment.csv')

#### Let's look at the size of the file.

In [12]:
print(f'training_set_file is {os.path.getsize(data_file) / 1e6} MB')

training_set_file is 0.026717 MB


#### Let's look at the number of lines in the data.

In [13]:
num_lines = num_lines_in_file(data_file)
print(f'The training set file has {num_lines} lines of data.')

The training set file has 206 lines of data.


#### Let's check the contents of the file to make sure it is a csv file.

In [14]:
print_file_contents(data_file)

0	'car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,carwidth,carheight,curbweight,enginetype,cylindernumber,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price\n'
1	'1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000,21,27,13495\n'
2	'2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000,21,27,16500\n'
3	'3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9,154,5000,19,26,16500\n'
4	'4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10,102,5500,24,30,13950\n'
5	'5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8,115,5500,18,22,17450\n'
6	'6,2,audi fox,gas,std,two,sedan,fwd,front,99.

#### File looks like a standard csv file with multiple columns consisting of real numbers and strings.

#### Let's create a data frame.

In [16]:
df = pd.read_csv(data_file)
df.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


#### Let's be sure of the types within the data frame.

In [17]:
df.dtypes

car_ID                int64
symboling             int64
CarName              object
fueltype             object
aspiration           object
doornumber           object
carbody              object
drivewheel           object
enginelocation       object
wheelbase           float64
carlength           float64
carwidth            float64
carheight           float64
curbweight            int64
enginetype           object
cylindernumber       object
enginesize            int64
fuelsystem           object
boreratio           float64
stroke              float64
compressionratio    float64
horsepower            int64
peakrpm               int64
citympg               int64
highwaympg            int64
price               float64
dtype: object

# Training a Multiple Linear Regression Model