# Simple Linear Regression

## Overview

This notebook provides an in-depth exploration of simple linear regression.

**Simple Linear Regression**: A foundational method in machine learning utilized for predecting numerical values based on a single input variable.

The content herein covers the theoretical underpinning of simple linear regression, its practical implementation in `Python` leveraging prominent libraries such as `NumPy` and `SciKit-Learn`, and methodologies for assessing model performance.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

## Read Dataset

The dataset stored in a comma-seperated values (CSV) file is loaded using `pandas.read_csv()`. Pandas, a library offering a powerful suite of tools for data manipulation and analysis, is leveraged here. DataFrames, its core data structure, provide a tabular format for efficient data handling, making exploration, analysis, and visualization straightforwar.

**Note:**
- The path should be modified if it's located elsewhere.

In [2]:
dataset = pd.read_csv('dataset/auto-mpg.csv')
dataset.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [3]:
data = dataset.iloc[:, :7]
data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
0,18.0,8,307.0,130,3504,12.0,70
1,15.0,8,350.0,165,3693,11.5,70
2,18.0,8,318.0,150,3436,11.0,70
3,16.0,8,304.0,150,3433,12.0,70
4,17.0,8,302.0,140,3449,10.5,70


## Exploratory Data Analysis (EDA)

This section delves into the loaded dataset using various techniques from `pandas` and `seaborn` libraries to gain a comprehensive understanding of its characteristics.

1. **Data Shape and Data Types:**
   - The `data.shape` attribute is use to retrieve the dimension (number of rows and columns) of the `DataFrame`. This provides a quick overview of the data's size.
   - The `data.dtypes` attribute returns a Series displaying the data type of each column. This helps identify potential data type mismatches or areas requiring type conversion.

In [4]:
data.shape

(398, 7)

In [5]:
data.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower       object
weight            int64
acceleration    float64
model year        int64
dtype: object

<hr>

### Handling Missing Values in *horsepower* column

This section addresses the presence of missing values represented by the character `?` in the `horsepower` column of the dataset. Several steps are undertaken to ensure data quality and consistency.

**Identification of Non-Numeric Values:**
The `np.nonzero(~data.horsepower.str.isdigit())[0]` expression leverages `NumPy` to identify the indices of rows within the `horsepower` column that contain non-numeric values (using string comparison with `str.isdigit()` and negation with tilde sign `~`).

In [13]:
horsepower_nulls = np.nonzero(~data.horsepower.str.isdigit())[0]
horsepower_nulls

array([ 32, 126, 330, 336, 354, 374])

In [11]:
data.iloc[horsepower_nulls]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
32,25.0,4,98.0,?,2046,19.0,71
126,21.0,6,200.0,?,2875,17.0,74
330,40.9,4,85.0,?,1835,17.3,80
336,23.6,4,140.0,?,2905,14.3,80
354,34.5,4,100.0,?,2320,15.8,81
374,23.0,4,151.0,?,3035,20.5,82


The `data = data.replace('?', np.nan)` employs the replace method of the DataFrame to substitute all occurrences of the `?` character with the missing value representation `np.nan` (Not a Number).

In [14]:
data = data.replace('?', np.nan) # Replace '?' with NaN

**Verification of Missing Values (Optional):**
The Line `data.isnull().sum()` utilizes the `isnull().sum()` method to display the total number of missing values present in each column after the replacement step.


In [17]:
data.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model year      0
dtype: int64

**Imputation with Median:**
The core imputation step is implemented in `data['horsepower'] = data['horsepower'].fillna(data['horsepower'].astype('float64').median())`. Here, the `fillna()` method is applied to the `horsepower` column to fill missing values (represented by `NaN`). The median value is calculated using `data['horsepower'].astype('float64').median()`. Converting the column to numeric data type `(float64)` ensures proper calculation of the median. This approach replaces missing values with the central tendency of the existing numerical data within the column.

In [21]:
data['horsepower'] = data['horsepower'].fillna(data['horsepower'].astype('float64').median())

In [23]:
data.iloc[horsepower_nulls]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
32,25.0,4,98.0,93.5,2046,19.0,71
126,21.0,6,200.0,93.5,2875,17.0,74
330,40.9,4,85.0,93.5,1835,17.3,80
336,23.6,4,140.0,93.5,2905,14.3,80
354,34.5,4,100.0,93.5,2320,15.8,81
374,23.0,4,151.0,93.5,3035,20.5,82
