<a href="https://colab.research.google.com/github/dhvl15/data-science/blob/main/multivariable_linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multivariable Linear Regression
Using multivariable linear regression to predict house prices in Boston / California

### Imports and packages

In [18]:
import pandas as pd

# Boston Data
#from sklearn.datasets import load_boston

# California Data
from sklearn.datasets import fetch_california_housing

### Gather data

[Source : California Data](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)

In [10]:
# Boston Data
#dataset = load_boston()
#print(dataset.data.shape)

# California Data
dataset = fetch_california_housing()

# Dataset type
type(dataset)

<class 'sklearn.utils.Bunch'>
(20640, 8)


['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [11]:
# To find any object's attributes
dir(dataset)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [12]:
# Description of the Dataset
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

### Data points and Features

In [13]:
type(dataset.data)

numpy.ndarray

In [15]:
# dataset dimensions
dataset.data.shape

(20640, 8)

In [16]:
# dataset features
dataset.feature_names

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

In [17]:
# target variable - actual prices in hundred thousands ($100,000)
dataset.target

array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

### Data exploration with Pandas dataframe

In [19]:
# convert data to dataframe
data = pd.DataFrame(data = dataset.data, columns = dataset.feature_names)

# add target varibale to dataframe
data['PRICE']=dataset.target

In [22]:
# final dataset
data.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,PRICE
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [23]:
data.count()

MedInc        20640
HouseAge      20640
AveRooms      20640
AveBedrms     20640
Population    20640
AveOccup      20640
Latitude      20640
Longitude     20640
PRICE         20640
dtype: int64

### Cleaning data - check for missing values

In [25]:
# check missing values in each columns using pandas
pd.isnull(data).any()

MedInc        False
HouseAge      False
AveRooms      False
AveBedrms     False
Population    False
AveOccup      False
Latitude      False
Longitude     False
PRICE         False
dtype: bool

In [26]:
# check missing values in each columns using dataframe object
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   PRICE       20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB
