# Linear Regression Note book
***Dataset info***

**Dataset name**: Housing   </br>
**Type**: numerical </br>
**number of classes**: 10 </br>
| ***data set class*** | data type |
|----------------------|-----------|
| longitude | float64 |  
| latitude  | float64 |
| housing_median_age  | float64 |
| total_rooms  | float64 |
| total_bedrooms  | float64 |
| population  | float64 |
| households  | float64 |
| median_income  | float64 |
| median_house_value  | float64 |
| ocean_proximity  | string |

Target value is **median_house_value**

# Import Needed Libraries

In [19]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


# Load Data

In [32]:
df = pd.read_csv('./data/housing.csv')
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [63]:
fig = px.scatter_mapbox(
    df,
    lat="latitude",
    lon="longitude",
    color="ocean_proximity",
    height=600
)
fig.update_layout(mapbox_style="open-street-map")
fig.show()

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

# Preprocessing
## Removing null values

In [33]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

**There is 207 null valsues inside total_bedrooms coloumn**</br>
we will replace them with mean

In [34]:
df['total_bedrooms'] = df['total_bedrooms'].fillna(df['total_bedrooms'].mean())
df.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

**Now there is no any null values**
## Encode `ocean_proximity` 

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20640 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


**`ocean_proximity` values isn't *numeric***</br>
We need to encode the values so we can use it inside the model

In [36]:
le = LabelEncoder()
df['ocean_proximity'] = le.fit_transform(df['ocean_proximity'])
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,3
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,3
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,3
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,3
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,3


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20640 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  int32  
dtypes: float64(9), int32(1)
memory usage: 1.5 MB


</br>**All data values are numaric and ready to use**</br></br></br>

## Splitting the Data 

In [43]:
x = df.drop(['median_house_value'], axis=1)
y = df['median_house_value']

## Scaling / Normalizing the Data
Before we split the data into training and testing sets, we need to scale the data to ***ensure* that no single feature dominates the distance calculations in an algorithm**</br>
[Read more](https://medium.com/codex/why-scaling-your-data-is-important-1aff95ca97a2)

In [47]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
x = ss.fit_transform(x)

# Show the normalized data
pd.DataFrame(x).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,-1.327835,1.052548,0.982143,-0.804819,-0.975228,-0.974429,-0.977033,2.344766,1.291089
1,-1.322844,1.043185,-0.607019,2.04589,1.355088,0.861439,1.669961,2.332238,1.291089
2,-1.332827,1.038503,1.856182,-0.535746,-0.829732,-0.820777,-0.843637,1.782699,1.291089
3,-1.337818,1.038503,1.856182,-0.624215,-0.722399,-0.766028,-0.733781,0.932968,1.291089
4,-1.337818,1.038503,1.856182,-0.462404,-0.615066,-0.759847,-0.629157,-0.012881,1.291089


_Here we just split the data into testing and training sets..._

In [59]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print(f"The number of samples used in training : {len(x_train)}")
print(f"The number of samples used in testing  : {len(x_test)}")

The number of samples used in training : 16512
The number of samples used in testing  : 4128


# Models
**We have completed the data preprocessing steps, and the dataset is now prepared for use in building the models.**

## Linear regression model

In [60]:
lr = LinearRegression()
lr.fit(x_train, y_train)

# Evaluate Model

In [61]:
y_pred = lr.predict(x_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)


In [62]:
print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")
print(f"R-Squared: {r2}\n")

Mean Squared Error: 4664226474.394224
Mean Absolute Error: 50316.91458639086
R-Squared: 0.6514648086442427

