### Problem Statement

The problem at hand is to predict the housing prices of a town or a suburb based on the features of the locality provided to us. In the process, we need to identify the most important features in the dataset. We need to employ techniques of data preprocessing and build a linear regression model that predicts the prices for us. 

### Data Information

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Detailed attribute information can be found below-

Attribute Information (in order):
- CRIM: per capita crime rate by town( It provides information on the crime rate in different towns or neighborhoods within the Boston area. The value of CRIM indicates the number of crimes per person in a given town.)
- ZN: proportion of residential land zoned for lots over 25,000 sq. ft.
- INDUS: proportion of non-retail business acres per town( `Higher INDUS` values suggest a greater proportion of industrial or commercial land, while `lower values` indicate more residential or retail-focused areas.)
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)(Researchers and analysts use this variable to explore how proximity to the river affects housing prices, air quality, and other neighborhood characteristics.)
- NX: nitric oxides concentration (parts per 10 million)They are produced during combustion processes, such as those in vehicles, industrial facilities, and power plants.Measurement: The NX value for each town indicates the concentration of nitric oxides in the air. Higher values imply greater pollution levels.
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centers(Higher DIS values indicate that a town is farther away from employment centers.(`Higher DIS` values indicate that a town is farther away from employment centers.Lower DIS values suggest closer proximity to employment opportunities.)
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per 10,000 dollars
- PTRATIO: pupil-teacher ratio by town(`Higher PTRATIO` values suggest larger class sizes and potentially less individual attention for students.`Lower PTRATIO` values indicate smaller class sizes and potentially more personalized instruction.)
- LSTAT: %lower status of the population
- MEDV: Median value of owner-occupied homes in 1000 dollars.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## let's load the boston house pricing Dataset

In [2]:
boston=pd.read_csv('8boston.csv')

In [None]:
boston.keys()

In [None]:
print(boston.MEDV)

In [None]:
boston.head()

In [None]:
boston.info()

In [None]:
#summarizing the stats of the data 
boston.describe()

In [None]:
#missing value??
boston.isnull()

In [None]:
#missing value??
boston.isnull().sum()

In [None]:
### Exploratory data analysis
## correlation
boston.corr()

The two types of correlation that we need to check, the correlation between independant features and the correlation between the dependant and the independant features.<br> if there  is high correlation(say 0.9.. or -0.9..) between independant features, you can remove one of the independant feature because that is what is causing what we call multicolinearity

In [None]:
import seaborn as sns 
sns.pairplot(boston,hue='MEDV')

In [None]:
plt.scatter(boston['CRIM'], boston['MEDV'])
plt.xlabel('crime rate')
plt.ylabel('price')

In [None]:
plt.scatter(boston['RM'], boston['MEDV'])
plt.xlabel('ROOM')
plt.ylabel('price')

In [None]:
import seaborn as sns
sns.regplot(x='RM',y='MEDV',data=boston)

In [None]:
plt.scatter(boston['LSTAT'], boston['MEDV'])
plt.xlabel('LSTAT')
plt.ylabel('price')

In [None]:
import seaborn as sns
sns.regplot(x='LSTAT',y='MEDV',data=boston)

In [None]:
import seaborn as sns
sns.regplot(x='CHAS',y='MEDV',data=boston)

the above features CHAS and MEDV are not correlated and they reduce the error of the linear regression

In [None]:
import seaborn as sns
sns.regplot(x='PTRATIO',y='MEDV',data=boston)

In [19]:
# INDEPENDENT AND DEPENDENT FEATURES
X=boston.iloc[:,:-1]
y=boston.iloc[:,-1]

In [20]:
X = boston.drop(["MEDV"], axis=1)
y = boston["MEDV"]

In [None]:
X.head()

In [None]:
y.head()

In [23]:
# train test split
from sklearn.model_selection import train_test_split
X_train,X_test, y_train,y_test=train_test_split(X,y, test_size=0.3, random_state=42 )

In [None]:
X_train

In [None]:
X_test

In [26]:
## Standardise the dataset
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()

In [27]:
X_train=scaler.fit_transform(X_train)

In [28]:
#i don't want my model to know much about the training data .
#that s why i will not do .fit on test data
X_test=scaler.transform(X_test)

In [29]:
import pickle
pickle.dump(scaler,open('scalling.pkl','wb'))

In [None]:
X_train

In [None]:
X_test

# Model Training

In [32]:
from sklearn.linear_model import LinearRegression

In [33]:
regression=LinearRegression()

In [None]:
regression.fit(X_train,y_train)

In [None]:
#print the coefficient 
print (regression.coef_)

In [None]:
#print the intercept
print(regression.intercept_)

In [None]:
#on which parameters the model has been trained
regression.get_params()

In [38]:
#prediction with test data
reg_pred=regression.predict(X_test)

In [None]:
reg_pred

# Assumptions
after our assumptions, if we have data point that is very scatter or far away from each other we should know that with a linear regression we will not be able to have a good prediction with our model

In [None]:
#plot a scatter plot for the prediction
plt.scatter(y_test, reg_pred)

we can see that it is linear that means your model have actually perform well

In [41]:
#error
residuals=y_test -reg_pred

In [None]:
residuals

In [None]:
#plot the residuals
sns.displot(residuals,kind='kde')

my residual ranges mostly between -10 to +10; they are some point that are ranging between 10 to 30. we can still assume that our model is performing well because we have a normal distribution

In [None]:
##scatter plot with respect to prediction and residuals
##uniform distribution
plt.scatter(reg_pred, residuals)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

print(mean_absolute_error(y_test,reg_pred))
print(mean_squared_error(y_test,reg_pred))
print(np.sqrt(mean_squared_error(y_test,reg_pred)))

# R square and adjusted R square

### R^2=1-SSR/SST
R^2=coefficient of determination. SSR= sum of squares of residuals. SST = total sum of squares

In [None]:
from sklearn.metrics import r2_score
score=r2_score(y_test,reg_pred)
print(score)

the more it is close to 1 the more better our model is. 

### Adjusted R2 =1-[(1-R2)*(n-1)/(n-k-1)]
where:<br>
R2: The R2 of the model; n: The number of observations; k:The number of prediction

In [None]:
#display adjusted R-squared
1-(1-score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)

# New Data Prediction

we want to be able to take a new input data to see what will the prediction be.In most of the data we will be getting single single data point and we have to provide the prediction 

In [None]:
print(boston.columns)

In [None]:
boston.iloc[0]

In [None]:
boston.iloc[0].shape

In [None]:
boston.iloc[0]. values.reshape(1,-1)

In [None]:
boston.iloc[0]. values.reshape(1,-1).shape

In [53]:
# Reshape your input data
X_input = boston.iloc[0].values.reshape(1, -1)

# Remove the last feature
X_input = X_input[:, :-1]

# Make the prediction
prediction = regression.predict(X_input)


In [None]:
prediction

In [55]:
# Suppose 'training_columns' are the columns used for training the model
training_columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'LSTAT']  # Replace with actual column names

# Align the prediction data with the training columns
prediction_data = boston[training_columns].iloc[0].values.reshape(1, -1)

# Make predictions
predictions = regression.predict(prediction_data)


In [None]:
predictions

we negative value because we did not do standisation on our new data. so below is how to fix it 

In [None]:
# transformation of new data.
scaler.transform(prediction_data)

In [None]:
regression.predict(scaler.transform(prediction_data))


now we are getting the correct output so what ever step we did from the starting we also have to follow it on the new data .so this is our to do the prediction with respect to new data.

In [None]:
predictions

In [60]:
# Suppose 'training_columns' are the columns used for training the model
training_columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'LSTAT']  # Replace with actual column names

# Align the prediction data with the training columns
# Make predictions
predictions = regression.predict(boston[training_columns].iloc[5].values.reshape(1, -1))


In [None]:
predictions

In [None]:
# transformation of new data.
scaler.transform(prediction_data)
regression.predict(scaler.transform(prediction_data))

# Pickling The Model file For Deployment

Now we are going to pickle our model so that we can use it for deployment.

In [63]:
import pickle

In [64]:
#to convert our model into a pickle file.
# it is a serialise format so it can be deployed on any server.
pickle.dump(regression,open('regmodel.pkl','wb'))

`regression` is the name of the model that i have created<br> `regmodel.pkl` is the file where i will put my model<br>`wb` indicate that if our file does not exit in the folder, it should go ahead and create a pickle file

so if you want to see the file that you have created you can go `file, open`

In [65]:
#this pickle file can also be loaded with the pickle library
pickled_model=pickle.load(open('regmodel.pkl','rb'))

In [None]:
pickled_model.predict(scaler.transform(prediction_data))

we can see that we have the same output that we had before

so far we have created our model for the boston data set and we also tested the new data and finally we pickle the model file for the deployment.<br> now we will see how we can convert this project into an end to end project.<br> when we are creating the end to end project we need to follow the industrial standard; and when we are following the industrial standard we will be using git hub, CICD pipe line we are also use the cloud where we can deploy the application; we will create a simple front end application 

Go to git repository web site and download the setup file then install it in your computer.<br>To check if the installation was successful, open command prompt and type `git`. when you see response coming up that means the installation was successful.<br> create a git account

Go to your git hub profile and open repositories create new repository, enter the name, add the read me file,add in gitignoe the word python because that's what i will be using; pick up any type of licence you want abd you can finally go ahead and create your repository<br> after the repository is created, you have to clone it to your local machine so that you can make commit to it.<br> In other to clone it,  open a command prompt, go to the folder where you which to save your work henceforth type `git clone **the link from your repository**` 

So if you go back and open your folder you will see your repository, so the next thing that you need to do is go and copy your pigle file and save it in the same folder as your repository.

the next thing to do is to go ahead and download visual studio code where we will be doing our entire end to end implementation