# Project details - regression

**Background**: You are working as an analyst for a real estate company. Your company wants to build a machine learning model to predict the selling prices of houses based on a variety of features on which the value of the house is evaluated.

**Objective**: The task is to build a model that will predict the price of a house based on features provided in the dataset. The senior management also wants to explore the characteristics of the houses using some business intelligence tool. One of those parameters include understanding which factors are responsible for higher property value - \$650K and above.
The questions have been provided later in the document for which you can use tableau.

**Data**: The data set consists of information on some 22,000 properties.  The dataset consisted of historic data of houses sold between May 2014 to May 2015.
These are the definitions of data points provided:
(Note: For some of the variables that are self explanatory, no definition has been provided)


- id - Unique ID for each home sold
- date - Date of the home sale
- price - Price of each home sold
- bedrooms - Number of bedrooms
- bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
- sqft_living - Square footage of the apartments interior living space
- sqft_lot - Square footage of the land space
- floors - Number of floors
- waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
- view - An index from 0 to 4 of how good the view of the property was
- condition - An index from 1 to 5 on the condition of the apartment,
- grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
- sqft_above - The square footage of the interior housing space that is above ground level
- sqft_basement - The square footage of the interior housing space that is below ground level
- yr_built - The year the house was initially built
- yr_renovated - The year of the house’s last renovation
- zipcode - What zipcode area the house is in
- lat - Lattitude
- long - Longitude
- sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
- sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

### Exploring the data

We encourage you to thoroughly understand your data and take the necessary steps to prepare your data for modeling before building exploratory or predictive models. Since this is a regression task, you can use linear regression  for building a model. You are also encouraged to use other models in your project if necessary.
To explore the data, you can use the techniques that have been discussed in class. Some of them include using the describe method, checking null values, using _matplotlib_ and _seaborn_ for developing visualizations.
The data has a number of categorical and numerical variables. Explore the nature of data for these variables before you start with the data cleaning process and then data pre-processing (scaling numerical variables and encoding categorical variables).
You can  also use tableau to visually explore the data further.

### Model

Build a regression model that best fits your data. You can use the measures of accuracies that have been discussed in class

## First Steps 

- Check the columns
- Identify the column types (numerical/categorical, discrete/continuous ,string/float/date, check the unique values, check the outliers, check the null values and decide (replace or drop)
- Importing Libraries
- Input Customer Feedback Dataset
- Locate Missing Data
- Check for Duplicates
- Detect Outliers 
- Normalize Casing 

In [1]:
#Importing libraries

import pandas as pd
import numpy as np
import openpyxl

In [2]:
# Loading dataset
df = pd.read_excel("Data/Data_MidTerm_Project_Real_State_Regression.xls")
pd.set_option('max_columns', None)

In [3]:
# Dropping the id column, it won't be needed for the analysis

df.drop(['date','id'], axis=1, inplace=True)

In [4]:
# Checking for null values

df.isnull().sum()

bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
price            0
dtype: int64

In [5]:
# Dropping duplicates: no duplicates found

df.drop_duplicates()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price
0,3,1.00,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650,221900
1,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639,538000
2,2,1.00,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062,180000
3,4,3.00,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000,604000
4,3,2.00,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503,510000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21592,3,2.50,1530,1131,3.0,0,0,3,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509,360000
21593,4,2.50,2310,5813,2.0,0,0,3,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200,400000
21594,2,0.75,1020,1350,2.0,0,0,3,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007,402101
21595,3,2.50,1600,2388,2.0,0,0,3,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287,400000


In [6]:
# We assume that a house is renovated when the latest living sqft value is different than the former one and we create a new column

df["renovated"] = df["yr_renovated"] != 0

In [7]:
# We assume that a house has a basement when the basement sqft value is not null and we create a new column

df["basement"] = df["sqft_basement"] != 0

In [8]:
# Grouping years into bins to categorize the houses more easily

bins = [1899,1929,1959,1989,2015]

labels =["Category A","Category B","Category C","Category D"]

df['decade'] = pd.cut(df['yr_built'], bins,labels=labels)

In [9]:
# Grouping houses geographically into 3 zones: south, centre and north

labels =["south","centre","north"]

df['geo1'] = pd.cut(df['lat'],3,labels=labels)

In [10]:
# Grouping houses geographically into 3 zones: west, centre and east

labels =["west","centre","east"]
bins = [-123,-122.230,-122,-121]

df['geo2'] = pd.cut(df['long'],bins,labels=labels)

In [11]:
# Checking the bins

df["geo2"].value_counts()

west      10822
centre     9306
east       1469
Name: geo2, dtype: int64

In [12]:
# Grouping houses by price by zipcode

df['avg_price_by_zipcode'] = round(df.groupby(['zipcode'])['price'].transform('mean'),2)

In [13]:
# Checking an outlier: house with 33 bedrooms 

df.loc[df["bedrooms"] == 33]

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price,renovated,basement,decade,geo1,geo2,avg_price_by_zipcode
15856,33,1.75,1620,6000,1.0,0,0,5,7,1040,580,1947,0,98103,47.6878,-122.331,1330,4700,640000,False,True,Category B,north,west,585048.78


In [14]:
# The numbers don't make sense, it's a faulty record; removing the row

df.drop(df[df.bedrooms == 33].index, inplace=True)

In [15]:
# Saving cleaned data for modelling

df.to_excel("Data/midterm_project_cleaned.xlsx")

In [16]:
X_train_const_scaled = sm.add_constant(X_train_scaled_ready) # adding a constant

model = sm.OLS(y_train, X_train_const_scaled).fit()
predictions_train = model.predict(X_train_const_scaled) 

X_test_const_scaled = sm.add_constant(X_test_scaled) # adding a constant
predictions_test = model.predict(X_test_const_scaled) 
print_model = model.summary()
print(print_model)

NameError: name 'sm' is not defined