# Project details - regression

**Background**: You are working as an analyst for a real estate company. Your company wants to build a machine learning model to predict the selling prices of houses based on a variety of features on which the value of the house is evaluated.

**Objective**: The task is to build a model that will predict the price of a house based on features provided in the dataset. The senior management also wants to explore the characteristics of the houses using some business intelligence tool. One of those parameters include understanding which factors are responsible for higher property value - \$650K and above.
The questions have been provided later in the document for which you can use tableau.

**Data**: The data set consists of information on some 22,000 properties.  The dataset consisted of historic data of houses sold between May 2014 to May 2015.
These are the definitions of data points provided:
(Note: For some of the variables that are self explanatory, no definition has been provided)


id - Unique ID for each home sold

date - Date of the home sale

price - Price of each home sold

bedrooms - Number of bedrooms

bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower

sqft_living - Square footage of the apartments interior living space

sqft_lot - Square footage of the land space

floors - Number of floors

waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not

view - An index from 0 to 4 of how good the view of the property was

condition - An index from 1 to 5 on the condition of the apartment,

grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.

sqft_above - The square footage of the interior housing space that is above ground level

sqft_basement - The square footage of the interior housing space that is below ground level

yr_built - The year the house was initially built

yr_renovated - The year of the house’s last renovation

zipcode - What zipcode area the house is in

lat - Lattitude

long - Longitude

sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors

sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

### Exploring the data

We encourage you to thoroughly understand your data and take the necessary steps to prepare your data for modeling before building exploratory or predictive models. Since this is a regression task, you can use linear regression  for building a model. You are also encouraged to use other models in your project if necessary.
To explore the data, you can use the techniques that have been discussed in class. Some of them include using the describe method, checking null values, using _matplotlib_ and _seaborn_ for developing visualizations.
The data has a number of categorical and numerical variables. Explore the nature of data for these variables before you start with the data cleaning process and then data pre-processing (scaling numerical variables and encoding categorical variables).
You can  also use tableau to visually explore the data further.

### Model

Build a regression model that best fits your data. You can use the measures of accuracies that have been discussed in class

## First Steps 

- Check the columns
- Identify the column types (numerical/categorical, discrete/continuous ,string/float/date, check the unique values, check the outliers, check the null values and decide (replace or drop)
- Importing Libraries
- Input Customer Feedback Dataset
- Locate Missing Data
- Check for Duplicates
- Detect Outliers 
- Normalize Casing 

In [1]:
#Importing libraries

import pandas as pd
import numpy as np
import openpyxl

In [2]:
#Input dataset
df = pd.read_excel("Data/Data_MidTerm_Project_Real_State_Regression.xls")
pd.set_option('max_columns', None)

In [3]:
#We are dropping the columns that won't be needed for the analysis
df.drop(['date','id'], axis=1, inplace=True)

In [4]:
#We assume that a house is renovated when the latest living sqft value is different than the former one and we create a new column
df["renovated"] = df["yr_renovated"] != 0

In [5]:
#We assume that a house has a basement when the basement sqft value is not null and we create a new column
df["basement"] = df["sqft_basement"] != 0

In [6]:
#We will group the years into bins to categorize the houses more easily
bins = [1899,1929,1959,1989,2015]

labels =["Category A","Category B","Category C","Category D"]

df['decade'] = pd.cut(df['yr_built'], bins,labels=labels)

print(df)

       bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  view  \
0             3       1.00         1180      5650     1.0           0     0   
1             3       2.25         2570      7242     2.0           0     0   
2             2       1.00          770     10000     1.0           0     0   
3             4       3.00         1960      5000     1.0           0     0   
4             3       2.00         1680      8080     1.0           0     0   
...         ...        ...          ...       ...     ...         ...   ...   
21592         3       2.50         1530      1131     3.0           0     0   
21593         4       2.50         2310      5813     2.0           0     0   
21594         2       0.75         1020      1350     2.0           0     0   
21595         3       2.50         1600      2388     2.0           0     0   
21596         2       0.75         1020      1076     2.0           0     0   

       condition  grade  sqft_above  sqft_basement 

In [7]:
#We will group the years into bins to categorize the houses more easily

labels =["south","centre","north"]

df['geo1'] = pd.cut(df['lat'],3,labels=labels)

print(df)

       bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  view  \
0             3       1.00         1180      5650     1.0           0     0   
1             3       2.25         2570      7242     2.0           0     0   
2             2       1.00          770     10000     1.0           0     0   
3             4       3.00         1960      5000     1.0           0     0   
4             3       2.00         1680      8080     1.0           0     0   
...         ...        ...          ...       ...     ...         ...   ...   
21592         3       2.50         1530      1131     3.0           0     0   
21593         4       2.50         2310      5813     2.0           0     0   
21594         2       0.75         1020      1350     2.0           0     0   
21595         3       2.50         1600      2388     2.0           0     0   
21596         2       0.75         1020      1076     2.0           0     0   

       condition  grade  sqft_above  sqft_basement 

In [8]:
#We will group the years into bins to categorize the houses more easily

labels =["west","centre","east"]
bins = [-123,-122.230,-122,-121]

df['geo2'] = pd.cut(df['long'],bins,labels=labels)

print(df)

       bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  view  \
0             3       1.00         1180      5650     1.0           0     0   
1             3       2.25         2570      7242     2.0           0     0   
2             2       1.00          770     10000     1.0           0     0   
3             4       3.00         1960      5000     1.0           0     0   
4             3       2.00         1680      8080     1.0           0     0   
...         ...        ...          ...       ...     ...         ...   ...   
21592         3       2.50         1530      1131     3.0           0     0   
21593         4       2.50         2310      5813     2.0           0     0   
21594         2       0.75         1020      1350     2.0           0     0   
21595         3       2.50         1600      2388     2.0           0     0   
21596         2       0.75         1020      1076     2.0           0     0   

       condition  grade  sqft_above  sqft_basement 

In [9]:
#We check the created bins

df["geo2"].value_counts()

west      10822
centre     9306
east       1469
Name: geo2, dtype: int64

In [10]:
#We remove the outliers in the bedrooms 
df = df.loc[df["bedrooms"] != 33 ]

print(df)

       bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  view  \
0             3       1.00         1180      5650     1.0           0     0   
1             3       2.25         2570      7242     2.0           0     0   
2             2       1.00          770     10000     1.0           0     0   
3             4       3.00         1960      5000     1.0           0     0   
4             3       2.00         1680      8080     1.0           0     0   
...         ...        ...          ...       ...     ...         ...   ...   
21592         3       2.50         1530      1131     3.0           0     0   
21593         4       2.50         2310      5813     2.0           0     0   
21594         2       0.75         1020      1350     2.0           0     0   
21595         3       2.50         1600      2388     2.0           0     0   
21596         2       0.75         1020      1076     2.0           0     0   

       condition  grade  sqft_above  sqft_basement 

In [11]:
#We save the clean data
df.to_excel("Data/midterm_project_cleaned.xlsx")

# Notes


- Id : unique values, not needed for the analysis and the regression,to drop
- Bedrooms : outliers (11,33), we could keep only the values inside a certain quartile or z value and drop the rest
- Bathrooms : check the outliers
- SQFT values : maybe we can categorize the values in bins ?
- Condition, grade : to categorize ?
- yr_renovated : a lot of null values, maybe we could create a renovated ? column with yes/no values instead and categorize it
- zipcode :maybe needed for the map,otherwise could be dropped ?
- No duplicates
- Why are bathrooms and floors floats and not integers ? because of the bathroom counting system
- what is the difference between sqft_living and sqft_living 15)?
- Should we convert the years of construction to datetime ?
- Do we filter the price starting from 650k ?
- We could plot according to the tableau_regression file, the price per bedrooms, per bathrooms etc...
- Undrop year built
- Group latitude by north,centre,south( the lowest latitude is the southernmost) and group longitude by west,centre,east(the lowest longitude is the westernmost)