# Lab Assignment 3: Extending Logistic Regression

Gabs DiLiegro, London Kasper, Carys LeKander

# 1. Preparation and Overview


Dataset: https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot

## Define and Prepare Classs Variables

In [6]:
import pandas as pd
from sklearn.preprocessing import LabelBinarizer

df = pd.read_csv('melb_data.csv')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In Assingment 1, we explained why we decided to remove certain attributes. We have listed descriptions of the attributes used and removed below.

### Attributes used :
- Suburb: the name of the suburb that each property is in. (String object)
- Rooms: the total number of rooms for each property. (int64)
- Type: the type of property. ( h = house/cottage/villa/seni/terrace; u = unit, duplex; t = townhouse)
- Price: listing price in Australian dollars (float64)
- Distance: the distance from the property to the Melbourne central business district AKA CBD (float64)
- Postcode: zipcode the property falls within (float64) 
- Bathroom: the number of bathrooms (float64)
- Car: the number of parking spots (float64)
- Landsize: the size of the land in meters (float64)
- Regionname: general region of the property (String object)

### Attributes removed and why:
- Address: the address of the property (String object)
- Method: way the property was listed
     - We aren't considering data about how the house was sold, just the features of the house itself, which is why we also excluded SellerG and Date
- SellerG: name of the real estate agent listing the property (String object)
- Date: sale date in mm/dd/yyyy (float64)
- Bedroom2: the number of bedrooms (float64: scraped from a different source)
    - Rooms and Bedroom2 contain the same information collected from different sources. We have seen that these features are very strongly correlated and have very similar data, so we are excluding Bedroom2 for simplicity.
- Propertycount: number of properties in the same suburb (float64)
- BuildingArea: area of the building in meters
    - Since ~47% of the dataset is missing this we decided to remove it
- YearBuilt: year the house was built (float64)
    - Although we think YearBuilt could have useful infomation for our model, ~40% of the data is missing so we have decided to remove it
- CouncilArea: the governing council for the area (String object)
    - The CouncilArea stopped being recorded after a certain date. Therefore we decided to remove this attribute as we did not want it to skew our set
- Latitude: lattitude of property (float64)
- Longitude: longtitude of property (float64)
    - Niether Longitude or Latitude supply us with more useful information than other data we have collected about the area of the houses

In [8]:
# removing irrelevant columns 
df = df.drop(['Address','Method','SellerG', 'Date','Bedroom2', 'Propertycount','BuildingArea', 'YearBuilt','CouncilArea','Lattitude','Longtitude'], axis=1)

Our only column remaining with missing data is the car spots. There are only 62 missing data points of 13,580.

In [9]:
df.Car.describe()

count    13518.000000
mean         1.610075
std          0.962634
min          0.000000
25%          1.000000
50%          2.000000
75%          2.000000
max         10.000000
Name: Car, dtype: float64

Since there are so few missing points and the interquartile range is small, we used the median (2 spots) to fill the missing values. 

In [10]:
#fill in missing numeric values with median for Car
df.Car = df.Car.fillna(df.Car.median())

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Suburb      13580 non-null  object 
 1   Rooms       13580 non-null  int64  
 2   Type        13580 non-null  object 
 3   Price       13580 non-null  float64
 4   Distance    13580 non-null  float64
 5   Postcode    13580 non-null  float64
 6   Bathroom    13580 non-null  float64
 7   Car         13580 non-null  float64
 8   Landsize    13580 non-null  float64
 9   Regionname  13580 non-null  object 
dtypes: float64(6), int64(1), object(3)
memory usage: 1.0+ MB


Next, we want to normalize our numeric values.

In [11]:
from sklearn.preprocessing import MinMaxScaler

cols_to_norm = ['Rooms','Distance','Bathroom','Car','Landsize']
df[cols_to_norm] = MinMaxScaler().fit_transform(df[cols_to_norm])

In [12]:
df.head()

Unnamed: 0,Suburb,Rooms,Type,Price,Distance,Postcode,Bathroom,Car,Landsize,Regionname
0,Abbotsford,0.111111,h,1480000.0,0.051975,3067.0,0.125,0.1,0.000466,Northern Metropolitan
1,Abbotsford,0.111111,h,1035000.0,0.051975,3067.0,0.125,0.0,0.00036,Northern Metropolitan
2,Abbotsford,0.222222,h,1465000.0,0.051975,3067.0,0.25,0.0,0.000309,Northern Metropolitan
3,Abbotsford,0.222222,h,850000.0,0.051975,3067.0,0.25,0.1,0.000217,Northern Metropolitan
4,Abbotsford,0.333333,h,1600000.0,0.051975,3067.0,0.125,0.2,0.000277,Northern Metropolitan


Then, we one-hot encode our categorical data.

Type has 3 unqiue values ('h' for house, 'u' for unit, and 't' for townhouse) and Regionname has 8 unique regions. We one-hot encode those below and show the new variables we created.

In [13]:
df = pd.concat([df,pd.get_dummies(df['Type'], prefix='Type')],axis=1)
df.drop(['Type'],axis=1, inplace=True)

df = pd.concat([df,pd.get_dummies(df['Regionname'], prefix='Regionname')],axis=1)
df.drop(['Regionname'],axis=1, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 19 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Suburb                                 13580 non-null  object 
 1   Rooms                                  13580 non-null  float64
 2   Price                                  13580 non-null  float64
 3   Distance                               13580 non-null  float64
 4   Postcode                               13580 non-null  float64
 5   Bathroom                               13580 non-null  float64
 6   Car                                    13580 non-null  float64
 7   Landsize                               13580 non-null  float64
 8   Type_h                                 13580 non-null  uint8  
 9   Type_t                                 13580 non-null  uint8  
 10  Type_u                                 13580 non-null  uint8  
 11  Re

Postcode and Suburb have way more unique values. We one-hot encode them below.

In [14]:
df = pd.concat([df,pd.get_dummies(df['Postcode'], prefix='Postcode')],axis=1)
df.drop(['Postcode'],axis=1, inplace=True)

df = pd.concat([df,pd.get_dummies(df['Suburb'], prefix='Suburb')],axis=1)
df.drop(['Suburb'],axis=1, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Columns: 529 entries, Rooms to Suburb_Yarraville
dtypes: float64(6), uint8(523)
memory usage: 7.4 MB


# 2. Modeling

# 3. Deployment

# 4. Exceptional Work