<a href="https://colab.research.google.com/github/coughlinjennie/data71200/blob/main/projects/DATA71200_Project2_Coughlin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Supervised Learning
Because the field I want to use for labels is categorical — the property type — I'm using classifier models for this portion of the project. SVM, Gaussian naive Bayes, decision tree and KNN are the ones I'm considering.  


In [29]:
#Import the libraries and install scikit-learn
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
import requests
import io

!pip install -U scikit-learn==1.4



#Step 1: Import, split and clean the data
This is brought over from Project 1, with a fix to stratify when I split the data and a data pipeline for cleaning the data now that I know what needs to be done.

In [30]:
#Import the data, sourced from Kaggle and stored in my GitHub
url = "https://raw.githubusercontent.com/coughlinjennie/data71200/main/projects/nyhousing.csv" # Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content
#Load the data

housing_master = pd.read_csv(io.StringIO(download.decode('utf-8')))

In [31]:
housing_master["TYPE"].value_counts()

TYPE
Co-op for sale                1450
House for sale                1012
Condo for sale                 891
Multi-family home for sale     727
Townhouse for sale             299
Pending                        243
Contingent                      88
Land for sale                   49
For sale                        20
Foreclosure                     14
Condop for sale                  5
Coming Soon                      2
Mobile house for sale            1
Name: count, dtype: int64

I need to stratify the data when I split it, and the two values in this field that will interfere with that are ones I was going to drop anyway because they're not relevant for this model. (The TYPE field is showing the status of the property, but I'm using only the labels that indicate the property type and exclude the others, plus a couple that aren't super-relevant in New York.) We're not supposed to clean data until after we split it, but I can't figure out how to stratify the data for the split without doing this one step, so I'm going to do it anyway.

In [32]:
# Delete all rows where column 'TYPE' has certain values
indexType = housing_master[ (housing_master['TYPE'] == "For sale") | (housing_master['TYPE'] == "Contingent") | (housing_master['TYPE'] == "Land for sale") | (housing_master['TYPE'] == "Foreclosure") | (housing_master['TYPE'] == "Pending") | (housing_master['TYPE'] == "Coming Soon") | (housing_master['TYPE'] == "Mobile house for sale") ].index
housing_master.drop(indexType , inplace=True)

In [33]:
housing_master["TYPE"].value_counts()

TYPE
Co-op for sale                1450
House for sale                1012
Condo for sale                 891
Multi-family home for sale     727
Townhouse for sale             299
Condop for sale                  5
Name: count, dtype: int64

In [34]:
#Set the labels on TYPE

housing_label = housing_master["TYPE"]

#Set the data
housing = housing_master.drop("TYPE", axis=1)
print(housing)

                                            BROKERTITLE      PRICE  BEDS  \
0           Brokered by Douglas Elliman  -111 Fifth Ave     315000     2   
1                                   Brokered by Serhant  195000000     7   
2                                Brokered by Sowae Corp     260000     4   
3                                   Brokered by COMPASS      69000     3   
4     Brokered by Sotheby's International Realty - E...   55000000     7   
...                                                 ...        ...   ...   
4796                                Brokered by COMPASS     599000     1   
4797                    Brokered by Mjr Real Estate Llc     245000     1   
4798      Brokered by Douglas Elliman - 575 Madison Ave    1275000     1   
4799            Brokered by E Realty International Corp     598125     2   
4800                 Brokered by Nyc Realty Brokers Llc     349000     1   

           BATH  PROPERTYSQFT  \
0      2.000000   1400.000000   
1     10.000000  1754

In [35]:
#Divide the data into training and testing sets
from sklearn.model_selection import train_test_split

housing_train, housing_test, housing_label_train, housing_label_test = train_test_split(housing, housing_label, test_size=0.3, stratify=housing_label, random_state=42)


In [36]:
#Create a column with the ZIP code of the property
housing["ZIP"] = housing.MAIN_ADDRESS.str[-5:]

In [56]:
# Create a list of redundant column names to drop
to_drop = ["LONGITUDE", "LATITUDE", "ADDRESS", "ADMINISTRATIVE_AREA_LEVEL_2", "LOCALITY", "SUBLOCALITY", "FORMATTED_ADDRESS", "MAIN_ADDRESS", "STATE", "STREET_NAME","LONG_NAME"]

# Drop those columns from the dataset
housing_subset = housing.drop(to_drop, axis = 1)


In [57]:
#Drop all properties values that sold for more than $1B

housing_clean = housing_subset[housing_subset['PRICE'] <= 100000000]

In [58]:
housing_clean.head()

Unnamed: 0,BROKERTITLE,PRICE,BEDS,BATH,PROPERTYSQFT,ZIP
0,Brokered by Douglas Elliman -111 Fifth Ave,315000,2,2.0,1400.0,10022
2,Brokered by Sowae Corp,260000,4,2.0,2015.0,10312
3,Brokered by COMPASS,69000,3,1.0,445.0,10022
4,Brokered by Sotheby's International Realty - E...,55000000,7,2.373861,14175.0,10065
5,Brokered by Sowae Corp,690000,5,2.0,4004.0,11238


#Pre-Process the Data
Once the data is cleaned, I need to process it so I can run various surpervised models on it.

In [59]:
#Organize columns by dtype

num_housing = "BEDS", "BATH", "PROPERTYSQFT"
cat_housing = "BROKERTITLE", "ZIP"

In [60]:
#Import pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer

#Set up numeric and categorical pipelines
num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("transform", FunctionTransformer(np.log, inverse_func = np.exp))
])

cat_pipeline = Pipeline([
    ("impute_c", SimpleImputer(strategy = "most_frequent")),
    ("encode", OneHotEncoder(handle_unknown="ignore"))
])

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_housing),
    ("cat", cat_pipeline,cat_housing),
])

In [61]:
#Prepare the data
housing_prepared = preprocessing.fit_transform(housing_clean)
housing_prepared.shape

  return func(X, **(kw_args if kw_args else {}))


(4382, 1156)

The first few times I ran this, there were a lot more columns. I ended up dropping some of the categorical columns that gave location information to streamline things, which took it from 4K columns and change down to 1156. Still a ton, but one hot encoding for ZIP codes meant there were always going to be a bunch.

#Classifier Models

Since I'm trying to predict a categorical label — the type of property — I'm only assessing classifier models for this project.

In [None]:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(housing_prepared, )