<a href="https://colab.research.google.com/github/coughlinjennie/data71200/blob/main/projects/DATA71200_Project2b_Coughlin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Supervised Learning
Because the field I want to use for labels is categorical — the property type — I'm using classifier models for this portion of the project. SVM, Gaussian naive Bayes, decision tree and KNN are the ones I'm considering.  


In [1]:
#Import the libraries and install scikit-learn
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
import requests
import io
import matplotlib.pyplot as plt


!pip install -U scikit-learn==1.4

Collecting scikit-learn==1.4
  Downloading scikit_learn-1.4.0-1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
Successfully installed scikit-learn-1.4.0


#Step 1: Import, split and clean the data
This is brought over from Project 1, with a fix to stratify when I split the data and a data pipeline for cleaning the data now that I know what needs to be done.

In [2]:
#Import the data, sourced from Kaggle and stored in my GitHub
url = "https://raw.githubusercontent.com/coughlinjennie/data71200/main/projects/nyhousing.csv" # Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content
#Load the data

housing_master = pd.read_csv(io.StringIO(download.decode('utf-8')))

In [3]:
housing_master["TYPE"].value_counts()

TYPE
Co-op for sale                1450
House for sale                1012
Condo for sale                 891
Multi-family home for sale     727
Townhouse for sale             299
Pending                        243
Contingent                      88
Land for sale                   49
For sale                        20
Foreclosure                     14
Condop for sale                  5
Coming Soon                      2
Mobile house for sale            1
Name: count, dtype: int64

I need to stratify the data when I split it, and the two values in this field that will interfere with that are ones I was going to drop anyway because they're not relevant for this model. (The TYPE field is showing the status of the property, but I'm using only the labels that indicate the property type and exclude the others, plus a couple that aren't super-relevant in New York.) We're not supposed to clean data until after we split it, but I can't figure out how to stratify the data for the split without doing this one step, so I'm going to do it anyway.

In [4]:
# Delete all rows where column 'TYPE' has certain values
indexType = housing_master[ (housing_master['TYPE'] == "For sale") | (housing_master['TYPE'] == "Contingent") | (housing_master['TYPE'] == "Land for sale") | (housing_master['TYPE'] == "Foreclosure") | (housing_master['TYPE'] == "Pending") | (housing_master['TYPE'] == "Coming Soon") | (housing_master['TYPE'] == "Mobile house for sale") ].index
housing_master.drop(indexType , inplace=True)

In [5]:
housing_master["TYPE"].value_counts()

TYPE
Co-op for sale                1450
House for sale                1012
Condo for sale                 891
Multi-family home for sale     727
Townhouse for sale             299
Condop for sale                  5
Name: count, dtype: int64

In [6]:
#Set the labels on TYPE

housing_label = housing_master["TYPE"]

#Set the data
housing = housing_master.drop("TYPE", axis=1)
print(housing)

                                            BROKERTITLE      PRICE  BEDS  \
0           Brokered by Douglas Elliman  -111 Fifth Ave     315000     2   
1                                   Brokered by Serhant  195000000     7   
2                                Brokered by Sowae Corp     260000     4   
3                                   Brokered by COMPASS      69000     3   
4     Brokered by Sotheby's International Realty - E...   55000000     7   
...                                                 ...        ...   ...   
4796                                Brokered by COMPASS     599000     1   
4797                    Brokered by Mjr Real Estate Llc     245000     1   
4798      Brokered by Douglas Elliman - 575 Madison Ave    1275000     1   
4799            Brokered by E Realty International Corp     598125     2   
4800                 Brokered by Nyc Realty Brokers Llc     349000     1   

           BATH  PROPERTYSQFT  \
0      2.000000   1400.000000   
1     10.000000  1754

In [7]:
#Divide the data into training and testing sets
from sklearn.model_selection import train_test_split

housing_train, housing_test, housing_label_train, housing_label_test = train_test_split(housing, housing_label, test_size=0.3, stratify=housing_label, random_state=42)


The ZIP code field was giving me fits trying to one hot encode it, so I'm trying this without that field to see if the model is useful enough without it.

In [47]:
# Create a list of redundant column names to drop from the training data only
to_drop = ["LONGITUDE", "LATITUDE", "ADDRESS", "ADMINISTRATIVE_AREA_LEVEL_2", "LOCALITY", "SUBLOCALITY", "FORMATTED_ADDRESS", "MAIN_ADDRESS", "STATE", "STREET_NAME","LONG_NAME","BROKERTITLE"]

# Drop those columns from the dataset
housing_subset = housing_train.drop(to_drop, axis = 1)
h_test_subset = housing_test.drop(to_drop, axis = 1)



The dropped columns are redundant, but if they exist in the testing data the model just won't use them. So this step was only done on the training data.

In [27]:
#Drop all properties that sold for more than $1B from training data only

housing_clean = housing_subset[housing_subset['PRICE'] <= 100000000]
label_train_clean = housing_label_train[housing_subset['PRICE'] <= 100000000]


In [28]:
housing_clean.info()
label_train_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3067 entries, 2589 to 2311
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PRICE         3067 non-null   int64  
 1   BEDS          3067 non-null   int64  
 2   BATH          3067 non-null   float64
 3   PROPERTYSQFT  3067 non-null   float64
dtypes: float64(2), int64(2)
memory usage: 119.8 KB
<class 'pandas.core.series.Series'>
Index: 3067 entries, 2589 to 2311
Series name: TYPE
Non-Null Count  Dtype 
--------------  ----- 
3067 non-null   object
dtypes: object(1)
memory usage: 47.9+ KB


In [29]:
housing_clean.describe()

Unnamed: 0,PRICE,BEDS,BATH,PROPERTYSQFT
count,3067.0,3067.0,3067.0,3067.0
mean,1955427.0,3.401043,2.392915,2202.658046
std,4342835.0,2.81077,2.058869,2420.874714
min,49500.0,1.0,0.0,250.0
25%,499000.0,2.0,1.0,1166.5
50%,845000.0,3.0,2.0,2184.207862
75%,1499000.0,4.0,3.0,2184.207862
max,65000000.0,50.0,50.0,65535.0


That no bathroom property is going to cause an issue transforming, so I'm going to remove it, too.

In [30]:
housing_clean.shape

(3067, 4)

#Step 2: Prepare the Data
Once the data is cleaned, I need to process it so I can run various supervised models on it.

In [35]:
#Import pipeline
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer

log_pipeline = make_pipeline(SimpleImputer(missing_values= np.nan, strategy="median"),
                             FunctionTransformer(np.log,feature_names_out="one-to-one"),
                             StandardScaler())
num_pipeline = make_pipeline(SimpleImputer(strategy="median"))

preprocessing = ColumnTransformer([
    ("log", log_pipeline, ["BEDS","PROPERTYSQFT"]),
    ("std", num_pipeline, ["PRICE", "BATH"])
])

In [38]:
#Prepare the data
housing_prepared = preprocessing.fit_transform(housing_clean)
housing_prepared.shape

(3067, 4)

#Step 3: Examine the Target Attribute
The TYPE field is my target attribute

In [44]:
import seaborn as sb

#Examine the distribution of the categories
sb.barplot(data = label_train_clean, x="TYPE")

TypeError: Data source must be a DataFrame or Mapping, not <class 'pandas.core.series.Series'>.

#Step 4: Select Classifier Models

Since I'm trying to predict a categorical label — the type of property — I'm only assessing classifier models for this project. I'm going to try K-Nearest Neighbors and the Decision Tree classifiers

In [45]:
#import models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import model_selection



#Step 5: Run and Assess the Models
Three components here:

1.   Run with the defaults
2.   Run again and use cross-validation
3.   Adjust parameters for the model(s) using grid search




In [54]:
#Set the training data to values
X_train = housing_prepared


In [52]:
#Run K-Neighbors with defaults
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X train, label_train_clean)
housing_predict = knn.predict(h_test_subset.values)



AttributeError: 'numpy.ndarray' object has no attribute 'values'

In [None]:
from sklearn.metrics import precision_score
precision_score(y2_test, y2_pred, average=None)

from sklearn.metrics import recall_score
recall_score(y2_test, y2_pred, average=None)

from sklearn.metrics import f1_score

# Calculate metrics globally by counting the total true positives, false negatives and false positives.
print(f1_score(y2_test, y2_pred, average='micro'))
# Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
print(f1_score(y2_test, y2_pred, average='macro'))
# Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
print(f1_score(y2_test, y2_pred, average='weighted'))

# Class-wise, no averaging
print(f1_score(y2_test, y2_pred, average=None))

In [None]:
#Run DecisionTree with defaults
tree = DecisionTreeClassifier()
tree.fit(housing_prepared, label_train_clean)
tree.predict(h_test_subset)