# Logistic Regression Model to Predict Strong Housing Markets

This model will assess past characteristics of strong markets to predict future performance in 400+ US metropolitan areas.  

## About the model:


1. Logistic Regression Supervised Learning Model used to identify the top 90th percentile markets based on home price index growth.  

2. Employment data from the BLS was used to compute y/y growth for each of the 400+ markets. 

3. Home price index (HPI) data was drawn from the FHFA to compute y/y growth in home prices in each of the 400_ markets.  

4. Data sets were indexed by the MSA names which were manually adjusted to easily merge the two data sets for analysis.  

5. Lag variables were added as features for HPI and employment.  

6. Employment and HPI data were grouped into periods of five years starting from 1991 to 2021.  



In [186]:
# Import the required modules
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from imblearn.metrics import classification_report_imbalanced
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler,OneHotEncoder



# Prepare the Data

### Step 1: Load the `project_2_data_2.csv` file from the `Resources` folder into a Pandas DataFrame. Set the “MSA” column as the index.

In [187]:
# Read in the project_2_data.csv file into a PandasDataFrame.
housing_data = pd.read_csv(
    Path('./Resources/project_2_data_2.csv'), 
    index_col='MSA'
)

# Review the DataFrame
housing_data.head()


Unnamed: 0_level_0,%hpi,rank,%hpi_lag,Employment,Employment_ Lag,Period,Population
MSA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"Abilene, TX",0.149704,1,-0.009124,0.68,2.26,1996-2000,0.028429
"Albany, GA",0.126767,1,-0.007567,0.04,2.76,1996-2000,0.036046
"Albany-Schenectady-Troy, NY",0.107624,1,-0.014653,1.66,-0.06,1996-2000,0.008185
"Albuquerque, NM",0.106139,1,-0.006384,2.22,3.84,1996-2000,0.099915
"Alexandria, LA",0.100375,1,0.000399,1.64,2.0,1996-2000,-0.020036


### Step 2: Assess the structure of the target to determine balance. 


In [188]:
# Run the value_counts() function to determine the distribution of the "0" and "1" values. 

housing_data['rank'].value_counts()

0    1221
1     125
Name: rank, dtype: int64

# Split the data into training and testing sets

### Step 1: Using the `project_2_data_2.csv` DataFrame, separate the data into training and testing data. Start by defining the `target` (the “rank” column) and the `features` of the data (all the columns except “rank”).

In [189]:
# The target column should be the binary `rank` column.
y = housing_data['rank']


# The features column should be all of the features. 
X = housing_data.drop(columns=['rank','Period ','%hpi']).dropna()



### Step 2: Split the features and target data into `training_features`, `testing_features`, `training_targets`, and `testing_targets` datasets by using the `train_test_split` function.

In [190]:
# Split the dataset using the train_test_split function
X_train, X_test, y_train, y_test = train_test_split(X, y)

### Step 3: Use scikit-learn's `StandardScaler` to scale the features data.

In [191]:
# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the scaler to the features training dataset
X_scaler = scaler.fit(X_train)

# Fit the scaler to the features training dataset
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)


### Step 3: Oversample the data to reduce imbalance

In [192]:
# Resample the data

random_oversampler = RandomOverSampler(random_state=1)
X_resampled, y_resampled = random_oversampler.fit_resample(X_train, y_train)

In [193]:
# Count the distinct values
y_resampled.value_counts()

1    913
0    913
Name: rank, dtype: int64

# Model and Fit the Data to a Logistic Regression

### Step 1: Declare a `LogisticRegression` model.

In [194]:
# Declare a logistic regression model.
# Apply a random_state of 7 to the model
logistic_regression_model = LogisticRegression(random_state=7)

### Step 2: Fit the training data to the model, and save the model.

In [195]:
# Fit and save the logistic regression model using the training data
lr_model = logistic_regression_model.fit(X_train, y_train)

# Predict the Testing Labels

### Step 1: Make predictions about fraud by using the testing dataset, and save those predictions.

In [196]:
# Make and save testing predictions with the saved logistic regression model using the test data
testing_predections = lr_model.predict(X_test)

# Review the predictions
testing_predections

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

# Calculate the Performance Metrics

### Step 1: Calculate the accuracy score by evaluating `testing_targets` vs. `testing_predictions`.

In [197]:
# Display the accuracy score for the test dataset.
accuracy_score(y_test, testing_predections)

0.9109792284866469

# Predict the strongest housing market

In [198]:
# Read in the predicting_data.csv file into a PandasDataFrame.
predicting_data = pd.read_csv(
    Path("./Resources/predicting_data.csv"), 
    index_col="MSA"
)

# Review the DataFrame
predicting_data.head()


Unnamed: 0_level_0,%hpi,rank,%hpi_lag,Employment,Employment_ Lag,Period,Population
MSA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"Pine Bluff, AR",0.04,0,0.0,-0.97,-2.02,2016-2021,-0.013959
"Danville, IL",0.04,0,0.01,-1.47,-0.08,2016-2021,-0.013204
"Chico, CA",0.09,0,0.03,-0.43,2.18,2016-2021,-0.012199
"Beckley, WV",0.04,0,0.01,-0.6,-1.12,2016-2021,-0.011769
"Goldsboro, NC",0.06,0,-0.01,-0.38,0.12,2016-2021,-0.010171


In [199]:
X_new = predicting_data.drop(columns=['rank','Period ','%hpi']).dropna()

In [200]:
X_new

Unnamed: 0_level_0,%hpi_lag,Employment,Employment_ Lag,Population
MSA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Pine Bluff, AR",0.00,-0.97,-2.02,-0.013959
"Danville, IL",0.01,-1.47,-0.08,-0.013204
"Chico, CA",0.03,-0.43,2.18,-0.012199
"Beckley, WV",0.01,-0.60,-1.12,-0.011769
"Goldsboro, NC",-0.01,-0.38,0.12,-0.010171
...,...,...,...,...
"Myrtle Beach-Conway-North Myrtle Beach, SC-NC",-0.01,2.17,2.16,0.028453
"Greeley, CO",0.07,1.38,4.54,0.029383
"Provo-Orem, UT",0.04,4.30,5.28,0.030201
"Coeur d'Alene, ID",0.03,3.02,2.50,0.031405


In [209]:
# Make and save predictions with the saved logistic regression model

mythreshold = 0.15
#y_pred = lr_model.predict(X_new)  
y_pred =  (lr_model.predict_proba(X_new)[:,1]>=mythreshold).astype(int)

# Review the predictions
y_pred

Feature names unseen at fit time:
- Population
Feature names seen at fit time, yet now missing:
- Population 



array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,

In [210]:
df_pred = pd.DataFrame(y_pred,
                       index = X_new.index,
                       columns = ['rank']
                      )

df_pred
#df_pred.reindex
#df_pred.set_index(X_new.index)

Unnamed: 0_level_0,rank
MSA,Unnamed: 1_level_1
"Pine Bluff, AR",0
"Danville, IL",0
"Chico, CA",0
"Beckley, WV",0
"Goldsboro, NC",0
...,...
"Myrtle Beach-Conway-North Myrtle Beach, SC-NC",0
"Greeley, CO",0
"Provo-Orem, UT",0
"Coeur d'Alene, ID",0


In [211]:
#count the number of strong markets predicted. 
df_pred.value_counts()

rank
0       386
1         8
dtype: int64

In [212]:
# Output dataframe to csv file. 

df_pred.to_csv('strong_markets_2025')