# Logistic Regression Model to Predict Strong Housing Markets

This model will assess past characteristics of strong markets to predict future performance in 400+ US metropolitan areas.  

## About the model:


1. Logistic Regression Supervised Learning Model used to identify the top 90th percentile markets based on home price index growth.  

2. Employment data from the BLS was used to compute y/y growth for each of the 400+ markets. 

3. Home price index (HPI) data was drawn from the FHFA to compute y/y growth in home prices in each of the 400_ markets.  

4. Data sets were indexed by the MSA names which were manually adjusted to easily merge the two data sets for analysis.  

5. Lag variables were added as features for HPI and employment.  

6. Employment and HPI data were grouped into periods of five years starting from 1991 to 2021.  



In [51]:
# Import the required modules
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression


# Prepare the Data

### Step 1: Load the `transaction_fraud_data.csv` file from the `Resources` folder into a Pandas DataFrame. Set the “id” column as the index.

In [52]:
# Read in the project_2_data.csv file into a PandasDataFrame.
housing_data = pd.read_csv(
    Path('./Resources/project_2_data_2.csv'), 
    index_col='MSA'
)

# Review the DataFrame
housing_data.head()


Unnamed: 0_level_0,%hpi,rank,%hpi_lag,Employment,Employment_ Lag,Period
MSA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Abilene, TX",0.149704,1,-0.009124,0.68,2.26,1996-2000
"Akron, OH",0.129769,1,-0.016294,0.94,2.04,1996-2000
"Albany, GA",0.126767,1,-0.007567,0.04,2.76,1996-2000
"Albany-Lebanon, OR",0.108796,1,0.002857,0.64,3.48,1996-2000
"Albany-Schenectady-Troy, NY",0.107624,1,-0.014653,1.66,-0.06,1996-2000


### Step 2: Answer the following question:

Note that you want to predict the `fraud` variable. Answer the following question: Using `value_counts`, how many fraudulent transactions exist in this dataset?

In [53]:
# The  column 'fraud' is the thing you want to predict. 
# Class 0 indicates no-fraud trasactions and class 1 indicates fraudulent transactions
# Using value_counts, how many fraudulent transactions are in this dataset?
housing_data['rank'].value_counts()

0    1444
1     151
Name: rank, dtype: int64

# Split the data into training and testing sets

### Step 1: Using the `transaction_fraud_data` DataFrame, separate the data into training and testing data. Start by defining the `target` (the “fraud” column) and the `features` of the data (all the columns except “fraud”).

In [54]:
# The target column should be the binary `fraud` column.
y = housing_data['rank']


# The features column should be all of the features. 
X = housing_data.drop(columns=['rank','Period ']).dropna()



### Step 2: Split the features and target data into `training_features`, `testing_features`, `training_targets`, and `testing_targets` datasets by using the `train_test_split` function.

In [55]:
# Split the dataset using the train_test_split function
X_training, X_testing, y_training, y_testing = train_test_split(X, y)

# Model and Fit the Data to a Logistic Regression

### Step 1: Declare a `LogisticRegression` model.

In [56]:
# Declare a logistic regression model.
# Apply a random_state of 7 to the model
logistic_regression_model = LogisticRegression(random_state=7)

### Step 2: Fit the training data to the model, and save the model.

In [57]:
# Fit and save the logistic regression model using the training data
lr_model = logistic_regression_model.fit(X_training, y_training)

# Predict the Testing Labels

### Step 1: Make predictions about fraud by using the testing dataset, and save those predictions.

In [58]:
# Make and save testing predictions with the saved logistic regression model using the test data
testing_predections = lr_model.predict(X_testing)

# Review the predictions
testing_predections

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

# Calculate the Performance Metrics

### Step 1: Calculate the accuracy score by evaluating `testing_targets` vs. `testing_predictions`.

In [59]:
# Display the accuracy score for the test dataset.
accuracy_score(y_testing, testing_predections)

0.899749373433584

# Predict the strongest housing market

In [61]:
# Read in the predicting_data.csv file into a PandasDataFrame.
predicting_data = pd.read_csv(
    Path("./Resources/predicting_data.csv"), 
    index_col="MSA"
)

# Review the DataFrame
predicting_data.head()


Unnamed: 0_level_0,%hpi,rank,%hpi_lag,Employment,Employment_ Lag,Period
MSA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Abilene, TX",0.07,0,0.03,1.15,0.78,2016-2021
"Akron, OH",0.06,0,0.01,-0.87,1.18,2016-2021
"Albany, GA",0.06,0,-0.02,0.02,0.1,2016-2021
"Albany-Lebanon, OR",0.12,1,0.02,1.33,1.92,2016-2021
"Albany-Schenectady-Troy, NY",0.05,0,0.0,-0.32,1.2,2016-2021


In [64]:
X_new = predicting_data.drop(columns=['rank','Period ']).dropna()

In [66]:
X_new.head()

Unnamed: 0_level_0,%hpi,%hpi_lag,Employment,Employment_ Lag
MSA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Abilene, TX",0.07,0.03,1.15,0.78
"Akron, OH",0.06,0.01,-0.87,1.18
"Albany, GA",0.06,-0.02,0.02,0.1
"Albany-Lebanon, OR",0.12,0.02,1.33,1.92
"Albany-Schenectady-Troy, NY",0.05,0.0,-0.32,1.2


In [67]:
# Make and save predictions with the saved logistic regression model

predections = lr_model.predict(X_new)

# Review the predictions
predections

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
#df_pred = pd.DataFrame