# 🏡 **Unlocking the Secrets of Mortgages: A Journey into Homeownership and Beyond!**

![Home Image](https://i.ibb.co/DkfZnfD/image.png)

Welcome to the fascinating world of Mortgage Data, where numbers and trends come to life, telling the captivating story of Default Prediction, Customer Segmentation, and Property Purchase Trends. 🚀


## Necessary Libraries

In [31]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as srn
from scipy.stats import zscore

## Loading Dataset & Head Entries

In [2]:
data = pd.read_csv("dataset.csv", encoding="ISO-8859-1", low_memory=False)

In [3]:
data['Property Value Range'] = data['Property Value Range'].str.replace('$', '', regex=False)

In [4]:
data.head(5)

Unnamed: 0,First_Name,Last_Name,Address,City,County,State,Zip,Property Type,Phone,Gender,...,Property Purchased Year,Property Built,Property Value Range,Mortgage Amount In Thousands,Lender Name,Interest Type,Loan Type,Loan To Value,Home Value Mortgage File,Email
0,RHONDA,HALFPOP,731 S Mississippi Ave,Mason City,Cerro Gordo,IA,50401,S,6414230703,F,...,0,1976,"150,000 - 174,999",79,BANK OF AMERICA,U,Conventional,51,155335,
1,ROBERT,BIRGE,503 Student Dr,Ogallala,Keith,NE,69153,S,3082849975,M,...,2012,1959,"125,000 - 149,999",143,US BK NATIONAL ASSN,U,Conventional,0,140094,
2,HILARY,DOLAN,429 Moore Dr,Mount Holly,Burlington,NJ,8060,S,6095187632,F,...,2013,1926,"200,000 - 224,999",40,COMMERCE BK,U,Conventional,105,212437,
3,DEBORAH,EMOND,46 Junction Rd,Malone,Franklin,NY,12953,S,5184830320,F,...,0,1920,"75,000 - 99,999",45,NORTH FRANKLIN FCU *OTHER,U,Conventional,52,87210,
4,MARY,EDWARDS,11129 131st St,S Ozone Park,Queens,NY,11420,S,3475316801,F,...,1997,0,"350,000 - 399,999",100,NASSAU EDUCATORS FCU,U,Conventional,27,374555,


## Data Description

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 21 columns):
 #   Column                        Non-Null Count    Dtype 
---  ------                        --------------    ----- 
 0   First_Name                    999996 non-null   object
 1   Last_Name                     999975 non-null   object
 2   Address                       1000000 non-null  object
 3   City                          1000000 non-null  object
 4   County                        1000000 non-null  object
 5   State                         1000000 non-null  object
 6   Zip                           1000000 non-null  int64 
 7   Property Type                 1000000 non-null  object
 8   Phone                         1000000 non-null  object
 9   Gender                        1000000 non-null  object
 10  Age                           1000000 non-null  int64 
 11  Property Purchased Year       1000000 non-null  int64 
 12  Property Built                1000000 non-n

In [6]:
data.describe(include="all")

Unnamed: 0,First_Name,Last_Name,Address,City,County,State,Zip,Property Type,Phone,Gender,...,Property Purchased Year,Property Built,Property Value Range,Mortgage Amount In Thousands,Lender Name,Interest Type,Loan Type,Loan To Value,Home Value Mortgage File,Email
count,999996,999975,1000000,1000000,1000000,1000000,1000000.0,1000000,1000000.0,1000000,...,1000000.0,1000000.0,1000000,1000000.0,1000000,1000000,1000000,1000000.0,1000000.0,90905
unique,55928,204150,980375,11583,1530,51,,3,1000000.0,3,...,,,20,,41924,4,8,,,90848
top,MICHAEL,SMITH,PO Box 82,Houston,Los Angeles,CA,,S,6414230703.0,M,...,,,"500,000 - 749,999",,WELLS FARGO BK NA,U,Conventional,,,claramarine@aol.com
freq,17822,7510,16,8580,28601,121430,,951281,1.0,466009,...,,,86355,,64958,727689,842229,,,2
mean,,,,,,,50353.348007,,,,...,1716.699152,1784.453285,,168.309944,,,,66.59678,317226.8,
std,,,,,,,30018.322936,,,,...,701.213214,583.796886,,209.675384,,,,38.756272,313188.3,
min,,,,,,,1001.0,,,,...,0.0,0.0,,0.0,,,,0.0,0.0,
25%,,,,,,,27107.0,,,,...,1993.0,1952.0,,70.0,,,,42.0,148136.0,
50%,,,,,,,46256.0,,,,...,2003.0,1977.0,,128.0,,,,68.0,234582.0,
75%,,,,,,,78247.0,,,,...,2009.0,1996.0,,220.0,,,,89.0,379902.2,


## Dropping Unnecessary Data Columns (Name, Phone, Emails)

In [7]:
columns_to_keep = [
    'City', #
    'County', #
    'State', #
    'Property Type', #
    'Gender', #
    'Age', #
    'Property Purchased Year', #
    'Property Built', #
    'Property Value Range',
    'Mortgage Amount In Thousands',
    'Lender Name',
    'Interest Type',
    'Loan Type',
    'Loan To Value'
]

data = data[columns_to_keep]

In [8]:
memory_usage = data.memory_usage(deep=True).sum()

print(f"Memory usage of the DataFrame: {memory_usage / (1024**2):.2f} MB")

Memory usage of the DataFrame: 591.19 MB


## Updating tuple conventions

In [9]:
data.columns = data.columns.str.strip().str.replace(' ', '_').str.lower()

## Data Cleaning

The majority of essential data is missing in the form of NaN or NULL. In the case of categorical data, this isn't much of a problem since this data is easily replaceable and doesn't significantly affect the model. Most of the categorical and ordinal data are present. The challenge arises with numerical data such as years and values, where 20%-30% of the data is missing. This data is irreplaceable. Other continuous data, such as age and binary gender data, can be replaced. However, this leaves us with two options: either lose 20% of the data or retain it and maintain the overall data distribution.

In [10]:
data = data.drop_duplicates()

In [11]:
data = data.dropna(subset=['city', 'county', 'state'])

In [12]:
avg_age = data.loc[data['age'] >= 18, 'age'].mean()
data['age'] = data['age'].apply(lambda x: avg_age if x < 18 else x)
data['age'] = data['age'].astype(int)

In [13]:
mode_gender = data['gender'].mode().iloc[0]
data['gender'] = data['gender'].replace('U', mode_gender)

In [14]:
data = data[data['property_type'] != 'U']

In [22]:
median_year = data['property_purchased_year'].median()
median_built = data['property_built'].median()

data['property_purchased_year'] = data['property_purchased_year'].replace(0, median_year)
data['property_built'] = data['property_built'].replace(0, median_built)

In [27]:
median_mortgage = data['mortgage_amount_in_thousands'].median()
data['mortgage_amount_in_thousands'] = data['mortgage_amount_in_thousands'].replace(0, median_mortgage)

In [23]:
data = data.reset_index(drop=True)

## Removing Outliers

In [35]:
data_cleaned = data.copy()

for column_name in data_cleaned.select_dtypes(include=['number']).columns:
    z_scores = zscore(data_cleaned[column_name])
    is_outlier = abs(z_scores) > 3
    data_cleaned[column_name] = data_cleaned[column_name].where(~is_outlier, data_cleaned[column_name].median())

997218