<div align="center"> <span style="color:black"> <h1> <b> An Evaluation of House Prices Market </b> </h1> </span> </div>

<div align="center"> <span style="color:black"> <h3> Applying Data Science Life Cycle on Housing Prices </h3> </span> </div>

<br> 

<div align="center"> <img src="house.jpg" style="width:200px;height:200px;" alt="House"> </div>

<br>

<b> Brianna Giang, Christopher Giang, Tony Vu </b> 

<h3>Table of Contents</h3>

<ul>
    <li><a href="#introduction">Introduction</a>
    <br></li>
</ul>
<ul>
    <li><a href="#part1">Part 1: Data Collection</a></li>
</ul>

<ul>
    <li><a href="#part2">Part 2: Data Cleaning</a></li>
</ul>
    
<ul>
    <li><a href="#part3">Part 3: Exploratory Data Analysis</a></li>
</ul>

<ul>
    <li><a href="#part4">Part 4: Full Model Implementation</a>
</ul>

<ul>
    <li><a href="#part5">Part 5: Visualizations</a><br></li>
</ul>

<ul>
    <li><a href="#part6">Part 6: Conclusions</a><ul> </li>
</ul> 

<p><a id='introduction'></a></p>
<h2 id="Introduction">Introduction</h2><a class="anchor-link" href="#Introduction"></a>
<p> ... *write the introduction here* *add html links that fit* 
<br></p>

<br>

<p> We will seek to identify the factors that contribute to the house market prices by looking through 
USA Zillow House Listing of 2023. We will determine which factors are correlated betwee the changes of prices. This study will cover the entire data science lifecycle, from data collection all the way to conclusive findings. </p>

<p><a id='part1'></a></p>
<h2 id="Data-Collection">Part 1: Data Collection<a class="anchor-link" href="#Data-Collection"></a></h2>

<p>First off we start with importing our data. Our analysis will be on the entire country level. We will start by importing the relevant Python libraries for this study.</p>


In [None]:
# These are standard Python libraries when doing machine learning analysis. 
import pandas as pd
import numpy as np 

# These libraries are used for plots and visualizations. 
import matplotlib.pyplot as plt
import seaborn as sns

Our dataset came from <a href="https://www.kaggle.com/datasets/febinphilips/us-house-listings-2023/data"> Kaggle </a> , sourced from Zillow, a leading online platform for real estate transactions giving information on approximately 100 million homes. It's a comprehensive dataset covering various housing attributes across different regions of the United States, encompassing states, cities, and neighborhoods. This dataset offers valuable perspectives on real estate patterns and property attributes. Each entry corresponds to an individual property listing, containing information like location, property details, market valuations, and additional specifications.

Download the the dataset from Kaggle to be able to import it. 


In [None]:
df = pd.read_csv("zillow_housing_prices.csv")

print(df.head())

First we're going to clean our data to be usable. We will drop city and street because they are too specific for our purposes. We will also drop Latitude and Longitude because those aren't relevant to estimating the market price if we keep the states. 

*fix description to talk about why we dropped these specific columns*

In [39]:
df.drop(columns=['City', 'Street', 'Zipcode', 'Latitude','Longitude', 'LotArea', 'LotUnit', 'RentEstimate', 'Price'], inplace = True)
print(df.columns)


Index(['State', 'Bedroom', 'Bathroom', 'Area', 'PPSq', 'ConvertedLot',
       'MarketEstimate'],
      dtype='object')


* add more paragraphs about missing data, mention MCAR, NAR, etc.* 
* mention the possibility of missing data imputation* -> by using estimate 

The next concern we have about cleaning our data is missing data....


In [40]:
# Dropping all rows with missing data 
df.dropna(inplace = True)
print(df.head())

  State  Bedroom  Bathroom    Area        PPSq  ConvertedLot  MarketEstimate
0    AL      4.0       2.0  1614.0  148.636927       0.38050        240600.0
1    AL      3.0       2.0  1474.0    0.000678       0.67034        186700.0
4    AL      3.0       3.0  2224.0  150.629496       0.26000        336200.0
6    AL      3.0       2.0  1564.0   96.547315       0.20000        150500.0
7    AL      3.0       2.0  1717.0  139.196273       0.38000        238400.0


In [41]:
df.sort_values(by=['MarketEstimate'], inplace = True)
print(df.head())
print(df.tail())

     State  Bedroom  Bathroom    Area        PPSq  ConvertedLot  \
9164    MD      4.0       3.0  2300.0   56.478261         0.270   
471     AL      4.0       2.0  1089.0  117.079890         0.150   
1900    AR      3.0       1.0  1524.0   16.076115         0.330   
8228    LA      4.0       2.0  2707.0    9.235316         0.379   
7725    KY      2.0       2.0  1105.0   24.886878         0.090   

      MarketEstimate  
9164         15700.0  
471          21400.0  
1900         22500.0  
8228         24531.0  
7725         25900.0  
      State  Bedroom  Bathroom     Area         PPSq  ConvertedLot  \
9949     MA      9.0      11.0  23374.0  1197.912210        3.1012   
24095    WY      5.0       7.0   7984.0  4101.953908       37.0600   
2422     CA      7.0       9.0  14000.0  2063.428571        1.3082   
24268    WY      4.0       7.0   9696.0  3867.574257        4.5300   
20183    TN      5.0      10.0  19811.0  2019.080309       49.7200   

       MarketEstimate  
9949       277

*fix this later, find sources to talk about average* 

As you can see above, the ranges for the markest estimate is very large. We want to do a more accurate and average prediction. Thus we will limit our ranges from $100,000 to $2,000,000. 

In [42]:
df = df[(df['MarketEstimate'] >= 100000) & (df['MarketEstimate'] <= 2000000)]
print(df.head())

      State  Bedroom  Bathroom    Area        PPSq  ConvertedLot  \
7512     KS      2.0       1.0   704.0  140.625000          0.14   
23846    WI      4.0       2.0  2440.0   40.942623          0.42   
10408    MI      1.0       1.0   528.0  189.204545          1.74   
8567     ME      3.0       1.0  1008.0  106.150794          1.00   
18149    PA      3.0       2.0  1145.0   87.248908          0.06   

       MarketEstimate  
7512         100200.0  
23846        100200.0  
10408        100600.0  
8567         100600.0  
18149        100800.0  
