---
title: "Data Gathering"
format: html
---

# Census Data

I will attempt to gather all the data I need to answer my 10 questions. Most of the data I acquire will come from [census.gov](https://www.census.gov/). I will use data tables from the U.S. Census Bureau's [American Community Survey](https://www.census.gov/data/developers/data-sets/ACS-supplemental-data.html) (ACS), a nationwide survey that collects and produces information on social, economic, housing, and demographic characteristics about our nation's population each year.


I chose census data and specifically the ACS tables because, from experience and knowledge I have accumulated in my life, I inferred that the variables in these tables could be useful in predicting home prices. For example, one variable I will examine is the number of people with bachelor's degrees. Usually, higher education is linked to higher incomes, and therefore, higher home prices. The variables I choose to include from the tables may prove valuable or they may not; part of my analysis is determining which variables are the most important. Also, a lot of my variables are heavily correlated with each other, like total households and total married households. This can become an issue in the models I create as multicollinearity increases variance and makes it hard to know the impact of an individual variable in the model. I am aware of this and will address it later in my project.

I will import the DP02-DP05 tables from 2017-2022, excluding 2020 because there was not accurate data that year due to COVID. Ultimately, I will not need all the columns in the tables, but it will be nice to have easy access to them in my future analysis. I only went back to 2017 for now, but if I wanted to improve my models, I could retrieve more data. The same methodology will apply no matter how much data I acquire.

Here are what the ACS tables contain:<br>
DP02: Selected Social Characteristics in the United States <br>
DP03: Selected Economic Characteristics in the United States <br>
DP04: Selected Housing Characteristics <br>
DP05: ACS Demographic and Housing Estimates

Using the API, you can query different geography hierarchies, like if you want data on regions in the U.S. or, in my analysis, on U.S. states.


I will start by retreiving one table to see what we are working with.

In [42]:
import requests
import json
import re
import pandas as pd
DP02_URL_2017="https://api.census.gov/data/2017/acs/acs1/profile?get=group(DP02)&for=state:*"
DP02_2017= requests.get(DP02_URL_2017)
DP02_2017 = DP02_2017.json()
DP02_2017=pd.DataFrame(DP02_2017)
DP02_2017.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1209,1210,1211,1212,1213,1214,1215,1216,1217,1218
0,DP02_0001E,DP02_0001EA,DP02_0001M,DP02_0001MA,DP02_0001PE,DP02_0001PEA,DP02_0001PM,DP02_0001PMA,DP02_0002E,DP02_0002EA,...,DP02_0152EA,DP02_0152M,DP02_0152MA,DP02_0152PE,DP02_0152PEA,DP02_0152PM,DP02_0152PMA,GEO_ID,NAME,state
1,1091980,,9693,,1091980,,-888888888,(X),716451,,...,,9854,,73.4,,0.6,,0400000US28,Mississippi,28
2,2385135,,13054,,2385135,,-888888888,(X),1527260,,...,,15052,,81.3,,0.4,,0400000US29,Missouri,29
3,423091,,4068,,423091,,-888888888,(X),262726,,...,,4698,,81.3,,0.9,,0400000US30,Montana,30
4,754490,,4583,,754490,,-888888888,(X),484989,,...,,6206,,84.4,,0.5,,0400000US31,Nebraska,31


 The column names do not tell us much right now because they are codes that match to different variable labels in the census data. For example, DP02_0001E maps to total household counts. In the data cleaning section, we will be sure to give each column a proper label that tells us what the column represents. Also in the data cleaning section, we will only keep a select few columns out of the over 1000 columns in the tables.

I will use the following loop to get each table I want and turn them into csv files for easy retrieval for the rest of my analysis. If you would like to see my csv files, please refer to this [GitHub Repository](https://github.com/anly501/dsan-5000-project-NolanPenoyer/tree/main/Nolan%20Penoyer%20Website).<br>
*Nolan Penoyer Website/data* (Every file that starts with a year)

In [81]:
string='https://api.census.gov/data/2016/acs/acs1/profile?get=group(DP05)&for=state'
list1=['2016','2017','2018','2019','2021']
list2=['2017','2018','2019','2021','2022']
list3=['DP01','DP02','DP03','DP04']
list4=['DP02','DP03','DP04','DP05']
for i in range(5):
    string=string.replace('DP05',list3[0])
    for w in range(4):
        file='./data/'
        csv='.csv'
        name=file+list2[i]+list4[w]+csv
        csvname=file+list2[i]+list3[w]
        string=string.replace(list1[i],list2[i])
        string=string.replace(list3[w],list4[w])
        response=requests.get(string)
        response = response.json()
        df=pd.DataFrame(response)
        df.to_csv(name, index=False)


# Home Value Data

My home value data comes from [Zillow](https://www.zillow.com/research/data/). Specifically, I will use Zillow's Home Value Index (ZHVI), which reflects the typical value of homes in the 35th to 65th percentile, a home that will be in my price range. The ZHVI was created to accurately capture the seasonally adjusted value of a typical property across the nation as opposed to just the homes that sold. If you would like to learn more about how Zillow calculates this value and why it is a more accurate measure of home value as opposed to other alternatives like median home price, you can visit the link I provided.

Zillow has an API, but since it was one table I needed, I downloaded the csv file. Like my census data, I calculated this data on the state level for years 2017-2022, excluding 2020. If you would like to see my csv file, please refer to this [GitHub Repository](https://github.com/anly501/dsan-5000-project-NolanPenoyer/tree/main/Nolan%20Penoyer%20Website).<br>
*Nolan Penoyer Website/data* (Zillow.csv))

# News Data

Although not a focus of my project, news headline text data could potentially provide us with some insight on the sentiment of the real estate market and therefore what direction the general population thinks home prices are heading. The overall sentiment can provide us with signals that people are overvaluing or undervaluing homes, or that a rise or drop in home prices may be pending. Knowing if the general public believes buying a home, especially in a certain state, is a smart decision, can tell us which direction prices are headed, depending if you agree with the general public or not. Another potential data science project I would like to study in the future is if market sentiment matches market performance, and if there is a lag between the two.

I will use the [news API](https://newsapi.org/) to search and retreive live articles from across the web. I will only search for articles that contain "Real Estate" in the headline. If you would like to see my csv file, please refer to this [GitHub Repository](https://github.com/anly501/dsan-5000-project-NolanPenoyer/tree/main/Nolan%20Penoyer%20Website).<br>
*Nolan Penoyer Website/data* (news.csv))

In [10]:
import requests
import json
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True
API_KEY='6a2b4aca491e4bedb90a4c7275f2ddf6'
TOPIC='Real Estate'
URLpost = {'apiKey': API_KEY,
            'q': '+'+TOPIC,
            'sortBy': 'relevancy',
            'totalRequests': 1}

#print(baseURL)
response = requests.get(baseURL, URLpost)
response = response.json()
def string_cleaner(input_string):
    try: 
        out=re.sub(r"""
                    [,.;@#?!&$-]+  # Accept one or more copies of punctuation
                    \ *           # plus zero or more copies of a space,
                    """,
                    " ",          # and replace it with a single space
                    input_string, flags=re.VERBOSE)

        #REPLACE SELECT CHARACTERS WITH NOTHING
        out = re.sub('[’.]+', '', input_string)

        #ELIMINATE DUPLICATE WHITESPACES USING WILDCARDS
        out = re.sub(r'\s+', ' ', out)

        #CONVERT TO LOWER CASE
        out=out.lower()
    except:
        #print("ERROR")
        out=''
    return out

article_list=response['articles']   #list of dictionaries for each article
article_keys=article_list[0].keys()
#print("AVAILABLE KEYS:")
#print(article_keys)
index=0
cleaned_data=[];  
for article in article_list:
    tmp=[]
    for key in article_keys:
        if(key=='source'):
            src=string_cleaner(article[key]['name'])
            tmp.append(src) 

        if(key=='author'):
            author=string_cleaner(article[key])
            #ERROR CHECK (SOMETIMES AUTHOR IS SAME AS PUBLICATION)
            if(src in author): 
                tmp.append(author)

        if(key=='title'):
            tmp.append(string_cleaner(article[key]))

        # if(key=='description'):
        #     tmp.append(string_cleaner(article[key]))

        # if(key=='content'):
        #     tmp.append(string_cleaner(article[key]))

        if(key=='publishedAt'):
            #DEFINE DATA PATERN FOR RE TO CHECK  .* --> wildcard
            ref = re.compile('.*-.*-.*T.*:.*:.*Z')
            date=article[key]
            if(not ref.match(date)):
                print(" DATE ERROR:",date); date="NA"
            tmp.append(date)

    cleaned_data.append(tmp)
    index+=1

df = pd.DataFrame(cleaned_data)
df.head()
df.to_csv('./data/news.csv', index=False) #,index_label=['title','src','author','date','description'])


In [6]:
df.head()

Unnamed: 0,0,1,2,3
0,lifehackercom,elizabeth yuko,what to do if your home inspector missed a maj...,2023-11-12T18:00:00Z
1,slashdotorg,msmash,almost no one pays a 6% real-estate commission...,2023-11-17T18:40:00Z
2,business insider,george glover,real estate made china rich now it's looking m...,2023-11-16T10:11:14Z
3,business insider,cork gaines,baby boomers got rich off real estate and they...,2023-11-19T10:32:01Z
4,business insider,theron mohamed,"stocks may crash 30%, a recession looks immine...",2023-11-19T11:30:01Z


Here I will use an API in R to retreive text data that can give me some insight into some real estate trends and what states people are talking about.

# Reddit Data

I can also retrieve Reddit text data for the same reason. Reddit comments are a more accurate measure of what the public believes than news data as there are no biases and incentives from corporations to report news a certain way. I also can break up the text data by state, which I attempted to do below. I need to find a work around as Reddit limits how many requests you are allowed in a period of time. I will not use Reddit data for the rest of this project.

In [13]:
library(RedditExtractoR)
library(dplyr)
states <- c(
  "Alabama", "Alaska", "Arizona", "Arkansas", "California",
  "Colorado", "Connecticut", "Delaware", "Florida", "Georgia",
  "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa",
  "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland",
  "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
  "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey",
  "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio",
  "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina",
  "South Dakota", "Tennessee", "Texas", "Utah", "Vermont",
  "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"
)
text<-data.frame()
subreddit="RealEstate"
for (state in states){
    state_df<- find_thread_urls(keywords = state,subreddit=subreddit, sort_by="top", period = 'year')
    state_df<-state_df|>
    mutate(State=state)
    text<-rbind(text,state_df)
}