---
title: "Data Gathering"
format: html
---

In [11]:
import requests
import json
import re
import pandas as pd

I will attempt to gather all the data I need to answer my 10 questions. Most of the data I acquire will come from census.gov. I will use data tables from the U.S. Census Bureau's American Community Survey (ACS), a nationwide survey that collects and produces information on social, economic, housing, and demographic characteristics about our nation's population each year.

I do not know how exactly I will be doing my analysis yet or what variables I think will be most useful, but to cover all my bases I will import the DP02-DP05 tables from 2017-2022, excluding 2020 because there was not accurate data that year due to COVID. I may not need all these tables or columns in the tables but it will be nice to have easy access to them in my future analysis.

Here are what the ACS tables contain:<br>
DP02: Selected Social Characteristics in the United States <br>
DP03: Selected Economic Characteristics in the United States <br>
DP04: Selected Housing Characteristics <br>
DP05: ACS Demographic and Housing Estimates

Here is a link to the webpage: https://www.census.gov/data/developers/data-sets/ACS-supplemental-data.html

I decided that I just want to focus on the real estate growth potential in every state. I can start broad by going by state, then in the future I can apply this study to different geographical levels, like if I wanted to focus on one state and compare different counties or cities. The methodology will be the same. So, a state in the United States will be my observational unit.

# Python API

I will start by using an API in python to get one table to see what we are working with.

In [42]:
DP02_URL_2017="https://api.census.gov/data/2017/acs/acs1/profile?get=group(DP02)&for=state:*"
DP02_2017= requests.get(DP02_URL_2017)
DP02_2017 = DP02_2017.json()
DP02_2017=pd.DataFrame(DP02_2017)
DP02_2017.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1209,1210,1211,1212,1213,1214,1215,1216,1217,1218
0,DP02_0001E,DP02_0001EA,DP02_0001M,DP02_0001MA,DP02_0001PE,DP02_0001PEA,DP02_0001PM,DP02_0001PMA,DP02_0002E,DP02_0002EA,...,DP02_0152EA,DP02_0152M,DP02_0152MA,DP02_0152PE,DP02_0152PEA,DP02_0152PM,DP02_0152PMA,GEO_ID,NAME,state
1,1091980,,9693,,1091980,,-888888888,(X),716451,,...,,9854,,73.4,,0.6,,0400000US28,Mississippi,28
2,2385135,,13054,,2385135,,-888888888,(X),1527260,,...,,15052,,81.3,,0.4,,0400000US29,Missouri,29
3,423091,,4068,,423091,,-888888888,(X),262726,,...,,4698,,81.3,,0.9,,0400000US30,Montana,30
4,754490,,4583,,754490,,-888888888,(X),484989,,...,,6206,,84.4,,0.5,,0400000US31,Nebraska,31


As you can see the column names do not tell us much right now because they are codes that match to different variable labels in the census data. For example, DP02_0001E maps to total household counts. In the data cleaning section, we will be sure to give each column a proper label that tells us what the column represents. Furthermore, we will only keep a select few columns out of the large number of variables in the data cleaning section.

I will use the following code to get each table I want and turn them into csvs for easy retrieval for the rest of my analysis.

In [81]:
string='https://api.census.gov/data/2016/acs/acs1/profile?get=group(DP05)&for=state'
list1=['2016','2017','2018','2019','2021']
list2=['2017','2018','2019','2021','2022']
list3=['DP01','DP02','DP03','DP04']
list4=['DP02','DP03','DP04','DP05']
for i in range(5):
    string=string.replace('DP05',list3[0])
    for w in range(4):
        file='./data/'
        csv='.csv'
        name=file+list2[i]+list4[w]+csv
        csvname=file+list2[i]+list3[w]
        string=string.replace(list1[i],list2[i])
        string=string.replace(list3[w],list4[w])
        response=requests.get(string)
        response = response.json()
        df=pd.DataFrame(response)
        df.to_csv(name, index=False)


# R API

Here I will use an API in R to retreive text data that can give me some insight into some real estate trends and what states people are talking about.

In [12]:
library(RedditExtractoR)
library(dplyr)

In [13]:
states <- c(
  "Alabama", "Alaska", "Arizona", "Arkansas", "California",
  "Colorado", "Connecticut", "Delaware", "Florida", "Georgia",
  "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa",
  "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland",
  "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
  "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey",
  "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio",
  "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina",
  "South Dakota", "Tennessee", "Texas", "Utah", "Vermont",
  "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"
)

In [14]:
text<-data.frame()
subreddit="RealEstate"

Attemped to run this loop but it seems Reddit would not let me execute this.

In [None]:
for (state in states){
    state_df<- find_thread_urls(keywords = state,subreddit=subreddit, sort_by="top", period = 'year')
    state_df<-state_df|>
    mutate(State=state)
    text<-rbind(text,state_df)
}

Start with just general real estate and attempt to run states another time. 

In [15]:
RealEstate <- find_thread_urls(keywords = "Buy",subreddit=subreddit, sort_by="top", period = 'year')

parsing URLs on page 1...


"cannot open URL 'https://www.reddit.com/r/RealEstate/search.json?restrict_sr=on&q=Buy&sort=top&t=year&limit=100': HTTP status was '429 Unknown Error'"


ERROR: Error in h(simpleError(msg, call)): error in evaluating the argument 'content' in selecting a method for function 'fromJSON': cannot open the connection to 'https://www.reddit.com/r/RealEstate/search.json?restrict_sr=on&q=Buy&sort=top&t=year&limit=100'


In [None]:
find_thread_urls(keywords = "Buy" ,subreddit=subreddit, sort_by="top", period = 'year')

# News API

In [1]:
import requests
import json
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True

# THIS CODE WILL NOT WORK UNLESS YOU INSERT YOUR API KEY IN THE NEXT LINE
API_KEY='6a2b4aca491e4bedb90a4c7275f2ddf6'
TOPIC='Real Estate'

In [3]:
URLpost = {'apiKey': API_KEY,
            'q': '+'+TOPIC,
            'sortBy': 'relevancy',
            'totalRequests': 1}

print(baseURL)
# print(URLpost)

#GET DATA FROM API
response = requests.get(baseURL, URLpost) #request data from the server
# print(response.url);  
response = response.json() #extract txt data from request into json

# PRETTY PRINT
# https://www.digitalocean.com/community/tutorials/python-pretty-print-json

#print(json.dumps(response, indent=2))

# #GET TIMESTAMP FOR PULL REQUEST
#from datetime import datetime
#timestamp = datetime.now().strftime("%Y-%m-%d-H%H-M%M-S%S")

# SAVE TO FILE 
#with open(timestamp+'-newapi-raw-data.json', 'w') as outfile:
    #json.dump(response, outfile, indent=4)

https://newsapi.org/v2/everything?


In [13]:
def string_cleaner(input_string):
    try: 
        out=re.sub(r"""
                    [,.;@#?!&$-]+  # Accept one or more copies of punctuation
                    \ *           # plus zero or more copies of a space,
                    """,
                    " ",          # and replace it with a single space
                    input_string, flags=re.VERBOSE)

        #REPLACE SELECT CHARACTERS WITH NOTHING
        out = re.sub('[’.]+', '', input_string)

        #ELIMINATE DUPLICATE WHITESPACES USING WILDCARDS
        out = re.sub(r'\s+', ' ', out)

        #CONVERT TO LOWER CASE
        out=out.lower()
    except:
        print("ERROR")
        out=''
    return out

In [14]:
article_list=response['articles']   #list of dictionaries for each article
article_keys=article_list[0].keys()
print("AVAILABLE KEYS:")
print(article_keys)
index=0
cleaned_data=[];  
for article in article_list:
    tmp=[]
    for key in article_keys:
        if(key=='source'):
            src=string_cleaner(article[key]['name'])
            tmp.append(src) 

        if(key=='author'):
            author=string_cleaner(article[key])
            #ERROR CHECK (SOMETIMES AUTHOR IS SAME AS PUBLICATION)
            if(src in author): 
                print(" AUTHOR ERROR:",author);author='NA'
            tmp.append(author)

        if(key=='title'):
            tmp.append(string_cleaner(article[key]))

        # if(key=='description'):
        #     tmp.append(string_cleaner(article[key]))

        # if(key=='content'):
        #     tmp.append(string_cleaner(article[key]))

        if(key=='publishedAt'):
            #DEFINE DATA PATERN FOR RE TO CHECK  .* --> wildcard
            ref = re.compile('.*-.*-.*T.*:.*:.*Z')
            date=article[key]
            if(not ref.match(date)):
                print(" DATE ERROR:",date); date="NA"
            tmp.append(date)

    cleaned_data.append(tmp)
    index+=1

AVAILABLE KEYS:
dict_keys(['source', 'author', 'title', 'description', 'url', 'urlToImage', 'publishedAt', 'content'])
 AUTHOR ERROR: grit daily
 AUTHOR ERROR: grit daily
 AUTHOR ERROR: abdo riani, senior contributor, abdo riani, senior contributor https://wwwforbescom/sites/abdoriani/
 AUTHOR ERROR: christopher marquis, contributor, christopher marquis, contributor https://wwwforbescom/sites/christophermarquis/
 AUTHOR ERROR: tmz staff


In [15]:
cleaned_data

[['business insider',
  'jennifer sor',
  'distressed commercial real estate debt hit $80 billion last quarter, the highest amount in a decade',
  '2023-10-19T15:09:14Z'],
 ['grit daily',
  'NA',
  'ranka vucetic on 3 common real estate problems solved by bespoke service',
  '2023-10-31T11:00:08Z'],
 ['business insider',
  'phil rosen',
  "china's country garden reportedly misses final deadline for a $154 million bond payment",
  '2023-10-18T16:25:20Z'],
 ['business insider',
  'theron mohamed',
  '2023-10-25T12:55:06Z'],
 ['business insider',
  'jennifer ortakales dawkins',
  'tjmaxx is quietly closing stores in new york and chicago here are the confirmed locations',
  '2023-11-02T18:54:22Z'],
 ['business insider',
  'katherine tangalakis-lippert',
  "who's funding hamas?",
  '2023-10-22T00:32:03Z'],
 ['npr',
  'jennifer ludden',
  'to tackle homelessness faster, la has a kind of real estate agency for the unhoused',
  '2023-10-24T13:45:55Z'],
 ['business insider',
  'haley tenore',
 

In [16]:
df = pd.DataFrame(cleaned_data)
print(df)
df.to_csv('./data/news.csv', index=False) #,index_label=['title','src','author','date','description'])

                   0                           1  \
0   business insider                jennifer sor   
1         grit daily                          NA   
2   business insider                  phil rosen   
3   business insider              theron mohamed   
4   business insider  jennifer ortakales dawkins   
..               ...                         ...   
95       theonioncom                   the onion   
96               tmz                          NA   
97  business insider             kelsey neubauer   
98        gizmodocom                  dua rashid   
99  business insider            lakshmi varanasi   

                                                    2                     3  
0   distressed commercial real estate debt hit $80...  2023-10-19T15:09:14Z  
1   ranka vucetic on 3 common real estate problems...  2023-10-31T11:00:08Z  
2   china's country garden reportedly misses final...  2023-10-18T16:25:20Z  
4   tjmaxx is quietly closing stores in new york a...  2023-11-

# Download

https://www.zillow.com/research/data/