# COGS 108 - Data Checkpoint

# Names

- Anh Vuong
- Anh Bach
- Anh Pham
- Huy Nguyen

<a id='research_question'></a>
# Research Question

* How have housing costs (rent and owner-occupied) in California change over time, and how does this compare to other states?
* What percentage of of a household income is typically spent on housing in California, and how does this compare to other states?
* What has been the population rate of California over time, and how has it changed from year to year? 
* Does increase in housing costs affect California's populations?

# Background and Prior Work

It is widely acknowledged that California is the most populated state in the United States, with nearly 40 million residents, but only the third largest state. As a result, there is a conflict between the population and area in California, thus leading to the increasing trend in housing prices in recent years. In fact, the socioeconomic makeup of a region is significantly influenced by housing costs and population trends. This project intends to examine historical housing cost increases in California, compare them to those in other states, and comprehend any prospective effects on the state's population.

As all members of our group are California residents and are renting a place to live off-campus, we can feel the pressure while paying the rent. Especially after the emergence of the COVID-19 pandemic, there was an increasing trend in housing prices in California, and some news reports reported that more Californians were moving out of state due to the housing situation. Hence, we are interested in determining the relationship between the increasing cost of housing and California’s population.

To dig into the question, we conducted some research to explore the housing situation in California following the emergence of COVID-19. According to a report from John Duca and Anthony Murphy, in the wake of the short but steep COVID-19 recession, house prices have risen at record levels in recent months, hitting a peak increase of 19.3 percent in July 2021. These double-digit increases represent a stark departure from what occurred before the pandemic—from early 2013 to early 2020—when house prices rose at a moderate annual rate of about 5 percent and exceeded the rate of increase in rents (1). Then, we do some research to get a good understanding of the housing market in California compared to other states in the U.S. An analysis by Jack Caporal and Lyle Daly pointed out that the typical home price in California is $728,000, which is 218% of the typical U.S. price, and that California has the second highest typical home value in the United States and the second lowest income-to-home-value ratio, despite residents making 22% more than the median U.S. income (2). Lastly, we look at the California population trend in recent years. A report from Calmatters stated that according to the latest population estimates from the U.S. Census Bureau, California’s total population declined by more than 500,000 between April 2020 and July 2022. Put another way, 1 out of 100 people living in California at the beginning of the COVID-19 pandemic had, two years later, left the state — either by U-Haul or by hearse (3).

References (include links):

1. https://www.dallasfed.org/research/economics/2021/1228

2. https://www.fool.com/the-ascent/research/average-house-price-state/

3. https://calmatters.org/newsletters/whatmatters/2023/02/california-population-exodus-housing/


# Hypothesis

Our group hypothesis is that there is a relationship between the decrease in California's populations and the increase in cost of spending on housing (rent and owner-occupied) in which California residents are likely to move out of state because they are not able to afford housing. 

H<sub>0</sub>: There is no relationship between increasing in cost of housing in California and California's populations

H<sub>1</sub>: There is a relationship between increasing in cost of housing in California and California's populations

# Ethics & Privacy

In order to address ethics and privacy concerns, our group focused on using datasets that are publicly available online and our project is mainly for academic purposes. Our datasets are collected from government websites without sensitive or personally identifying information. We believe that our datasets are unbiased and ethical because we mainly focus on analyzing data on California’s populations and housing costs and comparing them to other states to see if there possibly is a relationship between housing costs and populations without focusing on any human bias. Our group makes sure there is no bias by aiming to collect and analyze publicly available data and datasets from trusted websites for our project and our datasets do not exclude any particular populations or are likely to reflect particular human biases in a way that could be a problem. Our datasets also do not target any particular group or conduct in a way that will lead to a particular group, whether that's defined by sex, age, ethnicity, etc. 

To detect any biases before our analysis, we will examine the source and methodology of the data collection, and check if there are any gaps or inconsistencies in the data. We will also review the literature on the topic and compare our data with other relevant studies. During our analysis, we will use appropriate statistical methods and visualizations to explore the data and identify any outliers, trends, or patterns that may indicate bias. We will also test our hypotheses and assumptions using inferential statistics and hypothesis testing. After our analysis, we will evaluate our results and conclusions in light of the data limitations and ethical implications. We will also seek feedback from our peers and instructors on our project report and presentation, and address any questions or concerns they may have. 

To handle any issues we identified, we will document them clearly and transparently in our project report and presentation, and acknowledge the limitations and uncertainties of our analysis. We will also suggest ways to improve the data quality and reliability in future research and discuss the potential implications and recommendations for policy and practice based on our findings.



# Dataset(s)

*Fill in your dataset information here*

https://data.census.gov/table?q=state+housing+cost&tid=ACSST1Y2021.S2503

https://data.census.gov/table?q=state+housing+cost&tid=ACSDP1Y2021.DP04

https://www2.census.gov/programs-surveys/popest/datasets/2020-2022/state/totals/

(Copy this information for each dataset)
- Dataset Name:
- Link to the dataset:
- Number of observations:

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [4]:
# import working with data libraries
import pandas as pd

In [5]:
survey_2020_df = pd.read_csv('https://raw.githubusercontent.com/COGS-108/Group_Sp23_AAA-H/main/data/osmi-2020-mental-health-in-tech-survey-results.csv')
survey_2020_df.head()

Unnamed: 0,#,*Are you self-employed?*,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided health coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health disorders and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,...,"If there is anything else you would like to tell us that has not been covered by the survey questions, please use this space to do so.",Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used _anonymously_ and only with your permission.),What is your age?,What is your gender?,What country do you *live* in?,What US state or territory do you *live* in?,What is your race?,Other.3,What country do you *work* in?,What US state or territory do you *work* in?
0,zwrffw6ykfo82ft1twvzwrffw6c6wsfv,1,,,,,,,,,...,,0,45,Male,United States of America,Connecticut,White,,United States of America,Connecticut
1,zhdmhaa8r0125c4zmoi7qzhdmtjrakhm,1,,,,,,,,,...,,1,24,female,Russia,,,,Russia,
2,x4itwa9hnlw7qke4y5xibx4itwa9yzl5,1,,,,,,,,,...,mental health should be a law by government.,1,46,Male,India,,,,India,
3,x3v3oimu5pn0043n8x3v3oizaybhwwto,1,,,,,,,,,...,,1,25,Female,Canada,,,,Canada,
4,uyp6re7bhnyx6gez09uyp6re72z0e4e4,1,,,,,,,,,...,no,1,25,F,Canada,,,,Canada,


In [6]:
survey_2019_df = pd.read_csv('https://raw.githubusercontent.com/COGS-108/Group_Sp23_AAA-H/main/data/osmi-mental-health-in-tech-survey-2019.csv')
survey_2019_df.head()

Unnamed: 0,Age,Gender,Country,State,Is_Self_Employed,Is_Family_History_of_Mental_Illness,Is_Treatment_Sought,Interference_With_Work,Number_of_Employees_in_Organization,Is_Employer_Tech_Company,Mental_Health_Benefits,Mental_Healthcare_Options,Mental_Health_Employee_Wellness_Program,Mental_Health_Employee_Resources,Employee_Anonymity_Protected,Mental_Health_Leave,Willingness_to_Discuss_with_Coworkers,Willingness_to_Discuss_with_Supervisors
0,25.0,Male,United States of America,Nebraska,False,False,False,Not applicable to me,26-100,True,I don't know,No,Yes,Yes,I don't know,Very easy,Physical health,Yes
1,51.0,Male,United States of America,Nebraska,False,True,False,Sometimes,26-100,True,Yes,No,No,Yes,Yes,I don't know,Physical health,Maybe
2,27.0,Male,United States of America,Illinois,False,,False,Not applicable to me,26-100,True,I don't know,No,No,I don't know,I don't know,Somewhat difficult,Same level of comfort for each,No
3,37.0,Male,United States of America,Nebraska,False,True,False,Not applicable to me,100-500,True,I don't know,No,Yes,Yes,Yes,Very easy,Physical health,Yes
4,46.0,Male,United States of America,Nebraska,False,False,False,Not applicable to me,26-100,True,I don't know,No,I don't know,I don't know,I don't know,I don't know,Physical health,No


# Data Cleaning

Describe your data cleaning steps here.

In [7]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION