# DS 3001: Foundations of Machine Learning
## Project 1: An Analysis on Attitudes towards Health-Related Internet Usage

---
**Group 2: brc4ega, zzb2rf, eqk9vb, akl5mjz, wvu9cs, ryp6vw, brr2tu**

**Authors: Bryant Chow, Elaine Zhang, Cheryl Bai, Ashley Luk, Eric Nguyen, Adam Chow, Hieu Vu**

- *Any questions regarding this report should be directed to these authors.*
---

## Research Question
**Selected Question**
*************

*How do attitudes towards health-related internet usage vary among different demographic groups (age, location, education, race, etc.)?*

*********

**Question Ideas:**

*Team members have each submitted two (2) project proposal topics. After discussing the implications of each question based on relevant societal events, a research question was selected (bolded).*

[Data Explorer](https://gssdataexplorer.norc.org/variables/vfilter)

- What affects happiness in marriage? Salary, household income, type of job, anything related to husband status
- How does parental status (job status, salary, marriage/divorce) affect degree that a person receives/how far they go in education?
- What role does geographic location play in shaping attitudes and behaviors related to climate change?
- How does the perception of work-life balance vary among different professions, and what factors contribute to this variation?
- **How do attitudes towards health-related internet usage vary among different demographic groups (age, location, education, race, etc.)?**
- Are there any trends between attitudes in health (physical and mental) and attitudes towards children?
- Are there any trends between zodiac signs and their attitudes towards the government?
- How did economic downturn affect the individuals of the survey? For example, during 2008's housing crisis, were there more divorces shortly after? Are there changes to mental health? What was the overall job status change? This could also be applied to 2020.
- What are the the most common profession routes for self employment? How did these individuals fare overtime?
- How does the state of the economy/social events affect familial trends?
- How has education curriculum changed in response to societal events?

## Initial Data Setup

---

### 1. Variable Definitions
*In this section, variables have been meticulously selected in order to support the research of the project. Variables have been included below with a brief description of what they measure and why they were selected.*

*If a variable's name was changed during the cleaning process, the updated name was included in this section for reader convenience.*

---



- **year**: GSS year for this respondent
  - Year was selected in order to track other variables over a period of time. It is an excellent variable to leverage when it comes time to create data visualizations.
- **race**: race of respondent
  - Race was selected in order to get a better understanding of the respondents' characteristics.
- **age**: age of respondent
  - Age was selected in order to get a beter understanding of the respondents' characteristics
- **sex**: sex of respondent
  - Sex was selected in order to get a beter understanding of the respondents' characteristics
- **degree**: highest degree earned by respondent
  - Degree was selected in order to get a beter understanding of the respondents' characteristics.
- **rincome**: respondent's income
- **income**: total family income
- **rweight**: how much does the respondent weigh
- **relig**: respondent's religious preferences
- **health**: condition of health
- **marital**: marital status
- **wwwhr**: not counting email, hours per week spent on the internet
- **compuse**: do you ever use a computer at work or home?
- **webmob**: do you access internet through mobile device
- **health30** (2000-2004): In the past 30 days, how often have you visited a web site for? Health and fitness?
- **health12**(2000-2002): have you used the web for health information in past 12 months?
- **hlthwblif** (2022): During the past 12 months how often have you used the internet to look for info on healthy lifestyle?
- **hlthwbanx** (2022): How often in the past 12 months have you searched on the internet information related to anxiety/stress?
- **evmhp** (1996): respondents have ever had a mental health problem? yes or no
- **hlthwww** (2000-2004): sought health information on the internet
- **hlthweb** (2022): During the past 12 months, how often did you use the internet on any device to look for health or medical information for yourself or someone else?
- **webhltbeh** (2022): To what extent do you agree or disagree with the following statements? During the past 12 months, information on the internet affected my health behavior in a positive way
- **webdocexp** (2022): During the past 12 months, information on the internet helped me understand what a doctor tried to explain to me
- **websympt** (2022): To what extent do you agree or disagree with the following statements? The internet is useful to help people decide if their symptoms are serious enough to go to the doctor
- **webdradv** (2022): To what extent do you agree or disagree with the following statements? The internet is useful to check that the doctor is giving people appropriate advice
- **webrely** (2022): To what extent do you agree or disagree with the following statements? It is not easy to distinguish between reliable and unreliable health information on the internet


### Cleaning Data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
var_list = ['year', 'race', 'age', 'sex', 'degree', 'rincome', 'income', 'relig', 'rweight', 'health', 'wwwhr', 'compuse', 'webmob', 'health30', 'health12',
            'hlthwblif', 'hlthwbanx', 'evmhp', 'hlthwww', 'hlthweb', 'hlthwbvax', 'webhltbeh', 'webdocexp', 'websympt',
            'webdradv', 'webrely'] # List of variables to save
output_file = 'raw_gss_data_url.csv' # Name of the file to save the data to
#
modes = ['w','a'] # Has write mode and append mode
phase = 0 # Starts in write mode; after one iteration of loop, switches to append mode
#
for k in range(37): # r each chunk of the data
    url = 'https://github.com/DS3001/gss_zip/raw/main/gss_' + str(1+k) + '.csv' # Create url to the chunk to be processed
    # print(url) # Check the url is correct
    df = pd.read_csv(url,low_memory=False) # Download this chunk of data
    # print(df.head()) # Visually inspect the first few rows
    df.loc[:,var_list].to_csv(output_file, # specifies target file to save the chunk to
                              mode=modes[phase], # control write versus append
                              header=var_list, # variable names
                              index=False) # no row index saved
    phase = 1 # Switch from write mode to append mode
    k =+ 1

In [None]:
df = pd.read_csv('raw_gss_data_url.csv')
print(df.shape)
df.head(5)

(72426, 26)


Unnamed: 0,year,race,age,sex,degree,rincome,income,relig,rweight,health,...,hlthwbanx,evmhp,hlthwww,hlthweb,hlthwbvax,webhltbeh,webdocexp,websympt,webdradv,webrely
0,1972,white,23.0,female,bachelor's,,,jewish,,good,...,,,,,,,,,,
1,1972,white,70.0,male,less than high school,,,catholic,,fair,...,,,,,,,,,,
2,1972,white,48.0,female,high school,,,protestant,,excellent,...,,,,,,,,,,
3,1972,white,27.0,female,bachelor's,,,other,,good,...,,,,,,,,,,
4,1972,white,61.0,female,high school,,,protestant,,good,...,,,,,,,,,,


In [None]:
# Getting some summary statistics
df.describe()

Unnamed: 0,year,race,age,sex,degree,rincome,income,relig,rweight,health,...,hlthwbanx,evmhp,hlthwww,hlthweb,hlthwbvax,webhltbeh,webdocexp,websympt,webdradv,webrely
count,72426,72319,71657.0,72314,72230,42369,63475,68506,4739,55190,...,1131,1089,1673,1174,1133,1094,1096,1135,1112,1122
unique,35,4,73.0,3,6,13,13,14,6,5,...,6,3,5,8,6,6,6,6,6,6
top,2006,white,30.0,female,high school,"$25,000 or more","$25,000 or more",protestant,about the right weight,good,...,never,no,not at all,several times a year,sometimes,neither agree nor disagree,agree,agree,agree,agree
freq,4510,57657,1571.0,40301,36446,18249,34785,38707,2805,25651,...,357,977,632,320,389,449,503,574,431,476


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72426 entries, 0 to 72425
Data columns (total 26 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   year       72426 non-null  object
 1   race       72319 non-null  object
 2   age        71657 non-null  object
 3   sex        72314 non-null  object
 4   degree     72230 non-null  object
 5   rincome    42369 non-null  object
 6   income     63475 non-null  object
 7   relig      68506 non-null  object
 8   rweight    4739 non-null   object
 9   health     55190 non-null  object
 10  wwwhr      17222 non-null  object
 11  compuse    18973 non-null  object
 12  webmob     1793 non-null   object
 13  health30   2371 non-null   object
 14  health12   1691 non-null   object
 15  hlthwblif  1133 non-null   object
 16  hlthwbanx  1131 non-null   object
 17  evmhp      1089 non-null   object
 18  hlthwww    1673 non-null   object
 19  hlthweb    1174 non-null   object
 20  hlthwbvax  1133 non-null   o

In [None]:
# when grabbing the data, it appends multiple rows of just variable name headers, so removing those rows here
rows_to_drop = df[df['health30'] == 'health30'].index

df = df.drop(rows_to_drop)
df['health30'].unique()

array([nan, 'never', '1-2 times', 'more than 5 times', '3-5 times'],
      dtype=object)

In [None]:
# checking to see if any values need to be switch to NaN
df['health30'].unique()
df['health12'].unique()
df['hlthwblif'].unique()
df['evmhp'].unique()
df['hlthwww'].unique()
df['hlthweb'].unique()
df['webhltbeh'].unique()
df['webdocexp'].unique()
df['websympt'].unique()
df['webdradv'].unique()
df['webrely'].unique()

array([nan, 'agree', 'strongly agree', 'neither agree nor disagree',
       'disagree', 'strongly disagree'], dtype=object)

In [None]:
# Renaming variables for clarity


In [None]:
# function to select rows that have specific columns filled, so that none of the datapoints are NaN
def create_new_df(columns):
  new_df = df.dropna(subset=columns, how='any', inplace=False)
  rows_to_drop = df[df['health30'] == 'health30'].index
  new_df = new_df.drop(rows_to_drop)
  return new_df

# example usage of creating a new df where rows have a value in columns age, webrely, and webdradv
new_df = create_new_df(['age', 'webrely', 'webdradv'])
new_df.head()

Unnamed: 0,year,race,age,sex,degree,rincome,income,relig,rweight,health,...,hlthwbanx,evmhp,hlthwww,hlthweb,hlthwbvax,webhltbeh,webdocexp,websympt,webdradv,webrely
68880,2022,white,72.0,female,bachelor's,"$25,000 or more","$25,000 or more",,,good,...,often,,,several times a month,sometimes,neither agree nor disagree,neither agree nor disagree,agree,agree,agree
68882,2022,white,57.0,female,high school,"$25,000 or more","$25,000 or more",,,good,...,never,,,several times a year,seldom,disagree,disagree,disagree,disagree,strongly agree
68884,2022,white,62.0,male,high school,,"$25,000 or more",,,fair,...,never,,,several times a month,never,neither agree nor disagree,neither agree nor disagree,neither agree nor disagree,neither agree nor disagree,agree
68885,2022,white,27.0,male,high school,"$25,000 or more","$25,000 or more",,,excellent,...,very often,,,several times a day,very often,strongly disagree,neither agree nor disagree,neither agree nor disagree,neither agree nor disagree,neither agree nor disagree
68886,2022,other,20.0,female,high school,"$25,000 or more","$25,000 or more",,,good,...,often,,,several times a year,often,neither agree nor disagree,agree,disagree,disagree,neither agree nor disagree


In [None]:
# function to select columns and rows where the values are not NaN
def select_columns(columns):
  new_df = df.dropna(subset=columns, how='any', inplace=False)[columns]
  rows_to_drop = df[df['health30'] == 'health30'].index
  new_df = new_df.drop(rows_to_drop)
  return new_df

# example of calling this function
df_2 = select_columns(['age','webrely', 'webdradv'])
df_2.head()

Unnamed: 0,age,webrely,webdradv
68880,72.0,agree,agree
68882,57.0,strongly agree,disagree
68884,62.0,agree,neither agree nor disagree
68885,27.0,neither agree nor disagree,neither agree nor disagree
68886,20.0,neither agree nor disagree,disagree
