**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Alexandro Merida Silva
- Adam Rolander
- Alyssa Le
- Enrique Aranda
- Hikari Gregersen

# Research Question

-  Include a specific, clear data science question.
-  Make sure what you're measuring (variables) to answer the question is clear

What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)



## Background and Prior Work


- Include a general introduction to your topic
- Include explanation of what work has been done previously
- Include citations or links to previous work

This section will present the background and context of your topic and question in a few paragraphs. Include a general introduction to your topic and then describe what information you currently know about the topic after doing your initial research. Include references to other projects who have asked similar questions or approached similar problems. Explain what others have learned in their projects.

Find some relevant prior work, and reference those sources, summarizing what each did and what they learned. Even if you think you have a totally novel question, find the most similar prior work that you can and discuss how it relates to your project.

References can be research publications, but they need not be. Blogs, GitHub repositories, company websites, etc., are all viable references if they are relevant to your project. It must be clear which information comes from which references. (2-3 paragraphs, including at least 2 references)

 **Use inline citation through HTML footnotes to specify which references support which statements** 

For example: After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Use a minimum of 2 or 3 citations, but we prefer more.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) You need enough to fully explain and back up important facts. 

Note that if you click a footnote number in the paragraph above it will transport you to the proper entry in the footnotes list below.  And if you click the ^ in the footnote entry, it will return you to the place in the main text where the footnote is made.

To understand the HTML here, `<a name="#..."> </a>` is a tag that allows you produce a named reference for a given location.  Markdown has the construciton `[text with hyperlink](#named reference)` that will produce a clickable link that transports you the named reference.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.


# Hypothesis



- Include your team's hypothesis
- Ensure that this hypothesis is clear to readers
- Explain why you think this will be the outcome (what was your thinking?)

What is your main hypothesis/predictions about what the answer to your question is? Briefly explain your thinking. (2-3 sentences)

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## Dataset #1 (use name instead of number here)

In [102]:
# setup
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib as mpl
import matplotlib.pyplot as plt

import statsmodels
import statsmodels.formula.api as smf

In [None]:
df_set1 = pd.read_csv("ICPSR_21600/DS0001/21600-0001-Data.tsv", sep="\t")

df_set22 = pd.read_csv("ICPSR_21600/DS0022/21600-0022-Data.tsv", sep="\t")

In [113]:
# Merge on AID (participant identifier), inner so non-respondaets aren't included
merged_df = pd.merge(df_set1, df_set22, on="AID", how="inner", suffixes=("_wave1", "_wave4"))

print("Merged shape:", merged_df.shape)
merged_df.head()

Merged shape: (5114, 3713)


Unnamed: 0,AID,IMONTH,IDAY,IYEAR,SCH_YR,BIO_SEX,VERSION,SMP01,SMP03,H1GI1M,...,H4EO5C,H4EO5D,H4EO5E,H4EO5F,H4EO5G,H4EO5H,H4EO5I,H4EO5J,H4EO6,H4EO7
0,57101310,5,5,95,1,2,1,1,0,11,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
1,57103869,7,14,95,0,1,4,1,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0
2,57109625,6,7,95,1,1,3,1,0,3,...,,,,,,,,,,
3,57111071,8,3,95,0,1,5,1,0,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0
4,57113943,5,20,95,1,1,3,1,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,3.0


In [114]:
# Get a subset of the variables we want to study (out of 3713 variables)
sub_df = merged_df[['H1PR3', 'H1DS7', 'PC20', 'H1DA2', 'H1DA7', 'H1PR1', 'H1PR5', 'H1FV1', 'H1FV6', 'H4EC7', 
                    'H4ID5H', 'H1HS3', 'H1FS6', 'H1PF5', 'H1SU1', 'H4SE1', 'H4SE2', 'H4ED9', 'H4LM28', 
                    'H4ED8', 'H4MA1', 'H4GH7', 'H1ED11', 'H1ED12', 'H1ED13', 'H1ED14', 'H4CJ1', 'H4WP38', 
                    'H1PF1', 'H4WP24']]
sub_df.head()

Unnamed: 0,H1PR3,H1DS7,PC20,H1DA2,H1DA7,H1PR1,H1PR5,H1FV1,H1FV6,H4EC7,...,H4MA1,H4GH7,H1ED11,H1ED12,H1ED13,H1ED14,H4CJ1,H4WP38,H1PF1,H4WP24
0,5,0,,0,3,5,4,0,0,4,...,6,5,2,5,3,3,0,7,1,7
1,5,0,7.0,1,2,5,1,0,1,1,...,1,2,3,4,3,2,1,5,2,5
2,5,0,7.0,0,3,5,3,0,0,3,...,6,3,3,3,4,4,1,4,2,4
3,5,0,7.0,1,2,5,4,0,0,3,...,6,4,2,3,2,3,0,5,1,5
4,5,0,1.0,2,2,5,4,0,0,1,...,6,5,3,3,2,4,1,5,1,5


In [115]:
new_col_names = {'H1PR3':'parents_care', 'H1DS7':'run_home', 'PC20':'breastfed_duration', 'H1DA2':'hobby_time',
                 'H1DA7':'friend_time', 'H1PR1':'mother_care', 'H1PR5':'fam_understanding',
                 'H1FV1':'witness_violence', 'H1FV6':'jumped', 'H4EC7':'assets_value', 'H4ID5H':'depression',
                 'H1HS3':'counseling', 'H1ED11':'English', 'H1ED12':'Math', 'H1ED13':'History', 'H1ED14':'Science',
                 'H1FS6':'depress_feel', 'H1PF5':'mother_satisfied', 'H1SU1':'consider_suicide_youth', 
                 'H4SE1':'consider_suicide_adult', 'H4SE2':'attempt_suicide_adult', 'H4ED9':'edu_exp',
                 'H4LM28':'respon_interfer', 'H4CJ1':'arrest_i', 'H4ED8':'desired_edu', 'H4MA1':'feels_18th',
                 'H4GH7':'feel_weight', 'H4WP38':'father_close', 'H4WP24':'mother_close', 'H1PF1':'mom_warmth',}
sub_df = sub_df.rename(new_col_names, axis='columns')
sub_df.head()

Unnamed: 0,parents_care,run_home,breastfed_duration,hobby_time,friend_time,mother_care,fam_understanding,witness_violence,jumped,assets_value,...,feels_18th,feel_weight,English,Math,History,Science,arrest_i,father_close,mom_warmth,mother_close
0,5,0,,0,3,5,4,0,0,4,...,6,5,2,5,3,3,0,7,1,7
1,5,0,7.0,1,2,5,1,0,1,1,...,1,2,3,4,3,2,1,5,2,5
2,5,0,7.0,0,3,5,3,0,0,3,...,6,3,3,3,4,4,1,4,2,4
3,5,0,7.0,1,2,5,4,0,0,3,...,6,4,2,3,2,3,0,5,1,5
4,5,0,1.0,2,2,5,4,0,0,1,...,6,5,3,3,2,4,1,5,1,5


In [107]:
# No null rows or columns, but some codes are "refused" or "legitimate skip" so we'll have to handle those
rows_nan = sub_df.isnull().any(axis=1).sum()

print(rows_nan)
print(sub_df.isnull().sum().sum())

0
0


In [123]:
print('Data types in sub_df:')
sub_df.dtypes

Data types in sub_df:


parents_care               int64
run_home                   int64
breastfed_duration        object
hobby_time                 int64
friend_time                int64
mother_care                int64
fam_understanding          int64
witness_violence           int64
jumped                     int64
assets_value               int64
depression                 int64
counseling                 int64
depress_feel               int64
mother_satisfied           int64
consider_suicide_youth     int64
consider_suicide_adult    object
attempt_suicide_adult     object
edu_exp                    int64
respon_interfer           object
desired_edu                int64
feels_18th                 int64
feel_weight                int64
English                    int64
Math                       int64
History                    int64
Science                    int64
arrest_i                   int64
father_close               int64
mom_warmth                 int64
mother_close               int64
dtype: obj

In [124]:
sub_df.loc[sub_df['consider_suicide_adult'] == '7']

Unnamed: 0,parents_care,run_home,breastfed_duration,hobby_time,friend_time,mother_care,fam_understanding,witness_violence,jumped,assets_value,...,feels_18th,feel_weight,English,Math,History,Science,arrest_i,father_close,mom_warmth,mother_close
1054,5,0,7,2,2,5,4,0,2,1,...,6,4,1,4,4,1,7,3,2,4
1231,5,1,7,1,1,4,4,0,0,1,...,6,2,2,2,4,4,7,3,2,5
1980,5,1,7,2,3,3,3,0,0,1,...,6,1,3,2,2,2,7,3,2,4
5038,5,0,7,1,3,3,2,1,0,1,...,5,3,2,4,4,4,7,1,2,5


In [125]:
sub_df.loc[sub_df['consider_suicide_adult'] == 7]

Unnamed: 0,parents_care,run_home,breastfed_duration,hobby_time,friend_time,mother_care,fam_understanding,witness_violence,jumped,assets_value,...,feels_18th,feel_weight,English,Math,History,Science,arrest_i,father_close,mom_warmth,mother_close
963,5,0,7,0,2,4,4,0,0,1,...,6,4,4,4,3,2,7,5,1,5


In [None]:
sub_df = sub_df[~sub_df.parents_care.isin([6, 96, 98])]
sub_df = sub_df[~sub_df.run_home.isin([6, 8, 9])]
sub_df = sub_df[~sub_df.breastfed_duration.isin([' ', 96, 98])]
sub_df = sub_df[~sub_df.mom_warmth.isin([6, 7, 8])]
sub_df = sub_df[~sub_df.hobby_time.isin([6, 8])]
sub_df = sub_df[~sub_df.friend_time.isin([6, 8])]
sub_df = sub_df[~sub_df.fam_understanding.isin([6, 96, 98])]
sub_df = sub_df[~sub_df.witness_violence.isin([9])]
sub_df = sub_df[~sub_df.jumped.isin([9])]
sub_df = sub_df[~sub_df.assets_value.isin([96, 98])]
sub_df = sub_df[~sub_df.mother_close.isin([7])]
sub_df = sub_df[~sub_df.father_close.isin([7, 8])]
sub_df = sub_df[~sub_df.depression.isin([6])]
sub_df = sub_df[~sub_df.English.isin([5, 6, 96, 97, 98])]
sub_df = sub_df[~sub_df.Math.isin([5, 6, 96, 97, 98])]
sub_df = sub_df[~sub_df.History.isin([5, 6, 96, 97, 98])]
sub_df = sub_df[~sub_df.Science.isin([5, 6, 96, 97, 98])]
sub_df = sub_df[~sub_df.counseling.isin([6, 8])]
sub_df = sub_df[~sub_df.mother_satisfied.isin([6, 7, 8, 9])]
sub_df = sub_df[~sub_df.consider_suicide_youth.isin([6, 8, 9])]
sub_df = sub_df[~sub_df.edu_exp.isin([96, 97, 98])]
sub_df = sub_df[~sub_df.respon_interfer.isin([6, 96, 98])]
sub_df = sub_df[~sub_df.desired_edu.isin([6, 8])]
sub_df = sub_df[~sub_df.feels_18th.isin([96, 98])]
sub_df = sub_df[~sub_df.feel_weight.isin([6, 8])]

sub_df

Unnamed: 0,parents_care,run_home,breastfed_duration,hobby_time,friend_time,mother_care,fam_understanding,witness_violence,jumped,assets_value,...,feels_18th,feel_weight,English,Math,History,Science,arrest_i,father_close,mom_warmth,mother_close
1,5,0,7,1,2,5,1,0,1,1,...,1,2,3,4,3,2,1,5,2,5
2,5,0,7,0,3,5,3,0,0,3,...,6,3,3,3,4,4,1,4,2,4
4,5,0,1,2,2,5,4,0,0,1,...,6,5,3,3,2,4,1,5,1,5
6,5,0,2,0,1,5,4,0,0,4,...,6,3,1,1,2,1,0,4,1,5
12,5,0,3,1,1,5,3,0,0,3,...,3,4,2,3,1,3,1,3,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5097,5,0,4,2,3,4,5,0,0,7,...,6,3,3,2,1,3,0,5,2,5
5099,4,0,2,2,3,5,3,0,0,4,...,4,4,2,3,2,2,0,2,3,4
5100,5,0,2,3,1,4,4,0,0,2,...,5,3,3,2,3,3,1,3,2,4
5105,5,0,2,2,2,5,5,0,0,1,...,6,4,2,1,2,2,0,4,1,5


# Ethics & Privacy

Our research question raises several ethical and privacy issues at every stage of the data process, including data collection, analysis, and post-analysis communication.

1. **Biases in Data Collection**

    The data we are planning to analyze for this project may disproportionately represent demographic communities. For example, studies that rely on self-reported parental involvement or medical records might exclude populations with limited access to healthcare or different cultural parenting sub_dfs. To lessen these concerns, we will:
   - Use diverse, representative datasets that encompass an extensive range of socioeconomic, cultural, and racial backgrounds
   - Conduct a thorough review of the datasets to eliminate any biases
   - Perform statistical analysis to check for sampling biases

2. **Privacy Concerns**

    The characteristics of the data, usually including sensitive information about parental sub_dfs and health outcomes, pose notable risks to privacy. To address this issue:
   - Prioritize the use of anonymized data to ensure no individual can be identified
   - Adhere to strict data privacy sub_dfs, including encryption, restricted access, and compliance with data privacy regulations
   - Communicate privacy protection protocols to all group members

3. **Biases in Analysis**

    Framing parental involvement as a changeable variable could unintentionally disclose cultural or societal biases about gender roles. Also, our analysis could overemphasize some correlations while ignoring confounding variables. To mitigate this:
   - Perform subgroup analyses to understand impacts across gender, race, and SES
   - Control for confounders such as maternal involvement, SES, and healthcare access
   - Apply ethical and statistical techniques to reduce model bias

4. **Post-Analysis Communication**

    Misinterpretation of results could reinforce stereotypes or social inequalities. To prevent this, we will:
   - Transparently report limitations and biases
   - Include insights from underrepresented groups to ensure balanced conclusions
   - Distribute findings in an accessible and culturally respectful manner

By identifying and addressing these issues, we aim to ensure that our analysis maintains integrity and promotes equitable data use.

# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1*
* *Team Expectation 2*
* *Team Expecation 3*
* ...

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |