# 9C - Theories about Selfies

In [None]:
# Load the CourseKata library
suppressPackageStartupMessages({
    library(coursekata)
})

## 1.0 - The Kind of Guy that Posts a Lot of Selfies...

If we told you about an adult male who posts a lot of selfies on social media what do you think he would be like? (Do you know one of these? What are they like?)

Well, lucky for us, we have some data we can take a look at to maybe get some answers.

1.1 - Write some code to take a quick look at the `selfies` data frame.

In [None]:
selfies <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQJiwaSVpvLXAXndwL7-NRNJrW7Ncz7tnKLwROpkWwr739zU5yJKY7QHHhmCSxwiClMq11P8SAdN6M7/pub?output=csv", header  = TRUE)

This data in `selfies` is from an online survey of a nationally representative sample of U.S. men aged 18–40 (Fox & Rooney, 2015). Researchers were interested in visual self-presentation (e.g., posting “selfies” or editing pictures of themselves) as well as psychological measures such as the Dark Triad (i.e., narcissism, Machiavellianism, and psychopathy) and self-objectification (treating oneself like an object; judging themselves as objects with their worth based on outward attractiveness). 

- `subjectid` The male participant’s ID (there are no females included in this sample)
- `selfobject` Trait self-objectification score. Scores can range from −25 to 25; higher scores indicate higher self-objectification
- `narcissism` Score on four items from Narcissistic Personality Inventory; higher scores indicate higher narcissism
- `machiavelli` Score on four items from the Mach IV scale; higher scores indicate higher tendencies toward manipulation
- `psychotic` Score on four items from the Psychopathy Scale; higher scores indicate higher levels of subclinical psychopathy
- `age` Age of the participant
- `agerange` Whether the participant is < 25 years old, 25 - 30 years old, 31 - 35 years old, or over 35 years old
- `timeonsocialmedia` Self-reported estimate of time spent each day on Facebook, Twitter, Instagram, Tumblr, and Pinterest in minutes, as a sum
- `numselfies` Self-reported estimate of how many pictures taken of themselves and posted on the social networking sites in the past week 
- `numphotos` Self-reported estimate of how many pictures participants had posted in the past week that were not of themselves
- `photoediting` Self-judgment of how often participants edit their photos before posting them (e.g. filters, cropping parts of themselves out), on a 5-point scale (1 = Never; 5 = Often)


1.2 - **Get to know the data set:** What are the cases? Do you have questions about any of the variables?

1.3 - What are some research questions or hypotheses you can explore with this data? Try to write as many questions or research hypotheses as you can.  

## 2.0 - Explore Variation

2.1 - Pick one research question/hypothesis you can explore with this data. Write that idea as a word equation and make a prediction about the pattern you might see.

2.2 - Create a visualization to explore your word equation. What do you notice?

## 3.0 - Model Variation

3.1 - Based on what you see in your visualization, do you think your $b_1$ estimate will be a positive value, negative value, or somewhat close to 0?

3.2 - Find the best fit for your model and put the best fitting estimates into GLM notation.

$$Y_i = b_0 + b_1 X_i + e_i$$

3.3 - Re-create your visualization here and put the best fitting model in your visualization.


3.4 - What do your parameter estimates mean? Interpret those numbers. 

3.5 - How much better is your model than the empty model? 

3.6 - Could a model ever be worse than the empty model? Why or why not?

3.7 - Re-create your visualization here and put both the best fitting model **and** the empty model in your visualization. What is the slope ($b_1$) of the empty model? 


3.8 - What is the correlation coefficient (also called *Pearson's r* -- not to be confused with the programming language R!) between your outcome and explanatory variables? Try using the function `cor()` to find it. What does it mean? 

3.9 - What is the relationship between the correlation coefficient and PRE? What is the relationship between the correlation coefficient and $b_1$?

## 4.0 - Evaluating Models

4.1 - When we look at a visualization and think – yeah, this looks like there is a relationship between the explanatory and outcome variables – which parameter estimate best represents the “relationship”: $b_0$ or $b_1$?

4.2 - If we were **using our data** to estimate the DGP, what would be the best $\beta_1$ to represent the Data Generating Process? (Bonus: Why are we using $\beta_1$ here instead of $b_1$?)


4.3 - So why do we call $b_1$ the "best estimate"? (What are we trying to estimate? Why is it "the best"?)

4.4 - If a skeptic came along and claimed there is **no** relationship between your outcome and explanatory variables in the real world, what would be the best $\beta_1$ to represent this idea? 

4.5 - What other model (that we looked at today) has the very same value for slope?

4.6 - Notice a connection here! The empty model is the same as a model with **no relationship**. The empty model is also the same as a **random DGP** (such as `shuffle`). Why is shuffling the same as having **no relationship** between variables?

4.7 - Is it *possible* that a DGP where there is *no relationship* (that is, the $\beta_1=0$) produced your $b_1$? Is it *likely*?

4.8 - Is it possible to generate some data that looks like this sample data with the `shuffle()` function? Is it likely?

Here is some example code that looks at **shuffle(photoediting) = narcissism + error** in a jitter plot. Plug in your outcome and explanatory variables instead!

In [None]:
# If you also want to look at the real sample data (unshuffled) as a jitter plot as well
#gf_jitter(photoediting ~ narcissism, data = selfies) %>%
#gf_lm(color = "orange") %>%
#gf_labs(title = "SAMPLE DATA")

gf_jitter(shuffle(photoediting) ~ narcissism, data = selfies, color = "navyblue") %>%
    gf_lm(color = "magenta") %>%
    gf_labs(title = "SHUFFLED DATA")

## 5.0 - Do Good Models Imply Causation?

5.1 - If you found a pretty good model today, can you say that your explanatory variable **causes** variation in your outcome variable? Why or why not?