In [1]:
##Clear the environment
rm(list=ls())

##Turn off scientific notations for numbers
options(scipen = 999)  

##Set locale
Sys.setlocale("LC_ALL", "English") 

##Set seed for reproducibility
set.seed(2345)

# Turn off warnings
options(warn = -1)

getstats <- function(cm){
  # Sensititvity a.k.a TPR
  tpr <-cm[2,2]/(cm[2,2]+cm[2,1])
  fpr <-cm[1,2]/(cm[1,2]+cm[1,1])
  
  # Specificity a.k.a. TNR
  tnr <- cm[1,1]/(cm[1,1]+cm[1,2])
  fnr <- cm[2,1]/(cm[2,1]+cm[2,2])
  
  # Calculate accuracy
  acc <-(cm[2,2]+cm[1,1])/sum(cm)
  err <-(cm[1,2]+cm[2,1])/sum(cm)
  
  #Precision - Positive Predictive Value
  ppv <- cm[2,2]/(cm[2,2]+cm[1,2])
  
  # Negative Predictive Value
  npv <- cm[1,1]/(cm[1,1]+cm[2,1])
  
  rbind(TruePos_Sensitivity=tpr, FalsePos=fpr, TrueNeg_Specificty=tnr, FalseNeg=fnr, PositivePredictiveValue=ppv, NegativePredictiveValue=npv, Accuracy = acc, Error = err)
}

# clean the data names and data
# Use: df<-cleanit(df)
cleanit <-function(df){
  names(df) <-tolower(names(df))
  names(df) <- gsub("\\(","",names(df))
  names(df) <- gsub("\\)","",names(df))
  names(df) <- gsub("\\.","",names(df))
  names(df) <- gsub("_","",names(df))
  names(df) <- gsub("-","",names(df))
  names(df) <- gsub(",","",names(df))
  return(df)
}


In [2]:
acl <- read.csv("D:/Data/AustinCityLimits.csv")

In [3]:
str(acl)

'data.frame':	116 obs. of  14 variables:
 $ Artist       : Factor w/ 116 levels "Aimee Mann","Alabama Shakes",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Year         : int  2008 2013 2009 2009 2007 2009 2010 2009 2003 2008 ...
 $ Month        : Factor w/ 6 levels "December","February",..: 4 2 3 5 4 4 3 4 3 5 ...
 $ Season       : Factor w/ 2 levels "fall","winter": 1 2 2 1 1 1 2 1 2 1 ...
 $ Gender       : Factor w/ 2 levels "F","M": 1 1 2 2 1 2 2 2 2 1 ...
 $ Age          : int  52 24 75 39 33 62 37 35 43 67 ...
 $ Age.Group    : Factor w/ 4 levels "Fifties or Older",..: 1 4 1 3 3 1 3 3 2 1 ...
 $ Grammy       : Factor w/ 2 levels "N","Y": 2 1 1 1 2 2 1 1 2 1 ...
 $ Genre        : Factor w/ 4 levels "Country","Jazz/Blues",..: 4 3 2 3 3 1 3 3 3 2 ...
 $ BB.wk.top10  : int  0 1 NA 1 1 0 1 NA 1 0 ...
 $ Twitter      : int  101870 73313 308634 56343 404439 3326 125758 8197 158647 690 ...
 $ Twitter.100k : int  1 0 1 0 1 0 1 0 1 0 ...
 $ Facebook     : int  113576 298278 10721 318313 1711685 27321 56

In [11]:
acl[acl$Artist=="Allen Toussaint",]

Unnamed: 0,Artist,Year,Month,Season,Gender,Age,Age.Group,Grammy,Genre,BB.wk.top10,Twitter,Twitter.100k,Facebook,Facebook.100k
3,Allen Toussaint,2009,January,winter,M,75,Fifties or Older,N,Jazz/Blues,,308634,1,10721,0


In [10]:
head(acl)

Unnamed: 0,Artist,Year,Month,Season,Gender,Age,Age.Group,Grammy,Genre,BB.wk.top10,Twitter,Twitter.100k,Facebook,Facebook.100k
1,Aimee Mann,2008,November,fall,F,52,Fifties or Older,Y,Singer-Songwriter,0.0,101870,1,113576,1
2,Alabama Shakes,2013,February,winter,F,24,Twenties,N,Rock/Folk/Indie,1.0,73313,0,298278,1
3,Allen Toussaint,2009,January,winter,M,75,Fifties or Older,N,Jazz/Blues,,308634,1,10721,0
4,Andrew Bird,2009,October,fall,M,39,Thirties,N,Rock/Folk/Indie,1.0,56343,0,318313,1
5,Arcade Fire,2007,November,fall,F,33,Thirties,Y,Rock/Folk/Indie,1.0,404439,1,1711685,1
6,Asleep at the Wheel,2009,November,fall,M,62,Fifties or Older,Y,Country,0.0,3326,0,27321,0


Primary Research Questions
1. Are there an equal number of male and female performers on Austin City Limits?
2. Are male performers just as likely to have had a Top 10 hit as female performers?

In [13]:
# Create a table of counts for Gender
gender_tab <-table(acl$Gender)
gender_tab


 F  M 
35 81 

In [14]:
# Create vector of expected proportions
ExpGender <- c(.50, .50)

In [15]:
# Check expected counts assumption
chisq.test(gender_tab, p=ExpGender)$expected

In [16]:
# Run goodness of fit
chisq.test(gender_tab, p=ExpGender)


	Chi-squared test for given probabilities

data:  gender_tab
X-squared = 18.2414, df = 1, p-value = 0.00001946


In [18]:
# Create two-way table
gender_top10 <-table(acl$Gender, acl$BB.wk.top10)
gender_top10

   
     0  1
  F 15 18
  M 38 32

In [19]:
# Generate expected counts
chisq.test(gender_top10, correct=FALSE)$expected

Unnamed: 0,0,1
F,16.98058,16.01942
M,36.01942,33.98058


In [20]:

# Run test of independence
chisq.test(gender_top10, correct=FALSE)


	Pearson's Chi-squared test

data:  gender_top10
X-squared = 0.7002, df = 1, p-value = 0.4027


In [21]:
table(acl$Season)


  fall winter 
    52     64 

In [None]:
### Lab Question 1

# Make a vector of happiness scores for each sample
underclass_happy <- post$happy[post$classification=='Freshman'|post$classification=='Sophomore']
upperclass_happy <- post$happy[post$classification=='Junior'|post$classification=='Senior']

In [None]:
# Check the normality assumption
hist(underclass_happy, xlab='Underclassman Happiness', main='Percent of Time Happy')
hist(upperclass_happy, xlab='Upperclassman Happiness', main='Percent of Time Happy')

In [None]:
# Run independent t-test
t.test(underclass_happy, upperclass_happy)

In [None]:
## Lab Question 2

# Make a vector of difference scores
post$diff_happy <- post$happy - post$post_happy

In [None]:
# Check the normality assumption
hist(post$diff_happy, xlab= 'Difference in Happiness over the Semester', main = 'Happy-Post Happy')

In [None]:
# Run dependent t-test
t.test(post$happy, post$post_happy, paired=T)

In [None]:
#Suppose we wanted to test the happiness scores of those who live on campus against those who live off campus. What has caused the error below?

on_campus <- post[post$live_campus == 'yes',]
off_campus <- post[post$live_campus == 'no',]
on_campus_happy <- on_campus$happy
off_campus_happy <- off_campus$happy
t.test(on_campus_happy, off_campus_happy)


## Lab

Let’s break this question down into the different statistics that you will need to construct your answer.  Be sure that your R output includes all of the following components. 

For each hypothesis test, 

1. Create vectors of the scores that you wish to analyze.
2. Check the assumption of normality by generating a histogram for each variable of interest. 
3. Find the t-statistic and p-value.
4. Interpret the results of each test. 


NOTE:  If you are running directional hypotheses tests, remember that you must modify the code to reflect this direction.
A one-sided test looks like this:   

t.test(Variable1, Variable2, alternative = 'less'), when you expect Mean1 < Mean2

t.test(Variable1, Variable2, alternative = 'greater'), when you expect Mean1 > Mean2

In [None]:
hs_hours <- post$hw_hours_HS 
col_hours<- post$hw_hours_college

In [None]:
hist(hs_hours)

In [None]:
hist(col_hours)

On average, students spent how many hours more on homework each week in college than they did in high school? (round to 2 decimal)

In [None]:
diff_hours <- col_hours - hs_hours

In [None]:
hist(diff_hours)

In [None]:
t.test(diff_hours,alternative = 'greater')

In [None]:
t.test(col_hours,hs_hours,paired=TRUE,alternative = 'greater')

On average, students who are Greek sleep how many hours less than Non-Greek students on Saturday nights? (report to 1 decimal place)

In [None]:
greek <- post[post$greek == 'yes',]
no_greek <- post[post$greek == 'no',]
greek_sleep <- greek$sleep_Sat
no_greek_sleep <- no_greek$sleep_Sat

In [None]:
hist(greek_sleep)

In [None]:
hist(no_greek_sleep)

In [None]:
t.test(greek_sleep, no_greek_sleep, alternative='less')

In [None]:
mean(greek_sleep - no_greek_sleep)

## Problem Set

Is the increase in time spent studying from high school to college the same for nursing majors and biology majors?  

1. Create a new variable that equals the difference in hours spent studying per week in college versus high school for each student. 

2. Create two vectors of those differences, one for nursing majors and one for biology majors.

In [None]:
post$diff_hours <- post$hw_hours_college - post$hw_hours_HS 
bio_major <- post[post$major == 'Biology',]
nurse_major <- post[post$major == 'Nursing',]
bio_diff<-bio_major$diff_hours
nurse_diff<-nurse_major$diff_hours

In [None]:
hist(bio_diff)

In [None]:
hist(nurse_diff)

In [None]:
t.test(bio_diff,nurse_diff)