<a href="https://colab.research.google.com/github/devrihartle/DS-Unit-1-Sprint-2-Statistics/blob/master/Unit_1_Sprint_2_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sprint 2 Challenge

### Dataset description: 

Anyone who is a fan of detective TV shows has watched a scene where human remains are discovered and some sort of expert is called in to determine when the person died. But is this science fiction or science fact? Is it possible to use evidence from skeletal remains to determine how long a body has been buried (a decent approximation of how long the person has been dead)? 

Researchers sampled long bone material from bodies exhumed from coffin burials in two cemeteries in England. In each case, date of death and burial (and therefore interment time) was known. This data is given in the Longbones.csv dataset. 

You can find Longbones.csv here: https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Longbones.csv

**What can we learn about the bodies that were buried in the cemetery?**

The variable names are:

Site = Site ID, either Site 1 or Site 2

Time = Interrment time in years

Depth = Burial depth in ft.

Lime = Burial with Quiklime (0 = No, 1 = Yes)

Age = Age at time of death in years

Nitro = Nitrogen composition of the long bones in g per 100g of bone.

Oil = Oil contamination of the grave site (0 = No contamination, 1 = Oil contamination)

Source: D.R. Jarvis (1997). "Nitrogen Levels in Long Bones from Coffin Burials Interred for Periods of 26-90 Years," Forensic Science International, Vol85, pp199-208

###1) Import the data 

Import the Longbones.csv file and print the first 5 rows.

In [1]:
#Import the dataset

import pandas as pd
import numpy as np


data_url = 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Longbones.csv'

df = pd.read_csv(data_url, skipinitialspace=True, header=0)



In [2]:
### YOUR CODE HERE ###
df.head(5)

Unnamed: 0,Site,Time,Depth,Lime,Age,Nitro,Oil
0,1,88.5,7.0,1,,3.88,1
1,1,88.5,,1,,4.0,1
2,1,85.2,7.0,1,,3.69,1
3,1,71.8,7.6,1,65.0,3.88,0
4,1,70.6,7.5,1,42.0,3.53,0


###2) Check for missing data.

Is there any missing data in the dataset?  If so, in what variable(s)?  

In [3]:
### YOUR CODE HERE ###
df.isnull().sum()


Site     0
Time     0
Depth    1
Lime     0
Age      7
Nitro    0
Oil      0
dtype: int64

There is one null value in the Depth variable and 7 null values in the Age variable.

### 3) Remove any rows with missing data from the dataset.  If there is no missing data, write "No missing data" in the answer section below.

In [6]:
### YOUR CODE HERE ###
#df = df.dropna(how='any',axis=0) 
df.isnull().sum()

Site     0
Time     0
Depth    0
Lime     0
Age      0
Nitro    0
Oil      0
dtype: int64

I properly removed the Null values

#Use the following information to answer questions 4) - 7) 

The mean nitrogen compostion in living individuals is 4.3g per 100g of bone.  

We wish to use the Longbones sample to test the null hypothesis that the mean nitrogen composition per 100g of bone in the deceased is 4.3g (equal to that of living humans) vs the alternative hypothesis that the mean nitrogen composition per 100g of bone in the deceased is not 4.3g (not equal to that of living humans). 



###4) Using symbols and statistical language, write the null and alternative hypotheses outlined above.

𝐻0:𝜇=4.3g
The mean nitrogen composition per 100g of bone in the deceased is 4.3g
𝐻𝑎:𝜇≠4.3g
 The mean nitrogen composition per 100g of bone in the deceased is not 4.3g. 

###5) What is the appropriate test for these hypotheses?  A t-test or a chi-square test?  Explain your answer in a sentence or two.

A T-test since these values are not categorical, they are numerical.

###6) Use a built-in Python function to conduct the statistical test you identified in 5).  Report your p-value.  Write the conclusion to your hypothesis test at the alpha = 0.05 significance level.

In [8]:
### YOUR CODE HERE ###
import scipy.stats as st

t, pval = st.stats.ttest_1samp(df['Nitro'], 4.3)
print(t)
print(pval)

-16.525765821830365
8.097649978903554e-18


Because the P value is less than the significance level (.05), we can reject the null hypothesis and conclude that the alternative hypothesis is correct. This shows that the nitrogen levels in dead bones are different that the nitrogen levels in living boones. 

###7) Create a 95% confidence interval for the mean nitrogen compostion in the longbones of a deceased individual.  Interpret your confidence interval in a sentence or two.

In [13]:
### YOUR CODE HERE ###
from scipy.stats import t
s_mean=df['Nitro'].mean()
s_std=df['Nitro'].std()
s_n=df['Nitro'].count()
s_error=s_std/(s_n*(1/2))
print(s_mean)
print(s_std)
print(s_n)
print(s_error)
t_star = t.ppf(0.975,df=34)
print(t_star)


t.interval(alpha =.95, df = 34, loc = s_mean, scale = s_error)

3.7960000000000003
0.18042759668925043
35
0.010310148382242881
2.032244509317718


(3.7750472575599363, 3.816952742440064)

I conclude that the population mean nitrogen composition in the longbones of a deceased individual is between 3.775 and 3.817

#Use the following information to answer questions 8) - 12) 


The researchers also want to learn more about burial practices in the parts of England where the two cemetaries in the study were located.  They wish to determine if burials with Quicklime are associated with the burial region.  

Their null hypothesis is that there is no association between cemetery site and burial with Quicklime.  The alternative hypothesis is that there is an association between cemetery site and burial with Quicklime.



###8) Calculate the joint distribution of burial with Quicklime by burial site.

In [15]:
### YOUR CODE HERE ###
joint = pd.crosstab(df['Site'], df['Lime'], margins= True)
print(joint)

Lime   0   1  All
Site             
1     14   5   19
2      9   7   16
All   23  12   35


###9) Calculate the conditional distribution of burial with Quicklime by burial site.

In [19]:
### YOUR CODE HERE ###
joint = pd.crosstab(df['Site'], df['Lime'], normalize='index')*100
print(joint)

Lime          0          1
Site                      
1     73.684211  26.315789
2     56.250000  43.750000


###10) What is the appropriate test for the hypotheses listed above?  A t-test or a chi-square test?  Explain your answer in a sentence or two.

A Chi square test. These are both categorical variables and you use Chi squared for categorical variables.

###11) Conducte your hypothesis test and report your conclusion at the alpha = 0.05 significance level.

In [20]:
### YOUR CODE HERE ###
from scipy.stats import chi2_contingency
g, p, dof, expctd = chi2_contingency(pd.crosstab(df['Site'], df['Lime']))
print(p)

0.4684181967877057


The p value is larger than the .05 significance level, thus, we fail to reject the null hypothesis, meaning that there is not enough evidence to show that there is a relationship between specific site and being buried with quicklime. 

###12) Conducte your hypothesis test and report your conclusion at the alpha = 0.05 significance level.

In [None]:
### YOUR CODE HERE ###

Duplicate question. 

###13) In a few sentences, describe the difference between Bayesian and Frequentist statistics.

In bayesian statistics, a probability is assigned to a hypothesis while in Frequentist statistics a hypothesis is tested without being assigned a probability. Bayesian statistics relies on previously observed data while frequentist statistics relies on both previously observed and not yet observed data. 