<a href="https://colab.research.google.com/github/lukehdez95/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/LS_DS_Sprint_2_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sprint 2 Challenge

### Dataset description: 

Anyone who is a fan of detective TV shows has watched a scene where human remains are discovered and some sort of expert is called in to determine when the person died. But is this science fiction or science fact? Is it possible to use evidence from skeletal remains to determine how long a body has been buried (a decent approximation of how long the person has been dead)? 

Researchers sampled long bone material from bodies exhumed from coffin burials in two cemeteries in England. In each case, date of death and burial (and therefore interment time) was known. This data is given in the Longbones.csv dataset. 

You can find Longbones.csv here: https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Longbones.csv

**What can we learn about the bodies that were buried in the cemetery?**

The variable names are:

Site = Site ID, either Site 1 or Site 2

Time = Interrment time in years

Depth = Burial depth in ft.

Lime = Burial with Quiklime (0 = No, 1 = Yes)

Age = Age at time of death in years

Nitro = Nitrogen composition of the long bones in g per 100g of bone.

Oil = Oil contamination of the grave site (0 = No contamination, 1 = Oil contamination)

Source: D.R. Jarvis (1997). "Nitrogen Levels in Long Bones from Coffin Burials Interred for Periods of 26-90 Years," Forensic Science International, Vol85, pp199-208

###1) Import the data 

Import the Longbones.csv file and print the first 5 rows.

In [1]:
#Import the dataset

import pandas as pd
import numpy as np


data_url = 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Longbones.csv'

df = pd.read_csv(data_url, skipinitialspace=True, header=0)



In [2]:
### YOUR CODE HERE ###
df.head()

Unnamed: 0,Site,Time,Depth,Lime,Age,Nitro,Oil
0,1,88.5,7.0,1,,3.88,1
1,1,88.5,,1,,4.0,1
2,1,85.2,7.0,1,,3.69,1
3,1,71.8,7.6,1,65.0,3.88,0
4,1,70.6,7.5,1,42.0,3.53,0


###2) Check for missing data.

Is there any missing data in the dataset?  If so, in what variable(s)?  

In [5]:
### YOUR CODE HERE ###
print(df.shape)
df.info()

(42, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Site    42 non-null     int64  
 1   Time    42 non-null     float64
 2   Depth   41 non-null     float64
 3   Lime    42 non-null     int64  
 4   Age     35 non-null     float64
 5   Nitro   42 non-null     float64
 6   Oil     42 non-null     int64  
dtypes: float64(4), int64(3)
memory usage: 2.4 KB


1 Null value in Depth

7 Null values in Age

### 3) Remove any rows with missing data from the dataset.  If there is no missing data, write "No missing data" in the answer section below.

In [9]:
### YOUR CODE HERE ###
df.dropna(inplace=True)
df.shape

(35, 7)

If there are no NA's, indicate that here. 

#Use the following information to answer questions 4) - 7) 

The mean nitrogen compostion in living individuals is 4.3g per 100g of bone.  

We wish to use the Longbones sample to test the null hypothesis that the mean nitrogen composition per 100g of bone in the deceased is 4.3g (equal to that of living humans) vs the alternative hypothesis that the mean nitrogen composition per 100g of bone in the deceased is not 4.3g (not equal to that of living humans). 



###4) Using symbols and statistical language, write the null and alternative hypotheses outlined above.

$H_0:$ The mean nitrogen composition in the deceased is 4.3g per 100g of bone

$H_a:$ The mean nitrogen composition in the deceased is not 4.3g per 100g of bone

###5) What is the appropriate test for these hypotheses?  A t-test or a chi-square test?  Explain your answer in a sentence or two.

A t-test, because chi-square tests are for categorical values. Nitrogen composition in this dataset seems to be quantitative and continuous.

###6) Use a built-in Python function to conduct the statistical test you identified in 5).  Report your p-value.  Write the conclusion to your hypothesis test at the alpha = 0.05 significance level.

In [10]:
### YOUR CODE HERE ###
from scipy import stats
stats.stats.ttest_1samp(df['Nitro'],4.3)

Ttest_1sampResult(statistic=-16.525765821830365, pvalue=8.097649978903554e-18)

My p-value is approximately .000000000000000008, which is much smaller than my .05 significance level, which means I reject the null hypothesis, which states that the mean nitrogen composition of the deceased per 100g of bone is 4.3.

###7) Create a 95% confidence interval for the mean nitrogen compostion in the longbones of a deceased individual.  Interpret your confidence interval in a sentence or two.

In [18]:
### YOUR CODE HERE ###
from scipy.stats import t
nitro_mean = df['Nitro'].mean()
nitro_std = df['Nitro'].std()
nitro_n = df['Nitro'].count()
nitro_se = nitro_std / (nitro_n**(1/2))

In [19]:
t.interval(alpha = .95,df = nitro_n,loc = nitro_mean, scale = nitro_se)

(3.7340861131115286, 3.857913886888472)

We are 95% confident that any 100g of bone from a deceased person will have between 3.73g and 3.86g of nitrogen.



#Use the following information to answer questions 8) - 12) 


The researchers also want to learn more about burial practices in the parts of England where the two cemetaries in the study were located.  They wish to determine if burials with Quicklime are associated with the burial region.  

Their null hypothesis is that there is no association between cemetery site and burial with Quicklime.  The alternative hypothesis is that there is an association between cemetery site and burial with Quicklime.



###8) Calculate the joint distribution of burial with Quicklime by burial site.

In [23]:
### YOUR CODE HERE ###
joint_df = pd.crosstab(df['Lime'],df['Site'],margins=True)

###9) Calculate the conditional distribution of burial with Quicklime by burial site.

In [22]:
### YOUR CODE HERE ###
pd.crosstab(df['Lime'],df['Site'],normalize='index')

Site,1,2
Lime,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.608696,0.391304
1,0.416667,0.583333


###10) What is the appropriate test for the hypotheses listed above?  A t-test or a chi-square test?  Explain your answer in a sentence or two.

Being that the data is categorical, a chi square test would be more appropriate for these hypotheses.

###11) Conducte your hypothesis test and report your conclusion at the alpha = 0.05 significance level.

In [24]:
### YOUR CODE HERE ###
from scipy.stats import chi2_contingency

g, p_val, dof, expectd = chi2_contingency(joint_df)
p_val

0.8827265856378302

Our p-value is .9 which is more than our significance level of .05, which leads me to fail to reject the null hypothesis, which states there is no association between cemetery site and burial with quicklime.

###12) Conducte your hypothesis test and report your conclusion at the alpha = 0.05 significance level.

In [None]:
### YOUR CODE HERE ###

Your answer here.

###13) In a few sentences, describe the difference between Bayesian and Frequentist statistics.

Frequentist statistics have a scenario run many, many times. Then use the results to state the probability of things happening. Bayesian statistics, has you reasses your opinion after every time a scenario is run, and base the next probability calculation based on the most recent result.