In [1]:
%matplotlib inline
import pandas as pd

## Outline

We have two purposes today. First, we'll show you how easy it is to calculate various statistics for your datasets. With `pandas` you wield a great deal of power to manipulate variables and instances into whatever form and shape you wish. 

Then we'll show you how easy it is to mislead with numbers. Not that we suspect you would want to do that on purpose. Unfortunately the easiest person to fool is yourself. 

While making technically correct calculations is easier than ever with `pandas`, the hard part is ensuring that the quantities you chose to highlight are actually meaningful and appropriate for the data.

### Making comparisons

Arguable the aim of modern science is to make comparisons. Scientists are for instance interested in

- which of two competing drugs work better; or
- whether a drug is more effective than placebo;
- if a food supplement has any of the claimed health benefits;
- if their newly developed sentiment analysis algorithm is actually better than a baseline

### Grouping

### Merging operations

In [74]:
prize_winners_age = pd.read_csv("nobel_prize_winners_age.csv", sep=";")
prize_winners_age['last_name'] = [name.strip().split(", ")[0] for name in prize_winners_age['name']]
prize_winners_age.head()

Unnamed: 0,name,field,year_birth,year_prize,year_research_mid,year_death,TheoryOrTheoryAndEmpirical,age_highdegree,last_name
0,"Van'T Hoff, Jacobus Henricus",Chemistry,1852,1901,1885,1911,1,22,Van'T Hoff
1,"Fischer, Hermann Emil",Chemistry,1852,1902,1895,1919,0,22,Fischer
2,"Arrhenius, Svante August",Chemistry,1859,1903,1884,1927,1,25,Arrhenius
3,"Ramsay, Sir William",Chemistry,1852,1904,1894,1916,0,20,Ramsay
4,"Von Baeyer, Johann",Chemistry,1835,1905,1873,1917,0,23,Von Baeyer


In [75]:
prize_winners_gender = pd.read_csv("nobel_prize_winners.tsv", sep="\t")
prize_winners_gender['last_name'] = [name.strip().split(" ")[-1] for name in prize_winners_gender['Name']]
prize_winners_gender.head()

Unnamed: 0,Year,Domain,Name,Nationality,Gender,last_name
0,2009,Physics,Charles K. Kao,China,Male,Kao
1,2009,Physics,Willard S. Boyle,USA,Male,Boyle
2,2009,Physics,George E. Smith,USA,Male,Smith
3,2009,Chemistry,Venkatraman Ramakrishnan,UK,Male,Ramakrishnan
4,2009,Chemistry,Thomas A. Steitz,USA,Male,Steitz


In [76]:
prize_winners = prize_winners_age.merge(right=prize_winners_gender, how='inner',  
                                        left_on=['year_prize', 'field', 'last_name'], 
                                        right_on=['Year', 'Domain', 'last_name'])
prize_winners.head()

Unnamed: 0,name,field,year_birth,year_prize,year_research_mid,year_death,TheoryOrTheoryAndEmpirical,age_highdegree,last_name,Year,Domain,Name,Nationality,Gender
0,"Fischer, Hermann Emil",Chemistry,1852,1902,1895,1919,0,22,Fischer,1902,Chemistry,Emil Fischer,Germany,Male
1,"Arrhenius, Svante August",Chemistry,1859,1903,1884,1927,1,25,Arrhenius,1903,Chemistry,Svante Arrhenius,Sweden,Male
2,"Ramsay, Sir William",Chemistry,1852,1904,1894,1916,0,20,Ramsay,1904,Chemistry,Sir William Ramsay,UK,Male
3,"Moissan, Henri",Chemistry,1852,1906,1898,1907,0,28,Moissan,1906,Chemistry,Henri Moissan,France,Male
4,"Buchner, Eduard",Chemistry,1860,1907,1897,1917,0,28,Buchner,1907,Chemistry,Eduard Buchner,Germany,Male


In [77]:
prize_winners = prize_winners.drop(['name', 'Domain', 'Year', 'last_name', 'TheoryOrTheoryAndEmpirical'], axis=1)
prize_winners.columns = [name.lower() for name in prize_winners.columns]
prize_winners.head()

Unnamed: 0,field,year_birth,year_prize,year_research_mid,year_death,age_highdegree,name,nationality,gender
0,Chemistry,1852,1902,1895,1919,22,Emil Fischer,Germany,Male
1,Chemistry,1859,1903,1884,1927,25,Svante Arrhenius,Sweden,Male
2,Chemistry,1852,1904,1894,1916,20,Sir William Ramsay,UK,Male
3,Chemistry,1852,1906,1898,1907,28,Henri Moissan,France,Male
4,Chemistry,1860,1907,1897,1917,28,Eduard Buchner,Germany,Male


In [78]:
prize_winners['age_death'] = prize_winners['year_death'] - prize_winners['year_birth']
prize_winners['age_prize'] = prize_winners['year_prize'] - prize_winners['year_birth']
prize_winners['age_research_mid'] = prize_winners['year_research_mid'] - prize_winners['year_birth']
prize_winners['degree_to_prize'] = prize_winners['age_prize'] - prize_winners['age_highdegree']
prize_winners['research_mid_to_prize'] = prize_winners['age_prize'] - prize_winners['age_research_mid']
prize_winners

Unnamed: 0,field,year_birth,year_prize,year_research_mid,year_death,age_highdegree,name,nationality,gender,age_death,age_prize,age_research_mid,degree_to_prize,research_mid_to_prize
0,Chemistry,1852,1902,1895,1919,22,Emil Fischer,Germany,Male,67,50,43,28,7
1,Chemistry,1859,1903,1884,1927,25,Svante Arrhenius,Sweden,Male,68,44,25,19,19
2,Chemistry,1852,1904,1894,1916,20,Sir William Ramsay,UK,Male,64,52,42,32,10
3,Chemistry,1852,1906,1898,1907,28,Henri Moissan,France,Male,55,54,46,26,8
4,Chemistry,1860,1907,1897,1917,28,Eduard Buchner,Germany,Male,57,47,37,19,10
5,Chemistry,1871,1908,1902,1937,23,Ernest Rutherford,"UK, New Zealand",Male,66,37,31,14,6
6,Chemistry,1853,1909,1894,1932,25,Wilhelm Ostwald,Germany,Male,79,56,41,31,15
7,Chemistry,1847,1910,1884,1931,22,Otto Wallach,Germany,Male,84,63,37,41,26
8,Chemistry,1867,1911,1910,1934,36,Marie Curie,France,Female,67,44,43,8,1
9,Chemistry,1871,1912,1900,1935,30,Victor Grignard,France,Male,64,41,29,11,12


- Fundamentals of grouping 
  - group operations
  - hierachal indexing
  
  
  
### Stats

- Mean (and why it's misleading)
- Median (and why it's misleading) 
- Quantile
- Histogram

- Standard deviation
- Correlation

### Statistical tests

- What is the issue with just comparing the numbers?
- What a statistical significance test isn't
- p and $\alpha$
- Power calculation (optional)
- Significance testing by simulation


In [2]:
police = pd.read_csv("police_killings.csv")

In [47]:
police.cause.value_counts()

Gunshot              411
Taser                 27
Death in custody      14
Struck by vehicle     12
Unknown                3
dtype: int64

In [41]:
police.groupby('raceethnicity')['name'].count()

raceethnicity
Asian/Pacific Islander     10
Black                     135
Hispanic/Latino            67
Native American             4
Unknown                    15
White                     236
Name: name, dtype: int64

In [7]:
france = pd.read_csv("CausesOfDeath_France_2001-2008.csv")

In [8]:
france

Unnamed: 0,TIME,GEO,UNIT,SEX,AGE,ICD10,Value,Flag and Footnotes
0,2001,France,Number,Males,Total,All causes of death (A00-Y89) excluding S00-T98,277 858,
1,2001,France,Number,Males,Total,Certain infectious and parasitic diseases (A00...,5 347,
2,2001,France,Number,Males,Total,Tuberculosis,545,
3,2001,France,Number,Males,Total,Meningococcal infection,30,
4,2001,France,Number,Males,Total,Viral hepatitis,471,
5,2001,France,Number,Males,Total,Human immunodeficiency virus [HIV] disease,892,
6,2001,France,Number,Males,Total,Neoplasms,91 737,
7,2001,France,Number,Males,Total,Malignant neoplasms (C00-C97),88 481,
8,2001,France,Number,Males,Total,"Malignant neoplasm of lip, oral cavity, pharynx",3 755,
9,2001,France,Number,Males,Total,Malignant neoplasm of oesophagus,3 442,
