### Statistics in Python

#### Data representation and interaction

##### Data as a table

The setting that we consider for statistical analysis is that of multiple observations or samples described
by a set of different attributes or features. The data can than be seen as a 2D table, or matrix, with
columns giving the different attributes of the data, and rows the observations. For instance, the data
contained in data/brain_size.csv:

##### The pandas data-frame

Creating dataframes: reading data files or converting arrays

Reading from a CSV file: Using the above CSV file that gives observations of brain size and weight
and IQ (Willerman et al. 1991), the data are a mixture of numerical and categorical values:

In [2]:
import pandas

dados = pandas.read_csv('data/brain_size.csv', 
                 sep=';',
                 na_values=".")

dados

Unnamed: 0.1,Unnamed: 0,Gender,FSIQ,VIQ,PIQ,Weight,Height,MRI_Count
0,1,Female,133,132,124,118.0,64.5,816932
1,2,Male,140,150,124,,72.5,1001121
2,3,Male,139,123,150,143.0,73.3,1038437
3,4,Male,133,129,128,172.0,68.8,965353
4,5,Female,137,132,134,147.0,65.0,951545


Creating from arrays: A pandas.DataFrame can also be seen as a dictionary of 1D ‘series’, eg arrays
or lists. If we have 3 numpy arrays:

In [3]:
import numpy as np

t = np.linspace(-6, 6, 20)
sin_t = np.sin(t)
cos_t = np.cos(t)

We can expose them as a pandas.DataFrame:

In [4]:
pandas.DataFrame({'t': t, 
                  'sin': sin_t,
                  'cos': cos_t})

Unnamed: 0,t,sin,cos
0,-6.0,0.279415,0.96017
1,-5.368421,0.792419,0.609977
2,-4.736842,0.999701,0.024451
3,-4.105263,0.821291,-0.570509
4,-3.473684,0.326021,-0.945363
5,-2.842105,-0.29503,-0.955488
6,-2.210526,-0.802257,-0.596979
7,-1.578947,-0.999967,-0.008151
8,-0.947368,-0.811882,0.583822
9,-0.315789,-0.310567,0.950551


Manipulating data

data is a pandas.DataFrame, that resembles R’s dataframe:

In [5]:
dados.shape # 40 rows and 8 columns

(5, 8)

In [6]:
dados.columns # It has columns

Index(['Unnamed: 0', 'Gender', 'FSIQ', 'VIQ', 'PIQ', 'Weight', 'Height',
       'MRI_Count'],
      dtype='object')

In [7]:

print(dados['Gender']) # Columns can be addressed by name

0    Female
1      Male
2      Male
3      Male
4    Female
Name: Gender, dtype: object


In [8]:
# Simpler selector
dados[dados['Gender'] == 'Female']['VIQ'].mean()

132.0

groupby: splitting a dataframe on values of categorical variables:

In [9]:
groupby_gender = dados.groupby('Gender')
for gender, value in groupby_gender['VIQ']:
    print((gender, value.mean()))

('Female', 132.0)
('Male', 134.0)
