In [2]:
require 'daru'

false

# Daru
Daru provides some data structures which are widely used in data science. They provide functionality for easily manipulating, subsetting, combining, and aggregating data. The primary structures provided by daru are vectors and data frames.

## Vectors
Conceptually, vectors are very similar to a ruby `Array`. However, daru's vectors are typed - meaning that each element of a vector must be of the same logical type (Integer, String, Boolean, etc.).

In addition to many of the methods you would expect on an Array, vectors include additional functionality for statistical methods, aggregation, plotting and other things.

### Making a new vector

In [3]:
numbers = Daru::Vector.new( (1..100).map { rand(100) })

0,1
0,57
1,59
2,98
3,82
4,51
5,69
6,68
7,53
8,97
9,37


#### Aggregation
Sum, count , average, standard deviation

In [15]:
numbers.std

27.60247939383106

# Daru - Data frames for Ruby
Data frames are tabular data structures for storing observational data. Conceptually, a data frame is a collection of vectors where each vector is the same length but does not have to be of the same type.

In [20]:
data = [
    { age: 32, gender: 'f' },
    { age: 21, gender: 'm' },
    { age: 25, gender: 'f' },
    { age: 46, gender: 'f' },
    { age: 54, gender: 'm' },
    { age: 53, gender: 'm' },
]

[{:age=>32, :gender=>"f"}, {:age=>21, :gender=>"m"}, {:age=>25, :gender=>"f"}, {:age=>46, :gender=>"f"}, {:age=>54, :gender=>"m"}, {:age=>53, :gender=>"m"}]

In [21]:
dataframe = Daru::DataFrame.new(data)

Unnamed: 0,age,gender
0,32,f
1,21,m
2,25,f
3,46,f
4,54,m
5,53,m


### NHANES
NHANES is a data set collected and released by the CDC. It contains data about the health and nutrition of people around the nation.

Let's explore NHANES a bit.

In [32]:
nhanes_url = 'https://raw.githubusercontent.com/ProjectMOSAIC/NHANES/master/data-raw/NHANES.csv'
nhanes = Daru::DataFrame.from_csv(nhanes_url); nil

p "Loaded #{nhanes.size} rows"

"Loaded 10000 rows"


"Loaded 10000 rows"

In [30]:
nhanes

Unnamed: 0,Year,LocationAbbr,LocationDesc,DataSource,PriorityArea1,PriorityArea2,PriorityArea3,PriorityArea4,Category,Topic,Indicator,Data_Value_Type,Data_Value_Unit,Data_Value,Data_Value_Alt,Data_Value_Footnote_Symbol,Data_Value_Footnote,Confidence_limit_Low,Confidence_limit_High,Break_Out_Category,Break_Out,CategoryId,TopicId,IndicatorID,Data_Value_TypeID,BreakOutCategoryId,BreakOutId,LocationID,GeoLocation
0,1999-2000,US,United States,NHANES,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease among US adults (20+); NHANES,Age-Standardized,Percent (%),7.0,7.0,,,6.0,8.1,Overall,Overall,C1,T1,NH001,AgeStdz,BOC01,OVR01,59,
1,1999-2000,US,United States,NHANES,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease among US adults (20+); NHANES,Crude,Percent (%),6.4,6.4,,,5.6,7.4,Overall,Overall,C1,T1,NH001,Crude,BOC01,OVR01,59,
2,1999-2000,US,United States,NHANES,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease among US adults (20+); NHANES,Crude,Percent (%),8.0,8.0,,,6.6,9.7,Gender,Male,C1,T1,NH001,Crude,BOC02,GEN01,59,
3,1999-2000,US,United States,NHANES,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease among US adults (20+); NHANES,Age-Standardized,Percent (%),9.1,9.1,,,7.5,11.0,Gender,Male,C1,T1,NH001,AgeStdz,BOC02,GEN01,59,
4,1999-2000,US,United States,NHANES,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease among US adults (20+); NHANES,Crude,Percent (%),5.0,5.0,,,4.1,6.1,Gender,Female,C1,T1,NH001,Crude,BOC02,GEN02,59,
5,1999-2000,US,United States,NHANES,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease among US adults (20+); NHANES,Age-Standardized,Percent (%),5.2,5.2,,,4.2,6.3,Gender,Female,C1,T1,NH001,AgeStdz,BOC02,GEN02,59,
6,1999-2000,US,United States,NHANES,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease among US adults (20+); NHANES,Crude,Percent (%),,-2.0,~,Statistically unstable estimates not presented [unstable by NCHS standards: (standard error/estimate>0.30)],,,Age,20-24,C1,T1,NH001,Crude,BOC03,AGE02,59,
7,1999-2000,US,United States,NHANES,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease among US adults (20+); NHANES,Crude,Percent (%),1.1,1.1,,,0.6,1.9,Age,25-44,C1,T1,NH001,Crude,BOC03,AGE04,59,
8,1999-2000,US,United States,NHANES,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease among US adults (20+); NHANES,Crude,Percent (%),8.6,8.6,,,7.1,10.3,Age,45-64,C1,T1,NH001,Crude,BOC03,AGE05,59,
9,1999-2000,US,United States,NHANES,,,,,Cardiovascular Diseases,Major Cardiovascular Disease,Prevalence of major cardiovascular disease among US adults (20+); NHANES,Crude,Percent (%),21.4,21.4,,,18.5,24.7,Age,65+,C1,T1,NH001,Crude,BOC03,AGE06,59,
