# Exploratory Data Analysis wiht Deedle

This notebook tries to investigate the EDA capabilities of Deedle in **Dotnet Interactive Notebooks**. The different tasks are mainly inspired by the R community's [ModernDive Book](https://moderndive.com/index.html).

## Load Nuget Dependencies

The main dependency for this task is [Deedle](https://fslab.org/Deedle/) - .NET OSS implementation of the data frame concept known from the R programming languages and `pandas`.
In .NET notebooks you can load external dependencies directly from nuget.

In [1]:
#!fsharp
#r "nuget: Deedle"

Installed package Deedle version 2.3.0

## Deedle Formatter

In order to better inspect the content of the data frames and serieses (roughly translated: columns in a data frame) we need to format them correctly. The following code combines the implementations for [a similar
Formatter using Microsoft's DataFrama](https://github.com/dotnet/interactive/blob/main/samples/notebooks/fsharp/Samples/HousingML.ipynb) and [IFSharp's Deedle formatter](https://github.com/mndrake/IfSharpLab/blob/master/src/DeedleFormat.fs).


**TODO**
- Externalize this formatter using the [Dotnet Interactive Extensions Mechanism](https://github.com/dotnet/interactive/blob/main/docs/extending-dotnet-interactive.md)

In [1]:
#!fsharp
module FrameFormatter =

    open Deedle
    open Deedle.Internal
    open Html

    let maxRows = 25

    let (|SeriesValues|_|) (value : obj) = 
        let iser = value.GetType().GetInterface("ISeries`1")
        if iser <> null then 
            let keys = 
                value.GetType().GetProperty("Keys").GetValue(value) :?> System.Collections.IEnumerable
            let vector = value.GetType().GetProperty("Vector").GetValue(value) :?> IVector
            Some(Seq.zip (Seq.cast<obj> keys) vector.ObjectSequence)
        else None

    // TODO: make this configurable for floats and such
    let formatValue (def: string) = function
        | Some v -> v.ToString()
        | None -> def

    Formatter.Register<IFsiFormattable>(Func<FormatContext, IFsiFormattable, TextWriter, bool>(fun (context: FormatContext) (formattable: IFsiFormattable) (writer: TextWriter) ->
        if context.ContentThreshold < 1.0 then false else

        context.ReduceContent(0.2)
        |> ignore

        let html = 
            match formattable with
            | SeriesValues s ->
                table [] [
                    thead [] [
                        thead [] [
                            th [] [ str "Keys" ]
                            yield! s
                            |> Seq.map (fun kvp -> th [] [ str (fst kvp |> string) ])
                        ]
                    ]
                    tbody [] [
                        td [] [ str "Values" ]
                        yield! s
                        |> Seq.map (fun kvp -> td [] [ str (snd kvp |> string) ])
                    ]
                ]
                |> Some
            | :? IFrame as df ->
                {
                    new IFrameOperation<_> with
                        member x.Invoke(df: Frame<_, _>) =
                            table [] [
                                thead [] [
                                    th [] [ str "Index" ]
                                    yield! df.ColumnKeys
                                    |> Seq.map (fun ck -> th [] [ str (ck.ToString()) ])
                                ]
                                tbody [] [
                                    yield! df
                                    |> Frame.take (min maxRows df.RowCount)
                                    |> Frame.rows
                                    |> Series.observationsAll
                                    |> Seq.map (fun item ->
                                        let def, k, data =
                                            match item with
                                            | k, Some d -> "N/A", k.ToString(), Series.observationsAll d |> Seq.map snd
                                            | k, _ -> "N/A", k.ToString(), df.ColumnKeys |> Seq.map (fun _ -> None)
                                        let row =
                                            data
                                            |> Seq.map (formatValue def)
                                            |> Seq.map (fun v ->
                                                td [] [ embed context v ])
                                        tr [] [
                                            td [] [ embed context k ]
                                            yield! row
                                        ])
                                ]
                            ]
                            |> Some
                }
                |> df.Apply
            | _ -> None

        match html with
        | Some v -> writer.Write v
        | None -> writer.Write ""

        true
    ), mimeType = "text/html")

In [1]:
#!fsharp
open Deedle

In [1]:
#!fsharp
let flights = Frame.ReadCsv "/home/gregor/source/repos/FSharpForDataScience/datasets/nycflights13/flights.csv"

In [1]:
#!fsharp
flights

Index,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,01/01/2013 05:00:00
1,2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,01/01/2013 05:00:00
2,2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,01/01/2013 05:00:00
3,2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,01/01/2013 05:00:00
4,2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,01/01/2013 06:00:00
5,2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,01/01/2013 05:00:00
6,2013,1,1,555,600,-5,913,854,19,B6,507,N516JB,EWR,FLL,158,1065,6,0,01/01/2013 06:00:00
7,2013,1,1,557,600,-3,709,723,-14,EV,5708,N829AS,LGA,IAD,53,229,6,0,01/01/2013 06:00:00
8,2013,1,1,557,600,-3,838,846,-8,B6,79,N593JB,JFK,MCO,140,944,6,0,01/01/2013 06:00:00
9,2013,1,1,558,600,-2,753,745,8,AA,301,N3ALAA,LGA,ORD,138,733,6,0,01/01/2013 06:00:00


In [1]:
#!fsharp
// Don't do this right now. I have to limit the column count first. Otherwise I blow up the browser/VSCode rendering a gazillion columns.
//flights?dep_time