# Exploratory Data Analysis wiht Deedle

This notebook tries to investigate the EDA capabilities of Deedle in **Dotnet Interactive Notebooks**. The different tasks are mainly inspired by the R community's [ModernDive Book](https://moderndive.com/index.html).

## Load Nuget Dependencies

The main dependency for this task is [Deedle](https://fslab.org/Deedle/) - .NET OSS implementation of the data frame concept known from the R programming languages and `pandas`.
In .NET notebooks you can load external dependencies directly from nuget.

In [1]:
#!fsharp
#r "nuget: Deedle"

## Deedle Formatter

In order to better inspect the content of the data frames and serieses (roughly translated: columns in a data frame) we need to format them correctly. The following code combines the implementations for [a similar
Formatter using Microsoft's DataFrama](https://github.com/dotnet/interactive/blob/main/samples/notebooks/fsharp/Samples/HousingML.ipynb) and [IFSharp's Deedle formatter](https://github.com/mndrake/IfSharpLab/blob/master/src/DeedleFormat.fs).


**TODO**
- Externalize this formatter using the [Dotnet Interactive Extensions Mechanism](https://github.com/dotnet/interactive/blob/main/docs/extending-dotnet-interactive.md)

In [1]:
#!fsharp
module FrameFormatter =

    open Deedle
    open Deedle.Internal
    open Html

    let maxRows = 20
    let maxCols = 15

    let (|SeriesValues|_|) (value : obj) = 
        let iser = value.GetType().GetInterface("ISeries`1")
        if iser <> null then 
            let keys = 
                value.GetType().GetProperty("Keys").GetValue(value) :?> System.Collections.IEnumerable
            let vector = value.GetType().GetProperty("Vector").GetValue(value) :?> IVector
            Some(Seq.zip (Seq.cast<obj> keys) vector.ObjectSequence)
        else None

    // TODO: make this configurable for floats and such
    let formatValue (def: string) = function
        | Some v -> v.ToString()
        | None -> def

    /// Super opinionated because I know what the pattern looks like
    /// Don't trust me on this
    let peekAtSeriesTypes (s: ((obj * OptionalValue<obj>) seq)) =
        if Seq.isEmpty s then
            None
        else
            let (k, v) = Seq.head s
            (k.GetType(), (v.ValueOrDefault.GetType()))
            |> Some

    Formatter.Register<IFsiFormattable>(Func<FormatContext, IFsiFormattable, TextWriter, bool>(fun (context: FormatContext) (formattable: IFsiFormattable) (writer: TextWriter) ->
        if context.ContentThreshold < 1.0 then false else

        context.ReduceContent(0.2)
        |> ignore

        let html = 
            match formattable with
            | SeriesValues s ->
                let typeInfo =
                    peekAtSeriesTypes s |> function
                    | None -> String.Empty
                    | Some (keyType, valueType) ->
                        sprintf "Key type: %A Value type: %A" keyType valueType

                let entries = Seq.length s
                let toBeShown = Seq.take (min maxCols entries) s

                div [] [
                    table [] [
                        caption [] [ sprintf "A series: %i values. %s" entries typeInfo |> str ]
                        thead [] [
                            thead [] [
                                th [] [ str "Keys" ]
                                yield! toBeShown
                                |> Seq.map (fun kvp -> th [] [ str (fst kvp |> string) ])
                                if entries > maxCols then th [] [ str "..." ]
                            ]
                        ]
                        tbody [] [
                            td [] [ str "Values" ]
                            yield! toBeShown
                            |> Seq.map (fun kvp -> td [] [ str (snd kvp |> string) ])
                            if entries > maxCols then th [] [ str "..." ]
                        ]
                    ]
                ]
                |> Some
            | :? IFrame as df ->
                {
                    new IFrameOperation<_> with
                        member x.Invoke(df: Frame<_, _>) =
                            let keyRepresentations =
                                df.ColumnKeys
                                |> Seq.map string

                            let typeRepresenations =
                                df.ColumnTypes
                                |> Seq.map string

                            let keysAndTypes =
                                (keyRepresentations, typeRepresenations)
                                ||> Seq.zip

                            let rowCount = df.RowCount
                            let columnCount = keyRepresentations |> Seq.length

                            let notShownRows = rowCount - maxRows |> max 0
                            let notShownColumns = columnCount - maxCols |> max 0

                            let rowSummary =
                                if notShownRows < 1 then None else
                                sprintf "%i rows" notShownRows |> Some

                            let columnSummary =
                                if notShownColumns < 1 then None else
                                keysAndTypes
                                |> Seq.skip maxCols
                                |> Seq.map (fun (k, v) ->
                                    span [ ] [
                                        str " "
                                        b [] [ str k ]
                                        small [] [ sprintf " <%s>" v |> str ]
                                        str " "
                                    ])
                                |> Some

                            let summary =
                                match (rowSummary, columnSummary) with
                                | None, None -> None
                                | Some rs, None ->
                                    span [] [ sprintf "...with %s additional rows" rs |> str ]
                                    |> Some
                                | None, Some cs ->
                                    span [] [
                                        sprintf "...with %i additional variables: " notShownColumns |> str
                                        br [] []
                                        yield! cs
                                    ]
                                    |> Some
                                | Some rs, Some cs ->
                                    span [] [
                                        sprintf "...with %s additional rows and %i additional variables: " rs notShownColumns |> str
                                        br [] []
                                        yield! cs
                                    ]
                                    |> Some

                            div [] [
                                table [] [
                                    caption [] [ sprintf "A frame: %i x %i" rowCount columnCount |> str ]
                                    thead [] [
                                        tr [] [
                                            th [] []
                                            yield! df.ColumnKeys
                                            |> Seq.take (min maxCols columnCount)
                                            |> Seq.map (fun ck -> th [] [ str (ck.ToString()) ])
                                            if maxCols < columnCount then th [] [ str "..." ]
                                        ]
                                        tr [] [
                                            th [] []
                                            yield! df.ColumnTypes
                                            |> Seq.take (min maxCols columnCount)
                                            |> Seq.map (fun ct -> th [] [ ct |> string |> str ])
                                            if maxCols < columnCount then th [] [ str "..." ]
                                        ]
                                    ]
                                    tbody [] [
                                        yield! df
                                        |> Frame.sliceCols (df.ColumnKeys |> Seq.take (min columnCount maxCols))
                                        |> Frame.take (min maxRows rowCount)
                                        |> Frame.rows
                                        |> Series.observationsAll
                                        |> Seq.map (fun item ->
                                            let def, k, data =
                                                match item with
                                                | k, Some d -> "N/A", k.ToString(), Series.observationsAll d |> Seq.map snd
                                                | k, _ -> "N/A", k.ToString(), df.ColumnKeys |> Seq.map (fun _ -> None)
                                            let row =
                                                data
                                                |> Seq.map (formatValue def)
                                                |> Seq.map (fun v ->
                                                    td [] [ embed context v ])
                                            tr [] [
                                                td [] [ embed context k ]
                                                yield! row
                                                if columnCount > maxCols then td [] [ str "..." ]
                                            ])
                                        if rowCount > maxRows then tr [] [
                                            yield! fun _ -> td [] [ str "..." ]
                                            |> Seq.init ((min columnCount maxCols) + 2)
                                        ]
                                    ]
                                ]
                                match summary with
                                | Some s ->
                                    div [] [
                                        p [] [ s ]
                                    ]
                                | None -> ()
                            ]
                            |> Some
                }
                |> df.Apply
            | _ -> None

        match html with
        | Some v -> writer.Write v
        | None -> writer.Write ""

        true
    ), mimeType = "text/html")

In [1]:
#!fsharp
open Deedle

In [1]:
#!fsharp
let flights = Frame.ReadCsv "/home/gregor/source/repos/FSharpForDataScience/datasets/nycflights13/flights.csv"

In [1]:
#!fsharp
flights

Unnamed: 0_level_0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,...
Unnamed: 0_level_1,System.Int32,System.Int32,System.Int32,System.Int32,System.Int32,System.Int32,System.Int32,System.Int32,System.Int32,System.String,System.Int32,System.String,System.String,System.String,System.Int32,...
0,2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,...
1,2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,...
2,2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,...
3,2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,...
4,2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,...
5,2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,...
6,2013,1,1,555,600,-5,913,854,19,B6,507,N516JB,EWR,FLL,158,...
7,2013,1,1,557,600,-3,709,723,-14,EV,5708,N829AS,LGA,IAD,53,...
8,2013,1,1,557,600,-3,838,846,-8,B6,79,N593JB,JFK,MCO,140,...
9,2013,1,1,558,600,-2,753,745,8,AA,301,N3ALAA,LGA,ORD,138,...


In [1]:
#!fsharp
flights?dep_time

Keys,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,...


In [1]:
#!fsharp
flights.GetColumn<string>("origin")

Keys,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,...


In [1]:
#!fsharp
flights?dep_delay
|> Stats.mean