## Type providers

Continuando con la lectura de archivos de datos estructurados, a veces no es posible (o es mucho trabajo) hacer el _parsing_ de los datos. Veamos un ejemplo en que esto ocurre:

In [3]:
let readFile(fileName: string) =  
    let lines = File.ReadAllLines(fileName)
    lines 
    

En este caso, tomamos datos de un archivo con canciones de los Beatles

In [4]:
let beatlesFile = "../data/The Beatles songs dataset.csv"

let songs = readFile(beatlesFile)

In [5]:
songs.GetType()

In [6]:
printfn "%A" songs[0]
printfn "%A" songs[1]

"Title,Year,Album.debut,Duration,Other.releases,Genre,Songwriter,Lead.vocal,Top.50.Billboard"
"12-Bar Original,1965,Anthology 2,174,0,Blues,"Lennon, McCartney, Harrison and Starkey",,-1"


In [7]:
songs[0..20]
|> Seq.iteri  (fun i s->  printfn $"{i}: {s}")

0: Title,Year,Album.debut,Duration,Other.releases,Genre,Songwriter,Lead.vocal,Top.50.Billboard
1: 12-Bar Original,1965,Anthology 2,174,0,Blues,"Lennon, McCartney, Harrison and Starkey",,-1
2: A Day in the Life,1967,Sgt. Pepper's Lonely Hearts Club Band,335,12,"Psychedelic Rock, Art Rock, Pop/Rock",Lennon and McCartney,Lennon and McCartney,-1
3: A Hard Day's Night,1964,UK: A Hard Day's Night US: 1962-1966,152,35,"Rock, Electronic, Pop/Rock",Lennon,"Lennon, with McCartney",8
4: A Shot of Rhythm and Blues,1963,Live at the BBC,104,0,"R&B, Pop/Rock",Thompson,Lennon,-1
5: A Taste of Honey,1963,UK: Please Please Me US: The Early Beatles,163,29,"Pop/Rock, Jazz, Stage&Screen","Scott, Marlow",McCartney,-1
6: Across the Universe,1968,Let It Be,230,19,"Psychedelic folk, Pop/Rock",Lennon,Lennon,-1
7: Act Naturally,1965,UK: Help! US: Yesterday and Today,139,14,"Country, Pop/Rock","Russell, Morrison",Starkey,50
8: Ain't She Sweet,1961,Anthology 1,150,9,Pop/Rock,"Yellen, Ager",Lennon,41
9: All I've Go

In [8]:
let song =   songs[2].Split(',')
printfn "%A" songs[2]
printfn "%A" song

"A Day in the Life,1967,Sgt. Pepper's Lonely Hearts Club Band,335,12,"Psychedelic Rock, Art Rock, Pop/Rock",Lennon and McCartney,Lennon and McCartney,-1"
[|"A Day in the Life"; "1967"; "Sgt. Pepper's Lonely Hearts Club Band"; "335";
  "12"; ""Psychedelic Rock"; " Art Rock"; " Pop/Rock""; "Lennon and McCartney";
  "Lennon and McCartney"; "-1"|]


In [9]:
song.Length

Uno puede consultar otros _parsers_, que los hay por doquier, por ejemplo [en esta página](https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-parsers), pero, gracias al tipo de dato estático, existe una módulo que nos puede resolver el problema, a través de _type providers_.

### Type Providers

Un _type provider_  es una biblioteca que nos permite lidiar con tipos particulares datos:

- [CSV type provider](http://fsprojects.github.io/FSharp.Data/library/CsvProvider.html).
- [Html type provider](https://fsprojects.github.io/FSharp.Data/library/HtmlProvider.html).
- [Json type provider](https://fsprojects.github.io/FSharp.Data/library/JsonProvider.html).

son algunos ejemplos. 

La biblioteca `FSharp.Data` es la que usaremos para aprender a leer estos tipos de datos. En un notebook se importa de la siguiente manera:

In [1]:
#r "nuget: FSharp.Data"

open FSharp.Data

Un _type provider_ genera un tipo de dato a partir de la información que lee desde un archivo. Esto ocurre en el momento de la compilación. Al momento de ejecutar el código, el tipo que se creó puede utilizarse para procesar los datos

In [11]:
type SongsTypeProvider = FSharp.Data.CsvProvider<"../data/The Beatles songs dataset.csv", HasHeaders=true>

El compilador (y la biblioteca `FSharp.Data`) construyen el _type provider_ utilizando el archivo "../data/The Beatles songs dataset.csv" como plantilla, descubriendo la estructura de los datos. 

Se puede obtener los datos propiamente dichos con:

In [12]:
let songs = SongsTypeProvider.GetSample()

De este modo, usamos el mismo archivo para crear el tipo y para obtener los datos. Sin embargo, se podría usar dos archivos diferentes, uno como plantilla y otro con los datos. En ese caso, llamamos al método  `.Load`: 

```fsharp
type SongsTypeProvider = FSharp.Data.CsvProvider<"myTemplateDataFile.csv", HasHeaders=true>
let songs = SongsTypeProvider.Load("myRealDataFile.csv")
```


Al crear el tipo de dato, el _type provider_ crea los campos para poder acceder a la información, por ejemplo

In [13]:
songs.Headers 

Unnamed: 0,Unnamed: 1
Value,"[ Title, Year, Album.debut, Duration, Other.releases, Genre, Songwriter, Lead.vocal, Top.50.Billboard ]"


nos da los encabezados de cada columna de los datos. Los datos propiamente dichos los encontramos en el campo `.Rows`: 

In [14]:
songs.Rows
|> Seq.take 20 
|> Seq.iteri  (fun i s ->  printfn $"{i}: {s}")

0: (12-Bar Original, 1965, Anthology 2, 174, 0, Blues, Lennon, McCartney, Harrison and Starkey, , -1)
1: (A Day in the Life, 1967, Sgt. Pepper's Lonely Hearts Club Band, 335, 12, Psychedelic Rock, Art Rock, Pop/Rock, Lennon and McCartney, Lennon and McCartney, -1)
2: (A Hard Day's Night, 1964, UK: A Hard Day's Night US: 1962-1966, 152, 35, Rock, Electronic, Pop/Rock, Lennon, Lennon, with McCartney, 8)
3: (A Shot of Rhythm and Blues, 1963, Live at the BBC, 104, 0, R&B, Pop/Rock, Thompson, Lennon, -1)
4: (A Taste of Honey, 1963, UK: Please Please Me US: The Early Beatles, 163, 29, Pop/Rock, Jazz, Stage&Screen, Scott, Marlow, McCartney, -1)
5: (Across the Universe, 1968, Let It Be, 230, 19, Psychedelic folk, Pop/Rock, Lennon, Lennon, -1)
6: (Act Naturally, 1965, UK: Help! US: Yesterday and Today, 139, 14, Country, Pop/Rock, Russell, Morrison, Starkey, 50)
7: (Ain't She Sweet, 1961, Anthology 1, 150, 9, Pop/Rock, Yellen, Ager, Lennon, 41)
8: (All I've Got to Do, 1963, UK: With the Beatles 

A partir de los encabezados de las columnas, el _type provider_ construye los campos correspondientes a cada dato:

In [16]:
songs.Rows
|> Seq.take 10 
|> Seq.iteri  (fun i s ->  printfn $"{i}: {s.Title} by {s.Songwriter} ({s.Year})")

0: 12-Bar Original by Lennon, McCartney, Harrison and Starkey (1965)
1: A Day in the Life by Lennon and McCartney (1967)
2: A Hard Day's Night by Lennon (1964)
3: A Shot of Rhythm and Blues by Thompson (1963)
4: A Taste of Honey by Scott, Marlow (1963)
5: Across the Universe by Lennon (1968)
6: Act Naturally by Russell, Morrison (1965)
7: Ain't She Sweet by Yellen, Ager (1961)
8: All I've Got to Do by Lennon (1963)
9: All My Loving by McCartney (1963)


In [17]:
songs.Rows
|> Seq.take 20 
|> Seq.iteri  (fun i s ->  printfn $"{i}: {s.Title} by >{s.``Lead.vocal``}<")

0: 12-Bar Original by ><
1: A Day in the Life by >Lennon and McCartney<
2: A Hard Day's Night by >Lennon, with McCartney<
3: A Shot of Rhythm and Blues by >Lennon<
4: A Taste of Honey by >McCartney<
5: Across the Universe by >Lennon<
6: Act Naturally by >Starkey<
7: Ain't She Sweet by >Lennon<
8: All I've Got to Do by >Lennon<
9: All My Loving by >McCartney<
10: All Things Must Pass by >Harrison<
11: All Together Now by >McCartney, with Lennon<
12: All You Need Is Love by >Lennon<
13: And I Love Her by >McCartney<
14: And Your Bird Can Sing by >Lennon<
15: Anna (Go to Him) by >Lennon<
16: Another Girl by >McCartney<
17: Any Time at All by >Lennon, with McCartney<
18: Ask Me Why by >Lennon<
19: Baby It's You by >Lennon<


In [19]:
songs.Rows 
|> Seq.filter (fun r -> r.``Top.50.Billboard``=1)
|> Seq.iter (fun r -> printfn $"Name:{r.Title} position {r.``Top.50.Billboard``}")


Name:Hey Jude position 1


In [25]:
let s10 = songs.Rows |> Seq.item 10 
s10.Title

All Things Must Pass

#### Eligiendo separadores

Se pueden especificar los separadores al momento de crear el tipo:
```fsharp
CsvProvider<"../data/AirQuality.csv", Separators=";,">
```

### Datos que faltan

El _type provider_ tiene [ciertas reglas para tratar con datos que faltan](https://fsprojects.github.io/FSharp.Data/library/CsvProvider.html#Controlling-the-column-types). Por ejemplo, si el dato que se espera en alguna columna es un número, pero el archivo contiene `NaN`, al crear el dato el _type provider lo reportará como `Double.NaN`. 

Por otro lado, [podemos especificar qué `strings` queremos que se conviertan a `Nan`](https://fsprojects.github.io/FSharp.Data/library/CsvProvider.html#Missing-values):

```fsharp 
CsvProvider<"X,Y,Z\nthis,that,1.0", MissingValues="this,that">
    .GetSample()
    .Rows
```

Además, si preferimos no utilizar las reglas del _type provider_, usamos `PreferOptionals=true` para que genere tipos `option` en el caso 
de datos faltantes:


In [20]:
type SongsTypeProviderOpt = FSharp.Data.CsvProvider<"../data/The Beatles songs dataset.csv", HasHeaders=true, PreferOptionals=true>

let songsWithOpt = SongsTypeProviderOpt.GetSample()

songsWithOpt.Rows
|> Seq.take 5
|> Seq.iteri  (fun i s ->  printfn $"{i}: {s}")

0: (12-Bar Original, 1965, Some(Anthology 2), 174, 0, Some(Blues), Lennon, McCartney, Harrison and Starkey, , -1)
1: (A Day in the Life, 1967, Some(Sgt. Pepper's Lonely Hearts Club Band), 335, 12, Some(Psychedelic Rock, Art Rock, Pop/Rock), Lennon and McCartney, Some(Lennon and McCartney), -1)
2: (A Hard Day's Night, 1964, Some(UK: A Hard Day's Night US: 1962-1966), 152, 35, Some(Rock, Electronic, Pop/Rock), Lennon, Some(Lennon, with McCartney), 8)
3: (A Shot of Rhythm and Blues, 1963, Some(Live at the BBC), 104, 0, Some(R&B, Pop/Rock), Thompson, Some(Lennon), -1)
4: (A Taste of Honey, 1963, Some(UK: Please Please Me US: The Early Beatles), 163, 29, Some(Pop/Rock, Jazz, Stage&Screen), Scott, Marlow, Some(McCartney), -1)


### Html Provider

También podemos obtener los datos de una página web usando un Html provider:

In [29]:
[<Literal>]
let url = """https://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles"""

type WebSongsTypeProvider = FSharp.Data.HtmlProvider<url>

let songs = WebSongsTypeProvider.Load("https://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles")

In [30]:
songs.Tables

In [32]:
songs.Tables.``Main songsedit 3``.Rows
|> Seq.map (fun r -> r.Song, r.Year)
|> Seq.filter (fun (s,y) -> y=1968)
|> Seq.iter (fun (s,y) -> printfn $"Name:%s{s} Year:{y}")


Name:"Back in the U.S.S.R." Year:1968
Name:"Birthday" Year:1968
Name:"Blackbird" Year:1968
Name:"The Continuing Story of Bungalow Bill" Year:1968
Name:"Cry Baby Cry" Year:1968
Name:"Dear Prudence" Year:1968
Name:"Don't Pass Me By" Year:1968
Name:"Everybody's Got Something to Hide Except Me and My Monkey" Year:1968
Name:"Glass Onion" Year:1968
Name:"Good Night" Year:1968
Name:"Happiness Is a Warm Gun" Year:1968
Name:"Helter Skelter" Year:1968
Name:"Hey Jude" # Year:1968
Name:"Honey Pie" Year:1968
Name:"I Will" Year:1968
Name:"I'm So Tired" Year:1968
Name:"The Inner Light" # Year:1968
Name:"Julia" Year:1968
Name:"Lady Madonna" # Year:1968
Name:"Long, Long, Long" Year:1968
Name:"Martha My Dear" Year:1968
Name:"Mother Nature's Son" Year:1968
Name:"Ob-La-Di, Ob-La-Da" Year:1968
Name:"Piggies" Year:1968
Name:"Revolution"[m] # Year:1968
Name:"Revolution 1"[n] Year:1968
Name:"Revolution 9"[o] Year:1968
Name:"Rocky Raccoon" Year:1968
Name:"Savoy Truffle" Year:1968
Name:"Sexy Sadie" Year:1968
Na

### Json Provider

Finalmente, existe un _type provider_  para leer datos en formato JSON (JavaScript Object Notation), que es standard en la transmisión de información en internet.

In [2]:
[<Literal>]
let tvUrl = "https://raw.githubusercontent.com/mganitombalak/training/master/DATA/tv-shows.json"

In [3]:
type TvListing = JsonProvider<tvUrl>
let tvListing = TvListing.GetSamples()                                   


In [4]:
tvListing.Length

In [5]:
tvListing
|> Seq.map (fun t -> (t.Name,t.Rating.Average))
|> Seq.sortByDescending (fun (n,a) -> a)
|> Seq.take 20
|> Seq.iter (fun (n,a) -> printfn $"{n}: {a}")

"Game of Thrones": Some(9.4)
"Rick and Morty": Some(9.4)
"Breaking Bad": Some(9.3)
"The Wire": Some(9.3)
"Firefly": Some(9.3)
"Stargate SG-1": Some(9.3)
"Berserk": Some(9.2)
"Person of Interest": Some(9)
"Fargo": Some(9)
"House": Some(9)
"Banshee": Some(9)
"The Newsroom": Some(8.9)
"Fringe": Some(8.9)
"Battlestar Galactica": Some(8.9)
"Stargate Atlantis": Some(8.9)
"Vikings": Some(8.8)
"Boardwalk Empire": Some(8.8)
"Justified": Some(8.8)
"Bob's Burgers": Some(8.8)
"CSI: Crime Scene Investigation": Some(8.8)


In [6]:
tvListing
|> Seq.map (fun t -> (t.Name,t.Rating.Average))
|> Seq.choose (fun (n,a) -> a)
|> Seq.length

In [9]:
tvListing
|> Seq.map (fun t -> (t.Name,t.Rating.Average))
|> Seq.filter (fun (n,a) -> 
                match a with 
                | Some a -> false
                | None -> true
)
|> Seq.iter (fun (n,a) -> printfn $"{n}")
// |> Seq.length

"The Biggest Loser"
"Mulaney"
"Utopia"
"The Chair"
"Happyland"
"The Great Fire"
"Town of the Living Dead"
"Long Shadow"
