#Data analysis in Haskell
  
In this notebook, we take some data from Sharelock Holmes in CSV format, and filter it by the top decile RS6m (relative strength for 6 months) and Piotroski score &ge; 8. We also display a way of converting from `String`s to `Float`s in the presence of possible parse errors.

We will need quite a few modules, some of which will need to be installed via `cabal`. These are clearly noted below:

In [1]:
import Data.Either
import Data.List
import Data.Maybe
import Data.Tuple.Select -- cabal install tuple
import Data.Vector (fromList)
import GHC.Float
import Network.HTTP
import Statistics.Quantile -- cabal install statistics
import Text.CSV -- cabal install csv
import Text.ParserCombinators.Parsec

Some data has been prepped, and is available on pastebin. It is in CSV format. The fields are called:
* `epic` - a code for the company name
* `rs6mb` - relative strength of the share for the last 6 months
* `pio` - Piotroski score.

Download the data into `rawData`:

In [2]:
getUrl :: String -> IO String
getUrl url = do
    resp <- simpleHTTP $ getRequest url
    getResponseBody resp
    
rawData <- getUrl "http://pastebin.com/raw.php?i=2whDuzjA"



Parse the CSV into a matrix. Just return an empty list if we run into conversion difficultes:

In [3]:
let mat = either (const []) id $ parseCSV "/tmp/oops.txt" rawData

The first ten records are as follows:

In [4]:
take 10 mat

[["epic","rs6mb","pio"],["3IN","5.86","5"],["888","6.38","8"],["AA.","16.2","7"],["AAL","-29.11","4"],["ABBY","-4.85","5"],["ABC","12.39","6"],["ABF","3.79","7"],["ACL","-0.6","5"],["ACSO","3.08","4"]]

In general, there may be many columns of data. We will normally only be interested in a small subset. We create a lookup table, with a key corresponding to the header (`epic`, etc.) and values corresponding to the values of that column:

In [5]:
mkLookup mat = map (\(h:t) -> (h, t))  $ transpose mat

lookups = mkLookup mat
getField f = fromJust $ lookup f lookups

So far, the values are in the form of `String`s. We will often want them as `Float`s. Sometimes, the string is empty, implying missing data. So the conversion won't always work. Let us return a `Just Float` if we can convert the `String`, or `Nothing` if the conversion fails. **This is a robust method for performing conversion between strings and floats**:

In [6]:
readFloat :: String -> Maybe Float
readFloat str = 
    case reads str :: [(Float, String)] of
        [(x, "")] -> Just x
        _ -> Nothing

Whilst `getField` returned a list of strings, it would be convenient is we could also return a list of floats (remember, that conversion may not always be possible):

In [7]:
getFloats :: Field -> [Maybe Float]
getFloats f = map readFloat $ getField f

We define 3 values, `epics`, `rs6s`, `pios`, being lists of either strings (in the first case), or `Maybe Float`s in the other cases:

In [8]:
epics = getField "epic"
rs6s = getFloats "rs6mb"
pios = getFloats "pio"

Let's print out the first 10 items in the data. You will notice that for company `ADM` a pio wasn't available:

In [9]:
take 10 $ zip3 epics rs6s pios

[("3IN",Just 5.86,Just 5.0),("888",Just 6.38,Just 8.0),("AA.",Just 16.2,Just 7.0),("AAL",Just (-29.11),Just 4.0),("ABBY",Just (-4.85),Just 5.0),("ABC",Just 12.39,Just 6.0),("ABF",Just 3.79,Just 7.0),("ACL",Just (-0.6),Just 5.0),("ACSO",Just 3.08,Just 4.0),("ADM",Just 15.75,Nothing)]

We may either want to keep, or discard data for which there is `Nothing`. In our case, let us throw out any records that we cannot process:

In [10]:
valids = filter (\(e,r,p) -> isJust r && isJust p) $zip3 epics rs6s pios

All of the data in `valids` is meaningful, so we should strip out the `Just`s, to give us actual floats:

In [11]:
erps = map (\(e, r, p) -> (e, fromJust r, fromJust p)) valids

`erps` means a list of tuples of `(epic, rs6s, pio)`. Here are its first 10 values:

In [12]:
take 10 erps

[("3IN",5.86,5.0),("888",6.38,8.0),("AA.",16.2,7.0),("AAL",-29.11,4.0),("ABBY",-4.85,5.0),("ABC",12.39,6.0),("ABF",3.79,7.0),("ACL",-0.6,5.0),("ACSO",3.08,4.0),("ADN",14.4,4.0)]

The 90th percentile of relative strength over the last 6 months is as follows:

In [13]:
rvec = fromList $ Prelude.map (float2Double . sel2) erps
r90 = double2Float $ weightedAvg 9 10 rvec
r90

29.592001

In other words, 1 in 10 shares had an RS6m of at least 29.6%.

We are interested in filtering for shares where the relative strength is in the top decile, and the Piotroski score is at least 8:

In [14]:
passes = filter (\(e,r,p) -> r >= r90 && p >= 8.0) erps

In [15]:
passes

[("AERL",56.67,8.0),("DNO",51.68,9.0),("GRG",66.27,8.0),("IAG",62.3,8.0),("JLF",30.68,8.0),("MSLH",32.55,8.0),("PETS",43.29,8.0),("PLUS",56.24,9.0),("RMV",38.22,8.0),("SYNT",43.83,8.0),("SYR",52.83,8.0),("TW.",30.92,8.0)]

This completes our analysis.

## About this document

    Author:  Mark Carter
    Created: 07-Apr-2015