Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
77 lines (53 sloc) 2.45 KB
module Y2017.M10.D30.Exercise where
{--
Hello! Welcome back! Happy Monday!
Of course, when you're working in a start-up, the days can run together in a
blur, I've found.
So, we've been working with the NYT archive for more than a month now. Did you
notice anything about the data? I have:
--}
import Control.Monad
import Data.Map (Map)
-- below imports available via 1HaskellADay git repository
import Y2017.M10.D23.Exercise -- for Article structure
import Y2017.M10.D24.Exercise (articlesFromFile)
sampleText :: String
sampleText = "GALVESTON, Tex. \226\128\148 Adolfo Guerra, a landscaper"
-- The thing is, if you look at this in a text editor (worth its salt) you see:
-- GALVESTON, Tex. — Adolfo Guerra, a landscaper
{--
So, we need:
1. to identify the documents that have special characters
2. the context of where these special characters occur
3. once marked, replace the special characters with a simple ASCII equivalent
Let's do it.
--}
data Context = Ctx { ctxid :: Integer, spcCharCtx :: SpecialCharacterContext }
deriving (Eq, Show)
type SpecialCharacterContext = Map SpecialChars [Line]
type SpecialChars = String
type Line = String
identify :: MonadPlus m => Article -> m Context
identify art = undefined
-- identifies the places in the text where there are special characters and
-- shows these characters in context ... now, since the full text is one
-- continuous line, getting a line is rather fun, isn't it? How would you
-- form the words around the special characters? Also, how do you extract
-- the special characters from the full text body?
type BodyContext = [String]
extractSpecialChars :: BodyContext -> [(SpecialChars, Line)]
extractSpecialChars ctx = undefined
-- sees if we're at a set of special characters then accumulates a context
-- around those characters. N.b.: special characters can occur as their own
-- word, or in a word at the beginning, or within, or at the end of a word,
-- e.g. ("Don't..." he said) if the quotes and apostrophe and ellipse are
-- special characters then you have an example of special characters all around
-- the word 'don't' including the (') within the word.
{--
So you should get the following:
>>> extractSpecialChars (words sampleText)
[("\226\128\148","GALVESTON, Tex. \226\128\148 Adolfo Guerra,")]
--}
-- With the set of articles from Y2017/M10/D24/hurricanes.json.gz
-- What are the special characters, in which articles, and in what contexts?
-- Tomorrow we'll look at building a replacement dictionary