Patents

I’m learning how to parse big CSV files in Haskell. This is my third attempt. I’ll be trying things (hopefully) that are almost directly translatable to my work, as parsing addresses out of free text. Good luck to me!

The dataset

The data we’ll be analysing are the public records of Australian patents. You can get the CSV files from data.gov.au. No, I will not include 785 MB of data (compressed) in this repository. However you can find the head of that file in the ./data directory. Obtained with

$ head -n 20 IPGOD.IPGOD122B_PAT_ABSTRACTS.csv > pat_abstracts.csv

What I plan to achieve

1. Read a stream

Read a CSV file as a stream, so I don’t need to load the entire thing to work on it.

2. Inspect the stream

Being able to inspect the stream using something like take or show with indexing. I assume I would be doing it in GHCi.

3. Extract relevant info

from unstructured text, such as addresses. That’s a big part of what I do for work, and the main motivation for looking beyond Python. I want to move away from regular expressions and do it fast.

4. Filter the stream

After parsing we must be able to subset the stream according to boolean constraints. These must be composable.

5. Tabular results

reshape results in tabular form as a prelude to exporting to CSV or database.

6. Group by

At some point we will want to analyse results for such things like counts.

7. Output

Encoding results back into an output file, or sending it to a database.

How can I achieve it?

Finally, I eagerly welcome help to move this forward. Get in touch!

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
src		src
.ghci		.ghci
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org
Setup.hs		Setup.hs
package.yaml		package.yaml
stack.yaml		stack.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Patents

The dataset

What I plan to achieve

1. Read a stream

2. Inspect the stream

3. Extract relevant info

4. Filter the stream

5. Tabular results

6. Group by

7. Output

How can I achieve it?

About

Releases

Packages

Languages

License

dmvianna/patents

Folders and files

Latest commit

History

Repository files navigation

Patents

The dataset

What I plan to achieve

1. Read a stream

2. Inspect the stream

3. Extract relevant info

4. Filter the stream

5. Tabular results

6. Group by

7. Output

How can I achieve it?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages