- Stratify the data better (by LOB in particular) to better relect realistic scenarios
- Example - for commercial property setting the value of the individual site, use external data
- Not worthwhile spending time here Going between rpy2 (using Python in R, seperate notebooks for now) - Easier in Data bricks %R, %sql, %Python magic commands
- Include some outliers for testing (so can create a control table to flag or drop these fields)
Project sits in the master branch
Exploring ways to share code & demos via new code space feature Codespace is pretty basic - about 5gb RAM Linux based Can handle some of the smaller demos Instance shuts down automatically after a period of inactivity - quite slow to boot up
Coding is primarly (at this point in time) written in both R & Python Both R & Python scripts will have a dedicated Engine to power the behind the scenes stuff - install (if not installed) packages The engine also contains user defined functions that are used regularly & can likely to be required to run a particular project
01_Create_Data Generate random data - EDA it, model it, map it. Continual work in progress
Some of the Scripts in this can be run in the new CodeSpace feature
Random data is consists of these fields - will likely evolve to enrich the data Geographic data - for example Latidues & Longitudes are randomly generated
- Then reverse geocoded to get the postcode
- The open source version is new, very slow - many ways to break this down for example:
- Break down the data set into seperate samples and run on different machines
- Run time about 5 hours (average) for 10,000 random rows of lats & longs)
- Currently just random, so good luck finding a postcode in the Adriatic
- Run time about 5 hours (average) for 10,000 random rows of lats & longs)
- Break down the data set into seperate samples and run on different machines
- The open source version is new, very slow - many ways to break this down for example:
Another appproach for generating REAL Postcodes is via python random addresses US
- Currently only generates random postcodes for a few states in the US
Randomly Generated fields below
The link below creates random data based on criteria to generate random data, sample size is optional, simple logic for example you can't have a claim count where there was no conversion (sale) to begin with
The below is a generic random dataset for multiple lines of business
- work to be done here on the data sample - a claim for £200 for a phone is not the same as £2,000,000 for comercial property
- Example creating a random data set purely for the motor market - so more sensible inputs in terms of premium claims etc (second hyperlink)
"Customer_ID": Cryptographically generated random identifiers "Purchase_Date": Dates are random within a given range, Purchase date must always be earlier or equal to - Cover_Start_Date "Cover_Start_Date": Date Cover Starts - Random "LOB": Line Of Business "Sale_Flag": Binary - 0/1 "Purchase_Price": Randomly generated - needs to be taylored for each LOB "Claims_Count": Number of claims in the customers history - currently just between 0/1 at random - only generated for sales (work to be done here) "Convictions Count": Number of historical convictions (regardless of sale 0/1 here), bound between 0 -> 5 at random "Period_of_Cover": 12, 24, 36, 48 months randomly sampled "Premium": Is Premium, random, needs logic to keep it sensible by LOB etc "Age": Between 18 & 80 random "Broker": Random between -> 'london_ins', 'some_syndicate', 'some_mga' # Could add in weights for balance
Variable | Definition |
---|---|
Customer_ID: | Cryptographically generated random identifiers |
Purchase_Date: | Dates are random within a given range, Purchase date must always be earlier or equal to - Cover_Start_Date |
Cover_Start_Date: | Date Cover Starts - Random |
LOB: | Line Of Business |