In [1]:
import requests
from IPython.display import Markdown

import re
import numpy as np
import pandas as pd
from datetime import datetime

**Warning!** This is a soultion. If you are looking to do these 
           [Agile Geosciences](https://agilescientific.com/blog/2020/4/16/geoscientist-challenge-thyself) 
           challenges on your own then please visit this
           [Jupyter Notebook](https://colab.research.google.com/drive/1eP68NTV-GA3R-BYUh-CUxcgYDQ5IuetS)
           to get started.


## Functions for URL requests
First a few functions to use along the way...

In [2]:
def get_data(url, key):
    params = {'key':my_key}
    r = requests.get(url, params)
    return r.text

def get_question(url):
    r = requests.get(url)
    return r.text

def check_answer(questionNum,answer):
    params = {'key':my_key,
              'question':questionNum,
              'answer':answer
             }
    result = requests.get(url, params)
    return Markdown(result.text)

## Request Challenge Description

In [3]:
url = 'https://kata.geosci.ai/challenge/sample-names' 
r = get_question(url)

Markdown(r)

# Sample names

You have a set of sample names. They look like this:

    001235_Ainsa_Sobrarbe_C_2016-04-20_PCx
    ^^^^^^ ^^^^^ ^^^^^^^^ ^ ^^^^^^^^^^ ^^^
      1      2      3     4      5      6

A **valid name** consists of 6 parts separated by underscores. The parts are underlined, above. Note that the parts might not be correct or consistent. Having 6 parts, whether they are correct or not, is enough to be called 'valid'. There may be other problems, for example with the spelling or formatting of individual parts, but we will still call it 'valid'.

The 6 parts are:

- **Unique identifier** consisting of 6 characters.
- **Basin name.** Note that spellings are not guaranteed to be correct.
- **Unit or Formation name.** Note that spellings are not guaranteed to be correct.
- **Specimen type**, either H or C (hand or core).
- **Date**, which must be in ISO 8601 YYYY-MM-DD format to be considered correct.
- **Preparation codes** of at least one character.

We need to extract some information from this dataset.

1. How many valid sample names are there?
2. How many valid samples were taken in the Ainsa basin? Include records with misspelt basin names.
3. What's the longest period of days with no valid samples taken in Ainsa?

If looking for misspellings, we'll assume that any word starting and ending in the same letters, but with the middle letters scrambled, is the same word. So 'Anisa' is a misspelling of 'Ainsa', but 'Aimsa' is not. We'll also assume that the spelling with the most occurrences is the correct spelling.


## Example

Here's some sample data:

    001235_Ainsa_Sobrarbe_C_2016-04-20_PCx
    001236_Ainsa_Sobrarbe_H_2016-04-21_P
    001237_Anisa_Sobrarbe_H_2016-04-29_TCx
    001238_Sorbas_Gochar_2017-06-03_PxM
    001238_Sorbas_Gochar_C_2017-06-03_PxM
    001240_SORBAS_Gochar_C_2017-06-03_PxM

Let's answer the 3 questions for this sample dataset:

- There are **5** valid names (and 1 invalid one, with no specimen type).
- The Ainsa Basin appears in **3** sample names (including 1 misspelling).
- There is a **7** day period with no samples taken, between 21 April and 29 April.


## Hints

It's likely that the `datetime` library will be useful in answering question 3. In particular, this code is useful:

    from datetime import datetime
    datetime.fromisoformat('2016-07-03')
    
If that command fails on a date, then you should consider the date format incorrect and ignore that record.


## A quick reminder how this works

You can retrieve your data by choosing any Python string as a **`<KEY>`** and substituting here:
    
    https://kata.geosci.ai/challenge/sample-names?key=<KEY>
                                                      ^^^^^
                                                      use your own string here

To answer question 1, make a request like:

    https://kata.geosci.ai/challenge/sample-names?key=<KEY>&question=1&answer=1234
                                                      ^^^^^          ^        ^^^^
                                                      your key       Q        your answer

[Complete instructions at kata.geosci.ai](https://kata.geosci.ai/challenge)

----

© 2020 Agile Scientific, licensed CC-BY

## My solution

Let's enter a seed phrase and get the data.

In [4]:
my_key = 'armstrys'

## Input
r = get_data(url, my_key)

r[:1000]

'000067_Ainsa_Sobrarbe_H_2000-01-01_P\n000068_Ainsa_Sobrarbe_H_2000-01-02_TC\n000069_Sorbas_Gochar_H_2000-01-02_PM\n000070_Sorbas_Gochar_H_2000-01-02\n000071_Sorbas_Gochar_H_2000-01-02_TC\n000072_Sorbas_Gochar_H_2000-01-02_TM\n000075_Sorbas_Gochar_H_2000-01-04_PTC\n000077_Sorbas_Gohcar_C_2000-01-05_Cx\n000078_Sorbas_GOCHAR_H_05-01-00_PTM\n000079_Sorbas_Zorreras_H_2000-01-07_C\n000080_Sorbas_Zorreras_H_2000-01-07_TCxM\n000081_sorbas_Zorreras_H_2000-01-08_C\n000082_Tremp_Tremp_H_2000-01-08_PTC\n000083_Tremp_Pasraela_C_2000-01-08_PxM\n000085_Tremp_Pasarela_C_2000-01-08_PTCx\n000086_Tremp_Pasarela_H_2000-01-08_TCM\n000090_Tremp_Pasarela_H_2000-01-08_xM\n000092_Tremp_Pasarela_H_2000-01-09_C\n000093_Tremp_Pasarela_H_2000-01-09_TCxM\n000094_Asana_Lleida_H_2000-01-10_TCx\n000095_Asana_Lleida_H_2000-01-11_C\n000096_asana_Lleida_2000-01-11_PM\n000097_asana_Lleida_H_2000-01-12_CM\n000098_asana_MADRID_C_2000-01-12_C\n000100_asana_Barcelona_C_2000-01-12_TCM\n000101_asana_Bacrelona_2000-01-14_Cx\n00

## Processing into Pandas dataframe

In [5]:
samples = pd.Series(r.split('\n')).str.title()
samples_df = samples.str.split('_',expand=True).dropna()
samples_df.columns = ['ID','Basin_Name','FmName','SpecType','Date','PrepCode']
samples_df['PrepCode'] = samples_df['PrepCode'].str.upper()

print('First 10 rows of the dataframe:')
samples_df.head(10)


First 10 rows of the dataframe:


Unnamed: 0,ID,Basin_Name,FmName,SpecType,Date,PrepCode
0,67,Ainsa,Sobrarbe,H,2000-01-01,P
1,68,Ainsa,Sobrarbe,H,2000-01-02,TC
2,69,Sorbas,Gochar,H,2000-01-02,PM
4,71,Sorbas,Gochar,H,2000-01-02,TC
5,72,Sorbas,Gochar,H,2000-01-02,TM
6,75,Sorbas,Gochar,H,2000-01-04,PTC
7,77,Sorbas,Gohcar,C,2000-01-05,CX
8,78,Sorbas,Gochar,H,05-01-00,PTM
9,79,Sorbas,Zorreras,H,2000-01-07,C
10,80,Sorbas,Zorreras,H,2000-01-07,TCXM


## Question 1
How many samples in the dataframe?

In [7]:
answer1 = len(samples_df)

Markdown(f'There are **{answer1}** valid samples.\n')


There are **9125** valid samples.


In [8]:
## Check
questionNum = 1
check_answer(questionNum,answer1)

Correct

## Question 2
To find all of the Ainsa Basin samples we will generate a function that will look for all matches of a word in the column including misspellings as described in the instructions.

In [9]:
def df_match(samples_df,colName,match):
    '''
    Subset dataframe to rows with a column matching a given string. Item can
    be misspelled as long as first and last characters are correct and all letters
    are present

    Args:
        samples_df (dataframe): Input dataframe.
        colName (str): name of column to search for matches
        match (str): String to match in column.

    Returns:
        df (dataframe): dataframe reduced to rows that match selected column.
    '''

    ## make a forgiving regex expression to match

    firstChar, midChars, lastChar = match[0], match[1:-1], match[-1]

    regExp = re.compile(f'({firstChar}'+ # match first char
                        f'[{midChars}]{{{ len(midChars) }}}'+ # match mid chars in any order
                        f'{lastChar})', re.IGNORECASE) # match last char

    df = (samples_df.loc[samples_df[colName]
                        .str.extract(regExp).dropna().index,:])
    return df

answer2 = len(df_match(samples_df,'Basin_Name','ainsa'))

Markdown(f'There are **{answer2}** valid samples in the Ainsa Basin.\n')



There are **1507** valid samples in the Ainsa Basin.


In [10]:
## Check
questionNum = 2
check_answer(questionNum,answer2)

Correct

## Question 3

In [11]:
def isodate(x):
    ''' Check convert the date and null if cannot be converted.
    '''
    try:
        return datetime.fromisoformat(x)
    except ValueError:
        return pd.NaT

def longest_gap(samples_df):
    '''
    Find the longest gap (on the order of days) between samples in
    the dataframe.
    '''

    samples_df['Date'] = samples_df['Date'].apply(lambda x: isodate(x))
    samples_df = samples_df.loc[samples_df['Date']>datetime(1900,1,1)]

    samples_df = samples_df.sort_values(by=['Date'], ascending=True).reset_index(drop=True)
    samples_df['Days_SincePrev'] = samples_df['Date'].diff(1)
    gap = max(samples_df['Days_SincePrev'].dropna()).days - 1
    return gap
df_ainsa = df_match(samples_df,'Basin_Name','ainsa')
answer3 = int(longest_gap(df_ainsa))

Markdown(f'Longest gap between Ainsa Basin valid samples was **{answer3}** days.\n')



Longest gap between Ainsa Basin valid samples was **187** days.


In [12]:
## Check
questionNum = 3
check_answer(questionNum,answer3)

Correct! The next challenge is: https://kata.geosci.ai/challenge/prospecting - good luck!