# Transform data to feed into ML model

This script is run once all of the data (weather data and bloom/flowering data) has been collected and merged 

If not interested in the previous data preparation steps, simply grab `df_with_weather_data.csv` and start here, or skip to the next notebook.

For each location, we have 14 types of weather data by month.

idea: if we make prediction on a certain date, e.g. Feb 1, we won't know March 1st conditions
we could keep those variables and set to NA

Notes:
- this is to predict date of **full bloom**
- idea is to predict a whole year's batch in advance

In [1]:
import pandas as pd
import numpy as np

In [4]:
outputdf = pd.read_csv('../data/df_with_weather_data.csv',low_memory=False,index_col=0)
#outputdf.rename({'blm_day':'day'},inplace=True)

### Data checks

In [5]:
# check whether any flowers bloomed the year prior, i.e. 1214 or so, using approx check based on length
outputdf.loc[((outputdf.day.str[:2]=='12') & (outputdf.day.str.len() > 3))]

Unnamed: 0,l_code,l_name,year,day,rm,l_name_romaji,Jan_1_lag_1,Feb_1_lag_1,Mar_1_lag_1,Apr_1_lag_1,...,Apr_14_lag_3,May_14_lag_3,Jun_14_lag_3,Jul_14_lag_3,Aug_14_lag_3,Sep_14_lag_3,Oct_14_lag_3,Nov_14_lag_3,Dec_14_lag_3,Annual_14_lag_3


### Change blooming date format, add days since year start, add blooming date lags

In [6]:
outputdf['year'] = outputdf.year.astype('int',errors='ignore')
#outputdf['day'] = outputdf.day.astype('int',errors='ignore')

In [7]:
outputdf.year.isnull().sum() # no years missing

0

In [8]:
(outputdf.day == '-').sum() # # of rows with missing value for bloom date, missing is denoted as - in the data

966

In [9]:
def create_date(row): 
        try:
            return(pd.datetime(row.year,int(row.day[0]),int(row.day[1:])))
        except:
            if (row.day[0] == '-'):
                return(np.nan)
            else:
                print('unknown error')
                return(np.nan)
            
def create_days_since_yr_start(row): 
        try:
            return((row.blooming_date - pd.datetime(row.year,1,1)).days)
        except:
            return(np.nan)

In [10]:
outputdf.dtypes[0:5]

l_code     int64
l_name    object
year       int64
day       object
rm        object
dtype: object

In [11]:
pd.datetime(outputdf.year[0],int(outputdf.day[0][0]),int(outputdf.day[0][1:]))

datetime.datetime(1953, 5, 30, 0, 0)

In [12]:
# create date column
outputdf['blooming_date'] = outputdf.apply(lambda row: create_date(row),axis=1)

In [13]:
outputdf.blooming_date.isnull().sum() # lines up with missing data count above

966

In [14]:
type((outputdf.blooming_date[0] - outputdf.blooming_date[1]).days)

int

In [15]:
outputdf['days_since_yr_start'] = outputdf.apply(lambda row: create_days_since_yr_start(row),axis = 1)

In [16]:
outputdf.days_since_yr_start.isnull().sum() # lines up with missing data count above

966

In [17]:
outputdf.loc[outputdf.year % 4 == 0,'days_since_yr_start'] += 1 # add one to leap years

In [18]:
# issue here if consecutive rows are missing, but they aren't so it's okay
max_lag = 3
for lag in range(1,max_lag + 1):
    outputdf[f'days_lag_minus_{lag}'] = outputdf.groupby(['l_code'])['days_since_yr_start'].transform(lambda x: x.shift(-lag))

In [19]:
# sanity check on null values for blooming date lags
for lag in range(1,max_lag + 1):
    print(outputdf[f'days_lag_minus_{lag}'].isnull().sum())

1041
1120
1199


### Reminder of Goal

What we want our data frame to look like

In [20]:
pd.DataFrame([],columns=['bloom_days','location','year','last_yr_{WEATHER}_month','last_yr_bloom_days','2_yr_bloom_days'])

Unnamed: 0,bloom_days,location,year,last_yr_{WEATHER}_month,last_yr_bloom_days,2_yr_bloom_days


### Discard Missing Values (after creating the lags(!))

In [21]:
outputdf = outputdf.loc[outputdf.day != "-"] # remove rows with missing bloom date
# must do this after we set the lags so we don't eliminate years and mess up lag calculation

In [22]:
outputdf.tail()

Unnamed: 0,l_code,l_name,year,day,rm,l_name_romaji,Jan_1_lag_1,Feb_1_lag_1,Mar_1_lag_1,Apr_1_lag_1,...,Sep_14_lag_3,Oct_14_lag_3,Nov_14_lag_3,Dec_14_lag_3,Annual_14_lag_3,blooming_date,days_since_yr_start,days_lag_minus_1,days_lag_minus_2,days_lag_minus_3
6724,936,那覇,2017,208,7,NAHA,17.4,16.9,18.7,23.0,...,--,--,--,--,--,2017-02-08,38.0,29.0,,
6725,936,那覇,2018,130,7,NAHA,18.4,17.1,18.3,21.6,...,--,--,--,--,--,2018-01-30,29.0,,,
6729,945,南大東島,2016,301,7,MINAMI DAITOJIMA,,,,,...,,,,,,2016-03-01,61.0,44.0,28.0,
6730,945,南大東島,2017,214,7,MINAMI DAITOJIMA,,,,,...,,,,,,2017-02-14,44.0,28.0,,
6731,945,南大東島,2018,129,7,MINAMI DAITOJIMA,,,,,...,,,,,,2018-01-29,28.0,,,


### Recall TO DO's:

- We want to exclude any locations where the flowers bloom before the start of the year, to make things simpler (because we would need to remove previous years lags to deal with those cases)

DONE at beginning this wasn't an issue
- We want to check what data is missing and which locations should be removed due to lack of data:

Removed rows with NA blooming date. everything else can be flagged with a `var_nameISNA` column
- We want to add a flowering day lag indicator

DONE
- We want to recode flowering date as days from Jan 1st (will need to add plus one every leap year (2000 +- 4)

DONE

### Methodology

- one batch prediction for the whole year (like you were planning a vacation far in advance and wanted to get an idea of the flower blossom forecast in January)
- model with Neural Nets for tabular data like Rossman comp fastai model
- could also try decision tree based models

### More data checks

In [23]:
import matplotlib.pyplot as plt

In [102]:
outputdf['rownullsums'] = outputdf.isnull().sum(axis=1)

In [108]:
problem_locs = outputdf.groupby('l_code').rownullsums.apply(lambda x: x.mean())

there are several problematic locations that it looks like weather data was unavailable for

In [110]:
problem_locs[problem_locs>500] 

l_code
433    546.215686
435    546.115385
597    546.133333
612    546.109091
631    546.113208
648    546.090909
663    546.109091
740    546.107143
744    546.109091
747    546.109091
776    546.183673
778    546.180000
800    546.105263
837    546.820513
909    546.300000
917    546.625000
945    546.452381
Name: rownullsums, dtype: float64

Could consider removing these locations for further analysis. I will do this

In [114]:
l_codes_missing_weather = problem_locs[problem_locs>500].index.values

In [118]:
rows_to_keep_remove_missing_weather = outputdf.l_code.apply(lambda x: x not in l_codes_missing_weather)

In [120]:
outputdf = outputdf.loc[rows_to_keep_remove_missing_weather,:]

### Reformat weather data using quality notes from JP website

(https://www.data.jma.go.jp/obd/stats/data/en/smp/index.html)

|Format|Example|Validity|Notes|
--- | --- | --- | ---
    |Value) |	11.5)	| Quasi-Reliable	| Only slight problems were found in the automatic quality control, or the value was computed from the dataset with a few missing data.
Value] |	11.5]	| Incomplete	| The value was computed from a dataset with excessive missing data.
-	| - |	No phenomenon	| No phenomenon was ovserved within the period.
X	| X	| Missing	| No value is available due to problems with observation instruments, etc.
Blank| &nbsp;  |		Out of observation |	No observation was conducted.
*	| 31* |	Most recent extreme values |	The value is the most recently observed of those two or more identical daily extreme values in the period.
#	| #	| Suspicious |	A serious quality problem was found in the value, treated as omitted from the statistics.

#### Understand weather data validity

In [128]:
weatherdataissues = dict(zip(['quasireliable','incomplete','no_phenom', 'missing', 'no_obs','extreme', 'suspicious'],
                         [[0,')'],
                          [0,']'],
                          [0,'-'],
                          [0,'X'],
                          [0,''],
                          [0,'*'],
                          [0,'#']]))

In [129]:
import calendar

In [130]:
month_abrevs = [i for i in calendar.month_abbr[1:]]; month_abrevs.append('Annual')

In [131]:
weather_cols = [i for i in df.columns if month_abrevs.count(i.split("_")[0])>0]

In [132]:
outputdf.dtypes[outputdf.dtypes=='int64']

l_code         int64
year           int64
day            int64
rm             int64
rownullsums    int64
dtype: object

In [133]:
for col in weather_cols:
    for key in weatherdataissues.keys():
        # first element of list of values is count of that problem, second is the identifying punct
        # add to count the number observed in that column
        try:
            weatherdataissues[key][0] += outputdf[col].str.endswith(weatherdataissues[key][1]).sum()
        except AttributeError:
            # if the col is a float then there are no missing issues here so we can continue
            print(f'{outputdf[col].isnull().sum()} null vals for col {col}')
            break

22 null vals for col Aug_9_lag_2
22 null vals for col Sep_9_lag_2
22 null vals for col Oct_9_lag_2
41 null vals for col Aug_9_lag_3
41 null vals for col Sep_9_lag_3
41 null vals for col Oct_9_lag_3


In [134]:
weatherdataissues

{'quasireliable': [20295, ')'],
 'incomplete': [1029, ']'],
 'no_phenom': [108425, '-'],
 'missing': [0, 'X'],
 'no_obs': [2578816, ''],
 'extreme': [0, '*'],
 'suspicious': [0, '#']}

Doesn't appear to be major issues with data validity. I'm not going to add separate columns for each data issue to the main data frame for now. Instead I will scrape away punctuation

#### Remove weather data quality markings, convert cols to float

need to iterate over each column
    use a series method to remove any special characters
    if the value is '-' need to something special to indicate no phenomenon, try -1000 for now
    if value 'X' need to replace with np.nan

In [143]:
import re
for col in weather_cols:
    x = outputdf[col].replace('\-$','1000',regex=True) # coerce to nmeric indicator for no phenom
    x = x.replace('×',np.nan) # missing
    x = x.replace(r' [\)\]]$','',regex=True) # remove quality indicator
    x = x.replace(r'^\s','1000',regex=True) # coerce to numeric indicator for no obs conducted
    outputdf[col] = x
    outputdf[col].astype('float64',copy=False)

In [145]:
outputdf.to_csv('../data/data_clean.csv')