In [0]:
import os
import requests
from urllib.parse import urljoin

! pip install --upgrade git+https://github.com/e-lo/forecastcards#egg=forecastcards
try:
  import forecastcards
except:
  ! pip install --upgrade git+https://github.com/e-lo/forecastcards#egg=forecastcards
  import forecastcards

# Data Validation and Preparation

1. Find and map where the data is
2. Validate data conforms to schema
3. Combine data
4. Clean and format data


## 1 - Map the Cards

Finds all the relevant cards and assigns them a type in order to compare to the data schema.

Returns a dictionary of card locations by card type.

**`map_cards`**`(  
    repo_loc = default_repo_api,   
    subdirs  = default_subdirs 
  ):`


  - **`repo_loc`** - API tree URL for data to be used
  - **`subdirs`**  - list of subdirs to search through




In [0]:
repo_api = "https://api.github.com/repos/e-lo/forecast-cards/git/trees/f35185168b238429157adcbf5ba09d09ae7d0172?recursive=1"

subdirs = ["examples"]

card_locs = forecastcards.map_cards(repo_loc=repo_api,subdirs=subdirs)
#print(card_locs)

## 2 - Validate Forecast Card Data

Uses [Frictionless Good Tables](https://github.com/frictionlessdata/goodtables-py) to validate that the data matches the schemas.

Returns a dictionary of reports by card type.

**`validate_cards`**`(  
    card_locs,
    schemas_loc 
   ):`


  - **`card_locs`** - dictionary of `card type`: list of files
  - **`schemas_loc`** - dictionary of `card type` : schema locations

**TIP: ** If data doesn't validate, try to resolve with the GUI at  https://try.goodtables.io

In [0]:
schema_locs = { 'poi'         : "https://raw.github.com/e-lo/forecast-cards/master/spec/en/poi-schema.json",
                "scenario"    : "https://raw.github.com/e-lo/forecast-cards/master/spec/en/scenario-schema.json",
                "project"     : "https://raw.github.com/e-lo/forecast-cards/master/spec/en/project-schema.json",
                "observations": "https://raw.github.com/e-lo/forecast-cards/master/spec/en/observations-schema.json",
                "forecast"    : "https://raw.github.com/e-lo/forecast-cards/master/spec/en/forecast-schema.json",
}

data_reports = forecastcards.validate_cards(card_locs,schema_locs)
    

In [0]:
data_reports['poi']

## 3 - Combine Data

Returns a dictionary of reports by card type.

**`validate_cards`**`(  
    card_locs 
    schemas_loc 
   ):`


  - **`card_locs`** - dictionary of `card type`: list of files
  - **`schemas_loc`** - dictionary of `card type` : schema locations

In [0]:
all_df = forecastcards.combine_data(card_locs)

In [0]:
all_df.dtypes

## 4 - Clean and Recode

- Fix missing values
- Code categorical variables
- Scale for estimation

Note that this entire process can be exceuted by calling `default_data_clean(df)`

### Fix Missing Values

Returns a dataframe that has some missing data recoded to 'missing' and some records dropped because they didn't have minimum values.

**`fix_missing_values`**`(  
    dataframe
    recode_na_vars = default_recode_na_vars
    no_na_vars     = default_no_na_vars
   ):`


  - **`recode_na_vars`** - list of variables to recode NA to "missing"
  - **`no_na_vars`** - list of variables where having an NA isn't acceptable

In [0]:
recode_na_vars = ['forecast_system_type', 'area_type', 'forecaster_type', 'state', 'agency', 'functional_class','facility_type','project_type']
no_na_vars     = ['scenario_date','forecast_creation_date','forecast_value','obs_value']

select_df = forecastcards.fix_missing_values(all_df,
                                             recode_na_vars=recode_na_vars,
                                             no_na_vars=no_na_vars)

### Create Categorical Variables

1. Add categorical variables for project size (cutoff: 30k), scenario decade and forecast decade.

   Returns a dataframe.

   **`create_default_categorical_vars`**`(  
    dataframe
   )`
   
2. Recodes categorical variables to dummy variables.

    Returns a dataframe.

    **`categorical_to_dummy`**`(  
    dataframe
    categorical_cols_list=default_categorical_cols,
    required_vars = default_required_vars
   ):`
   
 
 - **`categorical_cols_list`** - list of columns that will be recoded  
 - **`required_vars`** - list of variables that will be kept

In [0]:
select_df = forecastcards.create_default_categorical_vars(select_df)
estimate_df = forecastcards.categorical_to_dummy(select_df)

In [0]:
estimate_df.dtypes

### Scale 

Returns a dataframe that has the dummy variables scaled to the forecast value so that the estimation isn't biased.

**`scale_dummies_by_forecast`**`(  
    dataframe
    no_scale_cols=default_no_scale_cols
   ):`


  - **`no_scale_cols`** - list of variables that won't be scaled

In [0]:
scaled_df = forecastcards.scale_dummies_by_forecast_value(estimate_df)

# Dataset ready for Estimation

In [0]:
scaled_df.describe()
scaled_df

In [0]:
estimate_df.plot.scatter(y='forecast_value',
                         x='obs_value')
