In [39]:
import importlib
import os
import pandas as pd
import forecastcards
importlib.reload(forecastcards)

<module 'forecastcards' from '//anaconda/envs/forecastcards/lib/python3.6/site-packages/forecastcards/__init__.py'>

# Data Validation and Preparation

There are two main classes for forecast card data:  
 - **`Cardset`**: organization of forecastcards csv data 
 - **`Dataset`**: [pandas](https://pandas.pydata.org/) data created from a  validated `Cardset` that is cleaned for estimation

1. Find and validate data as a Cardset
2. Combine and format data for estimation



## 1 - Get and validate cards

Finds all the relevant cards and assigns them a type in order to compare to the data schema.

Returns a dictionary of card locations by card type.



data_loc = local_example_data_loc,
                 select_projects = [],
                 exclude_projects = [],
                 schemas = {},
                 schema_locs = github_master_schema_loc,


In [0]:
#example data repository
github_example_data = {'username':'e-lo','repository':'forecastcards','branch':'master'}

#instantiate a cardset with all the defaults
my_cards = forecastcards.Cardset(data_loc = github_example_data)

print("-- Card Locations by Type:",my_cards.card_locs_by_type)
print("-- Data schema Locations by Type:",my_cards.schema_locs)
print("-- Valid Projects",my_cards.validated_projects)

You can also get data from your a local repository by feeding in a local directory:

**NOTE** this will likely not work if you are running this in the cloud; or you will have to make modifications

In [40]:
file=r"/Users/elizabeth/Documents/urbanlabs/NCHRP 08-110/working/temp/forecastcardsdata/OhioDOT21927ROS-35-14-40/forecasts/forecast-21927.csv"
df=pd.read_csv(file,dtype={'obs_value':float},usecols=["start_time", "end_time"])

In [41]:
df

Unnamed: 0,start_time,end_time
0,00:00,24:00
1,00:00,24:00
2,00:00,24:00
3,00:00,24:00
4,00:00,24:00
5,00:00,24:00
6,00:00,24:00
7,00:00,24:00
8,00:00,24:00
9,00:00,24:00


In [48]:
df['end_time']=df['end_time'].apply(lambda x: '23:59:59' if x in ['24:00:00','24:00'] else x)
df

Unnamed: 0,start_time,end_time,c
0,00:00,23:59:59,23:59:59
1,00:00,23:59:59,23:59:59
2,00:00,23:59:59,23:59:59
3,00:00,23:59:59,23:59:59
4,00:00,23:59:59,23:59:59
5,00:00,23:59:59,23:59:59
6,00:00,23:59:59,23:59:59
7,00:00,23:59:59,23:59:59
8,00:00,23:59:59,23:59:59
9,00:00,23:59:59,23:59:59


In [36]:
# this is assuming that you are running this file from a /notebooks directory which is at the same level as the /forecastcards module

local_example_data = r"/Users/elizabeth/Documents/urbanlabs/NCHRP 08-110/working/temp/forecastcardsdata"


my_cards = forecastcards.Cardset(data_loc = local_example_data)

print(my_cards.validated_projects)


Validation Error: /Users/elizabeth/Documents/urbanlabs/NCHRP 08-110/working/temp/forecastcardsdata/OhioDOT21927ROS-35-14-40/poi-21927.csv
Validation Error: /Users/elizabeth/Documents/urbanlabs/NCHRP 08-110/working/temp/forecastcardsdata/OhioDOT21927ROS-35-14-40/scenarios-21927.csv
Validation Error: /Users/elizabeth/Documents/urbanlabs/NCHRP 08-110/working/temp/forecastcardsdata/OhioDOT21927ROS-35-14-40/forecasts/forecast-21927.csv


TypeError: Cannot change data-type for object array.

### Validating using a modified schema

If you modified the schema for any reason (i.e. you wanted to expand the project types), then you can validate the data compared to an alternate schema location, which is easy to set up using the **`Card_schema`** class.

It is good to use the class because it will also validate the schema for you.

In [0]:
#example data repository
github_example_data = {'username':'e-lo','repository':'forecastcards','branch':'master'}

#alternate schema repository location
my_new_schema_loc = {'username':'e-lo',
                     'repository':'forecastcards',
                     'branch':'master',
                     'subdir':'spec/en/'}

my_schemas = forecastcards.Card_schema(schema_dir = my_new_schema_loc)


#instantiate a cardset with all the defaults
my_cards = forecastcards.Cardset(data_loc = github_example_data, schema_locs = my_schemas.schema_locs)

## 2 - Combine and Format Data for Estimation

The **`Dataset`** class takes a cardset and turns it into a merged pandas dataframe.

If you don't tell it not to, it will complete all the merging and a default set of cleaning and variable coding on instantiation including:

 1. fix all missing values and clean records with required values . 
 2. code cateogrical variables as dummies  
 3. scale the estimation dataset by the forecast value . 


  - **`recode_na_vars`** - list of variables to recode NA to "missing"
  - **`no_na_vars`** - list of variables where having an NA isn't acceptable
  - **`categorical_cols_list`** - list of columns that will be recoded   
 - **`required_vars`** - list of variables that will be kept  
  - **`no_scale_cols`** - list of variables that won't be scaled
 

In [0]:
cardset = forecastcards.Cardset(data_loc = github_example_data)

dataset = forecastcards.Dataset(card_locs_by_type = cardset.card_locs_by_type, 
                                file_to_project_id = cardset.file_to_project_id
                               )
dataset.df

# Combine data from multiple sources

You can combine data from several github repositories as well as mix local and github data. This could be useful if you were using a shared, public data repository and you wanted to add your data in.

In [0]:
combined_gh_repo_cardset = forecastcards.Cardset(data_loc = github_example_data, exclude_projects=['lu123'])

combined_gh_repo_cardset.add_projects(data_loc=github_example_data, select_projects=['lu123'])

combined_gh_repo_cardset.validated_projects

In [0]:
combined_gh_local_repo_cardset = forecastcards.Cardset(data_loc = github_example_data, exclude_projects=['lu123'])

combined_gh_local_repo_cardset.add_projects(data_loc=local_example_data, select_projects=['lu123'])

combined_gh_local_repo_cardset.card_locs_by_type


# Explore Estimation Dataset


In [0]:
dataset.df.dtypes

In [0]:
dataset.df.describe()
dataset.df

In [0]:
dataset.df.plot.scatter(y='forecast_value',
                         x='obs_value')
