In [None]:
#Quick cell to make jupyter notebook use the full screen wi"dth
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# Basic utility functions
import logging
from src.log import logger
from src import paths
from src.utils import list_dir
from functools import partial

# data functions
from src.data import DataSource, Dataset
from src import workflow

In [None]:
import pandas as pd

In [None]:
logger.setLevel(logging.DEBUG)

## Load up the Wine Reviews DataSource

In [None]:
dsrc = DataSource.from_catalog('wine_reviews')

In [None]:
dsrc.unpack()

## Explore the data
Since we'll want to look at some analyses that involve the Tasters, we'll need to use the 130k dataset instead of the 150k dataset.

In [None]:
wine_reviews_path = dsrc.unpack_path_ / 'winemag-data-130k-v2.csv'

In [None]:
df = pd.read_csv(wine_reviews_path, index_col=0)

In [None]:
df.head()

Let's do some basic exploration of the various columns.

First, lets see where we have missing data.


In [None]:
# percentage of NAs per column
print('Percentage of NAs per column')
for column in df.columns:
    print(f'\t{column}: \t {round(sum(df[column].isna()) / len(df[column]) * 100)}')

Every review has a description, points, title, and winery.

Almost all reviews have country, province and varietal.

We have more taster names that twitter handles, so when we want to use tasters, use the taster name.


In [None]:
df.taster_name.value_counts(dropna=False)

Wow, only 19 recorded tasters in this dataset, and most of the reviews don't include a `taster_name`.


## Create a Dataset

Now we'll create a dataset object that we can easily load from a catalog once the instructions have been created once.


### Step 1: Make a process function

This is a nice dataset. Note really any clean up to do here. Just fixing things up to match the API. A dataset consists of a tuple: `(data, target, metadata)`.

In [None]:
!ls -la $dsrc.unpack_path_

In [None]:
import pandas as pd
import pathlib

def process_wine_reviews(*, kind='130k', extract_dir='wine_reviews',
                         metadata=None, unpack_dir=None):
    """
    Process wine reviews into (data, target, metadata) format. Since we plan to use Pandas
    for further processing, data will be a pandas dataframe. 
    
    Parameters
    ----------
    unpack_dir:
        The directory the reviews have been unpacked into
    kind: {'130k' , '150k'}
        This is an unsupervised learning example. There are no labels. We will only work
        with the whole dataset. There are two versions, the 130k version of 150k version.
    extract_dir: 
        Name of the directory of the unpacked zip file containing the raw data files.

    
    Returns
    -------
    A tuple:
        (data, target, additional_metadata)
        
    """
    if metadata is None:
        metadata = {}
    
    if unpack_dir is None:
        unpack_dir = paths['interim_data_path']
    else:
        unpack_dir = pathlib.Path(unpack_dir)
    data_dir = unpack_dir / extract_dir
    if kind == '130k':
        data = pd.read_csv(data_dir/"winemag-data-130k-v2.csv", index_col=0)
    elif kind == '150k':
        data = pd.read_csv(data_dir/"winemag-data_first150k.csv", index_col=0)
    else:
        raise ValueError(f'kind: {kind} must be one of "130k" or "150k"')
    
    target = None
    
    return data, target, metadata

In [None]:
data, target, metadata = process_wine_reviews(kind='130k')

In [None]:
data.head()

In [None]:
data.shape



Looks good. Now test this as a process function for our data.


In [None]:
dsrc.process_function = partial(process_wine_reviews, kind='130k')

In [None]:
%%time
ds = dsrc.process()

In [None]:
ds.metadata

In [None]:
ds.data.shape

In [None]:
print(ds)

In [None]:
type(ds)

### Now that things seem to work, we need to move the process function to the src module
The place for putting custom processing functions in in `src/data/process_functions.py`.

Now load it from the `src` module.

In [None]:
from src.data.process_functions import process_wine_reviews

In [None]:
help(process_wine_reviews)

Change the process function from the notebook defined function to the one from the src module.

In [None]:
dsrc.process_function = partial(process_wine_reviews, kind='130k')

Check that everything works as expected.

In [None]:
dsrc.fetch()
dsrc.unpack()
ds = dsrc.process()

In [None]:
ds.data.head()

## Save the Datasource processing

In [None]:
workflow.add_datasource(dsrc)

In [None]:
workflow.datasource_catalog(keys_only=False)

In [None]:
dsrc = DataSource.from_catalog('wine_reviews')

In [None]:
ds = Dataset.from_datasource('wine_reviews')

In [None]:
ds.data.shape

In [None]:
workflow.datasource_catalog(keys_only=True)

### Create a Dataset from a DataSource

In [None]:
from src.data import TransformerGraph

In [None]:
dag = TransformerGraph()

In [None]:
dag.add_source(output_dataset='wine_reviews_130k', datasource_name='wine_reviews', force=True)

We can also add the 150 dataset, by change the `kind` that we pass into the process function.

In [None]:
dag.add_source(output_dataset='wine_reviews_150k', datasource_name='wine_reviews', datasource_opts={'kind':'150k'}, force=True)

In [None]:
workflow.dataset_catalog(keys_only=True)

In [None]:
ds = Dataset.from_catalog('wine_reviews_130k')

In [None]:
ds.data.shape

In [None]:
ds = Dataset.from_catalog('wine_reviews_150k')

In [None]:
ds.data.shape

Now we're ready to work with the dataset and analyze wine reviews! See:
* [01-Varietal-by-Sets-of-Reviewers.ipynb](01-Varietal-by-Sets-of-Reviewers.ipynb)
* [02-Winery-by-Varietal-Review-Counts.ipynb](02-Winery-by-Varietal-Review-Counts.ipynb)