## Let's begin: prepare your data

Start by importing the mammographic_masses.data.txt file into a Pandas dataframe (hint: use read_csv) and take a look at it.

In [1]:
import pandas as pd

%load_ext autoreload
%autoreload 2


In [2]:
import logging
from src.logging import logger
logger.setLevel(logging.DEBUG)

In [4]:
dataset_name='mammographic'

In [5]:
from src.paths import raw_data_path, interim_data_path, processed_data_path


In [6]:
from src.data import RawDataset
mammo_data = RawDataset(dataset_name)
mammo_data.add_url(url="https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data")

In [7]:
mammo_data.add_url(url='https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.names',
                file_name=f'{dataset_name}.readme',
                name='DESCR')

In [8]:
from src.data.localdata import process_csv
help(process_csv)

ImportError: cannot import name 'process_csv' from 'src.data.localdata' (/Users/mei/Documents/courses/bbconf/mammogram/src/data/localdata.py)

If the missing data seems randomly distributed, go ahead and drop rows with missing data. Hint: use dropna().

Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels.

In [8]:
from src import workflow
from src.data.localdata import process_csv
mammo_data.load_function = process_csv

In [9]:
from src.data import Dataset
workflow.add_raw_dataset(mammo_data)
mammo_df = Dataset.from_raw(dataset_name, force=True)
print(str(mammo_df))

2018-11-11 13:15:13,132 - fetch - DEBUG - No file_name specified. Inferring mammographic_masses.data from URL
2018-11-11 13:15:13,134 - fetch - DEBUG - mammographic_masses.data exists, but no hash to check. Setting to sha1:5cfd64b52520391fb1f2d2d5d115d10c8c862046
2018-11-11 13:15:13,135 - fetch - DEBUG - mammographic.readme exists, but no hash to check. Setting to sha1:d8f3a7c205397d619eadfecf990dd84380115325
2018-11-11 13:15:13,136 - fetch - DEBUG - Copying mammographic_masses.data
2018-11-11 13:15:13,138 - fetch - DEBUG - Copying mammographic.readme
2018-11-11 13:15:13,141 - localdata - DEBUG - load_csv()-->loading csv file=/Users/mei/Documents/courses/bbconf/mammogram/data/interim/mammographic/mammographic_masses.data ...
2018-11-11 13:15:13,152 - datasets - DEBUG - Wrote Dataset Metadata: 252508f3d18124f88fc1204c8d0cb9c439dd032f.metadata
2018-11-11 13:15:13,155 - datasets - DEBUG - Wrote Dataset: 252508f3d18124f88fc1204c8d0cb9c439dd032f.dataset


<Dataset: mammographic, data.shape=(830, 4), target.shape=(830,), metadata=['descr', 'dataset_name', 'hash_type', 'data_hash', 'target_hash']>


In [10]:
workflow.available_datasets()
workflow.get_transformer_list()

[]

In [11]:
workflow.available_transformers()

['index_to_date_time', 'pivot', 'train_test_split']

In [12]:
transform_pipeline = [("train_test_split", {'random_state':1, 'test_size':0.25})]
workflow.add_transformer(from_raw=dataset_name,
                         suppress_output=True,
                         transformations=transform_pipeline)

In [13]:

workflow.make_data()
logger.setLevel(logging.DEBUG)

2018-11-11 13:15:13,388 - transform_data - DEBUG - Creating Dataset from Raw: mammographic with opts {}
2018-11-11 13:15:13,390 - datasets - DEBUG - process() called before unpack()
2018-11-11 13:15:13,391 - datasets - DEBUG - unpack() called before fetch()
2018-11-11 13:15:13,391 - fetch - DEBUG - No file_name specified. Inferring mammographic_masses.data from URL
2018-11-11 13:15:13,396 - fetch - DEBUG - mammographic_masses.data exists, but no hash to check. Setting to sha1:5cfd64b52520391fb1f2d2d5d115d10c8c862046
2018-11-11 13:15:13,398 - fetch - DEBUG - mammographic.readme exists, but no hash to check. Setting to sha1:d8f3a7c205397d619eadfecf990dd84380115325
2018-11-11 13:15:13,399 - fetch - DEBUG - Copying mammographic_masses.data
2018-11-11 13:15:13,402 - fetch - DEBUG - Copying mammographic.readme
2018-11-11 13:15:13,405 - datasets - DEBUG - Found cached Dataset for mammographic: 252508f3d18124f88fc1204c8d0cb9c439dd032f
2018-11-11 13:15:13,406 - transform_data - DEBUG - Applying

Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. Hint: use preprocessing.StandardScaler().

In [14]:
mammo_df.data

array([[ 0.7650629 ,  0.17563638,  1.39618483,  0.24046607],
       [ 0.15127063,  0.98104077,  1.39618483,  0.24046607],
       [-1.89470363, -1.43517241, -1.157718  ,  0.24046607],
       ...,
       [ 0.56046548,  0.98104077,  1.39618483,  0.24046607],
       [ 0.69686376,  0.98104077,  1.39618483,  0.24046607],
       [ 0.42406719,  0.17563638,  0.11923341,  0.24046607]])

In [15]:
mammo_df.target


array([1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,