# fredhutch.io -- Intermediate Python: Machine Learning
### Fred Hutchinson Cancer Research Center

# Week 1 -- Basic Data Understanding and Preparation: Data Import, Cleaning, Visualization, and Feature Engineering

## We have access to a set of data describing a variety of features about people's commutes. We're hoping to use these data to predict how long a new person's commute will be given some information about them.

## Can machine learning help us achieve our goal?

## What do we need to do with the data first in order to decide where to go and how to use machine learning thoughtfully?

### EDA, Data import, cleaning, visualization

In [None]:
# Import libraries

# Data exploration
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt

# The % signifies a ipynb "magic function". This line allows the figure to be rendered in the notebook next to the
# code. 
%matplotlib inline

In [None]:
# What versions are you running?
print("Pandas version:",pd.__version__)
print("Numpy version:",np.__version__)

In [None]:
# Read in the data. Remember to pass in the full file path!
training_dataset_path = '../data/commute-times-train.csv'
testing_dataset_path = '../data/commute-times-test.csv'
train_data_raw = pd.read_csv(training_dataset_path, index_col=[0], parse_dates=['time_of_day_ts'])
test_data_raw = pd.read_csv(testing_dataset_path, index_col=[0], parse_dates=['time_of_day_ts'])

### Quick primer to viewing data

In [None]:
# How to quickly access docstring on syntax
?pd.DataFrame.head

In [None]:
# Read in the first five rows of data.
train_data_raw.head()

In [None]:
# Other options include using .tail() and .sample(). You can specify how many rows of data to see. The default is 5. 
train_data_raw.sample(3)

In [None]:
# Other methods to use to get some information about your dataset. By default, using .describe() provides some summary
# statistics of what kind of data?
train_data_raw.describe()

In [None]:
# In case it is of interest, there are kwargs that will provide additional information.
train_data_raw.describe(include='all')

In [None]:
# There is a quick way to learn what size dataset you are working with. .shape will return (rows, columns).
train_data_raw.shape

In [None]:
# It's important to know what data types are in your dataset! It's not usually a good idea to make assumptions about 
# the data you are working with. 
train_data_raw.dtypes

### np.NaN != [None, 0] 
### NaN is very useful because you can leverage vectorized operations in numpy. The data type for each of those values (NaN, None, 0) is different. Think about this when you consider how to impute or otherwise handle missing data.

In [None]:
# Are there any missing data in this dataset?
train_data_raw.isnull().values.any()

In [None]:
# If there are NaNs, how many are present in each column?
train_data_raw.isnull().sum()

### Let's continue learning about the data through visualizations

In [None]:
# Can we discern any information from lat long?
train_data_raw.plot.scatter(x='source_latitude', y='source_longitude', marker='.')
plt.title('Source of Commute');

In [None]:
train_data_raw.plot.scatter(x='destination_latitude', y='destination_longitude', marker='.')
plt.title('Destination of Commute');

In [None]:
# This will show the distribution of commute times.
plt.figure(figsize=(15,5))
train_data_raw['commute_time'].hist(bins=50)
plt.title('Histogram of Commute Times')
plt.xlabel('Minutes')
plt.ylabel('Counts');

In [None]:
# This is a nice summary table, but let's graph this out somehow.
train_data_raw.groupby('commute_type').size()

In [None]:
plt.figure(figsize=(15,5))
train_data_raw.groupby('commute_type').size().plot(kind='bar')
plt.title('Count of Commute Types')
plt.ylabel('Number of Commutes');

### What can you say about the data based on these graphs?

In [None]:
# The current dtype of time_of_day_ts is not great for graphing. It might be a good idea to do something about it. 
# This function converts this into a decimal between zero and twenty-four.

def timestamp_to_decimal(ts):
    """Convert a timestamp datum into a decimal between zero and twenty-four.
    
    Parameters
    ----------
    ts: pd.Series of datetime.
    """
    return ts.dt.hour + (1/60)*ts.dt.minute

In [None]:
# Let's use our new function on the data and create a new column. Notice that the function is being used to transform 
# both the training AND testing datasets.
train_data_raw['time_of_day'] = timestamp_to_decimal(
    train_data_raw['time_of_day_ts'])
test_data_raw['time_of_day'] = timestamp_to_decimal(
    test_data_raw['time_of_day_ts'])

In [None]:
# Now that time of day is represented by a number from 0 to 24, let's see how many trips are being made throughout 
# the day.
plt.figure(figsize=(15,5))
train_data_raw['time_of_day'].hist(bins=50)
plt.title('Volume of Commutes by Time of Day')
plt.xlabel('Time of Day')
plt.ylabel('Volume of Commutes');

### We have visualized the variables included in the dataset. Sometimes variables need to be transformed before they can be graphed in a meaningful way. We've looked at locations, commute types, commute times, time of day. The dataset contains information we can use to generate new features. Let's revisit location data because with a source and a destination, we can calculate distance.



![dist_image](https://slideplayer.com/slide/4829376/15/images/8/Some+Euclidean+Distances.jpg "Euclidean")

In [None]:
# Euclidean distance is also known as L2.
def euclidean_distance(source_x, source_y, target_x, target_y):
    return np.sqrt((source_x - target_x)**2 + (source_y - target_y)**2)

# Manhattan (or taxicab) distance is also known as L1.
def manhattan_distance(source_x, source_y, target_x, target_y):
    return np.abs(source_x - target_x) + np.abs(source_y - target_y)

In [None]:
train_data_raw['euclidean_distance'] = euclidean_distance(
    train_data_raw['source_latitude'], train_data_raw['source_longitude'],
    train_data_raw['destination_latitude'], train_data_raw['destination_longitude'])
test_data_raw['euclidean_distance'] = euclidean_distance(
    test_data_raw['source_latitude'], test_data_raw['source_longitude'],
    test_data_raw['destination_latitude'], test_data_raw['destination_longitude'])

train_data_raw['manhattan_distance'] = manhattan_distance(
    train_data_raw['source_latitude'], train_data_raw['source_longitude'],
    train_data_raw['destination_latitude'], train_data_raw['destination_longitude'])
test_data_raw['manhattan_distance'] = manhattan_distance(
    test_data_raw['source_latitude'], test_data_raw['source_longitude'],
    test_data_raw['destination_latitude'], test_data_raw['destination_longitude'])

In [None]:
plt.figure(figsize=(15,5))
train_data_raw['euclidean_distance'].hist(bins=50)
plt.title('Euclidean Commute Distance')
plt.xlabel('Euclidean Distance')
plt.ylabel('Number of Commuters');

In [None]:
plt.figure(figsize=(15,5))
train_data_raw['manhattan_distance'].hist(bins=50)
plt.title('Manhattan Commute Distance')
plt.xlabel('Manhattan Distance')
plt.ylabel('Number of Commuters');

#### Now that we have some information about distance, do you have a guess for which type is related to commute time? 

## MOAR features
### At this point, we have done some feature engineering with numerical (continuous) data. What do you do with categorical data?

In [None]:
# The computer can't make sense of 'BIKE', 'CAR', etc, but it does understand 0 and 1 and combinations of those digits.
# Each of these combinations can be referred to as a level of a particular feature.
# Why does it make sense to leave one (level) out? By process of elimination, if a datapoint isn't one of the 
# n-1 levels, it must be the nth.

def create_indicator_features(feature, leave_one_out=True):
    # Sort the levels so we always get the same ordering of new features.
    levels = list(sorted(np.unique(feature)))
    # If we need to leave one out to avoid identifiability issues, we will 
    # leave out the *last* level, in sorted order.
    if leave_one_out:
        levels = levels[:-1]
    indicator_features = []
    for level in levels:
        indicator_feature = (feature == level)
        indicator_feature_name = "is_{0}".format(level)
        indicator_features.append(
            pd.Series(indicator_feature, 
                      name=indicator_feature_name, 
                      index=feature.index,
                      dtype=int))
    return pd.concat(indicator_features, axis=1)

In [None]:
commute_type_features_train = create_indicator_features(train_data_raw['commute_type'])
commute_type_features_test = create_indicator_features(test_data_raw['commute_type'])

commute_type_features_train.head()

In [None]:
# Drop unnecessary columns before moving on.
train_data_raw = train_data_raw.drop(['time_of_day_ts', 'commute_type'], axis=1)
test_data_raw = test_data_raw.drop(['time_of_day_ts', 'commute_type'], axis=1)

In [None]:
# Combine our new indicator variables to the original dataframes.
train_data_processed = pd.concat([train_data_raw, commute_type_features_train], axis=1)
test_data_processed = pd.concat([test_data_raw, commute_type_features_test], axis=1)

In [None]:
train_data_processed.to_csv('../data/train_data_processed.csv')
test_data_processed.to_csv('../data/test_data_processed.csv')

## A bit more practice with some python ideas:
### Defining functions, using for loops, simplifying your life