Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 1

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [X] Choose your target. Which column in your tabular dataset will you predict?
- [X] Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.
- [X] Determine whether your problem is regression or classification.
- [X] Choose your evaluation metric.
- [X] Begin with baselines: majority class baseline for classification, or mean baseline for regression, with your metric of choice.
- [X] Begin to clean and explore your data.
- [X] Choose which features, if any, to exclude. Would some features "leak" information from the future?

## Reading
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [How Shopify Capital Uses Quantile Regression To Help Merchants Succeed](https://engineering.shopify.com/blogs/engineering/how-shopify-uses-machine-learning-to-help-our-merchants-grow-their-business)
- [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), **by Lambda DS3 student** Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)

In [8]:
# Load the data so we can use it
import pandas as pd
bitcoin = pd.read_csv('Historical data for Bitcoin.csv')
bitcoin.head(10)

Unnamed: 0,Date,Open*,High,Low,Close**,Volume,Market Cap
0,8/12/2019,11528.19,11528.19,11320.95,11382.62,13647200000.0,203441000000.0
1,8/11/2019,11349.74,11523.58,11248.29,11523.58,15774370000.0,205942000000.0
2,8/10/2019,11861.56,11915.66,11323.9,11354.02,18125360000.0,202890000000.0
3,8/9/2019,11953.47,11970.46,11709.75,11862.94,18339990000.0,211961000000.0
4,8/8/2019,11954.04,11979.42,11556.17,11966.41,19481590000.0,213788000000.0
5,8/7/2019,11476.19,12036.99,11433.7,11941.97,22194990000.0,213330000000.0
6,8/6/2019,11811.55,12273.82,11290.73,11478.17,23635110000.0,205023000000.0
7,8/5/2019,10960.74,11895.09,10960.74,11805.65,23875990000.0,210849000000.0
8,8/4/2019,10821.63,11009.21,10620.28,10970.18,16530890000.0,195908000000.0
9,8/3/2019,10519.28,10946.78,10503.5,10821.73,15352690000.0,193234000000.0


In [9]:
bitcoin.dtypes

Date           object
Open*         float64
High          float64
Low           float64
Close**       float64
Volume        float64
Market Cap    float64
dtype: object

In [24]:
from tqdm import tnrange
# A little bit of feature engineering
bitcoin['avg_daily_price'] = (bitcoin['Open*']+bitcoin['High']+bitcoin['Low']+bitcoin['Close**'])/4

bitcoin['Date'] = pd.to_datetime(bitcoin['Date'])
bitcoin['Year'] = bitcoin['Date'].dt.year

bitcoin['Previous higher?'] = ""

for i in tnrange(len(bitcoin)-1,0,-1):
    if(i>0):
        bitcoin['Previous higher?'][i-1] = bitcoin['Low'][i].copy() > bitcoin['Low'][i-1].copy()
    bitcoin['Previous higher?'][2297] = False

HBox(children=(IntProgress(value=0, max=2297), HTML(value='')))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]





In [28]:
bitcoin.Year.value_counts()

2016    366
2017    365
2015    365
2018    365
2014    365
2013    248
2019    224
Name: Year, dtype: int64

In [29]:
train = bitcoin[bitcoin.Year < 2016]
val = bitcoin[(bitcoin.Year >2015) & (bitcoin.Year < 2018)]
test = bitcoin[bitcoin.Year > 2017]

In [33]:
# Get the X and y parts for train, val, and test
target = 'Previous higher?'
X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]
X_test = test.drop(columns=target)

This is a classification problem.

In [31]:
from sklearn.metrics import accuracy_score

In [41]:
accuracy = y_train.value_counts(normalize=True)[0]
print('Accuracy is', accuracy)

Accuracy is 0.5715746421267893
