Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [3]:
import numpy as np
import pandas as pd

In [5]:
url = 'https://raw.githubusercontent.com/laguz/stock_csv/master/AAPL.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,1980-12-08,0.128348,0.128906,0.128348,0.128348,0.101261,469033600
1,1980-12-15,0.12221,0.126674,0.112723,0.126116,0.0995,490134400
2,1980-12-22,0.132254,0.15904,0.132254,0.158482,0.125035,187891200
3,1980-12-29,0.160714,0.161272,0.152344,0.154018,0.121513,219452800
4,1981-01-05,0.151228,0.151228,0.135045,0.142299,0.112268,197904000


In [20]:
df.tail()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
2073,2020-08-31,127.580002,137.979996,110.889999,120.959999,120.959999,1168498600
2074,2020-09-07,113.949997,120.5,110.0,112.0,112.0,771441800
2075,2020-09-14,114.720001,118.830002,106.089996,106.839996,106.839996,944587000
2076,2020-09-21,104.540001,112.860001,103.099998,112.279999,112.279999,847212600
2077,2020-09-28,115.010002,117.720001,112.220001,113.019997,113.019997,640562200


In [33]:
df['Percentage Change'] = (((df['Close'] / df['Close'].shift(1))-1)*100)

In [34]:
df.tail(10)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Change,Change Volume,Percentage Change
2068,2020-07-27,93.709999,106.415001,93.247498,106.260002,106.068756,847635600,13.645004,182226800.0,14.733039
2069,2020-08-03,108.199997,114.412498,107.892502,111.112503,110.912529,1003689200,4.852501,156053600.0,4.56663
2070,2020-08-10,112.599998,116.042503,109.107498,114.907501,114.907501,941898000,3.794998,-61791200.0,3.415455
2071,2020-08-17,116.0625,124.8675,113.962502,124.370003,124.370003,835695200,9.462502,-106202800.0,8.234886
2072,2020-08-24,128.697495,128.785004,123.052498,124.807503,124.807503,1063638000,0.4375,227942800.0,0.351773
2073,2020-08-31,127.580002,137.979996,110.889999,120.959999,120.959999,1168498600,-3.847504,104860600.0,-3.082751
2074,2020-09-07,113.949997,120.5,110.0,112.0,112.0,771441800,-8.959999,-397056800.0,-7.407407
2075,2020-09-14,114.720001,118.830002,106.089996,106.839996,106.839996,944587000,-5.160004,173145200.0,-4.607146
2076,2020-09-21,104.540001,112.860001,103.099998,112.279999,112.279999,847212600,5.440003,-97374400.0,5.091729
2077,2020-09-28,115.010002,117.720001,112.220001,113.019997,113.019997,640562200,0.739998,-206650400.0,0.659065


In [35]:
df.corr()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,Change,Change Volume,Percentage Change
Open,1.0,0.999546,0.999468,0.999051,0.998109,-0.160985,0.156154,0.000302,0.008559
High,0.999546,1.0,0.999089,0.999447,0.998811,-0.158693,0.171202,0.001226,0.014392
Low,0.999468,0.999089,1.0,0.999478,0.998466,-0.16491,0.175995,-0.002033,0.015424
Close,0.999051,0.999447,0.999478,1.0,0.999223,-0.162191,0.193966,-0.000605,0.021548
Adj Close,0.998109,0.998811,0.998466,0.999223,1.0,-0.169754,0.196525,-0.000533,0.021661
Volume,-0.160985,-0.158693,-0.16491,-0.162191,-0.169754,1.0,-0.051942,0.316755,0.00621
Change,0.156154,0.171202,0.175995,0.193966,0.196525,-0.051942,1.0,-0.025201,0.310754
Change Volume,0.000302,0.001226,-0.002033,-0.000605,-0.000533,0.316755,-0.025201,1.0,0.035409
Percentage Change,0.008559,0.014392,0.015424,0.021548,0.021661,0.00621,0.310754,0.035409,1.0
