Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [3]:
# Import Yahoo finance python
!pip install yfinance
import yfinance as yf



In [4]:
import numpy as np
import pandas as pd

In [5]:
# Load the data from github
url = 'https://raw.githubusercontent.com/laguz/stock_csv/master/AAPL.csv'
df = pd.read_csv(url)

In [6]:
# Take a look at the data
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,1980-12-12,0.128348,0.128906,0.128348,0.128348,0.101261,469033600.0
1,1980-12-15,0.12221,0.12221,0.121652,0.121652,0.095978,175884800.0
2,1980-12-16,0.113281,0.113281,0.112723,0.112723,0.088934,105728000.0
3,1980-12-17,0.115513,0.116071,0.115513,0.115513,0.091135,86441600.0
4,1980-12-18,0.118862,0.11942,0.118862,0.118862,0.093777,73449600.0


In [7]:
# Add the price change feature
df['Price_Change'] = df['Close'] - df['Close'].shift(1)

In [8]:
# Add the price change feature
df['Volume_Change'] = df['Volume'] - df['Volume'].shift(1)

In [9]:
# Add the price change feature
df['Percentage_Change'] = (((df['Close'] / df['Close'].shift(1))-1)*100)

In [10]:
# Drop the any create from the three features from above
df.dropna(how='any', inplace=True)

In [36]:
# Create the target feature to predict.
# I want to know if the market up or down.
df.loc[df['Price_Change'] >= 0, 'Up_Down'] = 1
df.loc[df['Price_Change'] < 0, 'Up_Down'] = 0
df['Up_Down'] = df['Up_Down'].astype('int')
df['Up_Down'].value_counts(normalize=True)

1    0.532284
0    0.467716
Name: Up_Down, dtype: float64

In [12]:
# Choose your target. Which column in your tabular dataset will you predict?
# I will predict Up_Down

In [13]:
# Is your problem regression or classification?
# It will be a regression

In [14]:
# How is your target distributed?
# The majority class frequency is 53%

In [15]:
# My evaluation metrics.
# Accuracy and Mean Absolute Error

In [16]:
# I will train from 1980 to 2004, validate from 2005 to 2014, and test from 2015 to 2019
# I will do time base split.

In [17]:
####################### Fisnish retro 1 ####################

In [18]:
####################### Start retro 2 ####################

In [37]:
# Take a look at the data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10036 entries, 1 to 10038
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Date               10036 non-null  datetime64[ns]
 1   Open               10036 non-null  float64       
 2   High               10036 non-null  float64       
 3   Low                10036 non-null  float64       
 4   Close              10036 non-null  float64       
 5   Adj Close          10036 non-null  float64       
 6   Volume             10036 non-null  float64       
 7   Price_Change       10036 non-null  float64       
 8   Volume_Change      10036 non-null  float64       
 9   Percentage_Change  10036 non-null  float64       
 10  Up_Down            10036 non-null  int64         
dtypes: datetime64[ns](1), float64(9), int64(1)
memory usage: 940.9 KB


In [20]:
# Convert the Date column into time data type
df['Date'] = pd.to_datetime(df['Date'])

In [21]:
# Take a look at the data again
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10036 entries, 1 to 10038
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Date               10036 non-null  datetime64[ns]
 1   Open               10036 non-null  float64       
 2   High               10036 non-null  float64       
 3   Low                10036 non-null  float64       
 4   Close              10036 non-null  float64       
 5   Adj Close          10036 non-null  float64       
 6   Volume             10036 non-null  float64       
 7   Price_Change       10036 non-null  float64       
 8   Volume_Change      10036 non-null  float64       
 9   Percentage_Change  10036 non-null  float64       
 10  Up_Down            10036 non-null  float64       
dtypes: datetime64[ns](1), float64(10)
memory usage: 940.9 KB


In [22]:
####################### Fisnish retro 2 ####################

In [23]:
####################### Start retro 3 ####################

In [27]:
# Baseline
baseline = df['Up_Down'].value_counts(normalize=True)
print('Majority Baseline:', baseline[1])

Majority Baseline: 0.5322837783977681


In [41]:
numeric = df.describe().columns.values.tolist()

In [42]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier


numeric_features = numeric

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)])


clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', DecisionTreeClassifier())])

In [None]:
# Create the feature matrix 
X = df.drop('Up_Down', axis=1)

# Create and encode the target array
from sklearn.preprocessing import LabelEncoder
label_enc = LabelEncoder()
y=label_enc.fit_transform(weather_drop['RainTomorrow'])

# Import the train_test_split utility
from sklearn.model_selection import train_test_split

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Fit the model
clf.fit(X_train,y_train)
print('Validation Accuracy', clf.score(X_test, y_test))