# Regression

Supervised machine learning in which the machine is given contexts, questions, and the correct answers repeatedly until it can start to choose correctly on its own.

## 01 Intro and Data

For the purposes of this exercise we will start with some data from quandl.

In [4]:
import pprint
import pandas as pd

import quandl

df = quandl.get("WIKI/MSFT")

pprint.pprint(df.head())

Open   High   Low  Close     Volume  Ex-Dividend  Split Ratio  \
Date                                                                         
1986-03-13  25.50  29.25  25.5  28.00  3582600.0          0.0          1.0   
1986-03-14  28.00  29.50  28.0  29.00  1070000.0          0.0          1.0   
1986-03-17  29.00  29.75  29.0  29.50   462400.0          0.0          1.0   
1986-03-18  29.50  29.75  28.5  28.75   235300.0          0.0          1.0   
1986-03-19  28.75  29.00  28.0  28.25   166300.0          0.0          1.0   

            Adj. Open  Adj. High  Adj. Low  Adj. Close   Adj. Volume  
Date                                                                  
1986-03-13   0.058941   0.067609  0.058941    0.064720  1.031789e+09  
1986-03-14   0.064720   0.068187  0.064720    0.067031  3.081600e+08  
1986-03-17   0.067031   0.068765  0.067031    0.068187  1.331712e+08  
1986-03-18   0.068187   0.068765  0.065876    0.066454  6.776640e+07  
1986-03-19   0.066454   0.067031  0.0647

### Scrubbing Data

The initial response includes some things we're not intereseted in along with some things that are repetitive or (mostly) static. 

We can remove that data with a view on the dataframe.

Adjusted columns have been adjusted for stock splits. 

In [7]:
# Remove unused data from the data frame.
df = df[['Adj. Open',  'Adj. High',  'Adj. Low',  'Adj. Close', 'Adj. Volume']]
pprint.pprint(df.head())

Adj. Open  Adj. High  Adj. Low  Adj. Close   Adj. Volume
Date                                                                
1986-03-13   0.058941   0.067609  0.058941    0.064720  1.031789e+09
1986-03-14   0.064720   0.068187  0.064720    0.067031  3.081600e+08
1986-03-17   0.067031   0.068765  0.067031    0.068187  1.331712e+08
1986-03-18   0.068187   0.068765  0.065876    0.066454  6.776640e+07
1986-03-19   0.066454   0.067031  0.064720    0.065298  4.789440e+07


### Transforming Data

#### High / low percent

Now we calculate the high low percent by subtracting the day's low from the day's high and dividing by the day's close.

In [8]:
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0
pprint.pprint(df.head())

Adj. Open  Adj. High  Adj. Low  Adj. Close   Adj. Volume  \
Date                                                                   
1986-03-13   0.058941   0.067609  0.058941    0.064720  1.031789e+09   
1986-03-14   0.064720   0.068187  0.064720    0.067031  3.081600e+08   
1986-03-17   0.067031   0.068765  0.067031    0.068187  1.331712e+08   
1986-03-18   0.068187   0.068765  0.065876    0.066454  6.776640e+07   
1986-03-19   0.066454   0.067031  0.064720    0.065298  4.789440e+07   

               HL_PCT  
Date                   
1986-03-13  13.392857  
1986-03-14   5.172414  
1986-03-17   2.542373  
1986-03-18   4.347826  
1986-03-19   3.539823  


#### Percent change

And the percent change for the day 

In [3]:
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0

NameError: name 'df' is not defined

## Create a new dataframe

To do this we use some of the adjusted values from the previous dataframe along with the values we just calculated.

In [2]:
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
print(df.head())


NameError: name 'df' is not defined