# Supervised Learning to find value in stocks in the stock market

### Notebook by Catarina Fernandes, Diogo Almeida, Pedro Queirós
#### Supported by [Luís Paulo Reis](https://web.fe.up.pt/~lpreis/)
#### [Faculdade de Engenharia da Universidade do Porto](https://sigarra.up.pt/feup/pt/web_page.inicial)


**It is recommended to [view this notebook in nbviewer](http://nbviewer.ipython.org/github/rhiever/Data-Analysis-and-Machine-Learning-Projects/blob/master/example-data-science-notebook/Example%20Machine%20Learning%20Notebook.ipynb) for the best viewing experience.**

**You can also [execute the code in this notebook on Binder](https://mybinder.org/v2/gh/rhiever/Data-Analysis-and-Machine-Learning-Projects/master?filepath=example-data-science-notebook%2FExample%20Machine%20Learning%20Notebook.ipynb) - no local installation required.**

## Table of contents

1. [Introduction](#Introduction)

2. [Required libraries and Models](#Required-libraries-and-Models)
    - [Libraries](#Libraries)   
    - [Models](#Models)

3. [The problem domain](#The-problem-domain)

4. [Step 1: Data Analysis](#Step-1:-Data-Analysis)


## Introduction

[[ go back to the top ]](#Table-of-contents)



## Libraries

[[ go back to the top ]](#Table-of-contents)

If you don't have Python on your computer, you can use the [Anaconda Python distribution](http://continuum.io/downloads) to install most of the Python packages you need. Anaconda provides a simple double-click installer for your convenience.

This notebook uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:

* **NumPy**: Provides a fast numerical array structure and helper functions.
* **pandas**: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
* **scikit-learn**: The essential Machine Learning package in Python.
* **matplotlib**: Basic plotting library in Python; most other Python plotting libraries are built on top of it.

To make sure you have all of the packages you need, install them with `conda`:

    conda install numpy pandas scikit-learn matplotlib seaborn
    
    conda install -c conda-forge watermark

`conda` may ask you to update some of them if you don't have the most recent version. Allow it to do so.

**Note:** I will not be providing support for people trying to run this notebook outside of the Anaconda Python distribution.

## The problem domain

[[ go back to the top ]](#Table-of-contents)


## Step 1: Answering the question

[[ go back to the top ]](#Table-of-contents)

The first step to any data analysis project is to define the question or problem we're looking to solve, and to define a measure (or set of measures) for our success at solving that task. The data analysis checklist has us answer a handful of questions to accomplish that, so let's work through those questions.

>Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?

We're trying to design a predictive model capable of evaluating and finding stocks that are worth to invest on, based in the 10-K filings of each company.

>Did you define the metric for success before beginning?

Let's do that now. Since we're performing classification, we can use [accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision) — the fraction of correctly classified stocks — to quantify how well our model is performing. To ensure that the investers have profit, we should achieve an accuracy of 50% or greater.

>Did you understand the context for the question and the scientific or business application?

TODO

>Did you record the experimental design?

TODO

>Did you consider whether the question could be answered with the available data?

TODO

<hr />


## Step 1: Data Analysis

[[ go back to the top ]](#Table-of-contents)

Before start working with the data, we should analasy it and verify if there is erros, like null or undefined values.

In [12]:
import pandas as pd

ten_k_fillings_data = pd.read_csv('10_k_fillings.csv')
ten_k_fillings_data.head()


Unnamed: 0.1,Unnamed: 0,Revenue,Revenue Growth,Cost of Revenue,Gross Profit,SG&A Expense,Operating Expenses,Operating Income,Earnings before Tax,Net Income,...,EPS Diluted Growth,Weighted Average Shares Growth,Weighted Average Shares Diluted Growth,Operating Cash Flow growth,Free Cash Flow growth,Receivables growth,Asset Growth,Book Value per Share Growth,SG&A Expenses Growth,class
0,INTC,70848000000.0,0.1289,27111000000.0,43737000000.0,6750000000.0,20421000000.0,23316000000.0,23317000000.0,21053000000.0,...,1.2513,-0.0191,-0.0277,0.3312,0.3793,0.1989,0.0382,0.1014,-0.0942,1
1,MU,30391000000.0,0.4955,12500000000.0,17891000000.0,813000000.0,2897000000.0,14994000000.0,14303000000.0,14135000000.0,...,1.61,0.0579,0.065,1.1342,1.4922,0.4573,0.2275,0.6395,0.0942,1
2,AAPL,265595000000.0,0.1586,163756000000.0,101839000000.0,16705000000.0,30941000000.0,70898000000.0,72903000000.0,59531000000.0,...,0.2932,-0.0502,-0.0479,0.2057,0.2385,0.3734,-0.0256,-0.1584,0.0946,1
3,MSFT,110360000000.0,0.1428,38353000000.0,72007000000.0,22223000000.0,36949000000.0,35058000000.0,36474000000.0,16571000000.0,...,-0.3446,-0.0059,-0.0049,0.1108,0.0279,0.1806,0.0341,-0.0512,0.1144,1
4,HPQ,58472000000.0,0.1233,47803000000.0,10669000000.0,4859000000.0,6605000000.0,4064000000.0,3013000000.0,5327000000.0,...,1.2027,-0.0432,-0.04,0.2314,0.2422,0.1584,0.0519,0.8039,0.1104,1


In [11]:
ten_k_fillings_data.describe()

Unnamed: 0,Revenue,Revenue Growth,Cost of Revenue,Gross Profit,SG&A Expense,Operating Expenses,Operating Income,Earnings before Tax,Net Income,Net Income Com,...,EPS Diluted Growth,Weighted Average Shares Growth,Weighted Average Shares Diluted Growth,Operating Cash Flow growth,Free Cash Flow growth,Receivables growth,Asset Growth,Book Value per Share Growth,SG&A Expenses Growth,class
count,638.0,638.0,638.0,638.0,638.0,638.0,638.0,638.0,638.0,638.0,...,638.0,638.0,638.0,638.0,638.0,638.0,638.0,638.0,638.0,638.0
mean,4313656000.0,0.200593,2418867000.0,1893674000.0,683889500.0,1209286000.0,684985200.0,687694000.0,538703500.0,536502000.0,...,0.367957,0.327554,0.354594,0.024223,0.342144,0.28203,0.160931,0.134593,0.112316,0.706897
std,18320100000.0,1.236156,10601820000.0,8412074000.0,2367724000.0,5087444000.0,4005216000.0,4214649000.0,3416454000.0,3416642000.0,...,9.258839,4.700736,4.719772,10.419881,6.941349,1.567491,0.453783,1.487802,0.23132,0.455543
min,0.0,-0.9593,0.0,-297081000.0,792600.0,-619000.0,-1268450000.0,-1326000000.0,-4964000000.0,-4964000000.0,...,-136.0,-0.1587,-0.2298,-197.0278,-31.9362,-1.0,-0.8645,-17.0909,-0.6419,0.0
25%,102730000.0,0.0086,45014500.0,37232500.0,28614500.0,45838500.0,-9972250.0,-13226250.0,-17064380.0,-17678750.0,...,-0.259475,-0.000375,-0.002425,-0.21725,-0.316125,-0.03705,-0.03235,-0.1069,-0.0133,0.0
50%,453977500.0,0.1008,195013500.0,209774000.0,129198500.0,193400000.0,8819500.0,6248500.0,4420444.0,4163274.0,...,0.26425,0.01435,0.01625,0.1182,0.12625,0.0994,0.0542,0.03515,0.0799,1.0
75%,1896714000.0,0.231325,905880800.0,840554000.0,425222800.0,643955800.0,166081000.0,137877000.0,110032200.0,109034300.0,...,0.791325,0.06005,0.078225,0.49725,0.58915,0.282983,0.211925,0.17485,0.1888,1.0
max,265595000000.0,29.9804,163756000000.0,101839000000.0,24459000000.0,81310000000.0,70898000000.0,72903000000.0,59531000000.0,59531000000.0,...,100.0,117.9042,117.9042,73.7835,148.8191,26.9808,3.6539,22.5526,1.9449,1.0


In [13]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sb

In [None]:
sb.pairplot(ten_k_fillings_data)
;