Planning for this cycle

- Cleaning the dataset
- Listing rank of the best stock and chosen them
- Create a first solution faster

# 1.0 - Packages

## 1.1 - Libraries

In [1]:
# data manipulation
import pandas as pd
from datetime import datetime
from IPython.display import clear_output

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# jupyter viewer
from IPython.core.display import display, HTML

## 1.2 - Helper Functions

In [2]:
def jupyter_settings():
    
    # tamanho dos gráficos para o matplotlib
    plt.style.use('bmh')
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 20
    # colocando para aparecer todas as colunas e linhas e ponto flutuantes
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.options.display.float_format = '{:.3f}'.format
    pd.set_option('display.expand_frame_repr', False)
    # expandir a visualização da tela em 100%, como se fosse um zoom
    display(HTML('<style>.conteiner{width:100% !important;}</style>'))

    #aplicar as mesmas configurações no seaborn
    sns.set()

jupyter_settings()

# the path when the dataset be
raw_path = '/home/diegopmayer/Documents/projects/ideias/analise-acoes/notebooks/dataset/raw'

# 2.0 - Data Extraction

In [7]:
# select the data loaded on step 2.2 by cycle 01
dataset = pd.read_feather(f'{raw_path}/dataset.ftr')

# 3.0 - Data Cleaning

### 2.3.1 - Dimention and NaN

In [18]:
print(f'{dataset.shape[0]} lines\n     {dataset.shape[1]} Columns')
print(f'{dataset.isna().sum()}')

959655 lines
     8 Columns
Date        0
Open        0
High        0
Low         0
Close       0
Volume      0
Currency    0
symbol      0
dtype: int64


- Conclusion
    - It's a almost 1 milion of lines, but it has a little columns, only 8 columns, the computing do it easly
    - Dont's have no one null information, you mustn't clean them

### 2.3.2 - Data Type

In [20]:
dataset.dtypes

Date        datetime64[ns]
Open               float64
High               float64
Low                float64
Close              float64
Volume               int64
Currency            object
symbol              object
dtype: object

In [21]:
dataset.tail()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Currency,symbol
959650,2021-07-07,4.73,4.84,4.69,4.83,1399700,BRL,BMGB4
959651,2021-07-08,4.76,4.76,4.68,4.7,949100,BRL,BMGB4
959652,2021-07-12,4.74,4.86,4.71,4.84,1955700,BRL,BMGB4
959653,2021-07-13,4.84,4.9,4.75,4.89,1491600,BRL,BMGB4
959654,2021-07-14,4.89,4.91,4.83,4.86,333100,BRL,BMGB4


- Conclusion
    - It's all datatype ok

### 2.3.3 - Descriptive Statistics

In [45]:
dataset.describe(include="all", datetime_is_numeric=True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
Date,959655.0,,,,2013-05-15 07:21:12.123419648,1995-01-02 00:00:00,2009-06-10 00:00:00,2014-07-25 00:00:00,2018-05-18 00:00:00,2021-07-14 00:00:00,
Open,959655.0,,,,29.709,0.000,3.900,9.120,18.320,84410.360,346.521
High,959655.0,,,,30.156,0.000,3.990,9.280,18.640,20200.390,342.037
Low,959655.0,,,,28.990,0.000,3.820,8.950,17.990,19468.930,326.774
Close,959655.0,,,,29.573,0.000,3.900,9.110,18.310,20157.870,334.334
Volume,959655.0,,,,12633708.334,0.000,11900.000,203418.000,1477400.000,77548822528.000,347732917.884
Currency,959655.0,1.0,BRL,959655.0,,,,,,,
symbol,959655.0,374.0,BBDC3,13040.0,,,,,,,


Points of view

- there are a price values equal zero "0", is it a weekend?
- there are a highest price per stock in "max" with 20 thousand, is it correct?

In [69]:
# zero cotation
#getting a group per week and see the mean price
dataset.loc[:, 'day_of_week'] = dataset['Date'].dt.day_of_week
dataset.groupby(by=['day_of_week']).mean().head(10)

Unnamed: 0_level_0,Open,High,Low,Close,Volume
day_of_week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,29.866,29.935,28.805,29.348,10599464.921
1,29.532,30.056,28.914,29.471,13356059.721
2,30.204,30.759,29.569,30.163,13463955.597
3,29.372,29.905,28.714,29.324,13531028.473
4,29.557,30.105,28.931,29.542,12179134.315


In [70]:
# seeing the minimum value in each day of week
dataset.groupby(by=['day_of_week']).min().head(10)

Unnamed: 0_level_0,Date,Open,High,Low,Close,Volume,Currency,symbol
day_of_week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1995-01-02,0.01,0.01,0.01,0.01,0,BRL,AALR3
1,1995-01-03,0.01,0.01,0.01,0.01,0,BRL,AALR3
2,1995-01-04,0.0,0.0,0.0,0.0,0,BRL,AALR3
3,1995-01-05,0.0,0.0,0.0,0.0,0,BRL,AALR3
4,1995-01-06,0.0,0.0,0.0,0.0,0,BRL,AALR3


There are zero in all day of week registered, why?