![Insper](https://github.com/danielscarvalho/Insper-DS-Dicas/blob/master/Insper-Logo.png?raw=true)

# Insper Pós-Graduação
## Programa Avançado em Data Science e Decisão [»](https://www.insper.edu.br/pos-graduacao/programas-avancados/programa-avancado-em-data-science-e-decisao/)


# Atividade Integradora
## Setup

### Dependencias

In [1]:
import pandas as pd
import numpy as np
from dfply import *
import altair as alt
import missingno as msno

### Carregamento dos Dados

In [2]:
data = pd.read_csv("cs_bisnode_panel.csv")
dicionario_de_dados_0 = pd.read_excel("bisnode_variable_names.xls", header=4)

In [3]:
data.head()

Unnamed: 0,comp_id,begin,end,COGS,amort,curr_assets,curr_liab,extra_exp,extra_inc,extra_profit_loss,...,gender,origin,nace_main,ind2,ind,urban_m,region_m,founded_date,exit_date,labor_avg
0,1001034.0,2005-01-01,2005-12-31,,692.59259,7266.666504,7574.074219,0.0,0.0,0.0,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
1,1001034.0,2006-01-01,2006-12-31,,603.703674,13122.222656,12211.111328,0.0,0.0,0.0,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
2,1001034.0,2007-01-01,2007-12-31,,425.925934,8196.295898,7800.0,0.0,0.0,0.0,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
3,1001034.0,2008-01-01,2008-12-31,,300.0,8485.185547,7781.481445,0.0,0.0,0.0,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
4,1001034.0,2009-01-01,2009-12-31,,207.40741,5137.037109,15300.0,0.0,0.0,0.0,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,0.083333


## Limpeza dos Dados

### Dicionário de dados - Limpeza

Ao carregar o dicionário de dados a primeira coluna pega seu nome da quarta linha da tabela (argumento `header=4` acima). As outras três colunas são nomeadas abaixo.

In [4]:
dicionario_de_dados_1 = dicionario_de_dados_0.rename({'Unnamed: 1': 'description',
                                                      'Unnamed: 2': 'type',
                                                      'Unnamed: 3': 'footnote'},
                                                     axis=1)

In [5]:
dicionario_de_dados_1.columns

Index(['variable', 'description', 'type', 'footnote'], dtype='object')

Então retiramos as linhas não relevantes para a analize, incluindo linhas totalmente em branco e uma linha com informação de versão da base de dados: 
 - `v 0.92. 2021-02-04`

In [6]:
dicionario_de_dados = dicionario_de_dados_1\
                       .drop(index=54)\
                       .dropna(how="all")\
                       .reset_index()\
                       .drop('index', axis='columns')

In [7]:
dicionario_de_dados.sample(5)

Unnamed: 0,variable,description,type,footnote
6,curr_liab,current liabilities,numeric,(euros)
11,fixed_assets,fixed assets,numeric,(euros)
19,personnel_exp,personnel expenditure,numeric,(euros)
20,profit_loss_year,"Profit/loss, annuak",numeric,(euros)
29,year,year,date,


### Removendo colunas específicas:

In [22]:
columns_to_remove = ['COGS', 'finished_prod', 'net_dom_sales', 'net_exp_sales', 'wages', 'D']

data.drop(columns=columns_to_remove, inplace=True)

data.columns

Index(['comp_id', 'begin', 'end', 'amort', 'curr_assets', 'curr_liab',
       'extra_exp', 'extra_inc', 'extra_profit_loss', 'fixed_assets',
       'inc_bef_tax', 'intang_assets', 'inventories', 'liq_assets',
       'material_exp', 'personnel_exp', 'profit_loss_year', 'sales',
       'share_eq', 'subscribed_cap', 'tang_assets', 'balsheet_flag',
       'balsheet_length', 'balsheet_notfullyear', 'year', 'founded_year',
       'exit_year', 'ceo_count', 'foreign', 'female', 'birth_year',
       'inoffice_days', 'gender', 'origin', 'nace_main', 'ind2', 'ind',
       'urban_m', 'region_m', 'founded_date', 'exit_date', 'labor_avg'],
      dtype='object')

### Removendo dados do ano 2016:

Registros que começaram ou terminaram (ocorreram em) 2016 são removidos do conjunto.

In [23]:
colunas_data = ["begin", "end", "founded_date", "exit_date"]

data.dtypes[colunas_data]

begin           datetime64[ns]
end             datetime64[ns]
founded_date    datetime64[ns]
exit_date       datetime64[ns]
dtype: object

Dados destas colunas precisam estar em formato de datetime para serem operados corretamente, então vamos converte-los.

In [10]:
for column in colunas_data:
    data[column] = pd.to_datetime(data[column], format="%Y-%m-%d")

data[colunas_data].dtypes

begin           datetime64[ns]
end             datetime64[ns]
founded_date    datetime64[ns]
exit_date       datetime64[ns]
dtype: object

Removendo Registros do ano de 2016:

In [11]:
data = data[data["year"]!=2016]
data["year"].unique()

array([2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015],
      dtype=int64)

### Criando coluna para variável "Dependente":
---

- Vamos operar com o conceito de "atividade", empresas que não tiveram atividade por mais de 2 anos são consideradas "inativas".

Obs.: ativa = 0, inativa = 1

In [55]:
# Sort the DataFrame by company and year
data.sort_values(by=['comp_id', 'year'], inplace=True)

# Create a shifted column to check sales in X + 2 years
data['sales_x1'] = data['sales'].shift(-1)
data['sales_x2'] = data['sales'].shift(-2)

# Create a condition to identify companies that ceased to operate 
# (sem vendas por mais de 2 anos)
condition = (data['sales_x1'] == 0) & (data['sales_x2'] == 0)

# Create a new 'dependente' column with 1 for ceased companies and 0 otherwise
data['dependente'] = condition.astype(int)


Vamos observar as duas primeiras empresas da dataframe para averiguar seus estados de "dependencia":

In [56]:
# lista de colunas para avaliação:
check_list=["comp_id", "year", "sales", "sales_x1", "sales_x2", "dependente"]

data[check_list].head(18)

Unnamed: 0,comp_id,year,sales,sales_x1,sales_x2,dependente
0,1001034.0,2005,62751.851562,64625.925781,65100.0,0
1,1001034.0,2006,64625.925781,65100.0,78085.1875,0
2,1001034.0,2007,65100.0,78085.1875,45388.890625,0
3,1001034.0,2008,78085.1875,45388.890625,9929.629883,0
4,1001034.0,2009,45388.890625,9929.629883,0.0,0
5,1001034.0,2010,9929.629883,0.0,0.0,1
6,1001034.0,2011,0.0,0.0,0.0,1
7,1001034.0,2012,0.0,0.0,0.0,1
8,1001034.0,2013,0.0,0.0,0.0,1
9,1001034.0,2014,0.0,0.0,925.925903,0


Vamos averiguar agora os anos em que as empresas tiveram atividade e inatividade:

In [94]:
data_grouped = data.groupby('comp_id')

#Contando anos de acompanhamento
comp_years = data_grouped['year'].count()

#Contando anos de "atividade"
active_years = data_grouped.apply(lambda group: (group['sales'] > 0).sum())

#contando anos de "inatividade"
inactive_years = data_grouped.apply(lambda group: (group['sales'] == 0).sum())

#Contando anos de "dependência"
depend_years = data_grouped.apply(lambda group: (group['dependente'] == 1).sum())

In [96]:
activity_df = pd.DataFrame({"Total years":comp_years, "Active years":active_years, 
                            "Inactive years":inactive_years, "Dependent years":depend_years})

activity_df

Unnamed: 0_level_0,Total years,Active years,Inactive years,Dependent years
comp_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.001034e+06,11,6,5,4
1.001541e+06,7,5,2,1
1.002029e+06,9,9,0,0
1.003200e+06,8,8,0,0
1.004012e+06,8,8,0,0
...,...,...,...,...
4.640446e+11,8,7,1,0
4.640568e+11,3,3,0,0
4.640793e+11,1,0,1,0
4.641050e+11,11,11,0,0
