# Brazilian Congress proposals analysis

## About me
<table cellspacing="0" cellpadding="0">
  <tr>
    <td>
        Diego Alves <br />
        Software Engineer<br />
        <link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css">
        <i class="fa fa-google"></i> &nbsp; <a href="mailto:diegocardalves@gmail.com">Email</a><br />
        <i class="fa fa-linkedin"></i> &nbsp; <a href="https://www.linkedin.com/in/diegocardosoalves">LinkedIn</a><br />
        <i class="fa fa-github fa-lg"></i> &nbsp; <a href="https://github.com/diegoca80/datascience">Github</a><br />
    </td>
    <td>
        <a href="https://www.linkedin.com/in/diegocardosoalves" target="_blank"><img src="http://i67.tinypic.com/1jn605.png" border="0" alt="Diego Alves"></a>
    </td>
  </tr>
</table>

### Imports and configurations

In [None]:
import pandas as pd
%matplotlib inline
pd.set_option("max_rows", 10)
pd.set_option("max_columns", 100)
from seaborn import set_style
set_style("darkgrid")
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

### Exploring data

In [None]:
data = pd.read_csv("allProp.csv", na_values=["\n"])
data.head()

In [None]:
# Checking if we have null values and type columns
data.info()

In [None]:
data.describe()

#### As we can see above, the mean() for qtdAutores is around 2 which is the mean of authors responsible for proposals.
#### We can also check the top 10 distribution in order to enumerate the main proposal authors.

In [None]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

In [None]:
fig, axes = plt.subplots(data['ano'].nunique(),1, figsize=(8,100))
for (year, group), ax in zip(data.groupby("ano"), axes.flatten()):
    group.groupby(["autor1.txtNomeAutor"]).size().nlargest(10).plot(kind="barh",ax=ax,title=year)


### We can see that both "Comissão de Ciência e Tecnologia, Comunicação e Informática", "Comissão de Relações Exteriores e de Defesa Nacional" and "Poder Executivo" are frequently creating new proposals over the years. However, these authors are a group of members in Brazil so we could treat as outliers or only ignore. 

## Let's do the same analysis for political party since Brazilian people usually choose candidates based on their representation.

In [None]:
fig, axes = plt.subplots(data['ano'].nunique(),1, figsize=(8,100))
for (year, group), ax in zip(data.groupby("ano"), axes.flatten()):
    group.groupby(["autor1.txtSiglaPartido"]).size().nlargest(10).plot(kind="barh",ax=ax,title=year)

## Let's check the top 5 representation over the years using entire data.
### PS: Blank value is not null. It belongs to the organizations cited previously.

In [None]:
data.groupby("autor1.txtSiglaPartido").size().nlargest(10).plot(kind="barh", figsize=(8,8))

## As we have these information organized by year but a lot of different political parties, let's show a time series of the proposals number by the 3 most representative.

In [None]:
# Creating new column with size of group by 2 features(year,political party)
data['groupby_ano_partido'] = data.groupby(["ano","autor1.txtSiglaPartido"])['ano'].transform('size')
# Removing extra white spaces on strings
data['autor1.txtSiglaPartido'] = data["autor1.txtSiglaPartido"].str.strip()
data = data[(data["autor1.txtSiglaPartido"] == "PT") |
     (data["autor1.txtSiglaPartido"] == "PSDB") |
     (data["autor1.txtSiglaPartido"] == "PMDB")]
data.groupby(["ano","autor1.txtSiglaPartido"]).mean().unstack("autor1.txtSiglaPartido")["groupby_ano_partido"].plot(figsize=(15,7),xticks=data['ano'])

## Simple observations:
- > In fact, we have the majority number of proposals created by PMDB and PT during the history data but we can't assume anything now since we don't know the number of chairs occuped by each party among the years.
- > We can note that the proposals graph behavior is proportional in almost all years for both political parties (if one increases the other one increases, if one decreases the other one decreases). 

## Interesting observations:
- > Congress election's in Brazil occur every four years starting at 1990. If you pay attention to the graph data corresponding to elections years, there's a small number of proposals maybe because of the busy politicians agenda trying the reelection or the distrust of having their proposals approved during the last government year.
- > The highest number of proposals occur one year after each election and after that decreases over the years until the next election.
- > In 2015, we can noticed a high variance between the political parties that didn't exist in the past. Maybe this could be explained because of the Brazil crysis and recurrent protests of people and other politicians (See <a href="https://en.wikipedia.org/wiki/2015%E2%80%9316_protests_in_Brazil">link</a>). Since the president was Dilma from PT party, the opposition tried to impose new proposals for improvement.


# Next steps:
## 1) Check other features didn't covered on this notebook.
## 2) Analyse approved proposals against non-approvals.
## 3) Text analysis with word clouds and clustering algorithms.