## Expenses on car rental

This analysis aims to investigate expenses on car rental during the current term. Previous analysis I did using Excel shows 1) some politicians systematically spends above the monthly limit of R$ 10K, and 2) some congresspersons rent more than one vehicle every month, which brings certain suspicion: considering they work in DF, are those cars rented outside DF being used by someone else? 

~~**First step:** get a list of congresspersons, the amount reimbursed by them since Jan. 2015 and the dates of those reimbursements. Then we cross these data with the list of companies that rented those vehicles so we can get information on where those rentals occurred.~~ *Done!*

**Second step:** get datasets (sessions, speeches) that may prove whether congressperson was or was not in DF in specific periods of time: when the vehicles were rented. So we can get, as a result, months in which the congressperson spent most of his/her time in DF, but payed a full-month rent somewhere else. *I need some help here.*

-- Rodolfo Viana

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv('../data/2017-06-04-reimbursements.xz',
                      dtype={'applicant_id': np.str,
                             'cnpj_cpf': np.str,
                             'congressperson_id': np.str,
                             'congressperson_name': np.str,
                             'subquota_number': np.str,
                             'issue_date': np.str,
                             'document_id': np.str},
                      low_memory=False)

#### Selecting term, subquota

There are 14,263 reimbursements since Jan. 2015. They sum up to R$ 60,373,960.80.

In [2]:
data = data[data['year'] >= 2015]
data = data[data['subquota_description'] == 'Automotive vehicle renting or charter']
data['cnpj_cpf'] = data['cnpj_cpf'].str.replace(r'[\.\/\-]', '')
data.subquota_description.value_counts()

Automotive vehicle renting or charter    14263
Name: subquota_description, dtype: int64

In [3]:
data.net_values.sum()

60373960.80000005

In [4]:
congressperson_list = data[['congressperson_name', 
                            'congressperson_id', 
                            'net_values', 
                            'month', 
                            'year', 
                            'issue_date', 
                            'document_id',
                            'cnpj_cpf']]

In [5]:
congressperson_expenses = congressperson_list.groupby(['congressperson_name', 
                                                       'year', 
                                                       'month', 
                                                       'issue_date', 
                                                       'document_id']).agg({'net_values':sum})

congressperson_expenses.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,net_values
congressperson_name,year,month,issue_date,document_id,Unnamed: 5_level_1
ABEL MESQUITA JR.,2015,2,2015-02-02 00:00:00.0,5601321,900.0
ABEL MESQUITA JR.,2015,9,2015-10-26 00:00:00.0,5830987,9307.0
ABEL MESQUITA JR.,2015,11,2015-11-30 00:00:00.0,5862624,9693.9
ABEL MESQUITA JR.,2015,12,2015-12-23 00:00:00.0,5886420,9440.0
ABEL MESQUITA JR.,2016,2,2016-02-29T00:00:00,5929072,7080.0
ABEL MESQUITA JR.,2016,3,2016-03-30T00:00:00,5959422,9900.0
ABEL MESQUITA JR.,2016,4,2016-04-28T00:00:00,5986290,9100.0
ABEL MESQUITA JR.,2016,5,2016-05-31T00:00:00,6011416,9500.0
ABEL MESQUITA JR.,2016,6,2016-06-30T00:00:00,6041818,9600.0
ABEL MESQUITA JR.,2016,7,2016-07-29T00:00:00,6069097,7800.0


#### Getting companies dataset, excluding those from DF, merging with reimbursements dataset

There are 10,945 companies outside DF. The receipts sum up to R$ 49,156,001.56.

In [6]:
companies = pd.read_csv('../data/2017-05-21-companies-no-geolocation.xz', low_memory=False)
companies = companies[companies['state'] != 'DF']
companies['cnpj'] = companies['cnpj'].str.replace(r'[\.\/\-]', '')

In [7]:
dataset = pd.merge(data, companies, how='inner',
                   left_on='cnpj_cpf', right_on='cnpj')

In [8]:
congressperson_expenses_dataset = dataset.groupby(['congressperson_name', 
                                                    'year', 
                                                    'month', 
                                                    'issue_date',
                                                    'cnpj',
                                                    'name',
                                                    'city',
                                                    'state_y',
                                                    'document_id']).agg({'net_values':sum})
full_report = congressperson_expenses_dataset.reset_index()

In [9]:
full_report.name.count()

10945

In [10]:
full_report.net_values.sum()

49156001.55999989

#### Getting outliers

Although Rosie considers mean value + three times the std value to point outliers, here we consider mean value + twice the std value. This is due to the subquota category -- different from other categories (e.g. taxi, food, hotel), big money is spent on car rental, and work with 3 x std value would let suspect receipts pass.

In [11]:
full_report.net_values.describe()

count    10945.000000
mean      4491.183331
std       2694.178468
min          3.230000
25%       2492.100000
50%       3900.000000
75%       6200.000000
max      10900.000000
Name: net_values, dtype: float64

In [12]:
outliers = full_report[full_report['net_values'] >= (full_report.net_values.mean() + (2 * full_report.net_values.std()))].sort_values('net_values', ascending=False)

In [13]:
outliers.congressperson_name.value_counts().head(20)

GIVALDO CARIMBÃO        28
PEDRO FERNANDES         28
REMÍDIO MONAI           27
JOSI NUNES              27
FÁBIO MITIDIERI         26
LUIZ LAURO FILHO        26
JUSCELINO FILHO         26
ADALBERTO CAVALCANTI    25
ASSIS DO COUTO          25
JHONATAN DE JESUS       25
DELEGADO ÉDER MAURO     25
ASSIS CARVALHO          24
JONY MARCOS             24
ZECA DO PT              23
NILTON CAPIXABA         23
RICARDO TEOBALDO        22
NEWTON CARDOSO JR       21
CACÁ LEÃO               21
NELSON MEURER           19
HÉLIO LEITE             18
Name: congressperson_name, dtype: int64

In [14]:
outliers.net_values.sum()

8712789.409999996

### Conclusion (so far)

In the current term (since Jan. 2015), congresspersons have reimbursed R$ 49,156,001.56 due to expenses on car rental outside DF. We have here vehicles rented for few days, which is something normal, and possibly cars rented for the whole month --and this is unusual, considering congresspersons work in DF.

As I am newbie at statistics, I considered the sum of mean value and twice the standard value to point outliers --or should I consider any other? The outliers sum up to R$ 8,712,789.41.

Now I need help to go on with the second step and some review of the first step, so I can figure out how to improve this analysis. 

This analysis will be updated soon.