# Exploratory analysis for federal contractors

This is just a simple exploratory analysis to get familiar with the Python language, docker containers, and Jupyter notebooks.  I've downloaded datasets from the [Federal Contractors Database](https://www.usaspending.gov/#/download_center/custom_award_data). For variable types, see the [data dictionary](http://fedspendingtransparency.github.io/dictionary-v1.1/).

[Track progress of the project on my trello board](https://trello.com/b/lZYSGp4M/federal-contractors-python)


While the goal is simply to get used to Python and some other technologies, I am interested in the specific question of ***what factors are the best predictors of minority owned companies***.  

#### Init
Read in the data and load packages

In [1]:
import pandas as pd
import numpy as np
import pandas_profiling as pp

dat = pd.read_csv('data/2017.csv', low_memory=False)

## Profiling

In [2]:
dat.shape

(72367, 225)

Since the data has 225 columns and +70k rows, I'm only going to do a profile report on a small subset of the rows.  I'm also going to save the report as an HTML file outside of this analysis.  

In [3]:
profile = pp.ProfileReport(dat.loc[0:10000])
profile.to_file(outputfile = "profiling/profile.html")

In [4]:
import matplotlib.pyplot as plt
cmt = dat.corr()

I'm mostly interested in dollars the companies recieve and the size of the company. So I'm going to make a function that checks for a certain level of correlation for the selected variable. The following cells look at:  

* Dollars Obligated
* Number of Employees
* Minority Owned Flag

In [5]:
def corMat(dd, corlv, var):
    ind = abs(dd[var]) > corlv
    return dd.loc[ind, ind]; 

In [6]:
corMat(cmt,0.05, "dollarsobligated").dollarsobligated

dollarsobligated                         1.000000
baseandexercisedoptionsvalue             0.967203
baseandalloptionsvalue                   0.861076
progsourcesubacct                       -0.076980
prime_awardee_executive1_compensation   -0.084980
prime_awardee_executive2_compensation   -0.072067
prime_awardee_executive3_compensation   -0.071098
prime_awardee_executive4_compensation   -0.071617
prime_awardee_executive5_compensation   -0.065659
Name: dollarsobligated, dtype: float64

In [7]:
corMat(cmt,0.05, "numberofemployees").numberofemployees

progsourceagency                             -0.068632
progsourcesubacct                            -0.112453
ccrexception                                 -0.357217
vendor_cd                                     0.093742
congressionaldistrict                         0.093742
placeofperformancezipcode                    -0.145658
transactionnumber                             0.100897
numberofemployees                             1.000000
veteranownedflag                             -0.077613
receivescontracts                             0.077046
issubchapterscorporation                     -0.124280
islimitedliabilitycorporation                 0.051856
ispartnershiporlimitedliabilitypartnership    0.057622
prime_awardee_executive1_compensation         0.975024
prime_awardee_executive2_compensation         0.980690
prime_awardee_executive3_compensation         0.972878
prime_awardee_executive4_compensation         0.963208
prime_awardee_executive5_compensation         0.973902
Name: numb

In [8]:
corMat(cmt,0.25, "minorityownedbusinessflag").minorityownedbusinessflag

progsourcesubacct                               -0.288274
placeofperformancezipcode                        0.255004
firm8aflag                                       0.316287
minorityownedbusinessflag                        1.000000
apaobflag                                        0.614026
baobflag                                         0.329557
naobflag                                         0.263247
haobflag                                         0.269490
isdotcertifieddisadvantagedbusinessenterprise    0.270052
prime_awardee_executive1_compensation           -0.337572
prime_awardee_executive2_compensation           -0.375064
prime_awardee_executive3_compensation           -0.398767
prime_awardee_executive4_compensation           -0.446047
prime_awardee_executive5_compensation           -0.400471
Name: minorityownedbusinessflag, dtype: float64

The minority owned business flag shows several interesting correlations:  

1. apaobflag, baobflag, naobflag, and haobflag are just subtypes of minority flags: Asian Pacific American, Black American, Native American, and Hispanic American, respectively. (Thus they're not particularly interesting. 
2. firm8aflag is for 8(a) Program Participant Organizations, which is a program for small, underpriviledged companies. 
3. DOT certified disadvantaged companies has a slightly smaller correlation. 
4. All ofthe executive compensations have a negative correlation, meaning that as executive compensation goes up, the likelihood of being minority owned is smaller.

In [9]:
# dat.prime_awardee_executive1_compensation.describe()

## Aggregation

I need to aggregate the awards based on company. Some companies have a lot of awards, so it could mess with things.  I'm also interested in adding a column that shows the count of awards for companies since that might be a good predictor of minority owned businesses. 

My hunch is that minority owned businesses are smaller than average, and have lower executive compensation/annual revenue than average. 

In [41]:
grp = ['vendorname', 'minorityownedbusinessflag']
cls = ['dollarsobligated', 'prime_awardee_executive1_compensation']
aggs = dat.groupby(grp)[cls].agg(['sum', 'count'])
#aggs[0:4]
aggs.sort_values([('prime_awardee_executive1_compensation', 'sum')], ascending=False)[0:9]

Unnamed: 0_level_0,Unnamed: 1_level_0,dollarsobligated,dollarsobligated,prime_awardee_executive1_compensation,prime_awardee_executive1_compensation
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,count,sum,count
vendorname,minorityownedbusinessflag,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
METRO MACHINE CORPORATION,False,32297950.0,81,150574944.0,78
"GENERAL DYNAMICS OTS (AEROSPACE), INC.",False,5572170.0,24,52153420.0,23
"MISSION SUPPORT ALLIANCE, LLC",False,302504100.0,83,22534411.0,80
SAFE BOATS INTERNATIONAL LLC,False,19259400.0,43,17094888.0,39
"RORE, INC.",True,24820390.0,170,11400000.0,57
SAFE BOATS INTERNATIONAL L.L.C.,False,7267223.0,20,8691640.0,20
HPM CORPORATION,True,659526.8,12,4105031.0,12
TRITON MARINE CONSTRUCTION CORP.,False,5503661.0,6,3840000.0,6
SAFE BOATS INTERNATIONAL LIMITED LIABILITY COMPANY,False,561208.1,5,2197275.0,5


**Note: the above code isn't working because 1) prime awardee should be an average. More contracts will duplicate compensation amounts and 2)because of that, I need to get rid of the multi-indexed sorting. I'm saving it because the sorting is confusing to get used to.**

Now I need to get the other company attributes: 

* average number of employees (there are outliers that seem like mistakes here)
* sum of prime awardee compensation
* geographic region (see below)

In [27]:
# add code to get average number of employees 

## Add Rural/Urban flag

This dataset includes zip codes and address, but that's not good as a categorical variable. I'd like to create one, but I'm going to have to figure out the best way to do it. On other projects I've had to look up population statistics from the US Census, then join those in based on FIPS codes. We'll see if that's necessary here. 

## Logistic Regression

I'd like to do a logistic regression for a few of the variables above to see if they predict minority owned business