In [1]:
import pandas as pd
import numpy as np

# HW3 - Payment Classification

Goal is to predict whether a payment by a company to a medical doctor or facility was made as part of a research project or not. All relevant data can be found [here](https://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads.html).

### Data Description

Physicians may be identified as covered recipients of records or as principal investigators associated with research-related payment records. Teaching hospitals may also be identified as covered recipients. Teaching hospitals are defined as any hospital receiving payments for GM, IPPS or IME. 


Each record in the General Payment, Research Payment, and Ownership/Investment files includes a Change Type indicator field. 
- NEW: the record is newly reported by the reporting entity since the last publication and is being published for the first time.
- ADD: the record is not new in the system but, due to the record not being eligible for publication until the current publication cycle, is being published for the first time.
- CHANGED: record was previously published but has been modified since its last publication. A record whose only change since the last publication is a change to its dispute status is categorized as a changed record.
- UNCHANGED: record was published during the last publication cycle and is being republished without change in the current publication. 

Each record in the Removed and Deleted Records includes a Change Type indicator field as well:
- DELETED: the previously published record was deleted from the Open Payments system by the reporting entity.
- REMOVED: the previously published record was removed from the current publication as a result of the reporting entity making updates to the record which made the record ineligible for publication. 



## Task 1: Identify Features

First of all, let's load the data to assemble the dataset. The data comes from two different csv files, OP_DTL_GNRL_PGYR2017_P01182019.csv (general payments) and OP_DTL_RSRCH_PGYR2017_P01182019.csv (research payments), so we load a subsample of those two files, add the target feature "research_payment" (0 for rows in the first file and 1 for rows in the second file) and concatenate them. 

### How do we balance classes? 

The data is naturally imbalanced, as there are much more records of general payments than research payments. Here are the row counts for both files: 

In [15]:
n_gen = sum(1 for line in open('data/OP_DTL_GNRL_PGYR2017_P01182019.csv')) - 1
n_res = sum(1 for line in open('data/OP_DTL_RSRCH_PGYR2017_P01182019.csv')) - 1
# the -1 is to exclude header

print("General payments: " + str(n_gen) + " lines")
    
print("Research payments: " + str(n_res) + " lines")

General payments: 10663834 lines
Research payments: 602531 lines


Since the general payments csv is way too large to load, we will select a subsample of the rows for each csv, forming the whole dataset. We have the choice of how many rows to select in each file, and the choice we make will end up deciding the class balance. Here are the two options we have: 
* Select an equal number of rows for both classes: this completely removes class imbalance and the problems it might cause. However, we lose the "real-world setting" with imbalanced classes

* Select a number of rows in each file that's proportionate to their total number of rows: this will cause class imbalance problems since there is an approximate 95% / 5% class distribution, but will reflect the whole problem better.

We ended up choosing the . option. 

### Loading and joining the datasets

We first load the separate datasets and add the target feature. To load the data, we do a random subsampling.

In [23]:
import pandas as pd 
import numpy as np

# Number of desired samples for each file
nsamples_gen = 10000
nsamples_res = 10000

skiprows_gen = np.sort(np.random.choice(range(1, n_gen+1), replace = False, size = n_gen - nsamples_res))
skiprows_res = np.sort(np.random.choice(range(1, n_res+1), replace = False, size = n_res - nsamples_res))

gen = pd.read_csv("data/OP_DTL_GNRL_PGYR2017_P01182019.csv", skiprows = skiprows_gen)
res = pd.read_csv("data/OP_DTL_RSRCH_PGYR2017_P01182019.csv", skiprows = skiprows_res)

gen['research_payment'] = 0
res['research_payment'] = 1

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


The next problem is to concatenate the data. This raises an issue: our two data files have different columns - but also have a lot of columns in common. The "baseline" choice here would be to simply use panda's concatenate function, which would give us a concatenated dataset, whose columns would be the union of the columns of the two separated datasets, filling the missing values with NA. 

However, this creates a problem: since the two separated datasets are also the two separated classes, then if one feature is only non-missing in one of the classes, it might indirectly reveal information about the class to the model, in an unwanted way (data leakage). 