# Data Preprocessing

The preprocessing phase will take x steps:

1. Setup - import packages, setup paths
2. Import Data - bring in the data
3. Cleaning Data - aggregate data as necessary and make transformations as required

## Setup

---

#### Imports

In [1]:
import os
import pandas as pd

#### Paths

In [2]:
path_proj = 'U:/projects/donor_pred/preprocessing'
path_data = 'U:/data'

#### Globals

In [15]:
FY_START = 11
FY_END = 19

## Import Data
---

#### Files Needed

In [3]:
d_files = 'donors_fy08-fy19.csv'
t_files_fys = range(11, 20)
t_files_series = ['Chamber', 'Clx', 'Connections', 'Family', 'Organ', 'Pops', 'Specials', 'Summer']

In [4]:
t_files = [series + str(fy) + '.csv' for series in t_files_series for fy in t_files_fys]

### Donor Data

In [5]:
def donor_data_import(path_to_data_files, file):
    donor_raw = pd.read_csv(path_to_data_files + '/donors/' + file, encoding='ISO-8859-1')
    return donor_raw

In [6]:
d_data = donor_data_import(path_data, d_files)

In [8]:
donor_fy = d_data.campaign.str[3:8] # parse last 5 characters of campaign column
d_data['fy'] = donor_fy # append that to a column named 'fy'

Identify which 'fys' don't start with an integer, these we can remove. Also, identify the years that are outside of the range in question

In [16]:
filter_fys = {fy: False for fy in d_data.fy.drop_duplicates()}
for fy,boo in filter_fys.items():
    try:
        int(fy[0])
        if (int(fy[3:]) > FY_START) & (int(fy[3:]) < FY_END+1):
            filter_fys[fy] = True
        else:
            continue
    except:
        continue

Map the fy column to identify rows to remove then remove them

In [17]:
d_data['keep'] = d_data['fy'].map(filter_fys)
donor_df = d_data.loc[d_data.keep].reset_index()
donor_df.drop(columns=['keep'], inplace=True)

Remove unnecessary columns

In [21]:
donor_cols = ['summary_cust_id', 'customer_no', 'gift_plus_pledge', 'cont_dt', 'fy']
donor_df = donor_df[donor_cols]

In [22]:
donor_df.head()

Unnamed: 0,summary_cust_id,customer_no,gift_plus_pledge,cont_dt,fy
0,179338,179338,1000.0,10/14/2013 00:00:00,13-14
1,179338,179338,-1000.0,10/14/2013 00:00:00,13-14
2,2441428,2441428,-1000.0,10/15/2013 00:00:00,13-14
3,2441428,2441428,1000.0,10/15/2013 00:00:00,13-14
4,2579172,2579172,-1000.0,10/14/2013 00:00:00,13-14


Strip FY to only include the last two digits of fiscal year

In [23]:
donor_df['fy'] = donor_df['fy'].str[3:]

### Ticketing Data

In [48]:
dtypes = {
    'section': str,
    'summary_cust_name': str
}

In [49]:
def ticketing_data_import(path_to_data_files, filename, dtypes=dtypes):
    df = pd.read_csv(path_data + '/ticketing/' + filename, skiprows=3, dtype=dtypes)
    return df

In [50]:
t_data = pd.concat([ticketing_data_import(path_data, file) for file in t_files])