# Introduction

Got client data for 3 seasons (2015, 2016, 2017)
- Geographic data = Region, District, Site
- Groupint/Ops data = Site, GroupName, Facilitator
- Transaction Data = TXsize, %Repaid, LastRepayDate

## Objective
- Do summary stats and graphs with Py 
    - Sample structure 
    - Key outcomes summary stats = Group/Site size, Transaction Size, % repaid, % group complete, % individual complete, Outstanding amounts, 
    - Distribution graphs 
    - Correlation matrix for key outcomes 
    - Analysis levels = individual, Facilitator, group, site, district, region 
    
    
- Try clustering/classification/segmentation
    - Facilitator?
    - Group Size
    - TXsize, 
    - % repaid
    - Outstanding amount, Overpayment amount
    
    
- Account time series
    - 3 seasons
    - last repayment date

In [164]:
### Setup

## 1. Load Libraries
import pandas as pd
import numpy as np
import seaborn as sns

%matplotlib inline 

## 2. Load def fxs
df_name_prefix = "df_sc"
df_cols = ["xClientID", "xSiteID", "xPodID", "xGroupID",
           "Year",
           "RegionName", "DistrictName", 
          "Facilitator", 
           "TotalCredit", "TotalRepaid_IncludingOverpayments", "LastRepaymentDate"]

def load_csv(yr, fprefix, dsdir):
    print( " WORKING on PATH : %s/%s_%d.csv" %(dsdir,fprefix,yr)  )
    dtmp = pd.read_csv( "%s/%s_%d.csv" %(dsdir,fprefix,yr) )
    dtmp["Year"] = yr
    dtmp["xClientID"] = str(yr) + "_"+dtmp.GlobalClientId 
    dtmp["xSiteID"] = str(yr) + "_"+dtmp.SiteName 
    dtmp["xPodID"] = str(yr) + "_"+dtmp.SectorName
    dtmp["xGroupID"] = str(yr) + "_"+dtmp.GroupName 
    dtmp.index = dtmp.xClientID
    print( "LOADED df " )
    return dtmp[df_cols] 
    

def computed_fields(df):
    df["PctRepaid"] =  np.minimum( 1, df.TotalRepaid_IncludingOverpayments/df.TotalCredit )
    df["OverpaymentAmt"] = np.maximum(0, df.TotalRepaid_IncludingOverpayments - df.TotalCredit)
    df["OutstandingAmt"] = np.maximum(0, df.TotalCredit - df.TotalRepaid_IncludingOverpayments )
    df.rename(columns={"TotalCredit":"TXSize", "TotalRepaid_IncludingOverpayments":"TotalRepaid"}, inplace=True)

## Agg level outcomes = group size, [sum,mean](totalcredit, totalrepaid, outstanding, overpayment), %repaid 
def group_aggz( grplvl ):
    print("hello there!!")
    
    
## 3. Define key outcomes 
outcomes = ["TXSize", "PctRepaid", "OutstandingAmt", "OverpaymentAmt", ]


## 4. Load dataset 
data_dir = "D:/xAppzor/datasets/oaf"
data_file_prefix = "oaf_sc"
seasons = [2015, 2016, 2017]


all_data = pd.concat(  np.asarray(
    pd.Series( seasons ).apply(
        lambda x: load_csv(x, data_file_prefix, data_dir) ) 
) )



## 5. Computed Fields
computed_fields( all_data )

print("\n\n all data = \n" , all_data.info() , "\nIndecies\n", all_data.index.names,
     "\n\n Sample", all_data[ all_data.TotalRepaid != all_data.TXSize][ outcomes ].sample(7),
      "\n\n CFO\n", all_data.groupby(["xSiteID"]).size()
     )



## 6. Aggregates = group size, [sum,mean](totalcredit, totalrepaid, outstanding, overpayment), %repaid
# grouping levels = site, pod, group
grp_levels = ["xSiteID", "xPodID", "xGroupID"]

all_data.groupby(["xSiteID"]).agg({
    "xClientID" : ["size"],
    "PctRepaid" : ["mean"],
    "TXSize" : ["mean", "sum"],
    "TotalRepaid" : ["mean", "sum"],
    "OutstandingAmt" : ["mean", "sum"],
    "OverpaymentAmt" : ["mean", "sum"],
}).tail()
#site_data.head()

 WORKING on PATH : D:/xAppzor/datasets/oaf/oaf_sc_2015.csv
LOADED df 
 WORKING on PATH : D:/xAppzor/datasets/oaf/oaf_sc_2016.csv
LOADED df 
 WORKING on PATH : D:/xAppzor/datasets/oaf/oaf_sc_2017.csv
LOADED df 




<class 'pandas.core.frame.DataFrame'>
Index: 578840 entries, 2015_60608205-6fba-4932-a022-511006e668c5 to 2017_24205200-f191-e611-9e9a-2c600c86e73a
Data columns (total 14 columns):
xClientID            578840 non-null object
xSiteID              578840 non-null object
xPodID               578840 non-null object
xGroupID             578840 non-null object
Year                 578840 non-null int64
RegionName           578840 non-null object
DistrictName         578840 non-null object
Facilitator          578840 non-null bool
TXSize               578840 non-null float64
TotalRepaid          578840 non-null float64
LastRepaymentDate    578837 non-null object
PctRepaid            578838 non-null float64
OverpaymentAmt       578840 non-null float64
OutstandingAmt       578840 non-null float64
dtypes: bool(1), float64(5), int64(1), object(7)
memory usage: 62.4+ MB


 all data = 
 None 
Indecies
 ['xClientID'] 

 Sample                                             TXSize  PctRepaid  Outstandin

Unnamed: 0_level_0,xClientID,PctRepaid,TXSize,TXSize,TotalRepaid,TotalRepaid,OutstandingAmt,OutstandingAmt,OverpaymentAmt,OverpaymentAmt
Unnamed: 0_level_1,size,mean,mean,sum,mean,sum,mean,sum,mean,sum
xSiteID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
2017_Yiro West,322,0.869182,9974.953416,3211935.0,8702.298137,2802140.0,1299.142857,418324.0,26.487578,8529.0
2017_kingandole,216,0.904384,10052.453704,2171330.0,9065.768519,1958206.0,1006.027778,217302.0,19.342593,4178.0
2017_lusiola,172,0.99293,10256.976744,1764200.0,10396.325581,1788168.0,41.453488,7130.0,180.802326,31098.0
2017_mabonde,121,0.988977,12073.801653,1460930.0,11984.545455,1450130.0,124.46281,15060.0,35.206612,4260.0
2017_masara,107,0.932714,8167.757009,873950.0,7674.981308,821223.0,518.429907,55472.0,25.654206,2745.0


# Individual Level Analysis

## Data Structure
- GlobalID=Index Unique, 
- TXsize, Total repaid including overpayment 
- % repaid, OutstandingAmt, OverpaymentAmt 