# Final Project - Part 3 - Exploratory Data Analysis (Ellein Cheng)


### PROMPT
Exploratory data analysis is a crucial and informative step in the data process. It helps confirm or deny your initial hypotheses and helps visualize the relationships among your data. Your exploratory analysis also informs the kinds of data transformations that you'll need to optimize for machine learning models.

In this assignment, you will explore and visualize your initial analysis in order to effectively tell your data's story. You'll create an iPython notebook that explores your data mathematically, using a python visualization package.

Goal: Confirm your data and create an exploratory analysis notebook with stat analysis and visualization.

### DELIVERABLES
Exploratory Analysis Writeup

Requirements:

Review the data set and project with an EIR during office hours.
Practice importing (potentially unformatted) data into clean matrices|data frames, and if necessary, export into a form that makes sense (text files or a database, for example).
Explore the mathematical properties and visualize data through a python visualization tool (matplotlib and seaborn)
Provide insight about the data set and any impact on a hypothesis.

Detailed Breakdown:

A well organized iPython notebook with code and output
At least one visual for each independent variable and, if possible, its relationship to your dependent variable.
It's just as important to show what's not correlated as it is to show any actual correlations found.
Visuals should be well labeled and intuitive based on the data types.
For example, if your X variable is temperature and Y is "did it rain," a reasonable visual would be two histograms of temperature, one where it rained, and one where it didn't.
Tables are a perfectly valid visualization tool! Interweave them into your work.

Bonus:

Surface and share your analysis online. Jupyter makes this very simple and the setup should not take long.
Try experimenting with other visualization languages; python/pandas-highcharts, shiny/r, or for a real challenge, d3 on its own. Interactive data analysis opens the doors for others to easily interpret your work and explore the data themselves!

In [878]:
import pandas as pd
from matplotlib import pyplot as plt  
%matplotlib inline
import seaborn as sb
import numpy as np

# I) Understanding the Original Dataset
This section provides basic summary descriptions of the original dataset

## A) Read Data Dictionary (Raw Dataset)

In [879]:
#Read Data Dictionary
GD_DD = pd.read_csv('assets/DATA_DICTIONARY_version_28_11_2017_0_MOD.csv')
GD_DD.head(n=10)

Unnamed: 0,Variable name,Data type,Values/Categories,Notes
0,yearOfRegistration,Numeric,"Range: [2002, 2017]",The year in which the individual was registere...
1,Datasource,Alphanumeric,| Case management [Individual received social ...,"Data collection method, which generally reflec..."
2,gender,Alpha,| -99 [missing data] | Male | Female | Transge...,Designates the individual's expression or cond...
3,ageBroad,Alphanumeric,| -99 [missing data] | 0-8 | 9-17 | 18-20 | 21...,The individual's age at the time the individua...
4,majorityStatus,Alpha,| -99 [missing data] | Minor [Any person under...,Indicates whether the individual was under the...
5,majorityStatusAtExploit,Alpha,| -99 [missing data] | Minor [Any person under...,The individual's age at the time the exploitat...
6,MajorityEntry,Alphanumeric,| -99 [missing data] | Minor [Any person under...,Indicates the age of an individual at the time...
7,citizenship,Alpha,| -99 [missing data] | ZZ [unknown] | Values b...,The set of rights and duties that a person has...
8,meansOfControlDebtBondage,Bool,| -99 [missing data] | 0 | 1,Indicates whether the individual is forced to ...
9,meansOfControlTakesEarnings,Bool,| -99 [missing data] | 0 | 1,Indicates whether the individual has experienc...


In [880]:
#Number of Variables 
print "Number of Variables: ", len(GD_DD)  

Number of Variables:  62


## B) Read the original global dataset

In [881]:
# Import Global Dataset
GD_data = pd.read_csv('assets/TheGlobalDataset27Nov2017_0.csv')
GD_data.head()

Unnamed: 0,yearOfRegistration,Datasource,gender,ageBroad,majorityStatus,majorityStatusAtExploit,majorityEntry,citizenship,meansOfControlDebtBondage,meansOfControlTakesEarnings,...,typeOfSexRemoteInteractiveServices,typeOfSexPrivateSexualServices,isAbduction,RecruiterRelationship,CountryOfExploitation,recruiterRelationIntimatePartner,recruiterRelationFriend,recruiterRelationFamily,recruiterRelationOther,recruiterRelationUnknown
0,2010,Case Management,Female,18--20,Adult,Adult,Adult,KZ,-99,-99,...,-99,-99,-99,Unknown,KZ,0,0,0,0,1
1,2011,Case Management,Female,18--20,Adult,-99,-99,KZ,-99,-99,...,-99,-99,-99,Unknown,KZ,0,0,0,0,1
2,2004,Case Management,Female,18--20,Adult,-99,-99,MD,-99,-99,...,-99,-99,0,Unknown,MD,0,0,0,0,1
3,2011,Case Management,Female,18--20,Adult,-99,-99,KZ,-99,-99,...,-99,-99,-99,Unknown,KZ,0,0,0,0,1
4,2011,Case Management,Female,18--20,Adult,-99,-99,KZ,-99,-99,...,-99,-99,-99,Unknown,KZ,0,0,0,0,1


In [882]:
#original length of dataset
print "Number of Entries:", len(GD_data)  # Number of entries in Global Dataset

Number of Entries: 47102


In [883]:
print "Column Names:", GD_data.columns   # Field names of Global Dataset

Column Names: Index([u'yearOfRegistration', u'Datasource', u'gender', u'ageBroad',
       u'majorityStatus', u'majorityStatusAtExploit', u'majorityEntry',
       u'citizenship', u'meansOfControlDebtBondage',
       u'meansOfControlTakesEarnings',
       u'meansOfControlRestrictsFinancialAccess', u'meansOfControlThreats',
       u'meansOfControlPsychologicalAbuse', u'meansOfControlPhysicalAbuse',
       u'meansOfControlSexualAbuse', u'meansOfControlFalsePromises',
       u'meansOfControlPsychoactiveSubstances',
       u'meansOfControlRestrictsMovement',
       u'meansOfControlRestrictsMedicalCare',
       u'meansOfControlExcessiveWorkingHours', u'meansOfControlUsesChildren',
       u'meansOfControlThreatOfLawEnforcement',
       u'meansOfControlWithholdsNecessities',
       u'meansOfControlWithholdsDocuments', u'meansOfControlOther',
       u'meansOfControlNotSpecified', u'isForcedLabour', u'isSexualExploit',
       u'isOtherExploit', u'isSexAndLabour', u'isForcedMarriage',
       u'isF

# II) Data Cleaning
One of the biggest challenges with this dataset is that there are a lot of unknown values.  Therefore, it is important to find a right solution when dealing with these missing data.  

In [884]:
## 1) Unknown Gender
GD_data[['gender','Datasource']].groupby('gender').count()

Unnamed: 0_level_0,Datasource
gender,Unnamed: 1_level_1
Female,33850
Male,12997
Unknown,255


In [885]:
### There are 255 rows of Unknown gender.  Remove them from dataset.
GD_data = GD_data.loc[(GD_data["gender"]!="Unknown")]

In [886]:
len(GD_data)

46847

In [887]:
## 2) Unknown ageBroad
GD_data[['ageBroad','Datasource']].groupby('ageBroad').count()

Unnamed: 0_level_0,Datasource
ageBroad,Unnamed: 1_level_1
-99,11819
0--8,1581
18--20,3536
21--23,3504
24--26,2948
27--29,2210
30--38,6012
39--47,2434
48+,1374
9--17,6083


In [888]:
### Remove ageBroad=-99
GD_data = GD_data.loc[(GD_data["ageBroad"]!="-99")]
len(GD_data)

35028

In [889]:
## 3) Remove rows where ageBroad & majorityStatus & majorityStatusAtExploit & majorityEntry are all unknown
GD_data = GD_data.loc[(GD_data["ageBroad"]!="Unknown") | (GD_data["majorityStatus"]!="Unknown")|(GD_data["majorityStatusAtExploit"]!="Unknown")|(GD_data["majorityEntry"]!="Unknown")]
len(GD_data)

33669

In [890]:
## 4) Combine all available data between majorityStatus, majorityStatusAtExploit, majorityEntry
def combine_nulls(x):
    if x['majorityStatus'] != 'Unknown':
        return x['majorityStatus']
    elif x['majorityStatusAtExploit'] != 'Unknown':
        return x['majorityStatusAtExploit']
    elif x['majorityEntry'] != 'Unknown':
        return x['majorityEntry']
    x['majorityStatus'] = x['majorityStatusCombined']

GD_data["majorityStatus"] = GD_data.apply(combine_nulls, axis=1)

In [891]:
GD_data[['majorityStatus','Datasource']].groupby('majorityStatus').count() #double check that it worked

Unnamed: 0_level_0,Datasource
majorityStatus,Unnamed: 1_level_1
Adult,25240
Minor,8429


In [892]:
## 5) Create column called ageBroad_mid with values that are midpoint of the ageBroad groups
def ageBroad_ave(x):
    if x['ageBroad'] == '0--8':
        return 4
    elif x['ageBroad'] == '18--20':
        return 19
    elif x['ageBroad'] == '21--23':
        return 22
    elif x['ageBroad'] == '24--26':
        return 25
    elif x['ageBroad'] == '27--29':
        return 28
    elif x['ageBroad'] == '30--38':
        return 34
    elif x['ageBroad'] == '39--47':
        return 43
    elif x['ageBroad'] == '48+':
        return 55
    elif x['ageBroad'] == '9--17':
        return 13

GD_data['ageBroad_mid'] = GD_data.apply(ageBroad_ave,axis=1)


In [893]:
## 6) Imputing missing ageBroad_mid with average for Adult and Minor majorityStatuses respectively
### Calculate Average for Adult
GD_data[['majorityStatus','ageBroad_mid']].groupby('majorityStatus').mean()

Unnamed: 0_level_0,ageBroad_mid
majorityStatus,Unnamed: 1_level_1
Adult,30.161917
Minor,11.149228


In [894]:
### Fill in unknown ageBroad_mid
def imputeAge(x):
    if x['ageBroad'] != 'Unknown':
        return x['ageBroad_mid']
    elif (x['ageBroad'] == 'Unknown' and x['majorityStatus'] == 'Adult'):
        return 30
    elif (x['ageBroad'] == 'Unknown' and x['majorityStatus'] == 'Minor'):
        return 11
    
GD_data['ageBroad_mid'] = GD_data.apply(imputeAge,axis=1)

In [895]:
GD_data[['ageBroad_mid','Datasource']].groupby('ageBroad_mid').count()

Unnamed: 0_level_0,Datasource
ageBroad_mid,Unnamed: 1_level_1
4.0,1581
11.0,783
13.0,6083
19.0,3536
22.0,3504
25.0,2948
28.0,2210
30.0,3204
34.0,6012
43.0,2434


In [896]:
## 7) citizenship - replace -99 with ZZ for consistency
def unk_citizenship(x):
    if x['citizenship'] == '-99':
        return 'ZZ'
    else:
        return x['citizenship']

GD_data['citizenship'] = GD_data.apply(unk_citizenship,axis=1)

In [897]:
GD_data[['citizenship','Datasource']].groupby('citizenship').count()

Unnamed: 0_level_0,Datasource
citizenship,Unnamed: 1_level_1
AF,239
AL,45
BD,11
BF,33
BG,342
BO,18
BY,1513
CD,53
CI,31
CN,84


In [898]:
## 8) Purpose of which a victim is trafficked - remove rows with all unknowns
GD_data = GD_data.loc[(GD_data['isForcedLabour']>=0) | (GD_data['isSexualExploit']>=0)
            | (GD_data['isOtherExploit']>=0) | (GD_data['isSexAndLabour']>=0)
            | (GD_data['isForcedMarriage']>=0) | (GD_data['isForcedMilitary']>=0) | (GD_data['isOrganRemoval']>=0)]

len(GD_data)


20132

In [899]:
### Replace all -99 with 0 since these rows have at least one 1
GD_data['isForcedLabour'].replace(-99,0,inplace=True)
GD_data['isSexualExploit'].replace(-99,0,inplace=True)
GD_data['isOtherExploit'].replace(-99,0,inplace=True)
GD_data['isSexAndLabour'].replace(-99,0,inplace=True)
GD_data['isForcedMarriage'].replace(-99,0,inplace=True)
GD_data['isForcedMilitary'].replace(-99,0,inplace=True)
GD_data['isOrganRemoval'].replace(-99,0,inplace=True)

In [900]:
## 9) Clean Type of Labor / Type of Sex Exploitation data
GD_data['sumExploit'] = GD_data['isForcedLabour'] + GD_data['isSexualExploit'] + GD_data['isOtherExploit'] + GD_data['isSexAndLabour'] + GD_data['isForcedMarriage'] + GD_data['isForcedMilitary'] + GD_data['isOrganRemoval']
GD_data[['sumExploit','Datasource']].groupby('sumExploit').count()

Unnamed: 0_level_0,Datasource
sumExploit,Unnamed: 1_level_1
0,1696
1,18416
3,20


In [901]:
### Checked if any of the types of forced labours are 1 when isForcedLabour is not showing it, if so, map that to 1
def verifyForcedLabor(x):
    if x['sumExploit'] == 0:
        if x['typeOfLabourAgriculture'] == 1:
            return 1
        elif x['typeOfLabourAquafarming'] == 1:
            return 1
        elif x['typeOfLabourBegging'] == 1:
            return 1
        elif x['typeOfLabourConstruction'] == 1:
            return 1
        elif x['typeOfLabourDomesticWork'] == 1:
            return 1
        elif x['typeOfLabourHospitality'] == 1:
            return 1
        elif x['typeOfLabourIllicitActivities'] == 1:
            return 1
        elif x['typeOfLabourManufacturing'] == 1:
            return 1
        elif x['typeOfLabourMiningOrDrilling'] == 1:
            return 1
        elif x['typeOfLabourPeddling'] == 1:
            return 1
        elif x['typeOfLabourTransportation'] == 1:
            return 1
        elif x['typeOfLabourOther'] == 1:
            return 1
        elif x['typeOfLabourNotSpecified'] == 1:
            return 1
        else:
            return 0
    else:
        return x['isForcedLabour']
            
GD_data['isForcedLabour'] = GD_data.apply(verifyForcedLabor,axis=1)
GD_data[['isForcedLabour','Datasource']].groupby('isForcedLabour').count()

Unnamed: 0_level_0,Datasource
isForcedLabour,Unnamed: 1_level_1
0,13380
1,6752


In [902]:
### Checked if any of the types of sex exploitations are 1 when isSexualExploit is not showing it, if so, map that to 1
def verifySexualExploit(x):
    if x['sumExploit'] == 0:
        if x['typeOfSexProstitution'] == 1:
            return 1
        elif x['typeOfSexPornography'] == 1:
            return 1
        elif x['typeOfSexRemoteInteractiveServices'] == 1:
            return 1
        elif x['typeOfSexPrivateSexualServices'] == 1:
            return 1
        else:
            return 0
    else:
        return x['isSexualExploit']
            
GD_data['isSexualExploit'] = GD_data.apply(verifySexualExploit,axis=1)
GD_data[['isSexualExploit','Datasource']].groupby('isSexualExploit').count()

Unnamed: 0_level_0,Datasource
isSexualExploit,Unnamed: 1_level_1
0,8787
1,11345


In [903]:
len(GD_data)

20132

In [904]:
## 10) Remove all the rows where there are no known means of control
GD_data = GD_data.loc[(GD_data['meansOfControlDebtBondage']>=0) | (GD_data['meansOfControlTakesEarnings']>=0) 
                      | (GD_data['meansOfControlRestrictsFinancialAccess']>=0) | (GD_data['meansOfControlThreats']>=0) 
                      | (GD_data['meansOfControlPsychologicalAbuse']>=0) | (GD_data['meansOfControlPhysicalAbuse']>=0) 
                      | (GD_data['meansOfControlSexualAbuse']>=0) | (GD_data['meansOfControlFalsePromises']>=0) 
                      | (GD_data['meansOfControlPsychoactiveSubstances']>=0) | (GD_data['meansOfControlRestrictsMovement']>=0)
                      | (GD_data['meansOfControlRestrictsMedicalCare']>=0) | (GD_data['meansOfControlExcessiveWorkingHours']>=0)
                      | (GD_data['meansOfControlUsesChildren']>=0) | (GD_data['meansOfControlThreatOfLawEnforcement']>=0)
                      | (GD_data['meansOfControlWithholdsNecessities']>=0) | (GD_data['meansOfControlWithholdsDocuments']>=0)
                      | (GD_data['meansOfControlOther']>=0) | (GD_data['meansOfControlNotSpecified']>=0)]
 
len(GD_data)

20126

In [905]:
### Fill remaining -99 with 0 because there is at least one known value for that row
GD_data['meansOfControlDebtBondage'].replace(-99,0,inplace=True)
GD_data['meansOfControlTakesEarnings'].replace(-99,0,inplace=True)
GD_data['meansOfControlRestrictsFinancialAccess'].replace(-99,0,inplace=True)
GD_data['meansOfControlThreats'].replace(-99,0,inplace=True)
GD_data['meansOfControlPsychologicalAbuse'].replace(-99,0,inplace=True)
GD_data['meansOfControlPhysicalAbuse'].replace(-99,0,inplace=True)
GD_data['meansOfControlSexualAbuse'].replace(-99,0,inplace=True)
GD_data['meansOfControlFalsePromises'].replace(-99,0,inplace=True)
GD_data['meansOfControlPsychoactiveSubstances'].replace(-99,0,inplace=True)
GD_data['meansOfControlRestrictsMovement'].replace(-99,0,inplace=True)
GD_data['meansOfControlRestrictsMedicalCare'].replace(-99,0,inplace=True)
GD_data['meansOfControlExcessiveWorkingHours'].replace(-99,0,inplace=True)
GD_data['meansOfControlUsesChildren'].replace(-99,0,inplace=True)
GD_data['meansOfControlThreatOfLawEnforcement'].replace(-99,0,inplace=True)
GD_data['meansOfControlWithholdsNecessities'].replace(-99,0,inplace=True)
GD_data['meansOfControlWithholdsDocuments'].replace(-99,0,inplace=True)
GD_data['meansOfControlOther'].replace(-99,0,inplace=True)
GD_data['meansOfControlNotSpecified'].replace(-99,0,inplace=True)

In [906]:
## 11) Country Of Exploitation - replace -99 with ZZ for consistency
def unk_countryExploit(x):
    if x['CountryOfExploitation'] == '-99':
        return 'ZZ'
    else:
        return x['CountryOfExploitation']

GD_data['CountryOfExploitation'] = GD_data.apply(unk_countryExploit,axis=1)
GD_data[['CountryOfExploitation','Datasource']].groupby('CountryOfExploitation').count()

Unnamed: 0_level_0,Datasource
CountryOfExploitation,Unnamed: 1_level_1
AE,502
AF,84
AL,45
AT,24
BA,127
BG,130
BH,22
BY,22
CN,50
CY,12


In [907]:
## 12) Explore whether isAbduction is worth using
GD_data[['isAbduction','Datasource']].groupby('isAbduction').count()
#len(GD_data)

Unnamed: 0_level_0,Datasource
isAbduction,Unnamed: 1_level_1
-99,7248
0,12828
1,50


There are only 50 out of 20126 counts that are confirmed abduction.  Not reliable data.  Won't use it.

In [908]:
## 13) Remove all the rows that have no known type of Exploitation
### First recalc sumExploit
GD_data['sumExploit'] = GD_data['isForcedLabour'] + GD_data['isSexualExploit'] + GD_data['isOtherExploit'] + GD_data['isSexAndLabour'] + GD_data['isForcedMarriage'] + GD_data['isForcedMilitary'] + GD_data['isOrganRemoval']
GD_data = GD_data.loc[(GD_data['sumExploit']!=0)] 
len(GD_data)

18695

In [909]:
## 14) Replace all the -99 in Type of Labor and Sex Exploitations with 0
### Fill remaining -99 with 0 because there is at least one known value for that row
GD_data['typeOfLabourAgriculture'].replace(-99,0,inplace=True)
GD_data['typeOfLabourAquafarming'].replace(-99,0,inplace=True)
GD_data['typeOfLabourBegging'].replace(-99,0,inplace=True)
GD_data['typeOfLabourConstruction'].replace(-99,0,inplace=True)
GD_data['typeOfLabourDomesticWork'].replace(-99,0,inplace=True)
GD_data['typeOfLabourHospitality'].replace(-99,0,inplace=True)
GD_data['typeOfLabourIllicitActivities'].replace(-99,0,inplace=True)
GD_data['typeOfLabourManufacturing'].replace(-99,0,inplace=True)
GD_data['typeOfLabourMiningOrDrilling'].replace(-99,0,inplace=True)
GD_data['typeOfLabourPeddling'].replace(-99,0,inplace=True)
GD_data['typeOfLabourTransportation'].replace(-99,0,inplace=True)
GD_data['typeOfLabourOther'].replace(-99,0,inplace=True)
GD_data['typeOfLabourNotSpecified'].replace(-99,0,inplace=True)
GD_data['typeOfSexProstitution'].replace(-99,0,inplace=True)
GD_data['typeOfSexPornography'].replace(-99,0,inplace=True)
GD_data['typeOfSexRemoteInteractiveServices'].replace(-99,0,inplace=True)
GD_data['typeOfSexPrivateSexualServices'].replace(-99,0,inplace=True)

In [910]:
## 15) If no specific type of Labor is specified when isForcedLabour=1, t
### then map typeOfLabourNotSpecified as 1
def unk_LaborType(x):
    if x['isForcedLabour']==1:
        if (x['typeOfLabourAgriculture'] + x['typeOfLabourAquafarming'] 
           + x['typeOfLabourBegging'] + x['typeOfLabourConstruction']
           + x['typeOfLabourDomesticWork'] + x['typeOfLabourHospitality']
           + x['typeOfLabourIllicitActivities'] + x['typeOfLabourManufacturing']
           + x['typeOfLabourMiningOrDrilling'] + x['typeOfLabourPeddling']
           + x['typeOfLabourTransportation'] + x['typeOfLabourOther'] == 0):
            return 1
        else:
            return 0
    else:
        return 0
    
GD_data['typeOfLabourNotSpecified'] = GD_data.apply(unk_LaborType,axis=1)
GD_data[['typeOfLabourNotSpecified','Datasource']].groupby(GD_data['typeOfLabourNotSpecified']).count()

Unnamed: 0_level_0,typeOfLabourNotSpecified,Datasource
typeOfLabourNotSpecified,Unnamed: 1_level_1,Unnamed: 2_level_1
0,14340,14340
1,4355,4355


In [911]:
## 16) For confirmed Sex Exploit, but no type of sex exploit is known
### map to a new column called typeOfSexNotSpecified, value 1
def unk_SexExploitType(x):
    if x['isSexualExploit']==1:
        if (x['typeOfSexProstitution'] + x['typeOfSexPornography'] 
            + x['typeOfSexRemoteInteractiveServices'] 
            + x['typeOfSexPrivateSexualServices'] == 0):
            return 1
        else:
            return 0
    else:
        return 0
    
GD_data['typeOfSexNotSpecified'] = GD_data.apply(unk_SexExploitType,axis=1)
GD_data[['typeOfSexNotSpecified','Datasource']].groupby(GD_data['typeOfSexNotSpecified']).count()


Unnamed: 0_level_0,typeOfSexNotSpecified,Datasource
typeOfSexNotSpecified,Unnamed: 1_level_1,Unnamed: 2_level_1
0,11708,11708
1,6987,6987


In [912]:
## 17) Change all Recruiter relationship -99 to 0
GD_data['recruiterRelationIntimatePartner'].replace(-99,0,inplace=True)
GD_data['recruiterRelationFriend'].replace(-99,0,inplace=True)
GD_data['recruiterRelationFamily'].replace(-99,0,inplace=True)
GD_data['recruiterRelationOther'].replace(-99,0,inplace=True)
GD_data['recruiterRelationUnknown'].replace(-99,0,inplace=True)

In [914]:
### of all recruiter relations are 0, then set recruiterRelationUnknown to 1
def unk_recruiterRelation(x):
    if (x['recruiterRelationIntimatePartner'] 
        + x['recruiterRelationFriend'] 
        + x['recruiterRelationFamily'] 
        + x['recruiterRelationOther']) == 0:
        return 1
    else:
        return 0
    
GD_data['recruiterRelationUnknown'] = GD_data.apply(unk_recruiterRelation,axis=1)
GD_data[['recruiterRelationUnknown','Datasource']].groupby(GD_data['recruiterRelationUnknown']).count()


Unnamed: 0_level_0,recruiterRelationUnknown,Datasource
recruiterRelationUnknown,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3993,3993
1,14702,14702


In [915]:
## 18) Bucket Citizenship and countryOfExploitation into Geo-categories
### import mapping table
geo_map = pd.read_csv('assets/code_country_mapping.csv')
geo_map.head(3)

Unnamed: 0,Code,Country,Region,Continent,Geo_Categories
0,AE,UAE,Middle_East,Asia,Middle_East
1,AF,Afghanistan,Middle_East,Asia,Middle_East
2,AL,Albania,Southeastern_Europe,Europe,Europe


In [916]:
### Remove rows with CountryOfExploitation==Y1
GD_data = GD_data.loc[(GD_data['CountryOfExploitation']!='Y1')]
### Replace CountryOfExploitation==-99 with 'ZZ'
len(GD_data)

18484

In [917]:
### merge two dataframes on citizenship, then rename new geo columns
GD_data2 = pd.merge(GD_data, geo_map, how='left',
        left_on='citizenship', right_on='Code')
len(GD_data2)

18484

In [920]:
GD_data2.head(3)

Unnamed: 0,yearOfRegistration,Datasource,gender,ageBroad,majorityStatus,majorityStatusAtExploit,majorityEntry,citizenship,meansOfControlDebtBondage,meansOfControlTakesEarnings,...,recruiterRelationOther,recruiterRelationUnknown,ageBroad_mid,sumExploit,typeOfSexNotSpecified,Code,Country,Region,Continent,Geo_Categories
0,2010,Case Management,Female,18--20,Adult,Adult,Adult,KZ,0,0,...,0,1,19.0,1,1,KZ,Kazakhstan,Central_Asia,Asia,Asia
1,2004,Case Management,Female,18--20,Adult,-99,-99,MD,0,0,...,0,1,19.0,1,1,MD,Moldova,Eastern_Europe,Europe,Europe
2,2010,Case Management,Female,18--20,Adult,Adult,Adult,KZ,1,0,...,0,1,19.0,1,1,KZ,Kazakhstan,Central_Asia,Asia,Asia


In [921]:
GD_data2 = GD_data2.rename(columns={'Country': 'citizenshipCountry', 
                                    'Region': 'citizenshipRegion',
                                    'Continent':'citizenshipContinent',
                                    'Geo_Categories':'citizenshipGeoCategory'})

In [937]:
#GD_data2.head(100)

In [924]:
len(GD_data2)

18484

In [925]:
### do the same for countryOfExploitation
GD_data3 = pd.merge(GD_data2, geo_map, how='left',
        left_on='CountryOfExploitation', right_on='Code')
GD_data3.head()

Unnamed: 0,yearOfRegistration,Datasource,gender,ageBroad,majorityStatus,majorityStatusAtExploit,majorityEntry,citizenship,meansOfControlDebtBondage,meansOfControlTakesEarnings,...,Code_x,citizenshipCountry,citizenshipRegion,citizenshipContinent,citizenshipGeoCategory,Code_y,Country,Region,Continent,Geo_Categories
0,2010,Case Management,Female,18--20,Adult,Adult,Adult,KZ,0,0,...,KZ,Kazakhstan,Central_Asia,Asia,Asia,KZ,Kazakhstan,Central_Asia,Asia,Asia
1,2004,Case Management,Female,18--20,Adult,-99,-99,MD,0,0,...,MD,Moldova,Eastern_Europe,Europe,Europe,MD,Moldova,Eastern_Europe,Europe,Europe
2,2010,Case Management,Female,18--20,Adult,Adult,Adult,KZ,1,0,...,KZ,Kazakhstan,Central_Asia,Asia,Asia,KZ,Kazakhstan,Central_Asia,Asia,Asia
3,2010,Case Management,Female,18--20,Adult,Adult,Adult,KZ,1,0,...,KZ,Kazakhstan,Central_Asia,Asia,Asia,KZ,Kazakhstan,Central_Asia,Asia,Asia
4,2012,Case Management,Female,9--17,Minor,-99,Minor,HT,0,0,...,HT,Haiti,North_America,North_America,North_America,HT,Haiti,North_America,North_America,North_America


In [926]:
### renaming columns to indicate country of exploitation
GD_data3 = GD_data3.rename(columns={'Country': 'exploitationCountry', 
                                    'Region': 'exploitationRegion',
                                    'Continent':'exploitationContinent',
                                    'Geo_Categories':'exploitationGeoCategory'})

In [932]:
GD_data3.head(3)

Unnamed: 0,yearOfRegistration,Datasource,gender,ageBroad,majorityStatus,majorityStatusAtExploit,majorityEntry,citizenship,meansOfControlDebtBondage,meansOfControlTakesEarnings,...,Code_x,citizenshipCountry,citizenshipRegion,citizenshipContinent,citizenshipGeoCategory,Code_y,exploitationCountry,exploitationRegion,exploitationContinent,exploitationGeoCategory
0,2010,Case Management,Female,18--20,Adult,Adult,Adult,KZ,0,0,...,KZ,Kazakhstan,Central_Asia,Asia,Asia,KZ,Kazakhstan,Central_Asia,Asia,Asia
1,2004,Case Management,Female,18--20,Adult,-99,-99,MD,0,0,...,MD,Moldova,Eastern_Europe,Europe,Europe,MD,Moldova,Eastern_Europe,Europe,Europe
2,2010,Case Management,Female,18--20,Adult,Adult,Adult,KZ,1,0,...,KZ,Kazakhstan,Central_Asia,Asia,Asia,KZ,Kazakhstan,Central_Asia,Asia,Asia


In [934]:
GD_data3[['citizenship','CountryOfExploitation','citizenshipGeoCategory','exploitationGeoCategory']].tail()

Unnamed: 0,citizenship,CountryOfExploitation,citizenshipGeoCategory,exploitationGeoCategory
18479,KH,CN,Asia,Asia
18480,KH,CN,Asia,Asia
18481,KH,MY,Asia,Asia
18482,ZZ,KH,Unknown,Asia
18483,ZZ,KH,Unknown,Asia


In [935]:
### Reset data name
GD_data = GD_data3

In [964]:
## 19) Drop unecessary columns 
### Drop Datasource, majorityStatusAtExploit, majorityEntry, citizenship, isAbduction, CountryOfExploitation, ageBroad_mid,
### sumExploit, Code_x, citizenshipCountry, citizenshipContinent, Code_y, exploitationCountry, exploitationContinent
GD_data_new = GD_data.drop(['Datasource','majorityStatusAtExploit','majorityEntry','citizenship','isAbduction',
                           'CountryOfExploitation','ageBroad','sumExploit','Code_x','citizenshipCountry',
                            'citizenshipContinent','Code_y','exploitationCountry','exploitationContinent','RecruiterRelationship'],axis=1)

In [965]:
len(GD_data_new.columns)

56

In [968]:
## 20) Final touches 
### Re-order columns
GD_data_new2 = GD_data_new[['yearOfRegistration','gender','majorityStatus','ageBroad_mid','citizenshipRegion','citizenshipGeoCategory',
             'exploitationRegion','exploitationGeoCategory','meansOfControlDebtBondage','meansOfControlTakesEarnings',
             'meansOfControlRestrictsFinancialAccess','meansOfControlThreats','meansOfControlPsychologicalAbuse',
             'meansOfControlPhysicalAbuse','meansOfControlSexualAbuse','meansOfControlFalsePromises',
             'meansOfControlPsychoactiveSubstances','meansOfControlRestrictsMovement','meansOfControlRestrictsMedicalCare',
             'meansOfControlExcessiveWorkingHours','meansOfControlUsesChildren','meansOfControlThreatOfLawEnforcement',
             'meansOfControlWithholdsNecessities','meansOfControlWithholdsDocuments','meansOfControlOther',
             'meansOfControlNotSpecified','isForcedLabour','isSexualExploit','isOtherExploit','isSexAndLabour',
             'isForcedMarriage','isForcedMilitary','isOrganRemoval','typeOfLabourAgriculture','typeOfLabourAquafarming',
             'typeOfLabourBegging','typeOfLabourConstruction','typeOfLabourDomesticWork','typeOfLabourHospitality',
             'typeOfLabourIllicitActivities','typeOfLabourManufacturing','typeOfLabourMiningOrDrilling','typeOfLabourPeddling',
             'typeOfLabourTransportation','typeOfLabourOther','typeOfLabourNotSpecified','typeOfSexProstitution',
             'typeOfSexPornography','typeOfSexRemoteInteractiveServices','typeOfSexPrivateSexualServices',
             'typeOfSexNotSpecified','recruiterRelationIntimatePartner','recruiterRelationFriend',
             'recruiterRelationFamily','recruiterRelationOther','recruiterRelationUnknown']]
len(GD_data_new2.columns)


56

In [1052]:
### Shorten some of the columns names
GD_data_new3 = GD_data_new2.rename(columns={'meansOfControlDebtBondage': 'mocDebtBondage', 
                                    'meansOfControlTakesEarnings': 'mocTakesEarnings',
                                    'meansOfControlRestrictsFinancialAccess':'mocRestrictsFinancialAccess',
                                    'meansOfControlThreats':'mocThreats',
                                    'meansOfControlPsychologicalAbuse':'mocPsychologicalAbuse',
                                    'meansOfControlPhysicalAbuse':'mocPhysicalAbuse',
                                    'meansOfControlSexualAbuse':'mocSexualAbuse',
                                    'meansOfControlFalsePromises':'mocFalsePromises',
                                    'meansOfControlPsychoactiveSubstances':'mocPsychoactiveSubstances',
                                    'meansOfControlRestrictsMovement':'mocRestrictsMovement',
                                    'meansOfControlRestrictsMedicalCare':'mocRestrictsMedicalCare',
                                    'meansOfControlExcessiveWorkingHours':'mocExcessiveWorkingHours',
                                    'meansOfControlUsesChildren':'mocUsesChildren',
                                    'meansOfControlThreatOfLawEnforcement':'mocThreatOfLawEnforcement',
                                    'meansOfControlWithholdsNecessities':'mocWithholdsNecessities',
                                    'meansOfControlWithholdsDocuments':'mocWithholdsDocuments',
                                    'meansOfControlOther':'mocOther',
                                    'meansOfControlNotSpecified':'mocNotSpecified',
                                    'typeOfLabourAgriculture':'tolAgriculture',
                                    'typeOfLabourAquafarming':'tolAquafarming',
                                    'typeOfLabourBegging':'tolBegging',
                                    'typeOfLabourConstruction':'tolConstruction',
                                    'typeOfLabourDomesticWork':'tolDomesticWork',
                                    'typeOfLabourHospitality':'tolHospitality',
                                    'typeOfLabourIllicitActivities':'tolIllicitActivities',
                                    'typeOfLabourManufacturing':'tolManufacturing',
                                    'typeOfLabourMiningOrDrilling':'tolMiningOrDrilling',
                                    'typeOfLabourPeddling':'tolPeddling',
                                    'typeOfLabourTransportation':'tolTransportation',
                                    'typeOfLabourOther':'tolOther',
                                    'typeOfLabourNotSpecified':'tolNotSpecified',
                                    'typeOfSexProstitution':'tosProstitution',
                                    'typeOfSexPornography':'tosPornography',
                                    'typeOfSexRemoteInteractiveServices':'tosRemoteInteractiveServices',
                                    'typeOfSexPrivateSexualServices':'tosPrivateSexualServices',
                                    'typeOfSexNotSpecified':'tosNotSpecified',
                                    'recruiterRelationIntimatePartner':'rrIntimatePartner',
                                    'recruiterRelationFriend':'rrFriend',
                                    'recruiterRelationFamily':'rrFamily',
                                    'recruiterRelationOther':'rrOther',
                                    'recruiterRelationUnknown':'rrUnknown'})

GD_data_new3.columns

Index([u'yearOfRegistration', u'gender', u'majorityStatus', u'ageBroad_mid',
       u'citizenshipRegion', u'citizenshipGeoCategory', u'exploitationRegion',
       u'exploitationGeoCategory', u'mocDebtBondage', u'mocTakesEarnings',
       u'mocRestrictsFinancialAccess', u'mocThreats', u'mocPsychologicalAbuse',
       u'mocPhysicalAbuse', u'mocSexualAbuse', u'mocFalsePromises',
       u'mocPsychoactiveSubstances', u'mocRestrictsMovement',
       u'mocRestrictsMedicalCare', u'mocExcessiveWorkingHours',
       u'mocUsesChildren', u'mocThreatOfLawEnforcement',
       u'mocWithholdsNecessities', u'mocWithholdsDocuments', u'mocOther',
       u'mocNotSpecified', u'isForcedLabour', u'isSexualExploit',
       u'isOtherExploit', u'isSexAndLabour', u'isForcedMarriage',
       u'isForcedMilitary', u'isOrganRemoval', u'tolAgriculture',
       u'tolAquafarming', u'tolBegging', u'tolConstruction',
       u'tolDomesticWork', u'tolHospitality', u'tolIllicitActivities',
       u'tolManufacturing', u'tolM

In [1053]:
len(GD_data_new3.columns)

56

In [1054]:
GD_data_final = GD_data_new3

# III) Exploratory Analysis

## A) Demographic Data (Input Variable)
Use demographic data to predict experiences of trafficked victims

In [1055]:
### Create Dummies for gender, majorityStatus, citizenshipRegion, citizenshipGeoCategory, exploitationRegion, exploitationGeoCategory
#### Split up genders
dummy_gender = pd.get_dummies(GD_data_final['gender'], prefix='g')
print dummy_gender.head()

   g_Female  g_Male
0         1       0
1         1       0
2         1       0
3         1       0
4         1       0


In [1056]:
dummy_gender.sum()

g_Female    14054
g_Male       4430
dtype: int64

In [1057]:
#### split up adult vs minor
dummy_majorityStatus = pd.get_dummies(GD_data_final['majorityStatus'], prefix='ms')
print dummy_majorityStatus.head()

   ms_Adult  ms_Minor
0         1         0
1         1         0
2         1         0
3         1         0
4         0         1


In [1058]:
dummy_majorityStatus.sum()

ms_Adult    14204
ms_Minor     4280
dtype: int64

In [1059]:
#### Citizenship Region
#### bucket countries into regions - less granular but fewer features (possibly more efficient)
dummy_citizenshipRegion = pd.get_dummies(GD_data_final['citizenshipRegion'], prefix='cr')
print dummy_citizenshipRegion.head()

   cr_Central_Asia  cr_East_Africa  cr_East_Asia  cr_Eastern_Europe  \
0                1               0             0                  0   
1                0               0             0                  1   
2                1               0             0                  0   
3                1               0             0                  0   
4                0               0             0                  0   

   cr_Middle_East  cr_North_America  cr_Northeast_Africa  cr_South_America  \
0               0                 0                    0                 0   
1               0                 0                    0                 0   
2               0                 0                    0                 0   
3               0                 0                    0                 0   
4               0                 1                    0                 0   

   cr_South_Asia  cr_Southeast_Asia  cr_Southeastern_Europe  cr_Unknown  \
0              0             

In [1060]:
dummy_citizenshipRegion.sum()

cr_Central_Asia            310
cr_East_Africa              70
cr_East_Asia                96
cr_Eastern_Europe         4030
cr_Middle_East              98
cr_North_America          3346
cr_Northeast_Africa         13
cr_South_America            43
cr_South_Asia              112
cr_Southeast_Asia         3016
cr_Southeastern_Europe     541
cr_Unknown                6129
cr_West_Africa             680
dtype: int64

In [1061]:
#### Citizenship Geo Category
#### bucket countires into even less granular geo categories (mixture of continents and regions)
dummy_citizenshipGeoCategory = pd.get_dummies(GD_data_final['citizenshipGeoCategory'], prefix='cg')
print dummy_citizenshipGeoCategory.head()

   cg_Africa  cg_Asia  cg_Europe  cg_Middle_East  cg_North_America  \
0          0        1          0               0                 0   
1          0        0          1               0                 0   
2          0        1          0               0                 0   
3          0        1          0               0                 0   
4          0        0          0               0                 1   

   cg_South_America  cg_Unknown  
0                 0           0  
1                 0           0  
2                 0           0  
3                 0           0  
4                 0           0  


In [1062]:
dummy_citizenshipGeoCategory.sum()

cg_Africa            763
cg_Asia             3534
cg_Europe           4571
cg_Middle_East        98
cg_North_America    3346
cg_South_America      43
cg_Unknown          6129
dtype: int64

In [1063]:
#### Create Demographic Dataframe including dummies
demo_cols_keep = ['yearOfRegistration', 'ageBroad_mid']
demo_handCalc = GD_data_final[demo_cols_keep].join(dummy_gender.ix[:, 'g_Female':]).join(dummy_majorityStatus.ix[:,'ms_Adult']).join(dummy_citizenshipRegion).join(dummy_citizenshipGeoCategory)
print demo_handCalc.head()

   yearOfRegistration  ageBroad_mid  g_Female  g_Male  ms_Adult  \
0                2010          19.0         1       0         1   
1                2004          19.0         1       0         1   
2                2010          19.0         1       0         1   
3                2010          19.0         1       0         1   
4                2012          13.0         1       0         0   

   cr_Central_Asia  cr_East_Africa  cr_East_Asia  cr_Eastern_Europe  \
0                1               0             0                  0   
1                0               0             0                  1   
2                1               0             0                  0   
3                1               0             0                  0   
4                0               0             0                  0   

   cr_Middle_East     ...      cr_Southeastern_Europe  cr_Unknown  \
0               0     ...                           0           0   
1               0     ...       

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until


In [1064]:
#### Basica Demographic Stats and Exploratory Analysis
demo_handCalc.describe()

Unnamed: 0,yearOfRegistration,ageBroad_mid,g_Female,g_Male,ms_Adult,cr_Central_Asia,cr_East_Africa,cr_East_Asia,cr_Eastern_Europe,cr_Middle_East,...,cr_Southeastern_Europe,cr_Unknown,cr_West_Africa,cg_Africa,cg_Asia,cg_Europe,cg_Middle_East,cg_North_America,cg_South_America,cg_Unknown
count,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,...,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0
mean,2013.033218,25.278187,0.760333,0.239667,0.768448,0.016771,0.003787,0.005194,0.218026,0.005302,...,0.029269,0.331584,0.036789,0.041279,0.191192,0.247295,0.005302,0.181021,0.002326,0.331584
std,4.432433,10.083946,0.426892,0.426892,0.421835,0.128417,0.061424,0.071882,0.412917,0.072623,...,0.168563,0.470795,0.188247,0.19894,0.393251,0.431451,0.072623,0.385046,0.048177,0.470795
min,2002.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2012.0,19.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2015.0,25.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,2016.0,30.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,2017.0,55.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## B) Exploitation Geo Data (Output Variables) 
Use input variables to predict Exploitation Location

In [1065]:
#### Exploitation Region
#### bucket countries into regions - less granular but fewer features (possibly more efficient)
dummy_exploitationRegion = pd.get_dummies(GD_data_final['exploitationRegion'], prefix='er')
print dummy_exploitationRegion.head()

   er_Central_Asia  er_Central_Europe  er_East_Africa  er_East_Asia  \
0                1                  0               0             0   
1                0                  0               0             0   
2                1                  0               0             0   
3                1                  0               0             0   
4                0                  0               0             0   

   er_Eastern_Europe  er_Eurasia  er_Mediterranean  er_Middle_East  \
0                  0           0                 0               0   
1                  1           0                 0               0   
2                  0           0                 0               0   
3                  0           0                 0               0   
4                  0           0                 0               0   

   er_North_America  er_Northwestern_Europe  er_South_Africa  \
0                 0                       0                0   
1                 0     

In [1066]:
#### Exploitation Geo Category
#### bucket countires into even less granular geo categories (mixture of continents and regions)
dummy_exploitationGeoCategory = pd.get_dummies(GD_data_final['exploitationGeoCategory'], prefix='eg')
print dummy_exploitationGeoCategory.head()

   eg_Africa  eg_Asia  eg_Eurasia  eg_Europe  eg_Middle_East  \
0          0        1           0          0               0   
1          0        0           0          1               0   
2          0        1           0          0               0   
3          0        1           0          0               0   
4          0        0           0          0               0   

   eg_North_America  eg_South_America  eg_Unknown  
0                 0                 0           0  
1                 0                 0           0  
2                 0                 0           0  
3                 0                 0           0  
4                 1                 0           0  


In [1067]:
#### Create Exploitation Geo Dataframe including dummies
expgeo_cols_keep = []
expgeo_handCalc = GD_data_final[expgeo_cols_keep].join(dummy_exploitationRegion).join(dummy_exploitationGeoCategory)
print expgeo_handCalc.head()

   er_Central_Asia  er_Central_Europe  er_East_Africa  er_East_Asia  \
0                1                  0               0             0   
1                0                  0               0             0   
2                1                  0               0             0   
3                1                  0               0             0   
4                0                  0               0             0   

   er_Eastern_Europe  er_Eurasia  er_Mediterranean  er_Middle_East  \
0                  0           0                 0               0   
1                  1           0                 0               0   
2                  0           0                 0               0   
3                  0           0                 0               0   
4                  0           0                 0               0   

   er_North_America  er_Northwestern_Europe     ...      er_Unknown  \
0                 0                       0     ...               0   
1         

In [1068]:
expgeo_handCalc.sum()

er_Central_Asia            135
er_Central_Europe          429
er_East_Africa              70
er_East_Asia                83
er_Eastern_Europe          187
er_Eurasia                2846
er_Mediterranean            53
er_Middle_East            1255
er_North_America          9488
er_Northwestern_Europe      17
er_South_Africa             36
er_South_America            85
er_Southeast_Asia         1984
er_Southeastern_Europe     577
er_Unknown                 667
er_West_Africa             572
eg_Africa                  678
eg_Asia                   2202
eg_Eurasia                2846
eg_Europe                 1263
eg_Middle_East            1255
eg_North_America          9488
eg_South_America            85
eg_Unknown                 667
dtype: int64

In [1069]:
#### How does Demographic Data relate with Exploitation Geo Data?
demo_expgeo_calc = demo_handCalc.join(expgeo_handCalc)

#### Correlation (look at exploitation Region)
demo_expgeo_calc.corr().iloc[:,25:41].head(18)

Unnamed: 0,er_Central_Asia,er_Central_Europe,er_East_Africa,er_East_Asia,er_Eastern_Europe,er_Eurasia,er_Mediterranean,er_Middle_East,er_North_America,er_Northwestern_Europe,er_South_Africa,er_South_America,er_Southeast_Asia,er_Southeastern_Europe,er_Unknown,er_West_Africa
yearOfRegistration,-0.031606,-0.180313,0.013448,0.042034,-0.147002,-0.249129,-0.065004,-0.005274,0.609951,0.013463,0.009636,0.014826,0.097048,-0.385561,-0.425219,-0.316198
ageBroad_mid,0.076648,0.05358,-0.112023,0.000555,-0.032062,0.3,-0.037601,0.151419,-0.297835,0.053323,0.011924,0.010972,0.188859,-0.070019,-0.03097,-0.250773
g_Female,-0.133428,0.073917,-0.019031,0.037707,0.056759,-0.356029,-0.000706,0.102158,0.453356,-0.054041,-0.078682,-0.121063,-0.406361,0.100781,0.108629,-0.168993
g_Male,0.133428,-0.073917,0.019031,-0.037707,-0.056759,0.356029,0.000706,-0.102158,-0.453356,0.054041,0.078682,0.121063,0.406361,-0.100781,-0.108629,0.168993
ms_Adult,0.047084,0.084615,-0.11232,0.015766,0.027299,0.213211,-0.001746,0.09819,-0.273289,0.016655,0.024249,0.03731,0.139383,0.068298,0.10002,-0.325544
cr_Central_Asia,0.572644,-0.020132,-0.008052,-0.008772,-0.013203,0.168405,-0.007004,-0.035249,-0.134128,-0.003963,-0.005769,-0.008877,-0.045288,-0.023444,-0.02527,-0.023339
cr_East_Africa,-0.005289,-0.009504,1.0,-0.004141,-0.006233,-0.026303,-0.003306,-0.016641,-0.06332,-0.001871,-0.002724,-0.004191,-0.02138,-0.011068,-0.011929,-0.011018
cr_East_Asia,-0.006198,-0.011138,-0.004455,-0.004853,-0.007305,-0.030824,-0.003875,-0.019501,0.070357,-0.002192,-0.003192,-0.004911,-0.025055,-0.01297,-0.01398,-0.012912
cr_Eastern_Europe,-0.019133,0.291925,-0.032556,-0.035463,0.191457,0.699011,0.001089,-0.050845,-0.542277,-0.016021,-0.023326,-0.03589,-0.1831,-0.033755,0.306711,-0.094359
cr_Middle_East,-0.006262,-0.011254,-0.004501,-0.004903,-0.007381,-0.031146,-0.003915,0.270506,-0.074978,-0.002215,-0.003225,-0.004962,-0.025316,-0.013105,-0.014126,-0.013047


### Observations: (Demographic vs Regional Geo Data)
#### Central Europe:
- Citizenship: Eastern Europe: 0.29

#### Eurasia:
- Citizenship: Easter Europe: 0.70
- Age: 0.3
- Male: 0.36

#### Mediterranean:
- Citizenship: Northeast Africa: 0.49

#### Middle East:
- Citizenship: Middle East: 0.27
- Citizenship: South Asia: 0.29
- Citizenship: Southeast Asia: 0.39

#### North America:
- Citizenship: North America: 0.46
- Citizenship: Southeast Asia: -0.45
- Year of Registration: 0.6
- Female: 0.45

#### Southeastern Europe:
- Citizenship: Southeastern Europe: 0.88
- Year of Registration vs: -0.39

#### West Africa:
- Citizenship: West Africa: 0.91
- Year of Registration: -0.32
- Age: -0.25
- Adult: -0.33

#### Southeast Asia:
- Male: 0.4


In [1070]:
#### Correlation (look at exploitation wider Geo Categories)
demo_expgeo_calc.corr().iloc[:25,42:]

Unnamed: 0,eg_Asia,eg_Eurasia,eg_Europe,eg_Middle_East,eg_North_America,eg_South_America,eg_Unknown
yearOfRegistration,0.093101,-0.249129,-0.443808,-0.005274,0.609951,0.014826,-0.425219
ageBroad_mid,0.200725,0.3,-0.030564,0.151419,-0.297835,0.010972,-0.03097
g_Female,-0.415585,-0.356029,0.129442,0.102158,0.453356,-0.121063,0.108629
g_Male,0.415585,0.356029,-0.129442,-0.102158,-0.453356,0.121063,-0.108629
ms_Adult,0.148817,0.213211,0.110026,0.09819,-0.273289,0.03731,0.10002
cr_Central_Asia,0.105435,0.168405,-0.035369,-0.035249,-0.134128,-0.008877,-0.02527
cr_East_Africa,-0.022674,-0.026303,-0.016697,-0.016641,-0.06332,-0.004191,-0.011929
cr_East_Asia,-0.026572,-0.030824,-0.019568,-0.019501,0.070357,-0.004911,-0.01398
cr_Eastern_Europe,-0.187308,0.699011,0.225186,-0.050845,-0.542277,-0.03589,0.306711
cr_Middle_East,-0.026849,-0.031146,-0.019772,0.270506,-0.074978,-0.004962,-0.014126


### Observations: (Demographics vs Wider Geo Category)

#### Output: Asia:
- Male: 0.42
- Citizenship: Southeast Asia: 0.75
- Citizenship: Asia: 0.73
- Citizenship: Europe: -0.20

#### Output: Eurasia
- Year: -0.25
- Age: 0.3
- Male: 0.36
- Citizenship: Eastern Europe: 0.70
- Citizenship: Europe: 0.64

#### Output: Europe
- Year: -0.44
- Citizenship: Southeastern Europe: 0.64
- Citizenship: Europe: 0.47

#### Output: Middle East
- Citizenship: Asia: 0.41
- Citizenship: Middle East: 0.27

#### Output: North America
- Year: 0.61
- Age: -0.30
- Female: 0.45
- Citizenship: Easter Europe: -0.54
- Citizenship: North America: 0.46
- Citizenship: Southeast Asia: -0.45
- Citizenship: Asia: -0.47
    

## C) Means of Control Data

In [1072]:
### Carve out Means of Control Data from the clean dataset
moc_handCalc = GD_data_final[['mocDebtBondage','mocTakesEarnings','mocRestrictsFinancialAccess','mocThreats',
                              'mocPsychologicalAbuse','mocPhysicalAbuse','mocSexualAbuse','mocFalsePromises','mocPsychoactiveSubstances',
                              'mocRestrictsMovement','mocRestrictsMedicalCare','mocExcessiveWorkingHours','mocUsesChildren',
                              'mocThreatOfLawEnforcement','mocWithholdsNecessities','mocWithholdsDocuments','mocOther','mocNotSpecified']]
moc_handCalc.head()

Unnamed: 0,mocDebtBondage,mocTakesEarnings,mocRestrictsFinancialAccess,mocThreats,mocPsychologicalAbuse,mocPhysicalAbuse,mocSexualAbuse,mocFalsePromises,mocPsychoactiveSubstances,mocRestrictsMovement,mocRestrictsMedicalCare,mocExcessiveWorkingHours,mocUsesChildren,mocThreatOfLawEnforcement,mocWithholdsNecessities,mocWithholdsDocuments,mocOther,mocNotSpecified
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,1,0,1,0,0,0,1,0,0,0,0,0,0


In [1073]:
moc_handCalc.describe()

Unnamed: 0,mocDebtBondage,mocTakesEarnings,mocRestrictsFinancialAccess,mocThreats,mocPsychologicalAbuse,mocPhysicalAbuse,mocSexualAbuse,mocFalsePromises,mocPsychoactiveSubstances,mocRestrictsMovement,mocRestrictsMedicalCare,mocExcessiveWorkingHours,mocUsesChildren,mocThreatOfLawEnforcement,mocWithholdsNecessities,mocWithholdsDocuments,mocOther,mocNotSpecified
count,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0
mean,0.048042,0.118102,0.005626,0.138877,0.174042,0.129734,0.046473,0.111989,0.067842,0.152294,0.049989,0.084235,0.005464,0.031324,0.057076,0.019314,0.069141,0.514878
std,0.21386,0.322738,0.074801,0.345827,0.379156,0.336019,0.210512,0.315361,0.251482,0.359315,0.217929,0.277747,0.07372,0.174198,0.231995,0.13763,0.2537,0.499792
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [1074]:
moc_handCalc.sum()

mocDebtBondage                  888
mocTakesEarnings               2183
mocRestrictsFinancialAccess     104
mocThreats                     2567
mocPsychologicalAbuse          3217
mocPhysicalAbuse               2398
mocSexualAbuse                  859
mocFalsePromises               2070
mocPsychoactiveSubstances      1254
mocRestrictsMovement           2815
mocRestrictsMedicalCare         924
mocExcessiveWorkingHours       1557
mocUsesChildren                 101
mocThreatOfLawEnforcement       579
mocWithholdsNecessities        1055
mocWithholdsDocuments           357
mocOther                       1278
mocNotSpecified                9517
dtype: int64

In [1075]:
### Join Demo data with Means of Control Data
demo_moc_calc = demo_handCalc.join(moc_handCalc)
demo_moc_calc.head()

Unnamed: 0,yearOfRegistration,ageBroad_mid,g_Female,g_Male,ms_Adult,cr_Central_Asia,cr_East_Africa,cr_East_Asia,cr_Eastern_Europe,cr_Middle_East,...,mocPsychoactiveSubstances,mocRestrictsMovement,mocRestrictsMedicalCare,mocExcessiveWorkingHours,mocUsesChildren,mocThreatOfLawEnforcement,mocWithholdsNecessities,mocWithholdsDocuments,mocOther,mocNotSpecified
0,2010,19.0,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2004,19.0,1,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
2,2010,19.0,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2010,19.0,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2012,13.0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [1076]:
### Demo data vs Means of Control correlation
demo_moc_calc.corr().iloc[0:25,25:]

Unnamed: 0,mocDebtBondage,mocTakesEarnings,mocRestrictsFinancialAccess,mocThreats,mocPsychologicalAbuse,mocPhysicalAbuse,mocSexualAbuse,mocFalsePromises,mocPsychoactiveSubstances,mocRestrictsMovement,mocRestrictsMedicalCare,mocExcessiveWorkingHours,mocUsesChildren,mocThreatOfLawEnforcement,mocWithholdsNecessities,mocWithholdsDocuments,mocOther,mocNotSpecified
yearOfRegistration,0.103565,0.169722,0.04627,0.196695,0.233793,0.179755,0.140406,0.136564,0.168394,0.209483,0.070983,0.123769,0.045972,0.073489,0.095283,0.084002,0.171887,-0.525998
ageBroad_mid,0.146363,0.281713,0.008971,0.165616,0.158631,0.095004,-0.057906,0.269374,-0.040789,0.186605,0.259023,0.331927,0.026121,0.167151,0.1508,0.033163,-0.003712,0.142182
g_Female,-0.117444,-0.263818,0.033761,-0.097035,-0.100277,0.00857,0.108293,-0.315837,0.137856,-0.097824,-0.225965,-0.335771,0.041615,-0.190062,-0.167797,-0.046448,0.100059,-0.116925
g_Male,0.117444,0.263818,-0.033761,0.097035,0.100277,-0.00857,-0.108293,0.315837,-0.137856,0.097824,0.225965,0.335771,-0.041615,0.190062,0.167797,0.046448,-0.100059,0.116925
ms_Adult,0.09033,0.153588,0.029289,0.09546,0.057811,0.046285,-0.062204,0.114817,-0.030414,0.120228,0.112382,0.143394,0.038948,0.086931,0.079216,0.055601,0.039394,0.295033
cr_Central_Asia,0.122354,0.258983,-0.009824,0.193643,0.111171,0.100034,-0.028833,0.243526,-0.031883,0.147493,0.287096,0.178822,-0.009681,0.092607,0.341973,-0.018328,-0.035594,-0.10926
cr_East_Africa,-0.013851,0.004729,-0.004638,0.023633,0.020483,0.049592,-0.013612,0.089827,-0.016633,-0.023682,-0.010101,-0.018699,-0.00457,-0.011087,-0.015169,0.029747,0.004028,-0.056469
cr_East_Asia,0.029522,0.024866,-0.005435,-0.002899,-0.011331,0.003462,-0.012376,0.019688,-0.0165,0.040596,0.004148,-0.011074,-0.005356,0.021573,-0.0048,0.017204,0.081179,-0.030765
cr_Eastern_Europe,0.147897,0.204233,-0.039719,0.082341,0.091443,-0.013191,-0.1035,0.249161,-0.134114,0.09454,0.221585,0.30925,-0.039139,0.138223,0.018063,-0.071246,-0.142358,0.202141
cr_Middle_East,-0.00595,-0.024409,-0.005492,-0.029319,-0.029584,-0.025971,-0.016118,0.078017,-0.019696,-0.030945,-0.016747,-0.01946,-0.005412,-0.013129,-0.017962,-0.010246,-0.019897,-0.005155


### Observations (Demo vs Means of Control)

#### Output: Takes Earnings
- Age: 0.28
- Male: 0.26
- Citizen-Central Asia: 0.26

#### Output: Psychological Abuse
- Year: 0.23

#### Output: False Promises
- Age: 0.27
- Male: 0.32
- Citizen: Central Asia: 0.24
- Citizen: Eastern Europe: 0.25

#### Output: Restricts Medical Care
- Age: 0.26
- Male: 0.23
- Citizen: Central Asia: 0.29
- Citizen: Eastern Europe: 0.22

#### Output: Excessive Working Hours
- Age: 0.33
- Male: 0.35
- Citizenship: Eastern Europe: 0.31

#### Output: Withhold Necessities
- Central Asia: 0.34

## D) Type of Exploitation Data

In [1078]:
### Carve out Exploitation Type from the clean dataset
exptype_handCalc = GD_data_final[['isForcedLabour','isSexualExploit','isOtherExploit','isSexAndLabour',
                              'isForcedMarriage','isForcedMilitary','isOrganRemoval']]
exptype_handCalc.head()

Unnamed: 0,isForcedLabour,isSexualExploit,isOtherExploit,isSexAndLabour,isForcedMarriage,isForcedMilitary,isOrganRemoval
0,0,1,0,0,0,0,0
1,0,1,0,0,0,0,0
2,0,1,0,0,0,0,0
3,0,1,0,0,0,0,0
4,1,0,0,0,0,0,0


In [1079]:
exptype_handCalc.describe()

Unnamed: 0,isForcedLabour,isSexualExploit,isOtherExploit,isSexAndLabour,isForcedMarriage,isForcedMilitary,isOrganRemoval
count,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0,18484.0
mean,0.365181,0.602359,0.030837,0.001082,0.002705,0.0,0.0
std,0.481494,0.489424,0.172882,0.032877,0.051941,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,1.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,0.0,0.0


In [1080]:
exptype_handCalc.sum()

isForcedLabour       6750
isSexualExploit     11134
isOtherExploit        570
isSexAndLabour         20
isForcedMarriage       50
isForcedMilitary        0
isOrganRemoval          0
dtype: int64

In [1081]:
### Join Demo data with Exploitation Type Data
demo_exptype_calc = demo_handCalc.join(exptype_handCalc)
demo_exptype_calc.head()

Unnamed: 0,yearOfRegistration,ageBroad_mid,g_Female,g_Male,ms_Adult,cr_Central_Asia,cr_East_Africa,cr_East_Asia,cr_Eastern_Europe,cr_Middle_East,...,cg_North_America,cg_South_America,cg_Unknown,isForcedLabour,isSexualExploit,isOtherExploit,isSexAndLabour,isForcedMarriage,isForcedMilitary,isOrganRemoval
0,2010,19.0,1,0,1,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,2004,19.0,1,0,1,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
2,2010,19.0,1,0,1,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,2010,19.0,1,0,1,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,2012,13.0,1,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0


In [1083]:
### Demo data vs Exploitation Type correlation
demo_exptype_calc.corr().iloc[0:25,25:]

Unnamed: 0,isForcedLabour,isSexualExploit,isOtherExploit,isSexAndLabour,isForcedMarriage,isForcedMilitary,isOrganRemoval
yearOfRegistration,-0.071521,0.031877,0.098428,0.018317,0.04661,,
ageBroad_mid,0.378389,-0.357482,-0.034187,0.006109,-0.02158,,
g_Female,-0.700495,0.664337,0.06496,0.018478,0.02924,,
g_Male,0.700495,-0.664337,-0.06496,-0.018478,-0.02924,,
ms_Adult,0.114801,-0.096151,-0.04675,0.006363,0.001426,,
cr_Central_Asia,0.160822,-0.149554,-0.023297,-0.004298,-0.006802,,
cr_East_Africa,0.081292,-0.075885,-0.010998,-0.002029,-0.003211,,
cr_East_Asia,-0.054802,0.058707,-0.012889,-0.002378,-0.003763,,
cr_Eastern_Europe,0.139418,-0.102137,-0.094189,-0.017378,-0.0275,,
cr_Middle_East,0.096259,-0.089857,-0.013023,-0.002403,-0.003802,,


### Observations - Correlations between Demographic vs Types of Exploitation:

#### Output: isForcedLabour:
- Age: 0.38
- Male: 0.7
- North America: -0.23
- Southeast Asia: 0.56
- Africa: 0.21
- Asia: 0.59

#### Output: isSexualExploit:
- Age: -0.36
- Female: 0.66
- North America: 0.24
- Southeast Asia: -0.54
- Asia: -0.57


## E) Type of Labour Data

In [1093]:
# Carve out type of labour data form GD_data_final (isForcedLabour = 1)
labortype_handCalc = GD_data_final[['tolAgriculture','tolAquafarming','tolBegging','tolConstruction','tolDomesticWork',
                                      'tolHospitality','tolIllicitActivities','tolManufacturing','tolMiningOrDrilling',
                                      'tolPeddling','tolTransportation','tolOther','tolNotSpecified']][GD_data_final['isForcedLabour']==1]

labortype_handCalc.head()

Unnamed: 0,tolAgriculture,tolAquafarming,tolBegging,tolConstruction,tolDomesticWork,tolHospitality,tolIllicitActivities,tolManufacturing,tolMiningOrDrilling,tolPeddling,tolTransportation,tolOther,tolNotSpecified
4,0,0,0,0,1,0,0,0,0,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,0,0,0
6,0,0,0,0,1,0,0,0,0,0,0,0,0
7,0,0,0,0,1,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,1


In [1094]:
len(labortype_handCalc)

6750

In [1095]:
labortype_handCalc.describe()

Unnamed: 0,tolAgriculture,tolAquafarming,tolBegging,tolConstruction,tolDomesticWork,tolHospitality,tolIllicitActivities,tolManufacturing,tolMiningOrDrilling,tolPeddling,tolTransportation,tolOther,tolNotSpecified
count,6750.0,6750.0,6750.0,6750.0,6750.0,6750.0,6750.0,6750.0,6750.0,6750.0,6750.0,6750.0,6750.0
mean,0.031407,0.013778,0.02237,0.154519,0.055556,0.001926,0.0,0.044296,0.0,0.003704,0.0,0.028444,0.645185
std,0.174429,0.116576,0.147896,0.361472,0.229078,0.043846,0.0,0.205768,0.0,0.06075,0.0,0.166251,0.478493
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0


In [1096]:
labortype_handCalc.sum()

tolAgriculture           212
tolAquafarming            93
tolBegging               151
tolConstruction         1043
tolDomesticWork          375
tolHospitality            13
tolIllicitActivities       0
tolManufacturing         299
tolMiningOrDrilling        0
tolPeddling               25
tolTransportation          0
tolOther                 192
tolNotSpecified         4355
dtype: int64

In [1102]:
### Join Demo data with Type of Labor Data
demo_handCalc_labor = GD_data_final[GD_data_final['isForcedLabour']==1][demo_cols_keep].join(dummy_gender.ix[:, 'g_Female':]).join(dummy_majorityStatus.ix[:,'ms_Adult']).join(dummy_citizenshipRegion).join(dummy_citizenshipGeoCategory)
len(demo_handCalc_labor)
demo_labortype_calc = demo_handCalc_labor.join(labortype_handCalc)
demo_labortype_calc.head()
#len(demo_labortype_calc)

Unnamed: 0,yearOfRegistration,ageBroad_mid,g_Female,g_Male,ms_Adult,cr_Central_Asia,cr_East_Africa,cr_East_Asia,cr_Eastern_Europe,cr_Middle_East,...,tolConstruction,tolDomesticWork,tolHospitality,tolIllicitActivities,tolManufacturing,tolMiningOrDrilling,tolPeddling,tolTransportation,tolOther,tolNotSpecified
4,2012,13.0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
5,2012,13.0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
6,2012,13.0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
7,2012,13.0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
8,2012,13.0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [1106]:
## Demo data vs Type of Labor correlation
demo_labortype_calc.corr().iloc[0:25,25:]

Unnamed: 0,tolAgriculture,tolAquafarming,tolBegging,tolConstruction,tolDomesticWork,tolHospitality,tolIllicitActivities,tolManufacturing,tolMiningOrDrilling,tolPeddling,tolTransportation,tolOther,tolNotSpecified
yearOfRegistration,0.086741,0.09215,0.012997,0.101201,-0.113532,0.016409,,0.147432,,0.04646,,-0.02287,-0.142619
ageBroad_mid,0.046769,0.024838,-0.238946,0.222099,-0.073447,-0.06567,,0.171245,,-0.112003,,-0.078225,-0.10762
g_Female,0.023614,-0.089819,0.026458,-0.315506,0.257403,-0.033381,,0.10843,,-0.046333,,0.045719,0.068613
g_Male,-0.023614,0.089819,-0.026458,0.315506,-0.257403,0.033381,,-0.10843,,0.046333,,-0.045719,-0.068613
ms_Adult,0.080831,0.053056,-0.291406,0.168859,-0.088481,-0.09786,,0.096639,,-0.135829,,-0.052003,-0.03369
cr_Central_Asia,-0.038632,-0.025357,-0.032452,0.184075,-0.052032,-0.009424,,-0.046187,,-0.01308,,-0.036708,-0.049247
cr_East_Africa,-0.018433,-0.012099,0.676724,-0.043762,-0.024828,-0.004497,,-0.022039,,-0.006241,,-0.017516,-0.138039
cr_East_Asia,,,,,,,,,,,,,
cr_Eastern_Europe,0.053489,-0.07626,-0.097599,0.519533,-0.156484,-0.028342,,0.333679,,-0.039338,,-0.071273,-0.397649
cr_Middle_East,-0.021857,-0.014346,0.18264,0.02007,-0.029438,0.361911,,-0.026131,,0.278047,,-0.020768,-0.096369


### Observations - Correlations between Demographic vs Types of Labor:
#### Output: tolAgriculture
- North America: 0.33

#### Output: tolBegging
- Age: -0.23
- Adult: -0.29
- East Africa: 0.68
- Africa: 0.3

#### Output: tolConstruction
- Age: 0.22
- Male: 0.32
- Eastern Eurpe: 0.52
- Southeast Asia: -0.38
- Asia: -0.31
- Europe: 0.52

#### Output: tolDomesticWork
- Female: 0.26
- North America: 0.21
- South Asia: 0.40

#### Output: tolHospitality
- Middle East: 0.36

#### Output: Manufacturing
- Eastern Europe: 0.33

#### Output: Peddling
- Middle East: 0.28

## F) Type of Sex Exploitation Data

In [1108]:
# Carve out type of Sex Exploitation data form GD_data_final (isSexualExploit = 1)
sexexp_handCalc = GD_data_final[['tosProstitution','tosPornography','tosRemoteInteractiveServices',
                                 'tosPrivateSexualServices','tosNotSpecified']][GD_data_final['isSexualExploit']==1]

sexexp_handCalc.head()

Unnamed: 0,tosProstitution,tosPornography,tosRemoteInteractiveServices,tosPrivateSexualServices,tosNotSpecified
0,0,0,0,0,1
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
33,0,0,0,0,1


In [1109]:
len(sexexp_handCalc)

11134

In [1110]:
sexexp_handCalc.describe()

Unnamed: 0,tosProstitution,tosPornography,tosRemoteInteractiveServices,tosPrivateSexualServices,tosNotSpecified
count,11134.0,11134.0,11134.0,11134.0,11134.0
mean,0.386474,0.004131,0.0,0.000808,0.608586
std,0.486963,0.064147,0.0,0.028421,0.488089
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,1.0
75%,1.0,0.0,0.0,0.0,1.0
max,1.0,1.0,0.0,1.0,1.0


In [1111]:
sexexp_handCalc.sum()

tosProstitution                 4303
tosPornography                    46
tosRemoteInteractiveServices       0
tosPrivateSexualServices           9
tosNotSpecified                 6776
dtype: int64

In [1115]:
### Join Demo data with Type of Sex Exploitation Data
demo_handCalc_sex = GD_data_final[GD_data_final['isSexualExploit']==1][demo_cols_keep].join(dummy_gender.ix[:, 'g_Female':]).join(dummy_majorityStatus.ix[:,'ms_Adult']).join(dummy_citizenshipRegion).join(dummy_citizenshipGeoCategory)
#len(demo_handCalc_sex)
demo_sexExptype_calc = demo_handCalc_sex.join(sexexp_handCalc)
demo_sexExptype_calc.head()
#len(demo_sexExptype_calc)

Unnamed: 0,yearOfRegistration,ageBroad_mid,g_Female,g_Male,ms_Adult,cr_Central_Asia,cr_East_Africa,cr_East_Asia,cr_Eastern_Europe,cr_Middle_East,...,cg_Europe,cg_Middle_East,cg_North_America,cg_South_America,cg_Unknown,tosProstitution,tosPornography,tosRemoteInteractiveServices,tosPrivateSexualServices,tosNotSpecified
0,2010,19.0,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2004,19.0,1,0,1,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,1
2,2010,19.0,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,2010,19.0,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
33,2002,34.0,1,0,1,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,1


In [1116]:
## Demo data vs Type of Sex Exploitation correlation
demo_sexExptype_calc.corr().iloc[0:25,25:]

Unnamed: 0,tosProstitution,tosPornography,tosRemoteInteractiveServices,tosPrivateSexualServices,tosNotSpecified
yearOfRegistration,0.429826,0.038087,,0.016692,-0.434813
ageBroad_mid,-0.093338,-0.047674,,0.028488,0.097729
g_Female,0.003481,0.006224,,0.002748,-0.004451
g_Male,-0.003481,-0.006224,,-0.002748,0.004451
ms_Adult,-0.223658,-0.06931,,0.017057,0.231258
cr_Central_Asia,-0.027136,-0.002202,,-0.000972,0.027419
cr_East_Africa,,,,,
cr_East_Asia,0.117502,-0.006007,,-0.002653,-0.116288
cr_Eastern_Europe,-0.376584,-0.030561,,-0.013496,0.380518
cr_Middle_East,,,,,


### Observations - Correlations between Demographic vs Types of Sex Exploitation:

#### Output: tosProstitution
- Year: 0.43
- Eastern Europe: -0.38
- Europe: -0.43
    

In [1118]:
### I know I'm missing visualizations here. Will include in Project 4.