## Question
If the victim of human trafficking is being forced to perform labor, can we predict the type of labor given the inputs available from our dataset?

### Data
The base dataset includes features for:

* Year
* Data Source
* Gender
* Citizenship
* Victim Forced to Labor
* Type of Control Exerted Upon Victim
* RecruiterCategory
* ExploitType
* LaborType
* Age Categories

*Not all of these features will be used in the model we create to predict the answers to our question.*

### Model
Given an input set, we are trying to predict an outcome. In our base set, we have a known outcome (the type of labor forced upon the victim (represented by the 'Labor_Type' feature)). Can we use a supervised machine learning model with multi-class logistic regression to predict the specific type of labor being forced upon the victim?

### Output
We need the model to predict one of 12 different forced labor categories derived from our data and ETL process.

#### Table: Forced Labor Types Categorized

In [1]:
# Import Dependencies

import os
import pandas as pd
import numpy as np

from sqlalchemy import create_engine
from config import conn

from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [2]:
data_df = pd.read_sql_table("Trafficking_Cleaned", conn)
data_df

Unnamed: 0,yearOfRegistration,Datasource,gender,citizenship,isForcedLabour,ControlCategory,RecruiterCategory,ExploitType,Labor_Type,ageCategories
0,2012,Case Management,Female,LK,1,Threats,Other,Forced Labor,Domestic Work,Age 30-38
1,2012,Case Management,Female,LK,1,Financial,Other,Forced Labor,Domestic Work,Age 30-38
2,2012,Case Management,Female,LK,1,Threats,Other,Forced Labor,Domestic Work,Age 30-38
3,2012,Case Management,Female,LK,1,Financial,Other,Forced Labor,Domestic Work,Age 30-38
4,2012,Case Management,Female,LK,1,Financial,Other,Forced Labor,Domestic Work,Age 30-38
...,...,...,...,...,...,...,...,...,...,...
14294,2018,Hotline,Male,US,0,Threats,Family/Relative,Sexual Exploitation,Unknown,Age 9-17
14295,2018,Hotline,Male,US,0,Threats,Family/Relative,Sexual Exploitation,Unknown,Age 9-17
14296,2018,Hotline,Male,US,0,Threats,Family/Relative,Sexual Exploitation,Unknown,Age 9-17
14297,2018,Hotline,Male,US,0,Other,Family/Relative,Sexual Exploitation,Unknown,Age 9-17


In [3]:
# Subset our data looking at only instances of forced labor
forced_labor_df = data_df[data_df['isForcedLabour'] == 1]
forced_labor_df

Unnamed: 0,yearOfRegistration,Datasource,gender,citizenship,isForcedLabour,ControlCategory,RecruiterCategory,ExploitType,Labor_Type,ageCategories
0,2012,Case Management,Female,LK,1,Threats,Other,Forced Labor,Domestic Work,Age 30-38
1,2012,Case Management,Female,LK,1,Financial,Other,Forced Labor,Domestic Work,Age 30-38
2,2012,Case Management,Female,LK,1,Threats,Other,Forced Labor,Domestic Work,Age 30-38
3,2012,Case Management,Female,LK,1,Financial,Other,Forced Labor,Domestic Work,Age 30-38
4,2012,Case Management,Female,LK,1,Financial,Other,Forced Labor,Domestic Work,Age 30-38
...,...,...,...,...,...,...,...,...,...,...
14279,2018,Hotline,Male,0,1,Other,Not Specified,Forced Labor,Peddling,Age 9-17
14280,2018,Hotline,Male,0,1,Other,Not Specified,Forced Labor,Peddling,Age 9-17
14281,2018,Hotline,Male,0,1,Financial,Not Specified,Forced Labor,Peddling,Age 9-17
14282,2018,Hotline,Male,0,1,Financial,Not Specified,Forced Labor,Peddling,Age 9-17


In [4]:
# A little EDA
forced_labor_df['Labor_Type'].value_counts()

Construction     810
Unknown          563
Manufacturing    304
Domestic Work    229
Other            103
Aquafarming       91
Begging           88
Agriculture       69
Peddling          58
Name: Labor_Type, dtype: int64

In [5]:
# Drop features/columns that we won't want to initially include in our analysis/model
forced_labor_df.drop(["yearOfRegistration", "ExploitType", "Datasource", "isForcedLabour"], axis=1, inplace=True)
forced_labor_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,gender,citizenship,ControlCategory,RecruiterCategory,Labor_Type,ageCategories
0,Female,LK,Threats,Other,Domestic Work,Age 30-38
1,Female,LK,Financial,Other,Domestic Work,Age 30-38
2,Female,LK,Threats,Other,Domestic Work,Age 30-38
3,Female,LK,Financial,Other,Domestic Work,Age 30-38
4,Female,LK,Financial,Other,Domestic Work,Age 30-38
...,...,...,...,...,...,...
14279,Male,0,Other,Not Specified,Peddling,Age 9-17
14280,Male,0,Other,Not Specified,Peddling,Age 9-17
14281,Male,0,Financial,Not Specified,Peddling,Age 9-17
14282,Male,0,Financial,Not Specified,Peddling,Age 9-17


In [6]:
# Change 0 to null so rows can be dropped
forced_labor_df['citizenship'].replace({'0': np.nan}, inplace=True)
forced_labor_df['citizenship'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


UA    1016
BY     222
MM     186
KG     157
HT      89
LK      73
ID      69
UG      66
KH      64
TH      53
PH      38
KE      29
NP      27
UZ      25
MX      19
AF       4
Name: citizenship, dtype: int64

In [7]:
# Drop rows with null values
forced_labor_df.dropna(inplace=True)
forced_labor_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,gender,citizenship,ControlCategory,RecruiterCategory,Labor_Type,ageCategories
0,Female,LK,Threats,Other,Domestic Work,Age 30-38
1,Female,LK,Financial,Other,Domestic Work,Age 30-38
2,Female,LK,Threats,Other,Domestic Work,Age 30-38
3,Female,LK,Financial,Other,Domestic Work,Age 30-38
4,Female,LK,Financial,Other,Domestic Work,Age 30-38
...,...,...,...,...,...,...
10380,Female,MM,Threats,Friend/Acquaintance,Unknown,Age 9-17
10381,Female,MM,Threats,Friend/Acquaintance,Unknown,Age 9-17
10382,Female,MM,Threats,Other,Unknown,Age 9-17
10383,Female,MM,Threats,Friend/Acquaintance,Unknown,Age 9-17


In [8]:
# Gender transformation
gender_transform = {
    "Male": 0, 
    "Female": 1
}

# Encode labeled data so the model can interpret it correctly
forced_labor_df['gender'] = forced_labor_df['gender'].apply(lambda x: gender_transform[x])
forced_labor_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,gender,citizenship,ControlCategory,RecruiterCategory,Labor_Type,ageCategories
0,1,LK,Threats,Other,Domestic Work,Age 30-38
1,1,LK,Financial,Other,Domestic Work,Age 30-38
2,1,LK,Threats,Other,Domestic Work,Age 30-38
3,1,LK,Financial,Other,Domestic Work,Age 30-38
4,1,LK,Financial,Other,Domestic Work,Age 30-38
...,...,...,...,...,...,...
10380,1,MM,Threats,Friend/Acquaintance,Unknown,Age 9-17
10381,1,MM,Threats,Friend/Acquaintance,Unknown,Age 9-17
10382,1,MM,Threats,Other,Unknown,Age 9-17
10383,1,MM,Threats,Friend/Acquaintance,Unknown,Age 9-17


In [9]:
# Country transformation
country_transform = {
    "US": 840, 
    "UA": 804,
    "BY": 112, 
    "MM": 104, 
    "KG": 417, 
    "KH": 116, 
    "NG": 566,
    "HT": 332,  
    "LK": 144, 
    "ID": 360, 
    "UG": 800, 
    "TH": 764, 
    "PH": 608,
    "KE": 404, 
    "NP": 524, 
    "UZ": 860, 
    "CN": 156, 
    "MX": 484, 
    "KR": 410, 
    "AF": 4, 
    "ER": 232
}

# Encode labeled data so the model can interpret it correctly
forced_labor_df['citizenship'] = forced_labor_df['citizenship'].apply(lambda x: country_transform[x])
forced_labor_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,gender,citizenship,ControlCategory,RecruiterCategory,Labor_Type,ageCategories
0,1,144,Threats,Other,Domestic Work,Age 30-38
1,1,144,Financial,Other,Domestic Work,Age 30-38
2,1,144,Threats,Other,Domestic Work,Age 30-38
3,1,144,Financial,Other,Domestic Work,Age 30-38
4,1,144,Financial,Other,Domestic Work,Age 30-38
...,...,...,...,...,...,...
10380,1,104,Threats,Friend/Acquaintance,Unknown,Age 9-17
10381,1,104,Threats,Friend/Acquaintance,Unknown,Age 9-17
10382,1,104,Threats,Other,Unknown,Age 9-17
10383,1,104,Threats,Friend/Acquaintance,Unknown,Age 9-17


In [10]:
# Transform the ControlCategory
control_transform = {
    "Financial": 1, 
    "Threats": 2, 
    "Survival": 3, 
    "Physical": 4, 
    "Other": 5
}

# Encode data so the model can interpret it correctly
forced_labor_df['ControlCategory'] = forced_labor_df['ControlCategory'].apply(lambda x: control_transform[x])
forced_labor_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,gender,citizenship,ControlCategory,RecruiterCategory,Labor_Type,ageCategories
0,1,144,2,Other,Domestic Work,Age 30-38
1,1,144,1,Other,Domestic Work,Age 30-38
2,1,144,2,Other,Domestic Work,Age 30-38
3,1,144,1,Other,Domestic Work,Age 30-38
4,1,144,1,Other,Domestic Work,Age 30-38
...,...,...,...,...,...,...
10380,1,104,2,Friend/Acquaintance,Unknown,Age 9-17
10381,1,104,2,Friend/Acquaintance,Unknown,Age 9-17
10382,1,104,2,Other,Unknown,Age 9-17
10383,1,104,2,Friend/Acquaintance,Unknown,Age 9-17


In [12]:
# Transform the Recruiter Category
recruiter_transform = {
    "Not Specified": 1, 
    "Other": 2, 
    "Friend/Acquaintance": 3, 
    "Family/Relative": 4, 
    "Intimate Partner": 5
}

# Encode data so the model can interpret it correctly
forced_labor_df['RecruiterCategory'] = forced_labor_df['RecruiterCategory'].apply(lambda x: recruiter_transform[x])
forced_labor_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,gender,citizenship,ControlCategory,RecruiterCategory,Labor_Type,ageCategories
0,1,144,2,2,Domestic Work,Age 30-38
1,1,144,1,2,Domestic Work,Age 30-38
2,1,144,2,2,Domestic Work,Age 30-38
3,1,144,1,2,Domestic Work,Age 30-38
4,1,144,1,2,Domestic Work,Age 30-38
...,...,...,...,...,...,...
10380,1,104,2,3,Unknown,Age 9-17
10381,1,104,2,3,Unknown,Age 9-17
10382,1,104,2,2,Unknown,Age 9-17
10383,1,104,2,3,Unknown,Age 9-17


In [13]:
# Transform the Labor Type
labor_transform = {
    "Domestic Work": 1, 
    "Other": 2, 
    "Unknown": 3, 
    "Agriculture": 4, 
    "Manufacturing": 5, 
    "Construction": 6, 
    "Begging": 7, 
    "Aquafarming": 8
}

# Encode data so the model can interpret it correctly
forced_labor_df['Labor_Type'] = forced_labor_df['Labor_Type'].apply(lambda x: labor_transform[x])
forced_labor_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,gender,citizenship,ControlCategory,RecruiterCategory,Labor_Type,ageCategories
0,1,144,2,2,1,Age 30-38
1,1,144,1,2,1,Age 30-38
2,1,144,2,2,1,Age 30-38
3,1,144,1,2,1,Age 30-38
4,1,144,1,2,1,Age 30-38
...,...,...,...,...,...,...
10380,1,104,2,3,3,Age 9-17
10381,1,104,2,3,3,Age 9-17
10382,1,104,2,2,3,Age 9-17
10383,1,104,2,3,3,Age 9-17


In [15]:
# Transform Age Categories
age_transform = {
    "Age 0-8": 1, 
    "Age 9-17": 2, 
    "Age 18-20": 3, 
    "Age 21-23": 4, 
    "Age 24-26": 5, 
    "Age 27-29": 6, 
    "Age 30-38": 7, 
    "Age 39-47": 8, 
    "Age 48+": 9
}

# Encode data so the model can interpret it correctly
forced_labor_df['ageCategories'] = forced_labor_df['ageCategories'].apply(lambda x: age_transform[x])
forced_labor_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


Unnamed: 0,gender,citizenship,ControlCategory,RecruiterCategory,Labor_Type,ageCategories
0,1,144,2,2,1,7
1,1,144,1,2,1,7
2,1,144,2,2,1,7
3,1,144,1,2,1,7
4,1,144,1,2,1,7
...,...,...,...,...,...,...
10380,1,104,2,3,3,2
10381,1,104,2,3,3,2
10382,1,104,2,2,3,2
10383,1,104,2,3,3,2


In [16]:
# Separate our data from our outcomes
y = forced_labor_df['Labor_Type']
X = forced_labor_df.drop(['Labor_Type'], axis=1)
data_df.head()

# Split data in to train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,  random_state=1, stratify=y)

In [17]:
# Create a pipeline and scale the input
labor_model = make_pipeline(StandardScaler(), SGDClassifier(max_iter=500, tol=1e-3))
labor_model.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdclassifier', SGDClassifier(max_iter=500))])

In [18]:
# Compare our model's predictions to our known outcomes
y_pred = labor_model.predict(X_test)

# Create a DataFrame if we need to take a deeper dive
# results = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
# results

In [19]:
# Check accuracy
accuracy_score(y_test, y_pred)

0.6747663551401869

In [20]:
# Check performance
confusion_matrix(y_test, y_pred)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.68      0.96      0.80        57
           2       0.00      0.00      0.00         1
           3       0.86      0.35      0.49       141
           4       0.00      0.00      0.00        17
           5       0.71      0.54      0.61        76
           6       0.63      0.97      0.76       203
           7       1.00      1.00      1.00        17
           8       0.33      0.09      0.14        23

    accuracy                           0.67       535
   macro avg       0.53      0.49      0.48       535
weighted avg       0.68      0.67      0.63       535



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
