### Question
If the victim of human trafficking is being forced to perform labor, can we predict the type of labor given the inputs available from our dataset?

#### Data
The base dataset includes features for:
* Year
* Data Source
* Gender
* Citizenship
* Victim Forced to Labor
* Type of Control Exerted Upon Victim
* RecruiterCategory
* ExploitType
* LaborType
* Age Categories
Not all of these features will be used in the model we create to predict the answers to our question.

#### Model
Given an input set, we are trying to predict an outcome. In our base set, we have a known outcome (the type of labor forced upon the victim (represented by the 'Labor_Type' feature)). Can we use a supervised machine learning model with multi-class logistic regression to predict the specific type of labor being forced upon the victim?

#### Output
We need the model to predict one of 12 different forced labor categories derived from our data and ETL process.

#### Table: Forced Labor Types Categorized
| Original Data | Data Code |
| ----- | ----- |
| typeOfLabourAgriculture | 1 |
| typeOfLabourAquafarming | 2 |
| typeOfLabourBegging | 3 |
| typeOfLabourConstruction | 4 |
| typeOfLabourDomesticWork | 5 |
| typeOfLabourHospitality | 6 |
| typeOfLabourIllicitActivities | 7 |
| typeOfLabourManufacturing | 8 |
| typeOfLabourMiningOrDrilling | 9 |
| typeOfLabourPeddling | 10 |
| typeOfLabourTransportation | 11 |
| typeOfLabourOther | 12 |
| typeOfLabourNotSpecified | 12 |

In [7]:
# Import Dependencies

import os
import pandas as pd
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [8]:
# Import our data

data_df = pd.read_csv(os.path.join("../../Exports", "eda5.csv"))
data_df.head()

Unnamed: 0,yearOfRegistration,Datasource,gender,citizenship,isForcedLabour,ControlCategory,RecruiterCategory,ExploitType,Labor_Type,ageCategories
0,2012,1,1,27,1,2,2,2,5,7
1,2012,1,1,27,1,1,2,2,5,7
2,2012,1,1,27,1,2,2,2,5,7
3,2012,1,1,27,1,1,2,2,5,7
4,2012,1,1,27,1,1,2,2,5,7


In [9]:
# Subset our data looking at only instances of forced labor

forced_labor_df = data_df[data_df['isForcedLabour'] == 1]
forced_labor_df.head()

Unnamed: 0,yearOfRegistration,Datasource,gender,citizenship,isForcedLabour,ControlCategory,RecruiterCategory,ExploitType,Labor_Type,ageCategories
0,2012,1,1,27,1,2,2,2,5,7
1,2012,1,1,27,1,1,2,2,5,7
2,2012,1,1,27,1,2,2,2,5,7
3,2012,1,1,27,1,1,2,2,5,7
4,2012,1,1,27,1,1,2,2,5,7


In [10]:
# A little EDA
forced_labor_df['isForcedLabour'].value_counts()

1    2137
Name: isForcedLabour, dtype: int64

In [11]:
# A little more EDA

forced_labor_df['Labor_Type'].value_counts()

4     810
12    568
8     304
5     229
2      91
1      69
3      66
Name: Labor_Type, dtype: int64

In [13]:
# Separate our data from our outcomes

y = forced_labor_df['Labor_Type']
X = forced_labor_df.drop(['isForcedLabour', 'yearOfRegistration', 'ExploitType', 'Datasource', 'Labor_Type'], axis=1)
data_df.head()

# Split data in to train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,  random_state=1, stratify=y)

In [14]:
# Create a pipeline and scale the input
trafficking_model = make_pipeline(StandardScaler(), SGDClassifier(max_iter=500, tol=1e-3))
trafficking_model.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdclassifier', SGDClassifier(max_iter=500))])

In [15]:
# Compare our model's predictions to our known outcomes
y_pred = trafficking_model.predict(X_test)
results = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results

Unnamed: 0,Prediction,Actual
0,12,12
1,1,5
2,4,12
3,4,12
4,12,12
...,...,...
530,12,2
531,12,5
532,12,12
533,12,5


In [16]:
# Check accuracy
accuracy_score(y_test, y_pred)

0.7140186915887851

In [17]:
# Check performance
confusion_matrix(y_test, y_pred)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00        17
           2       0.00      0.00      0.00        23
           3       1.00      1.00      1.00        17
           4       0.80      1.00      0.89       203
           5       0.00      0.00      0.00        57
           8       0.74      0.59      0.66        76
          12       0.60      0.83      0.70       142

    accuracy                           0.71       535
   macro avg       0.45      0.49      0.46       535
weighted avg       0.60      0.71      0.65       535



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
