### Question
Does the victim's age, gender, relationship to trafficker, and citizenship predict the victim's purpose?

#### Model
Given an input set, we are trying to predict an outcome. In our base set, we have a known outcome (the victim's purpose, which is mostly known from the case data). Can we use a supervised machine learning model to predict the type of exploit?

#### Output
We need the model to predict one of five different exploit categories derived from our data and ETL process.

### Table: Exploit Types Categorized
| Original Data | Data Code |
| ----- | ----- |
| Sexual exploitation | 1 |
| Forced labour | 2 |
| Other/Multiple | 3 |
| Slavery and similar practices | 4 |
| Forced marriage | 5 |
| -99 (Unknown) | NaN |

In [27]:
# Import Dependencies

import os
import pandas as pd
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


In [28]:
# Get our cleaned data from the ETL process

data_df = pd.read_csv(os.path.join("../../Exports", "eda5.csv"))
data_df.head()

Unnamed: 0,yearOfRegistration,Datasource,gender,citizenship,isForcedLabour,ControlCategory,RecruiterCategory,ExploitType,Labor_Type,ageCategories
0,2012,1,1,27,1,2,2,2,5,7
1,2012,1,1,27,1,1,2,2,5,7
2,2012,1,1,27,1,2,2,2,5,7
3,2012,1,1,27,1,1,2,2,5,7
4,2012,1,1,27,1,1,2,2,5,7


In [29]:
# Further reduce the number of features that we're using. Really hone in on the question asked.
model_df = data_df.drop(["yearOfRegistration", "Datasource", "isForcedLabour", "ControlCategory"], axis=1)
model_df.head()

Unnamed: 0,gender,citizenship,RecruiterCategory,ExploitType,Labor_Type,ageCategories
0,1,27,2,2,5,7
1,1,27,2,2,5,7
2,1,27,2,2,5,7
3,1,27,2,2,5,7
4,1,27,2,2,5,7


In [30]:
# Scale the input
# scaler = StandardScaler()

# Fit the scaler
# model_scaled = scaler.fit_transform(model_df)
# model_scaled[:10]

In [31]:
# Place scaled data back in to a dataframe
# scaled_model_df = pd.DataFrame(model_scaled, columns=['Gender', 'Citizenship', 'RecruiterCategory', 'ExploitType', 'LaborType', 'AgeCategories'])
# scaled_model_df.head()

In [32]:
# Create our train and test splits
y = model_df['ExploitType']
X = model_df.drop(['ExploitType'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,  random_state=1, stratify=y)

In [33]:
# Create a pipeline and scale the input
trafficking_model = make_pipeline(StandardScaler(), SGDClassifier(max_iter=500, tol=1e-3))
trafficking_model.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdclassifier', SGDClassifier(max_iter=500))])

In [34]:
# Compare our model's predictions to our known outcomes
y_pred = trafficking_model.predict(X_test)
results = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results

Unnamed: 0,Prediction,Actual
0,1,1
1,1,1
2,1,1
3,2,2
4,1,1
...,...,...
1482,2,2
1483,2,2
1484,2,2
1485,2,2


In [35]:
# Check accuracy
accuracy_score(y_test, y_pred)

0.9757901815736382

In [36]:
# Check performance
confusion_matrix(y_test, y_pred)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.97      1.00      0.99       934
           2       0.99      0.95      0.97       527
           3       1.00      1.00      1.00         7
           5       0.75      0.63      0.69        19

    accuracy                           0.98      1487
   macro avg       0.93      0.89      0.91      1487
weighted avg       0.98      0.98      0.98      1487

