The 2012 US Army Anthropometric Survey (ANSUR II) was executed by the Natick Soldier Research, Development and Engineering Center (NSRDEC) from October 2010 to April 2012 and is comprised of personnel representing the total US Army force to include the US Army Active Duty, Reserves, and National Guard. In addition to the anthropometric and demographic data described below, the ANSUR II database also consists of 3D whole body, foot, and head scans of Soldier participants. These 3D data are not publicly available out of respect for the privacy of ANSUR II participants. The data from this survey are used for a wide range of equipment design, sizing, and tariffing applications within the military and has many potential commercial, industrial, and academic applications.

The ANSUR II working databases contain 93 anthropometric measurements which were directly measured, and 15 demographic/administrative variables explained below. The ANSUR II Male working database contains a total sample of 4,082 subjects. The ANSUR II Female working database contains a total sample of 1,986 subjects.


data dict:
https://data.world/datamil/ansur-ii-data-dictionary/workspace/file?filename=ANSUR+II+Databases+Overview.pdf


Hİnt for metric : Our mission to classify soldiers races via their body sclales. We want a balanced score for our predictions.

# Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

import cufflinks as cf
import plotly.offline

cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)



pd.set_option("display.max_columns", None)
import warnings
warnings.filterwarnings('ignore')

# Ingest the data from links below and make a dataframe
- Soldiers Male : https://query.data.world/s/h3pbhckz5ck4rc7qmt2wlknlnn7esr
- Soldiers Female : https://query.data.world/s/sq27zz4hawg32yfxksqwijxmpwmynq

In [2]:
female = pd.read_csv("ANSUR II FEMALE Public.csv")
male = pd.read_csv("ANSUR II MALE Public.csv")

In [3]:
female = female.rename(columns={"SubjectId": "subjectid"})

In [4]:
df = pd.concat([female, male], axis=0).reset_index(drop=True)

In [5]:
df.isnull().sum().sort_values(ascending=False)

Ethnicity                         4647
subjectid                            0
radialestylionlength                 0
thighcircumference                   0
tenthribheight                       0
                                  ... 
earprotrusion                        0
earlength                            0
earbreadth                           0
crotchlengthposterioromphalion       0
WritingPreference                    0
Length: 108, dtype: int64

In [6]:
df.shape

(6068, 108)

# EDA
Tips :
- Drop unnecessary colums
- Drop DODRace class if value count below 500 (we assume that our data model can't learn if it is below 500)
- Find unusual value in Weightlbs

In [7]:
drop_cols = ["subjectid", "Date"]

In [8]:
df = df.drop(columns=drop_cols)

In [9]:
drop_values = df.DODRace.value_counts()[df.DODRace.value_counts()<500].keys().to_list()
drop_values

[4, 6, 5, 8]

In [10]:
df = df[~df["DODRace"].isin(drop_values)].reset_index(drop=True)

In [11]:
df[df["SubjectNumericRace"]==df["DODRace"]].shape

(5101, 106)

In [12]:
df.shape

(5769, 106)

- 5769 rowdan 5101'inde SubjectNumericRace ile DODRace birbirinin aynısı. Bu feature modelimize data leakage sağlayacaktır. Bu sebeple düşürülmesi daha doğru olur.

In [13]:
df = df.drop(columns="SubjectNumericRace")

In [14]:
cat_cols = df.select_dtypes("object").columns.to_list()
num_cols = df.drop(columns="DODRace").select_dtypes("number").columns.to_list()

### Categorical Columns

In [15]:
df[cat_cols].nunique()

Gender                     2
Installation              12
Component                  3
Branch                     3
PrimaryMOS               281
SubjectsBirthLocation    136
Ethnicity                157
WritingPreference          3
dtype: int64

In [16]:
df[cat_cols].sample(5)

Unnamed: 0,Gender,Installation,Component,Branch,PrimaryMOS,SubjectsBirthLocation,Ethnicity,WritingPreference
2545,Male,Fort Bliss,Regular Army,Combat Service Support,91W,Pennsylvania,,Right hand
1359,Female,Fort Gordon,Regular Army,Combat Support,25U,Georgia,,Right hand
3364,Male,Fort Drum,Regular Army,Combat Service Support,88M,New York,,Right hand
5114,Male,Camp Shelby,Army National Guard,Combat Service Support,25S,New York,,Left hand
3563,Male,Fort Drum,Regular Army,Combat Arms,13B,Louisiana,Arab or Middle Eastern,Right hand


In [17]:
df = df.drop(columns=["PrimaryMOS", "SubjectsBirthLocation", "Ethnicity"])

In [18]:
df.Weightlbs.iplot(kind="box", boxpoints="outliers")

In [19]:
df.loc[df["Weightlbs"]==0, "Weightlbs"] = round(df.loc[df["Weightlbs"]==0, "weightkg"] / 10)

In [20]:
((df.Weightlbs * 0.454) / (df.weightkg / 10)).mean()

0.9966750419689239

In [21]:
((df.Heightin * 2.54) / (df.stature / 10)).mean()

1.0111900427511724

# DATA Preprocessing

In [22]:
X = df.drop(columns="DODRace")
y = df.DODRace

# Modelling Implementing
- You can use pipeline (optional)
- You can research over/undersampling methods and after selecting the best model, examine it to see if better scores can be obtained. (https://imbalanced-learn.org/stable/introduction.html)

In [39]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier


from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler


from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [25]:
models = []

models.append(("LR", LogisticRegression(max_iter=10000000)))
models.append(("SVC", SVC()))
models.append(("RF", RandomForestClassifier()))
models.append(("ADA", AdaBoostClassifier()))
models.append(("GB", GradientBoostingClassifier()))
models.append(("XGB", XGBClassifier(verbosity = 0, silent=True)))

In [26]:
trans = ColumnTransformer([("ohe", OneHotEncoder(handle_unknown="ignore"), ["Gender", "Installation", 
                                                                            "Component", "Branch", "WritingPreference"])])

In [27]:
scores = []
names = []


for name, model in models:
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1845)
    pipe = Pipeline([("preprocessing", trans), ("model", model)])
    cv_results = cross_val_score(pipe, X_train, y_train, cv=kfold, scoring="accuracy")
    
    
    scores.append(cv_results)
    names.append(name)
    
    print(f"{name}: {cv_results.mean()}  ({cv_results.std()})")

LR: 0.6559023767266717  (0.007771930857505315)
SVC: 0.6587190466800011  (0.00892758395775248)
RF: 0.6533016874665465  (0.010382286583307886)
ADA: 0.656119296466368  (0.007146077541413673)
GB: 0.6559023767266717  (0.012071285819637342)
XGB: 0.6559023767266717  (0.01117978925295043)


In [28]:
pd.DataFrame(scores, index=names, columns=[i for i in range(1,11)]).T

Unnamed: 0,LR,SVC,RF,ADA,GB,XGB
1,0.664502,0.67316,0.65368,0.664502,0.666667,0.67316
2,0.666667,0.670996,0.677489,0.666667,0.677489,0.677489
3,0.647186,0.647186,0.645022,0.647186,0.634199,0.642857
4,0.651515,0.65368,0.645022,0.651515,0.655844,0.642857
5,0.660173,0.660173,0.658009,0.660173,0.655844,0.65368
6,0.659436,0.657267,0.652928,0.657267,0.663774,0.655098
7,0.663774,0.665944,0.64859,0.663774,0.661605,0.64859
8,0.644252,0.659436,0.663774,0.64859,0.64859,0.661605
9,0.655098,0.655098,0.639913,0.655098,0.655098,0.655098
10,0.646421,0.644252,0.64859,0.646421,0.639913,0.64859


In [29]:
pd.DataFrame(scores, index=names, columns=[i for i in range(1,11)]).T.iplot(kind="box", boxpoints="all")

# Standart Scaler

In [30]:
ss_cols = X_train.drop(columns=["Gender", "Installation", "Component", "Branch", "WritingPreference"]).columns.to_list()

In [55]:
df.Branch.value_counts()

Combat Service Support    3021
Combat Arms               1508
Combat Support            1240
Name: Branch, dtype: int64

In [31]:
trans = ColumnTransformer([("ohe", OneHotEncoder(handle_unknown="ignore"),
                            ["Gender", "Installation", "Component", "Branch", "WritingPreference"]),
                           ("scaler", StandardScaler(), ss_cols)])

In [60]:
from sklearn.metrics import classification_report, roc_curve

In [61]:
def plot_multiclass_roc(clf, X_test, y_test, n_classes, figsize=(5,5)):
    y_score = clf.decision_function(X_test)

    # structures
    fpr = dict()
    tpr = dict()
    roc_auc = dict()

    # calculate dummies once
    y_test_dummies = pd.get_dummies(y_test, drop_first=False).values
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_test_dummies[:, i], y_score[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])

    # roc for each class
    fig, ax = plt.subplots(figsize=figsize)
    ax.plot([0, 1], [0, 1], 'k--')
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title('Receiver operating characteristic example')
    for i in range(n_classes):
        ax.plot(fpr[i], tpr[i], label='ROC curve (area = %0.2f) for label %i' % (roc_auc[i], i))
    ax.legend(loc="best")
    ax.grid(alpha=.4)
    sns.despine()
    plt.show()

In [63]:
scores = []
names = []


for name, model in models:
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1845)
    pipe = Pipeline([("preprocessing", trans), ("model", model)])
    cv_results = cross_val_score(pipe, X_train, y_train, cv=kfold, scoring="accuracy")
    
    
    scores.append(cv_results)
    names.append(name)
    
    print(f"{name}: {cv_results.mean()}  ({cv_results.std()})")
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    print(classification_report(y_test, y_pred))
    
    print()

LR: 0.8715013475317163  (0.016678566698318544)
              precision    recall  f1-score   support

           1       0.91      0.96      0.94       798
           2       0.91      0.91      0.91       231
           3       0.65      0.43      0.52       125

    accuracy                           0.89      1154
   macro avg       0.82      0.77      0.79      1154
weighted avg       0.88      0.89      0.88      1154


SVC: 0.8613173883238959  (0.011639350479632899)
              precision    recall  f1-score   support

           1       0.89      0.98      0.93       798
           2       0.90      0.91      0.91       231
           3       0.78      0.23      0.36       125

    accuracy                           0.89      1154
   macro avg       0.86      0.71      0.73      1154
weighted avg       0.88      0.89      0.86      1154


RF: 0.8143030866458197  (0.00872789270576301)
              precision    recall  f1-score   support

           1       0.83      0.98      0

In [33]:
pd.DataFrame(scores, index=names, columns=[i for i in range(1,11)]).T

Unnamed: 0,LR,SVC,RF,ADA,GB,XGB
1,0.861472,0.861472,0.824675,0.822511,0.839827,0.854978
2,0.863636,0.84632,0.800866,0.807359,0.82684,0.850649
3,0.863636,0.87013,0.816017,0.800866,0.84632,0.844156
4,0.904762,0.878788,0.809524,0.82684,0.848485,0.867965
5,0.885281,0.87013,0.82684,0.824675,0.863636,0.872294
6,0.872017,0.867679,0.81128,0.813449,0.843818,0.845987
7,0.863341,0.845987,0.804772,0.824295,0.83731,0.845987
8,0.859002,0.845987,0.819957,0.806941,0.832972,0.848156
9,0.848156,0.872017,0.824295,0.817787,0.850325,0.863341
10,0.893709,0.854664,0.822126,0.804772,0.856833,0.850325


In [34]:
pd.DataFrame(scores, index=names, columns=[i for i in range(1,11)]).T.iplot(kind="box", boxpoints="all")

# Choose the best model based on the metric you choose and make a random prediction

In [36]:
model = LogisticRegression(max_iter=10000000)

pipe = Pipeline([("preprocessing", trans), ("model", model)])

pipe.fit(X_train, y_train)

Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('ohe',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Gender', 'Installation',
                                                   'Component', 'Branch',
                                                   'WritingPreference']),
                                                 ('scaler', StandardScaler(),
                                                  ['abdominalextensiondepthsitting',
                                                   'acromialheight',
                                                   'acromionradialelength',
                                                   'anklecircumference',
                                                   'axillaheight',
                                                   'balloffootcircumference',
                                                   'ballof...
      

In [41]:
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.91      0.96      0.94       798
           2       0.91      0.91      0.91       231
           3       0.65      0.43      0.52       125

    accuracy                           0.89      1154
   macro avg       0.82      0.77      0.79      1154
weighted avg       0.88      0.89      0.88      1154



---
---