# Dataset | Problem

The 2012 US Army Anthropometric Survey (ANSUR II) was executed by the Natick Soldier Research, Development and Engineering Center (NSRDEC) from October 2010 to April 2012 and is comprised of personnel representing the total US Army force to include the US Army Active Duty, Reserves, and National Guard. In addition to the anthropometric and demographic data described below, the ANSUR II database also consists of 3D whole body, foot, and head scans of Soldier participants. These 3D data are not publicly available out of respect for the privacy of ANSUR II participants. The data from this survey are used for a wide range of equipment design, sizing, and tariffing applications within the military and has many potential commercial, industrial, and academic applications.

The ANSUR II working databases contain 93 anthropometric measurements which were directly measured, and 15 demographic/administrative variables explained below. The ANSUR II Male working database contains a total sample of 4,082 subjects. The ANSUR II Female working database contains a total sample of 1,986 subjects.


data dict:
https://data.world/datamil/ansur-ii-data-dictionary/workspace/file?filename=ANSUR+II+Databases+Overview.pdf

Hİnt for metric : Our mission to classify soldiers races via their body sclales. We want a balanced score for our predictions.

# Ingest the data from links below and make a dataframe
- Soldiers Male : https://query.data.world/s/h3pbhckz5ck4rc7qmt2wlknlnn7esr
- Soldiers Female : https://query.data.world/s/sq27zz4hawg32yfxksqwijxmpwmynq

# EDA
Tips :
- Drop unnecessary colums
- Drop DODRace class if value count below 500 (we assume that our data model can't learn if it is below 500)
- Find unusual value in Weightlbs

# Context:

### SUBJECT: 2012 US Army Anthropometric Working Databases

#### Background
    1. This memorandum outlines the contents of the ANSUR II Working Databases and provides a
    brief explanation of each variable contained in the databases. These databases and this
    memorandum have been reviewed and cleared for UNLIMITED PUBLIC RELEASE.
    2. The 2012 US Army Anthropometric Survey (ANSUR II) was executed by the Natick Soldier
    Research, Development and Engineering Center (NSRDEC) from October 2010 to April 2012
    and is comprised of personnel representing the total US Army force to include the US Army
    Active Duty, Reserves, and National Guard. In addition to the anthropometric and demographic
    data described below, the ANSUR II database also consists of 3D whole body, foot, and head
    scans of Soldier participants. These 3D data are not publicly available out of respect for the
    privacy of ANSUR II participants. The data from this survey are used for a wide range of
    equipment design, sizing, and tariffing applications within the military and has many potential
    commercial, industrial, and academic applications.
    3. The ANSUR II working databases contain 93 anthropometric measurements which were
    directly measured, and 15 demographic/administrative variables explained below. The
    ANSUR II Male working database contains a total sample of 4,082 subjects. The ANSUR II
    Female working database contains a total sample of 1,986 subjects. The databases are reported in
    the associated spreadsheet files:
        a. “ANSUR II MALE Public.csv”
        b. “ANSUR II FEMALE Public.csv”. 

#### Data Content
    4. Demographic/Administrative Data: The following variables are included in the ANSUR II
    working databases for each subject and were assigned to or collected from subjects at the time of
    their participation.
         subjectid – A unique number for each participant measured in the anthropometric survey,
        ranging from 10027 to 920103, not inclusive
         SubjectBirthLocation – Subject Birth Location; a U.S. state or foreign country
         SubjectNumericRace – Subject Numeric Race; a single or multi-digit code
        indicating a subject’s self-reported race or races (verified through interview).
        Where 
                1 = White, 
                2 = Black, 
                3 = Hispanic, 
                4 = Asian, 
                5 = Native American,
                6 = Pacific Islander, 
                8 = Other
         Ethnicity – self-reported ethnicity (verified through interview); e.g. “Mexican”,
        “Vietnamese”
         DODRace – Department of Defense Race; a single digit indicating a subject’s
        self-reported preferred single race where selecting multiple races is not an option.
        This variable is intended to be comparable to the Defense Manpower Data Center
        demographic data. 
        Where 
                1 = White, 
                2 = Black, 
                3 = Hispanic, 
                4 = Asian,
                5 = Native American, 
                6 = Pacific Islander, 
                8 = Other
         Gender – “Male” or “Female”
         Age – Participant’s age in years
         Heightin – Height in Inches; self-reported, comparable to measured “stature”
         Weightlbs – Weight in Pounds; self-reported, comparable to measured “weightkg”
         WritingPreference – Writing Preference; “Right hand”, “Left hand”, or
        “Either hand (No preference)”
         Date – Date the participant was measured, ranging from “04-Oct-10” to “05-Apr-12”
         Installation – U.S. Army installation where the measurement occurred;
        e.g. “Fort Hood”, “Camp Shelby”
         Component – “Army National Guard”, “Army Reserve”, or “Regular Army”
         Branch – “Combat Arms”, “Combat Support”, or “Combat Service Support”
         PrimaryMOS – Primary Military Occupational Specialty
    5. Anthropometric Data: the following variables are included in the ANSUR II working
    databases for each subject and were directly-measured dimensions of the participant’s body. All
    measurements are recorded in millimeters with the exception of the variable “weightkg”.
         abdominalextensiondepthsitting – Abdominal Extension Depth, Sitting
         acromialheight – Acromial Height
         acromionradialelength – Acromion-Radiale Length
         anklecircumference – Ankle Circumference
         axillaheight – Axilla Height
         balloffootcircumference – Ball of Foot Circumference
         balloffootlength – Ball of Foot Length
         biacromialbreadth – Biacromial Breadth
         bicepscircumferenceflexed – Biceps Circumference, Flexed
         bicristalbreadth – Bicristal Breadth
         bideltoidbreadth – Bideltoid Breadth
         bimalleolarbreadth – Bimalleolar Breadth
         bitragionchinarc – Bitragion Chin Arc
         bitragionsubmandibulararc – Bitragion Submandibular Arc
         bizygomaticbreadth – Bizygomatic Breadth
         buttockcircumference – Buttock Circumference
         buttockdepth – Buttock Depth
         buttockheight – Buttock Height
         buttockkneelength – Buttock-Knee Length
         buttockpopliteallength – Buttock-Popliteal Length
         calfcircumference – Calf Circumference
         cervicaleheight – Cervical Height
         chestbreadth – Chest Breadth
         chestcircumference – Chest Circumference
         chestdepth – Chest Depth
         chestheight – Chest Height
         crotchheight – Crotch Height
         crotchlengthomphalion – Crotch Length (Omphalion)
         crotchlengthposterioromphalion – Crotch Length, Posterior (Omphalion)
         earbreadth – Ear Breadth
         earlength – Ear Length
         earprotrusion – Ear Protrusion
         elbowrestheight – Elbow Rest Height
         eyeheightsitting – Eye Height, Sitting 
         footbreadthhorizontal – Foot Breadth, Horizontal
         footlength – Foot Length
         forearmcenterofgriplength – Forearm-Center of Grip Length
         forearmcircumferenceflexed – Forearm Circumference, Flexed
         forearmforearmbreadth – Forearm-Forearm Breadth
         forearmhandlength – Forearm -Hand Length
         functionalleglength – Functional Leg Length
         handbreadth – Hand Breadth
         handcircumference – Hand Circumference
         handlength – Hand Length
         headbreadth – Head Breadth
         headcircumference – Head Circumference
         headlength – Head Length
         heelanklecircumference – Heel-Ankle Circumference
         heelbreadth – Heel Breadth
         hipbreadth – Hip Breadth
         hipbreadthsitting – Hip Breadth, Sitting
         iliocristaleheight – Iliocristale Height
         interpupillarybreadth – Interpupillary Breadth
         interscyei – Interscye I
         interscyeii – Interscye II
         kneeheightmidpatella – Knee Height, Midpatella
         kneeheightsitting – Knee Height, Sitting
         lateralfemoralepicondyleheight – Lateral Femoral Epicondyle Height
         lateralmalleolusheight – Lateral Malleolus Height
         lowerthighcircumference – Lower Thigh Circumference
         mentonsellionlength – Menton-Sellion Length
         neckcircumference – Neck Circumference
         neckcircumferencebase – Neck Circumference, Base
         overheadfingertipreachsitting – Overhead Fingertip Reach, Sitting
         palmlength – Palm Length
         poplitealheight – Popliteal Height
         radialestylionlength – Radiale-Stylion Length
         shouldercircumference – Shoulder Circumference
         shoulderelbowlength – Shoulder-Elbow Length
         shoulderlength – Shoulder Length
         sittingheight – Sitting Height
         sleevelengthspinewrist – Sleeve Length: Spine-Wrist
         sleeveoutseam – Sleeve Outseam
         span - Span
         stature - Stature
         suprasternaleheight – Suprasternale Height
         tenthribheight – Tenth Rib Height
         thighcircumference – Thigh Circumference
         thighclearance – Thigh Clearance
         thumbtipreach – Thumbtip Reach
         tibialheight – Tibiale Height
         tragiontopofhead – Tragion-Top of Head
         trochanterionheight – Trochanterion Height
         verticaltrunkcircumferenceusa – Vertical Trunk Circumference (USA)
         waistbacklength – Waist Back Length (Omphalion)
         waistbreadth – Waist Breadth
         waistcircumference – Waist Circumference (Omphalion)
         waistdepth – Waist Depth
         waistfrontlengthsitting – Waist Front Length, Sitting
         waistheightomphalion – Waist Height (Omphalion)
         weightkg – Weight (in kg*10)
         wristcircumference – Wrist Circumference
         wristheight – Wrist Height
        
#### Recommendations:
    6. The ANSUR II working databases are a representative sample of the US Army at the time of
    data collection and may or may not be representative of other populations of interest, to include
    later instances of the US Army. Other US military services maintain anthropometric databases of
    their service members which are distinct from the US Army’s anthropometric databases
    (ANSUR II). The US Army also maintains separate anthropometric databases representing Male
    and Female US Army pilots which are distinct from ANSUR II.
    7. The ANSUR II working databases are presented as two separate databases – one Female, one
    Male. In almost all cases, these databases should be treated and analyzed separately.
    Combination of the databases will result in a sample that is not representative of any real
    population and could easily lead to erroneous conclusions.
    8. Much more information about the data collection methodology and content of the ANSUR II
    Working Databases may be found in the following Technical Reports, available from the
    Defense Technical Information Center (www.dtic.mil) through the hyperlinks provided:
        a. 2010-2012 Anthropometric Survey of U.S. Army Personnel: Methods and Summary
        Statistics. (NATICK/TR-15/007)
        b. Measurer’s Handbook: US Army and Marine Corps Anthropometric Surveys,
    2010-2011 (NATICK/TR-11/017)
    9. The primary POC for the ANSUR II working databases is Joseph L Parham, Research
    Anthropologist, Email: joseph.l.parham2.civ@mail.mil.
    Steven P Paquette Joseph L Parham
    Anthropometry Team Leader Research Anthropologist
    Natick RD&E Center, Natick, MA Natick RD&E Center, Natick, MA

# Import Libraries

In [7]:
# !pip install pyforest

In [13]:
import lightgbm as lgb

In [9]:
# !pip install catboost

In [15]:
# 1-Import Libraies

import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline
%matplotlib notebook
import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

#Model Selection
from sklearn import model_selection
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.model_selection import KFold, cross_val_predict

#Feature Selection
from sklearn.feature_selection import SelectKBest, SelectPercentile, f_classif, f_regression, mutual_info_regression

#Models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.linear_model import LogisticRegression

from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

from sklearn.svm import SVC
from sklearn.svm import SVR

from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import ExtraTreesRegressor

from xgboost import XGBClassifier
from xgboost import plot_importance

from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.neural_network import MLPRegressor

#Scaling
from sklearn.preprocessing import scale 
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures 
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer 
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler


#Metrics
from sklearn import metrics
from sklearn.metrics import roc_auc_score, auc, roc_curve, precision_recall_curve
from sklearn.metrics import accuracy_score, recall_score, average_precision_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score 


#Importing plotly and cufflinks in offline mode
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

#Ignore Warnings
import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

#Figure&Display options
plt.rcParams["figure.figsize"] = (10,6)
pd.set_option('max_colwidth',200)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 200)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## Useful Functions

In [16]:
## Useful Functions

###############################################################################

def first_looking(column):
    print("column name    : ", column) 
    print("--------------------------------")
    print("per_of_nulls   : ", "%", round(df[column].isnull().sum()/df.shape[0]*100, 2))
    print("num_of_nulls   : ", df[column].isnull().sum())
    print("num_of_uniques : ", df[column].nunique())
    print("value_counts : ", df[column].value_counts(dropna = False).head())
    
# for i in df.columns:
#     first_looking(i)

###############################################################################

def missing (df):
    missing_number = df.isnull().sum().sort_values(ascending=False)
    missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
    missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=['Missing_Number', 'Missing_Percent'])
    return missing_values

###############################################################################

def perc_nans(serial):  # Ex:perc_nans(df['kW'])
    # display percentage of nans in a Series
    return serial.isnull().sum()/serial.shape[0]*100

def perc_nans_byLimitless(df):
    return df.isnull().sum()/df.shape[0]*100

def perc_nans_byLimit(df, limit):
    missing = df.isnull().sum()*100/df.shape[0]
    return missing.loc[lambda x : x >= limit]

# perc_nans_byLimit(df, 90)

###############################################################################

def fill_median(df, group_col, col_name):
    '''Fills the missing values with the most existing value (median) in the relevant column according to single-stage grouping'''
    for group in list(df[group_col].unique()):
        cond = df[group_col]==group
        median = list(df[cond][col_name].median())
        if median != []:
            df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[cond][col_name].median()[0])
        else:
            df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[col_name].median()[0])
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))
    
###############################################################################

def fill_most(df, group_col, col_name):
    '''Fills the missing values with the most existing value (mode) in the relevant column according to single-stage grouping'''
    for group in list(df[group_col].unique()):
        cond = df[group_col]==group
        mode = list(df[cond][col_name].mode())
        if mode != []:
            df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[cond][col_name].mode()[0])
        else:
            df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[col_name].mode()[0])
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))
    
###############################################################################

def fill_prop(df, group_col, col_name):
    for group in list(df[group_col].unique()):
        cond = df[group_col]==group
        df.loc[cond, col_name] = df.loc[cond, col_name].fillna(method="ffill").fillna(method="bfill")
    df[col_name] = df[col_name].fillna(method="ffill").fillna(method="bfill")
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))
    
###############################################################################

def fill(df, group_col1, group_col2, col_name, method): # method can be "mode" or "median" or "ffill"
    if method == "mode":
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond1 = df[group_col1]==group1
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                mode1 = list(df[cond1][col_name].mode())
                mode2 = list(df[cond2][col_name].mode())
                if mode2 != []:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond2][col_name].mode()[0])
                elif mode1 != []:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond1][col_name].mode()[0])
                else:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[col_name].mode()[0])
                
    elif method == "median":
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond1 = df[group_col1]==group1
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond2][col_name].median()).fillna(df[cond1][col_name].median()).fillna(df[col_name].median())
                
    elif method == "ffill":           
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(method="ffill").fillna(method="bfill")
                
        for group1 in list(df[group_col1].unique()):
            cond1 = df[group_col1]==group1
            df.loc[cond1, col_name] = df.loc[cond1, col_name].fillna(method="ffill").fillna(method="bfill")            
           
        df[col_name] = df[col_name].fillna(method="ffill").fillna(method="bfill")
    
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))
    
###############################################################################

def model_validation(y_train, y_train_pred, y_test, y_test_pred, model_name):
    
    scores =  {f"{model_name}_train": {"R2" : r2_score(y_train, y_train_pred),
    "rmse" : np.sqrt(mean_squared_error(y_train, y_train_pred)),
    "mse" : mean_squared_error(y_train, y_train_pred), 
    "mae" : mean_absolute_error(y_train, y_train_pred)},
    
    f"{model_name}_test": {"R2" : r2_score(y_test, y_test_pred),
    "rmse" : np.sqrt(mean_squared_error(y_test, y_test_pred)),
    "mse" : mean_squared_error(y_test, y_test_pred),
    "mae" : mean_absolute_error(y_test, y_test_pred)}}
     
    return pd.DataFrame(scores)

# lm = model_validation(y_train, y_train_pred, y_test, y_test_pred, 'lm')

# pd.concat([lm, rs, rcvs, lss, lcvs, es, ecvs], axis = 1)

###############################################################################

def get_classification_report(y_test, y_test_pred):
    from sklearn import metrics
    report = metrics.classification_report(y_test, y_test_pred, output_dict=True)
    df_classification_report = pd.DataFrame(report).transpose()
    #df_classification_report = df_classification_report.sort_values(by=['f1-score'], ascending=False)
    return df_classification_report

###############################################################################

def shape_control():
    print('df.shape:', df.shape)
    print('X.shape:', X.shape)
    print('y.shape:', y.shape)
    print('X_train.shape:', X_train.shape)
    print('y_train.shape:', y_train.shape)
    print('X_test.shape:', X_test.shape)
    print('y_test.shape:', y_test.shape)
    try:
        print('y_test_pred.shape:', y_test_pred.shape)
    except:
        print()
        
###############################################################################

def calc_predict():
    return accuracy_score(y_test, y_test_pred), recall_score(y_test, y_test_pred)
    
def get_report():
    from sklearn import metrics
    pd.set_option('display.float_format', lambda x: '%.3f' % x)
    y_train_pred = model.predict(X_train_scaled)
    try:
        y_train_pred_proba = model.predict_proba(X_train_scaled)
    except:
        print()
    try:
        precision, recall, _ = precision_recall_curve(y_train, y_train_pred_proba[:,1])
    except:
        print() 
    try:
        y_test_pred_proba = model.predict_proba(X_test_scaled)
    except:
        print()
    try:
        precision, recall, _ = precision_recall_curve(y_test, y_test_pred_proba[:,1])
    except:
        print()  
    print('Model:', model.get_params, '\n')
    try:
        print('model.best_params_:', model.best_params_, '\n')
    except:
        print()
    print("Train:")
    print('rmse:', np.sqrt(mean_squared_error(y_train, y_train_pred)))
    print('accuracy:', accuracy_score(y_train, y_train_pred))
    try:
        print('roc_auc_score:',roc_auc_score(y_train, y_train_pred_proba[:,1]))
    except:
        print()
    try:
        print('roc_auc_recall_precision_score:',auc(recall, precision),'\n')
    except:
        print()
    print('confusion_matrix:\n\n', confusion_matrix(y_train, y_train_pred), '\n')
    print('classification_report:\n\n', classification_report(y_train, y_train_pred),'\n')
    print()
    print("Test:")
    print('rmse:', np.sqrt(mean_squared_error(y_test, y_test_pred))) 
    print('accuracy:', accuracy_score(y_test, y_test_pred))
    try:
        print('roc_auc_score:',roc_auc_score(y_test, y_test_pred_proba[:,1]))
    except:
        print() 
    try:
        print('roc_auc_recall_precision_score:',auc(recall, precision),'\n')
    except:
        print() 
    print('confusion_matrix:\n\n', confusion_matrix(y_test, y_test_pred), '\n')
    print('classification_report:\n\n', classification_report(y_test, y_test_pred))

def train_control_table():
    y_train_pred = model.predict(X_train_scaled)
    y_train_pred = pd.DataFrame(y_train_pred)
    y_train_pred.rename(columns = {0: 'y_train_pred'}, inplace = True)
    return pd.concat([X_train, y_train, y_train_pred.set_index(y_train.index)], axis=1)

def test_control_table():
    y_test_pred = model.predict(X_test_scaled)
    y_test_pred = pd.DataFrame(y_test_pred)
    y_test_pred.rename(columns = {0: 'y_test_pred'}, inplace = True)
    return pd.concat([X_test, y_test, y_test_pred.set_index(y_test.index)], axis=1)

###############################################################################

def feature_importances():
    df_fi = pd.DataFrame(index=X.columns, 
                         data=model.feature_importances_, 
                         columns=["Feature Importance"]).sort_values("Feature Importance")

    return df_fi.sort_values(by="Feature Importance", ascending=False).T

def feature_importances_bar():
    df_fi = pd.DataFrame(index=X.columns, 
                         data=model.feature_importances_, 
                         columns=["Feature Importance"]).sort_values("Feature Importance")
    sns.barplot(data = df_fi, 
                x = df_fi.index, 
                y = 'Feature Importance', 
                order=df_fi.sort_values('Feature Importance', ascending=False).reset_index()['index'])
    plt.xticks(rotation = 90)
    plt.tight_layout()
    plt.show();

In [17]:
def outlier_zscore(df, col, min_z=1, max_z = 5, step = 0.1, print_list = False):
    z_scores = stats.zscore(df[col].dropna())
    threshold_list = []
    for threshold in np.arange(min_z, max_z, step):
        threshold_list.append((threshold, len(np.where(z_scores > threshold)[0])))
        df_outlier = pd.DataFrame(threshold_list, columns = ['threshold', 'outlier_count'])
        df_outlier['pct'] = (df_outlier.outlier_count - df_outlier.outlier_count.shift(-1))/df_outlier.outlier_count*100
    plt.plot(df_outlier.threshold, df_outlier.outlier_count)
    best_treshold = round(df_outlier.iloc[df_outlier.pct.argmax(), 0],2)
    outlier_limit = int(df[col].dropna().mean() + (df[col].dropna().std()) * df_outlier.iloc[df_outlier.pct.argmax(), 0])
    percentile_threshold = stats.percentileofscore(df[col].dropna(), outlier_limit)
    plt.vlines(best_treshold, 0, df_outlier.outlier_count.max(), 
               colors="r", ls = ":"
              )
    plt.annotate("Zscore : {}\nValue : {}\nPercentile : {}".format(best_treshold, outlier_limit, 
                                                                   (np.round(percentile_threshold, 3), 
                                                                    np.round(100-percentile_threshold, 3))), 
                 (best_treshold, df_outlier.outlier_count.max()/2))
    #plt.show()
    if print_list:
        print(df_outlier)
    return (plt, df_outlier, best_treshold, outlier_limit, percentile_threshold)

def outlier_inspect(df, col, min_z=1, max_z = 5, step = 0.5, max_hist = None, bins = 50):
    fig = plt.figure(figsize=(20, 6))
    fig.suptitle(col, fontsize=16)
    plt.subplot(1,3,1)
    if max_hist == None:
        sns.distplot(df[col], kde=False, bins = 50)
    else :
        sns.distplot(df[df[col]<=max_hist][col], kde=False, bins = 50)
    plt.subplot(1,3,2)
    sns.boxplot(df[col])
    plt.subplot(1,3,3)
    z_score_inspect = outlier_zscore(df, col, min_z=min_z, max_z = max_z, step = step)
    plt.show()

# Load | Read Data

In [18]:
# 2-Load|Read Data
csv_path = "ANSUR II MALE Public.csv"
df0 = pd.read_csv(csv_path)
df_male = df0.copy() 
# drop_columns = "id"
# df.head()
# df.shape
# df.columns= df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')
# df.nunique()
# df.info()
# df.shape
# df.isnull().sum()
# missing(df)
# df.drop(drop_columns, axis=1, inplace=True)
# df.shape
# df.describe().T
# df.columns

In [19]:
df_male.head()

Unnamed: 0,subjectid,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,bicristalbreadth,bideltoidbreadth,bimalleolarbreadth,bitragionchinarc,bitragionsubmandibulararc,bizygomaticbreadth,buttockcircumference,buttockdepth,buttockheight,buttockkneelength,buttockpopliteallength,calfcircumference,cervicaleheight,chestbreadth,chestcircumference,chestdepth,chestheight,crotchheight,crotchlengthomphalion,crotchlengthposterioromphalion,earbreadth,earlength,earprotrusion,elbowrestheight,eyeheightsitting,footbreadthhorizontal,footlength,forearmcenterofgriplength,forearmcircumferenceflexed,forearmforearmbreadth,forearmhandlength,functionalleglength,handbreadth,handcircumference,handlength,headbreadth,headcircumference,headlength,heelanklecircumference,heelbreadth,hipbreadth,hipbreadthsitting,iliocristaleheight,interpupillarybreadth,interscyei,interscyeii,kneeheightmidpatella,kneeheightsitting,lateralfemoralepicondyleheight,lateralmalleolusheight,lowerthighcircumference,mentonsellionlength,neckcircumference,neckcircumferencebase,overheadfingertipreachsitting,palmlength,poplitealheight,radialestylionlength,shouldercircumference,shoulderelbowlength,shoulderlength,sittingheight,sleevelengthspinewrist,sleeveoutseam,span,stature,suprasternaleheight,tenthribheight,thighcircumference,thighclearance,thumbtipreach,tibialheight,tragiontopofhead,trochanterionheight,verticaltrunkcircumferenceusa,waistbacklength,waistbreadth,waistcircumference,waistdepth,waistfrontlengthsitting,waistheightomphalion,weightkg,wristcircumference,wristheight,Gender,Date,Installation,Component,Branch,PrimaryMOS,SubjectsBirthLocation,SubjectNumericRace,Ethnicity,DODRace,Age,Heightin,Weightlbs,WritingPreference
0,10027,266,1467,337,222,1347,253,202,401,369,274,493,71,319,291,142,979,240,882,619,509,373,1535,291,1074,259,1292,877,607,351,36,71,19,247,802,101,273,349,299,575,477,1136,90,214,193,150,583,206,326,70,332,366,1071,685,422,441,502,560,500,77,391,118,400,436,1447,113,437,273,1151,368,145,928,883,600,1782,1776,1449,1092,610,164,786,491,140,919,1700,501,329,933,240,440,1054,815,175,853,Male,4-Oct-10,Fort Hood,Regular Army,Combat Arms,19D,North Dakota,1,,1,41,71,180,Right hand
1,10032,233,1395,326,220,1293,245,193,394,338,257,479,67,344,320,135,944,232,870,584,468,357,1471,269,1021,253,1244,851,615,376,33,62,18,232,781,98,263,348,289,523,476,1096,86,203,195,146,568,201,334,72,312,356,1046,620,441,447,490,540,488,73,371,131,380,420,1380,118,417,254,1119,353,141,884,868,564,1745,1702,1387,1076,572,169,822,476,120,918,1627,432,316,870,225,371,1054,726,167,815,Male,4-Oct-10,Fort Hood,Regular Army,Combat Support,68W,New York,1,,1,35,68,160,Left hand
2,10033,287,1430,341,230,1327,256,196,427,408,261,544,75,345,330,135,1054,258,901,623,506,412,1501,288,1120,267,1288,854,636,359,40,61,23,237,810,103,270,355,357,575,491,1115,93,220,203,148,573,202,356,70,349,393,1053,665,462,475,496,556,482,72,409,123,403,434,1447,121,431,268,1276,367,167,917,910,604,1867,1735,1438,1105,685,198,807,477,125,918,1678,472,329,964,255,411,1041,929,180,831,Male,4-Oct-10,Fort Hood,Regular Army,Combat Support,68W,New York,2,,2,42,68,205,Left hand
3,10092,234,1347,310,230,1239,262,199,401,359,262,518,73,328,309,143,991,242,821,560,437,395,1423,296,1114,262,1205,769,590,341,39,66,25,272,794,106,267,352,318,593,467,1034,91,217,194,158,576,199,341,68,338,367,986,640,458,461,460,511,452,76,393,106,407,446,1357,118,393,249,1155,330,148,903,848,550,1708,1655,1346,1021,604,180,803,445,127,847,1625,461,315,857,205,399,968,794,176,793,Male,12-Oct-10,Fort Hood,Regular Army,Combat Service Support,88M,Wisconsin,1,,1,31,66,175,Right hand
4,10093,250,1585,372,247,1478,267,224,435,356,263,524,80,340,310,138,1029,275,1080,706,567,425,1684,304,1048,232,1452,1014,682,382,32,56,19,188,814,111,305,399,324,605,550,1279,94,222,218,153,566,197,374,69,332,372,1251,675,481,505,612,666,585,85,458,135,398,430,1572,132,523,302,1231,400,180,919,995,641,2035,1914,1596,1292,672,194,962,584,122,1090,1679,467,303,868,214,379,1245,946,188,954,Male,12-Oct-10,Fort Hood,Regular Army,Combat Service Support,92G,North Carolina,2,,2,21,77,213,Right hand


In [20]:
df_male.shape

(4082, 108)

In [21]:
df_male.columns= df_male.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')

In [22]:
df_male.nunique()

subjectid                         4082
abdominalextensiondepthsitting     206
acromialheight                     348
acromionradialelength              111
anklecircumference                  97
axillaheight                       333
balloffootcircumference             90
balloffootlength                    71
biacromialbreadth                  129
bicepscircumferenceflexed          209
bicristalbreadth                   112
bideltoidbreadth                   198
bimalleolarbreadth                  33
bitragionchinarc                    95
bitragionsubmandibulararc          107
bizygomaticbreadth                  46
buttockcircumference               407
buttockdepth                       153
buttockheight                      281
buttockkneelength                  182
buttockpopliteallength             161
calfcircumference                  179
cervicaleheight                    353
chestbreadth                       114
chestcircumference                 451
chestdepth               

In [23]:
df_male.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4082 entries, 0 to 4081
Columns: 108 entries, subjectid to writingpreference
dtypes: int64(99), object(9)
memory usage: 3.4+ MB


In [24]:
df_male.isnull().sum()

subjectid                            0
abdominalextensiondepthsitting       0
acromialheight                       0
acromionradialelength                0
anklecircumference                   0
axillaheight                         0
balloffootcircumference              0
balloffootlength                     0
biacromialbreadth                    0
bicepscircumferenceflexed            0
bicristalbreadth                     0
bideltoidbreadth                     0
bimalleolarbreadth                   0
bitragionchinarc                     0
bitragionsubmandibulararc            0
bizygomaticbreadth                   0
buttockcircumference                 0
buttockdepth                         0
buttockheight                        0
buttockkneelength                    0
buttockpopliteallength               0
calfcircumference                    0
cervicaleheight                      0
chestbreadth                         0
chestcircumference                   0
chestdepth               

In [25]:
missing(df_male)

Unnamed: 0,Missing_Number,Missing_Percent
ethnicity,3180,0.779
subjectid,0,0.0
radialestylionlength,0,0.0
thighcircumference,0,0.0
tenthribheight,0,0.0
suprasternaleheight,0,0.0
stature,0,0.0
span,0,0.0
sleeveoutseam,0,0.0
sleevelengthspinewrist,0,0.0


In [26]:
df_male.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
subjectid,4082.0,20003.371,6568.435,10027.0,14270.25,17236.5,27315.75,29452.0
abdominalextensiondepthsitting,4082.0,254.651,37.327,163.0,227.0,251.0,279.0,451.0
acromialheight,4082.0,1440.737,63.287,1194.0,1398.0,1439.0,1481.0,1683.0
acromionradialelength,4082.0,335.244,17.483,270.0,324.0,335.0,346.0,393.0
anklecircumference,4082.0,229.344,14.649,156.0,219.25,228.0,239.0,293.0
axillaheight,4082.0,1329.082,59.516,1106.0,1289.0,1328.0,1367.0,1553.0
balloffootcircumference,4082.0,252.017,12.936,186.0,243.0,252.0,261.0,306.0
balloffootlength,4082.0,200.935,10.471,156.0,194.0,201.0,208.0,245.0
biacromialbreadth,4082.0,415.676,19.162,337.0,403.0,415.0,428.0,489.0
bicepscircumferenceflexed,4082.0,358.136,34.618,246.0,335.0,357.0,380.0,490.0


In [27]:
df_male.columns

Index(['subjectid', 'abdominalextensiondepthsitting', 'acromialheight',
       'acromionradialelength', 'anklecircumference', 'axillaheight',
       'balloffootcircumference', 'balloffootlength', 'biacromialbreadth',
       'bicepscircumferenceflexed',
       ...
       'branch', 'primarymos', 'subjectsbirthlocation', 'subjectnumericrace',
       'ethnicity', 'dodrace', 'age', 'heightin', 'weightlbs',
       'writingpreference'],
      dtype='object', length=108)

In [28]:
# 2-Load|Read Data
csv_path = "ANSUR II FEMALE Public.csv"
df1 = pd.read_csv(csv_path)
df_female = df1.copy() 
# drop_columns = "id"
# df.head()
# df.shape
# df.columns= df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')
# df.nunique()
# df.info()
# df.shape
# df.isnull().sum()
# missing(df)
# df.drop(drop_columns, axis=1, inplace=True)
# df.shape
# df.describe().T
# df.columns

In [29]:
df_female.head()

Unnamed: 0,SubjectId,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,bicristalbreadth,bideltoidbreadth,bimalleolarbreadth,bitragionchinarc,bitragionsubmandibulararc,bizygomaticbreadth,buttockcircumference,buttockdepth,buttockheight,buttockkneelength,buttockpopliteallength,calfcircumference,cervicaleheight,chestbreadth,chestcircumference,chestdepth,chestheight,crotchheight,crotchlengthomphalion,crotchlengthposterioromphalion,earbreadth,earlength,earprotrusion,elbowrestheight,eyeheightsitting,footbreadthhorizontal,footlength,forearmcenterofgriplength,forearmcircumferenceflexed,forearmforearmbreadth,forearmhandlength,functionalleglength,handbreadth,handcircumference,handlength,headbreadth,headcircumference,headlength,heelanklecircumference,heelbreadth,hipbreadth,hipbreadthsitting,iliocristaleheight,interpupillarybreadth,interscyei,interscyeii,kneeheightmidpatella,kneeheightsitting,lateralfemoralepicondyleheight,lateralmalleolusheight,lowerthighcircumference,mentonsellionlength,neckcircumference,neckcircumferencebase,overheadfingertipreachsitting,palmlength,poplitealheight,radialestylionlength,shouldercircumference,shoulderelbowlength,shoulderlength,sittingheight,sleevelengthspinewrist,sleeveoutseam,span,stature,suprasternaleheight,tenthribheight,thighcircumference,thighclearance,thumbtipreach,tibialheight,tragiontopofhead,trochanterionheight,verticaltrunkcircumferenceusa,waistbacklength,waistbreadth,waistcircumference,waistdepth,waistfrontlengthsitting,waistheightomphalion,weightkg,wristcircumference,wristheight,Gender,Date,Installation,Component,Branch,PrimaryMOS,SubjectsBirthLocation,SubjectNumericRace,Ethnicity,DODRace,Age,Heightin,Weightlbs,WritingPreference
0,10037,231,1282,301,204,1180,222,177,373,315,263,466,65,338,301,141,1011,223,836,587,476,360,1336,274,922,245,1095,759,557,310,35,65,16,220,713,91,246,316,265,517,432,1028,75,182,184,141,548,191,314,69,345,388,966,645,363,399,435,496,447,55,404,118,335,368,1268,113,362,235,1062,327,148,803,809,513,1647,1560,1280,1013,622,174,736,430,110,844,1488,406,295,850,217,345,942,657,152,756,Female,5-Oct-10,Fort Hood,Regular Army,Combat Support,92Y,Germany,2,,2,26,61,142,Right hand
1,10038,194,1379,320,207,1292,225,178,372,272,250,430,64,294,270,126,893,186,900,583,483,350,1440,261,839,206,1234,835,549,329,32,60,23,208,726,91,249,341,247,468,463,1117,78,187,189,138,535,180,307,60,315,335,1048,595,340,375,483,532,492,69,334,115,302,345,1389,110,426,259,1014,346,142,835,810,575,1751,1665,1372,1107,524,152,771,475,125,901,1470,422,254,708,168,329,1032,534,155,815,Female,5-Oct-10,Fort Hood,Regular Army,Combat Service Support,25U,California,3,Mexican,3,21,64,120,Right hand
2,10042,183,1369,329,233,1271,237,196,397,300,276,450,69,309,270,128,987,204,861,583,466,384,1451,287,874,223,1226,821,643,374,36,65,26,204,790,100,265,343,262,488,469,1060,84,198,195,146,588,207,331,70,356,399,1043,655,345,399,470,530,469,64,401,135,325,369,1414,122,398,258,1049,362,164,904,855,568,1779,1711,1383,1089,577,164,814,458,129,882,1542,419,269,727,159,367,1035,663,162,799,Female,5-Oct-10,Fort Hood,Regular Army,Combat Service Support,35D,Texas,1,,1,23,68,147,Right hand
3,10043,261,1356,306,214,1250,240,188,384,364,276,484,68,340,294,144,1012,253,897,599,471,372,1430,269,1008,285,1170,804,640,351,38,62,22,244,775,97,265,331,309,529,455,1069,80,192,186,153,593,206,332,68,337,402,1029,655,392,435,469,520,478,67,402,118,357,386,1329,115,394,250,1121,333,157,875,815,536,1708,1660,1358,1065,679,187,736,463,125,866,1627,451,302,923,235,371,999,782,173,818,Female,5-Oct-10,Fort Hood,Regular Army,Combat Service Support,25U,District of Columbia,8,Caribbean Islander,2,22,66,175,Right hand
4,10051,309,1303,308,214,1210,217,182,378,320,336,525,67,300,295,135,1281,284,811,607,467,433,1362,305,1089,290,1112,726,686,356,34,65,18,233,732,88,247,339,260,596,447,1039,78,183,187,140,522,181,308,63,448,499,964,635,428,435,440,491,441,63,479,114,340,358,1350,116,345,242,1151,329,156,824,810,559,1702,1572,1292,1030,766,197,766,429,116,800,1698,452,405,1163,300,380,911,886,152,762,Female,5-Oct-10,Fort Hood,Regular Army,Combat Arms,42A,Texas,1,,1,45,63,195,Right hand


In [30]:
df_female.shape

(1986, 108)

In [31]:
df_female.columns= df_female.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')

In [32]:
df_female.nunique()

subjectid                         1986
abdominalextensiondepthsitting     167
acromialheight                     292
acromionradialelength              103
anklecircumference                  92
axillaheight                       278
balloffootcircumference             72
balloffootlength                    59
biacromialbreadth                  108
bicepscircumferenceflexed          166
bicristalbreadth                   130
bideltoidbreadth                   161
bimalleolarbreadth                  27
bitragionchinarc                    82
bitragionsubmandibulararc           90
bizygomaticbreadth                  36
buttockcircumference               355
buttockdepth                       139
buttockheight                      234
buttockkneelength                  182
buttockpopliteallength             163
calfcircumference                  161
cervicaleheight                    303
chestbreadth                       106
chestcircumference                 377
chestdepth               

In [33]:
df_female.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1986 entries, 0 to 1985
Columns: 108 entries, subjectid to writingpreference
dtypes: int64(99), object(9)
memory usage: 1.6+ MB


In [34]:
df_female.isnull().sum()

subjectid                            0
abdominalextensiondepthsitting       0
acromialheight                       0
acromionradialelength                0
anklecircumference                   0
axillaheight                         0
balloffootcircumference              0
balloffootlength                     0
biacromialbreadth                    0
bicepscircumferenceflexed            0
bicristalbreadth                     0
bideltoidbreadth                     0
bimalleolarbreadth                   0
bitragionchinarc                     0
bitragionsubmandibulararc            0
bizygomaticbreadth                   0
buttockcircumference                 0
buttockdepth                         0
buttockheight                        0
buttockkneelength                    0
buttockpopliteallength               0
calfcircumference                    0
cervicaleheight                      0
chestbreadth                         0
chestcircumference                   0
chestdepth               

In [35]:
missing(df_female)

Unnamed: 0,Missing_Number,Missing_Percent
ethnicity,1467,0.739
subjectid,0,0.0
radialestylionlength,0,0.0
thighcircumference,0,0.0
tenthribheight,0,0.0
suprasternaleheight,0,0.0
stature,0,0.0
span,0,0.0
sleeveoutseam,0,0.0
sleevelengthspinewrist,0,0.0


In [35]:
df_female.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
subjectid,1986.0,22306.606,20904.73,10037.0,17667.0,22096.5,26089.75,920103.0
abdominalextensiondepthsitting,1986.0,229.651,31.465,155.0,207.0,227.0,249.0,358.0
acromialheight,1986.0,1335.095,58.08,1115.0,1298.0,1332.0,1374.0,1536.0
acromionradialelength,1986.0,311.198,17.165,249.0,300.0,311.0,323.0,371.0
anklecircumference,1986.0,215.74,14.892,170.0,205.0,215.0,225.0,275.0
axillaheight,1986.0,1239.03,55.802,1038.0,1202.0,1236.0,1277.0,1419.0
balloffootcircumference,1986.0,228.11,11.771,194.0,220.0,227.0,236.0,270.0
balloffootlength,1986.0,182.051,9.642,151.0,175.0,182.0,188.0,216.0
biacromialbreadth,1986.0,365.349,18.299,283.0,353.0,365.0,378.0,422.0
bicepscircumferenceflexed,1986.0,305.579,30.757,216.0,285.0,304.0,324.0,435.0


In [36]:
df_female.columns

Index(['subjectid', 'abdominalextensiondepthsitting', 'acromialheight',
       'acromionradialelength', 'anklecircumference', 'axillaheight',
       'balloffootcircumference', 'balloffootlength', 'biacromialbreadth',
       'bicepscircumferenceflexed',
       ...
       'branch', 'primarymos', 'subjectsbirthlocation', 'subjectnumericrace',
       'ethnicity', 'dodrace', 'age', 'heightin', 'weightlbs',
       'writingpreference'],
      dtype='object', length=108)

In [53]:
import missingno as msno 

In [55]:
# msno.bar(df_male)

In [56]:
# msno.matrix(df_male)

In [57]:
# msno.bar(df_female)

In [59]:
# msno.matrix(df_female)

In [60]:
df_all = pd.concat([df_male, df_female], axis=0, ignore_index=True, sort=False)

In [61]:
df_all.shape

(6068, 108)

In [62]:
df = df_all.copy()

In [63]:
df.head()

Unnamed: 0,subjectid,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,bicristalbreadth,bideltoidbreadth,bimalleolarbreadth,bitragionchinarc,bitragionsubmandibulararc,bizygomaticbreadth,buttockcircumference,buttockdepth,buttockheight,buttockkneelength,buttockpopliteallength,calfcircumference,cervicaleheight,chestbreadth,chestcircumference,chestdepth,chestheight,crotchheight,crotchlengthomphalion,crotchlengthposterioromphalion,earbreadth,earlength,earprotrusion,elbowrestheight,eyeheightsitting,footbreadthhorizontal,footlength,forearmcenterofgriplength,forearmcircumferenceflexed,forearmforearmbreadth,forearmhandlength,functionalleglength,handbreadth,handcircumference,handlength,headbreadth,headcircumference,headlength,heelanklecircumference,heelbreadth,hipbreadth,hipbreadthsitting,iliocristaleheight,interpupillarybreadth,interscyei,interscyeii,kneeheightmidpatella,kneeheightsitting,lateralfemoralepicondyleheight,lateralmalleolusheight,lowerthighcircumference,mentonsellionlength,neckcircumference,neckcircumferencebase,overheadfingertipreachsitting,palmlength,poplitealheight,radialestylionlength,shouldercircumference,shoulderelbowlength,shoulderlength,sittingheight,sleevelengthspinewrist,sleeveoutseam,span,stature,suprasternaleheight,tenthribheight,thighcircumference,thighclearance,thumbtipreach,tibialheight,tragiontopofhead,trochanterionheight,verticaltrunkcircumferenceusa,waistbacklength,waistbreadth,waistcircumference,waistdepth,waistfrontlengthsitting,waistheightomphalion,weightkg,wristcircumference,wristheight,gender,date,installation,component,branch,primarymos,subjectsbirthlocation,subjectnumericrace,ethnicity,dodrace,age,heightin,weightlbs,writingpreference
0,10027,266,1467,337,222,1347,253,202,401,369,274,493,71,319,291,142,979,240,882,619,509,373,1535,291,1074,259,1292,877,607,351,36,71,19,247,802,101,273,349,299,575,477,1136,90,214,193,150,583,206,326,70,332,366,1071,685,422,441,502,560,500,77,391,118,400,436,1447,113,437,273,1151,368,145,928,883,600,1782,1776,1449,1092,610,164,786,491,140,919,1700,501,329,933,240,440,1054,815,175,853,Male,4-Oct-10,Fort Hood,Regular Army,Combat Arms,19D,North Dakota,1,,1,41,71,180,Right hand
1,10032,233,1395,326,220,1293,245,193,394,338,257,479,67,344,320,135,944,232,870,584,468,357,1471,269,1021,253,1244,851,615,376,33,62,18,232,781,98,263,348,289,523,476,1096,86,203,195,146,568,201,334,72,312,356,1046,620,441,447,490,540,488,73,371,131,380,420,1380,118,417,254,1119,353,141,884,868,564,1745,1702,1387,1076,572,169,822,476,120,918,1627,432,316,870,225,371,1054,726,167,815,Male,4-Oct-10,Fort Hood,Regular Army,Combat Support,68W,New York,1,,1,35,68,160,Left hand
2,10033,287,1430,341,230,1327,256,196,427,408,261,544,75,345,330,135,1054,258,901,623,506,412,1501,288,1120,267,1288,854,636,359,40,61,23,237,810,103,270,355,357,575,491,1115,93,220,203,148,573,202,356,70,349,393,1053,665,462,475,496,556,482,72,409,123,403,434,1447,121,431,268,1276,367,167,917,910,604,1867,1735,1438,1105,685,198,807,477,125,918,1678,472,329,964,255,411,1041,929,180,831,Male,4-Oct-10,Fort Hood,Regular Army,Combat Support,68W,New York,2,,2,42,68,205,Left hand
3,10092,234,1347,310,230,1239,262,199,401,359,262,518,73,328,309,143,991,242,821,560,437,395,1423,296,1114,262,1205,769,590,341,39,66,25,272,794,106,267,352,318,593,467,1034,91,217,194,158,576,199,341,68,338,367,986,640,458,461,460,511,452,76,393,106,407,446,1357,118,393,249,1155,330,148,903,848,550,1708,1655,1346,1021,604,180,803,445,127,847,1625,461,315,857,205,399,968,794,176,793,Male,12-Oct-10,Fort Hood,Regular Army,Combat Service Support,88M,Wisconsin,1,,1,31,66,175,Right hand
4,10093,250,1585,372,247,1478,267,224,435,356,263,524,80,340,310,138,1029,275,1080,706,567,425,1684,304,1048,232,1452,1014,682,382,32,56,19,188,814,111,305,399,324,605,550,1279,94,222,218,153,566,197,374,69,332,372,1251,675,481,505,612,666,585,85,458,135,398,430,1572,132,523,302,1231,400,180,919,995,641,2035,1914,1596,1292,672,194,962,584,122,1090,1679,467,303,868,214,379,1245,946,188,954,Male,12-Oct-10,Fort Hood,Regular Army,Combat Service Support,92G,North Carolina,2,,2,21,77,213,Right hand


In [64]:
drop_columns = []
drop_columns.append("subjectid")

In [65]:
drop_columns

['subjectid']

In [66]:
df.head()

Unnamed: 0,subjectid,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,bicristalbreadth,bideltoidbreadth,bimalleolarbreadth,bitragionchinarc,bitragionsubmandibulararc,bizygomaticbreadth,buttockcircumference,buttockdepth,buttockheight,buttockkneelength,buttockpopliteallength,calfcircumference,cervicaleheight,chestbreadth,chestcircumference,chestdepth,chestheight,crotchheight,crotchlengthomphalion,crotchlengthposterioromphalion,earbreadth,earlength,earprotrusion,elbowrestheight,eyeheightsitting,footbreadthhorizontal,footlength,forearmcenterofgriplength,forearmcircumferenceflexed,forearmforearmbreadth,forearmhandlength,functionalleglength,handbreadth,handcircumference,handlength,headbreadth,headcircumference,headlength,heelanklecircumference,heelbreadth,hipbreadth,hipbreadthsitting,iliocristaleheight,interpupillarybreadth,interscyei,interscyeii,kneeheightmidpatella,kneeheightsitting,lateralfemoralepicondyleheight,lateralmalleolusheight,lowerthighcircumference,mentonsellionlength,neckcircumference,neckcircumferencebase,overheadfingertipreachsitting,palmlength,poplitealheight,radialestylionlength,shouldercircumference,shoulderelbowlength,shoulderlength,sittingheight,sleevelengthspinewrist,sleeveoutseam,span,stature,suprasternaleheight,tenthribheight,thighcircumference,thighclearance,thumbtipreach,tibialheight,tragiontopofhead,trochanterionheight,verticaltrunkcircumferenceusa,waistbacklength,waistbreadth,waistcircumference,waistdepth,waistfrontlengthsitting,waistheightomphalion,weightkg,wristcircumference,wristheight,gender,date,installation,component,branch,primarymos,subjectsbirthlocation,subjectnumericrace,ethnicity,dodrace,age,heightin,weightlbs,writingpreference
0,10027,266,1467,337,222,1347,253,202,401,369,274,493,71,319,291,142,979,240,882,619,509,373,1535,291,1074,259,1292,877,607,351,36,71,19,247,802,101,273,349,299,575,477,1136,90,214,193,150,583,206,326,70,332,366,1071,685,422,441,502,560,500,77,391,118,400,436,1447,113,437,273,1151,368,145,928,883,600,1782,1776,1449,1092,610,164,786,491,140,919,1700,501,329,933,240,440,1054,815,175,853,Male,4-Oct-10,Fort Hood,Regular Army,Combat Arms,19D,North Dakota,1,,1,41,71,180,Right hand
1,10032,233,1395,326,220,1293,245,193,394,338,257,479,67,344,320,135,944,232,870,584,468,357,1471,269,1021,253,1244,851,615,376,33,62,18,232,781,98,263,348,289,523,476,1096,86,203,195,146,568,201,334,72,312,356,1046,620,441,447,490,540,488,73,371,131,380,420,1380,118,417,254,1119,353,141,884,868,564,1745,1702,1387,1076,572,169,822,476,120,918,1627,432,316,870,225,371,1054,726,167,815,Male,4-Oct-10,Fort Hood,Regular Army,Combat Support,68W,New York,1,,1,35,68,160,Left hand
2,10033,287,1430,341,230,1327,256,196,427,408,261,544,75,345,330,135,1054,258,901,623,506,412,1501,288,1120,267,1288,854,636,359,40,61,23,237,810,103,270,355,357,575,491,1115,93,220,203,148,573,202,356,70,349,393,1053,665,462,475,496,556,482,72,409,123,403,434,1447,121,431,268,1276,367,167,917,910,604,1867,1735,1438,1105,685,198,807,477,125,918,1678,472,329,964,255,411,1041,929,180,831,Male,4-Oct-10,Fort Hood,Regular Army,Combat Support,68W,New York,2,,2,42,68,205,Left hand
3,10092,234,1347,310,230,1239,262,199,401,359,262,518,73,328,309,143,991,242,821,560,437,395,1423,296,1114,262,1205,769,590,341,39,66,25,272,794,106,267,352,318,593,467,1034,91,217,194,158,576,199,341,68,338,367,986,640,458,461,460,511,452,76,393,106,407,446,1357,118,393,249,1155,330,148,903,848,550,1708,1655,1346,1021,604,180,803,445,127,847,1625,461,315,857,205,399,968,794,176,793,Male,12-Oct-10,Fort Hood,Regular Army,Combat Service Support,88M,Wisconsin,1,,1,31,66,175,Right hand
4,10093,250,1585,372,247,1478,267,224,435,356,263,524,80,340,310,138,1029,275,1080,706,567,425,1684,304,1048,232,1452,1014,682,382,32,56,19,188,814,111,305,399,324,605,550,1279,94,222,218,153,566,197,374,69,332,372,1251,675,481,505,612,666,585,85,458,135,398,430,1572,132,523,302,1231,400,180,919,995,641,2035,1914,1596,1292,672,194,962,584,122,1090,1679,467,303,868,214,379,1245,946,188,954,Male,12-Oct-10,Fort Hood,Regular Army,Combat Service Support,92G,North Carolina,2,,2,21,77,213,Right hand


In [67]:
df.shape

(6068, 108)

In [68]:
df.nunique()

subjectid                         6068
abdominalextensiondepthsitting     218
acromialheight                     432
acromionradialelength              133
anklecircumference                 112
axillaheight                       402
balloffootcircumference            107
balloffootlength                    86
biacromialbreadth                  169
bicepscircumferenceflexed          237
bicristalbreadth                   132
bideltoidbreadth                   244
bimalleolarbreadth                  37
bitragionchinarc                   107
bitragionsubmandibulararc          125
bizygomaticbreadth                  50
buttockcircumference               429
buttockdepth                       161
buttockheight                      322
buttockkneelength                  209
buttockpopliteallength             185
calfcircumference                  196
cervicaleheight                    452
chestbreadth                       131
chestcircumference                 521
chestdepth               

In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6068 entries, 0 to 6067
Columns: 108 entries, subjectid to writingpreference
dtypes: int64(99), object(9)
memory usage: 5.0+ MB


In [70]:
df.isnull().sum()

subjectid                            0
abdominalextensiondepthsitting       0
acromialheight                       0
acromionradialelength                0
anklecircumference                   0
axillaheight                         0
balloffootcircumference              0
balloffootlength                     0
biacromialbreadth                    0
bicepscircumferenceflexed            0
bicristalbreadth                     0
bideltoidbreadth                     0
bimalleolarbreadth                   0
bitragionchinarc                     0
bitragionsubmandibulararc            0
bizygomaticbreadth                   0
buttockcircumference                 0
buttockdepth                         0
buttockheight                        0
buttockkneelength                    0
buttockpopliteallength               0
calfcircumference                    0
cervicaleheight                      0
chestbreadth                         0
chestcircumference                   0
chestdepth               

In [71]:
missing(df)

Unnamed: 0,Missing_Number,Missing_Percent
ethnicity,4647,0.766
subjectid,0,0.0
radialestylionlength,0,0.0
thighcircumference,0,0.0
tenthribheight,0,0.0
suprasternaleheight,0,0.0
stature,0,0.0
span,0,0.0
sleeveoutseam,0,0.0
sleevelengthspinewrist,0,0.0


In [72]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
subjectid,6068.0,20757.198,13159.391,10027.0,14841.75,20063.5,27234.5,920103.0
abdominalextensiondepthsitting,6068.0,246.469,37.4,155.0,219.0,242.0,271.0,451.0
acromialheight,6068.0,1406.161,79.091,1115.0,1350.0,1410.0,1462.0,1683.0
acromionradialelength,6068.0,327.374,20.72,249.0,313.0,328.0,341.25,393.0
anklecircumference,6068.0,224.891,16.052,156.0,214.0,225.0,235.0,293.0
axillaheight,6068.0,1299.609,72.022,1038.0,1249.0,1302.0,1349.0,1553.0
balloffootcircumference,6068.0,244.193,16.845,186.0,232.0,245.0,256.0,306.0
balloffootlength,6068.0,194.755,13.516,151.0,185.0,195.0,204.0,245.0
biacromialbreadth,6068.0,399.204,30.237,283.0,376.0,404.0,421.0,489.0
bicepscircumferenceflexed,6068.0,340.934,41.52,216.0,311.0,341.0,370.0,490.0


In [73]:
df.describe(include=object).T

Unnamed: 0,count,unique,top,freq
gender,6068,2,Male,4082
date,6068,253,27-Feb-12,45
installation,6068,12,Camp Shelby,1160
component,6068,3,Regular Army,3140
branch,6068,3,Combat Service Support,3174
primarymos,6068,285,11B,671
subjectsbirthlocation,6068,152,California,446
ethnicity,1421,209,Mexican,357
writingpreference,6068,3,Right hand,5350


In [74]:
unneccesseray_data = list(df.describe(include=object).columns.drop('gender'))
unneccesseray_data 

['date',
 'installation',
 'component',
 'branch',
 'primarymos',
 'subjectsbirthlocation',
 'ethnicity',
 'writingpreference']

In [75]:
drop_columns.extend(unneccesseray_data )

In [76]:
drop_columns

['subjectid',
 'date',
 'installation',
 'component',
 'branch',
 'primarymos',
 'subjectsbirthlocation',
 'ethnicity',
 'writingpreference']

In [77]:
drop_columns.append('age')

In [78]:
drop_columns

['subjectid',
 'date',
 'installation',
 'component',
 'branch',
 'primarymos',
 'subjectsbirthlocation',
 'ethnicity',
 'writingpreference',
 'age']

In [79]:
((df.subjectnumericrace==df.dodrace)>8).sum()

0

In [80]:
drop_columns.append('subjectnumericrace')
drop_columns

['subjectid',
 'date',
 'installation',
 'component',
 'branch',
 'primarymos',
 'subjectsbirthlocation',
 'ethnicity',
 'writingpreference',
 'age',
 'subjectnumericrace']

In [81]:
missing(df)

Unnamed: 0,Missing_Number,Missing_Percent
ethnicity,4647,0.766
subjectid,0,0.0
radialestylionlength,0,0.0
thighcircumference,0,0.0
tenthribheight,0,0.0
suprasternaleheight,0,0.0
stature,0,0.0
span,0,0.0
sleeveoutseam,0,0.0
sleevelengthspinewrist,0,0.0


In [82]:
len(drop_columns)

11

In [None]:
drop_columns

In [83]:
df.drop(drop_columns, axis=1, inplace=True)

In [84]:
df.shape

(6068, 97)

In [85]:
missing(df)

Unnamed: 0,Missing_Number,Missing_Percent
abdominalextensiondepthsitting,0,0.0
hipbreadth,0,0.0
sleevelengthspinewrist,0,0.0
sittingheight,0,0.0
shoulderlength,0,0.0
shoulderelbowlength,0,0.0
shouldercircumference,0,0.0
radialestylionlength,0,0.0
poplitealheight,0,0.0
palmlength,0,0.0


# Exploratory Data Analysis and Visualization

### Dodrace | Weight

#### Dodrace

In [86]:
df['dodrace'].value_counts()

1    3792
2    1298
3     679
4     188
6      59
5      49
8       3
Name: dodrace, dtype: int64

In [87]:
df['dodrace'].value_counts()<500

1    False
2    False
3    False
4     True
6     True
5     True
8     True
Name: dodrace, dtype: bool

In [88]:
df.drop(df[df['dodrace']==4].index, inplace=True)
df.drop(df[df['dodrace']==6].index, inplace=True)
df.drop(df[df['dodrace']==5].index, inplace=True)
df.drop(df[df['dodrace']==8].index, inplace=True)
df.shape

(5769, 97)

In [89]:
df.shape

(5769, 97)

#### Weight

In [90]:
df['weightkg']*(1/4.54)
# weight in kg *10

0      179.515
1      159.912
2      204.626
3      174.890
4      208.370
         ...  
6063   183.260
6064   157.930
6065   167.841
6066   139.207
6067   134.361
Name: weightkg, Length: 5769, dtype: float64

In [91]:
df['weightkg']

0       815
1       726
2       929
3       794
4       946
       ... 
6063    832
6064    717
6065    762
6066    632
6067    610
Name: weightkg, Length: 5769, dtype: int64

In [92]:
df['weightlbs'].describe()

count   5769.000
mean     175.578
std       33.600
min        0.000
25%      150.000
50%      175.000
75%      197.000
max      321.000
Name: weightlbs, dtype: float64

In [93]:
df['weightlbs']

0       180
1       160
2       205
3       175
4       213
       ... 
6063    180
6064    150
6065    168
6066    133
6067    132
Name: weightlbs, Length: 5769, dtype: int64

#### Weightkg and weightlbs are the similar values, weightlbs has some streght numbers like 0 than we will drop weightlbs values

In [94]:
df.shape

(5769, 97)

In [95]:
df.drop('weightlbs', axis=1, inplace=True)

In [96]:
df.shape

(5769, 96)

## Features | Target

In [97]:
df.duplicated(subset=None, keep='first').sum()

0

In [98]:
# 3-Target Examination
target = "dodrace"

# df.duplicated(subset=None, keep='first').sum()
df.drop_duplicates(keep = 'first', inplace = True)

# df = df.dropna()

X_columns = df.drop(target, axis=1).columns
X_categorical = df.drop(target, axis=1).select_dtypes('object')
X_numerical = df.drop(target, axis=1).select_dtypes('number').astype('float64')

# df[target].value_counts()
# X_columns
# X_numerical.columns
# X_categorical.columns
# X_numerical.columns.values

In [99]:
df[target].value_counts()

1    3792
2    1298
3     679
Name: dodrace, dtype: int64

## Numerical Features

In [None]:
# index = 0
# plt.figure(figsize=(20,20))
# for feature in X_numerical.columns:
#     if feature != target:
#         index += 1
#         plt.subplot(5,5,index)
#         sns.boxplot(x=target,y=feature,data=df);

In [None]:
df.corr().style.background_gradient(cmap='RdPu')

In [None]:
def correlation(dataset, threshold):
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.add(colname)
                if colname in dataset.columns:
                    del dataset[colname] # deleting the column from the dataset

    print(dataset)

In [None]:
correlation(df, 0.9)

In [None]:
df.corr().style.background_gradient(cmap='RdPu')

In [None]:
df.shape

# Model Selection

## Train | Test Split & Scaling

In [None]:
# 10-Train|Test Split, Dummy 

# # Before dummy: 
# make_dtype_object = df[['categorical1','categorical2']].astype('object')

X_columns_ = df.drop(target, axis=1).columns
X_categorical_ = df.drop(target, axis=1).select_dtypes('object')
X_numerical_ = df.drop(target, axis=1).select_dtypes('number').astype('float64')

###############################################################################

if (df.dtypes==object).any():
    dummied = pd.get_dummies(X_categorical_, drop_first=True)
    X = pd.concat([X_numerical_, dummied[dummied.columns]], axis=1)
    
else:
    X = df.drop(target, axis=1).astype('float64')
try:
    if (df[target].dtypes==object).any():
        y = pd.get_dummies(df[target], drop_first=True)
    
except:
    y = df[target]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20, 
                                                    random_state=42)

###############################################################################

# # 11-MinMax Scaling
# from sklearn.preprocessing import MinMaxScaler
# scaler = MinMaxScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

# 11-Standart Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

###############################################################################

In [None]:
X

In [None]:
df[target]

In [None]:
shape_control()

## Implement DT and Evaluate¶

In [None]:
## Cross Validation 
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
from sklearn.metrics import make_scorer

cv_model = DecisionTreeClassifier(random_state=42)
scores = cross_validate(cv_model, X_train, y_train, scoring = ["accuracy", "precision_macro", "recall_macro", "f1_macro"], cv = 10)
df_scores = pd.DataFrame(scores, index = range(1, 11))

df_scores.mean()[2:]

In [None]:
# Simple Classifier
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
log_acc, log_recall = calc_predict()
get_report()

In [None]:
feature_importances()

In [None]:
feature_importances_bar()

## Implement Logistic Regression and Evaluate¶

In [None]:
## Model Evaluate¶

# 1-Logistic Regression
params = {"penalty" : ["l1", "l2", "elasticnet"],
          "l1_ratio" : np.linspace(0, 1, 20),
          "C" : np.logspace(0, 10, 20)}
model = GridSearchCV(LogisticRegression(random_state=42), 
                     params, 
                     cv=10).fit(X_train_scaled, y_train)

y_test_pred = model.predict(X_test_scaled)
log_acc, log_recall = calc_predict()
get_report()
# train_control_table()
# test_control_table()
# feature_importances()
# feature_importances_bar()
log_acc = accuracy_score(y_test, y_test_pred)
log_recall = recall_score(y_test, y_test_pred)

# # Model tunning
# tuned_model = LogisticRegression(penalty = penalty, 
#                                C = C, 
#                                l1_ratio = l1_ratio, 
#                                solver='saga', 
#                                max_iter=5000).fit(X_train_scaled, y_train)
# y_test_pred = tuned_model.predict(X_test_scaled)

## ROC (Receiver Operating Curve) and AUC (Area Under Curve)

In [None]:
#print(model)
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

# Implement KNN and Evaluate

In [None]:
## Model Evaluate¶

# 1-KNN Classification
params = {"n_neighbors": np.arange(1, 30)}
model = GridSearchCV(KNeighborsClassifier(), 
                    params, 
                    cv=10).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
# knn_acc, knn_recall = calc_predict()
get_report()

In [None]:
#print(model)
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

In [None]:
# KNN Classification
params = {"n_neighbors": np.arange(1, 30), 
          "p": [1, 2]}
model = GridSearchCV(KNeighborsClassifier(), 
                     params, 
                     cv=10).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
# knn_acc, knn_recall = calc_predict()
get_report()

In [None]:
#print(model)
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

In [None]:
# KNN Classification
params = {"n_neighbors": np.arange(1, 50), 
          "p": [1,2], 
          "weights": ['uniform', "distance"]}
model = GridSearchCV(KNeighborsClassifier(), 
                     params, 
                     cv=10).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
knn_acc, knn_recall = calc_predict()
get_report()

In [None]:
#print(model)
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

## Implement SVM and Evaluate

In [None]:
# SVM Classification
from sklearn.model_selection import GridSearchCV
params = {'C': [0.1,1, 10, 100, 1000],
          'gamma': ["scale", "auto", 1,0.1,0.01,0.001,0.0001],
          'kernel': ['rbf', 'linear', 'poly']}
model = GridSearchCV(SVC(random_state=42), 
                     params, 
                     verbose=3, 
                     refit=True, 
                     cv=10).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
svm_acc, svm_recall = calc_predict()
get_report()

In [None]:
#print(model)
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

## Implement XGBoost and Evaluate

In [None]:
from xgboost import XGBClassifier
params = {"n_estimators":[100, 300],
          "max_depth":[3,5,6], 
          "learning_rate": [0.1, 0.3],
          "subsample":[0.5, 1],
          "colsample_bytree":[0.5, 1]}
model = GridSearchCV(XGBClassifier(random_state=42), 
                     params, 
                     scoring="f1", 
                     verbose=2, 
                     n_jobs=-1,
                     cv=10).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
xg_acc, xg_recall = calc_predict()
get_report()

In [None]:
#print(model)
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

## Implement RandomForest and Evaluate

In [None]:
# RandomForest Classification
params = {"max_depth": [2,5,8,10],
          "max_features": [2,5,8],
          "n_estimators": [10,500,1000],
          "min_samples_split": [2,5,10]}
model = GridSearchCV(RandomForestClassifier(random_state=42), 
                     params, 
                     n_jobs=-1, 
                     verbose=2, 
                     refit=True,
                     cv=10).fit(X_train_scaled, y_train)
y_test_pred = model.predict(X_test_scaled)
rf_acc, rf_recall = calc_predict()
get_report()

In [None]:
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve
model = gridCV_model
plot_roc_curve(model, X_train_scaled, y_train);
plot_precision_recall_curve(model, X_train_scaled, y_train);

model = gridCV_model
plot_roc_curve(model, X_test_scaled, y_test);
plot_precision_recall_curve(model, X_test_scaled, y_test);

# Data Preprocessing

# Visually compare models based on your chosen metric

# Chose best model and make a random prediction

In [None]:
compare = pd.DataFrame({"Model": ["LR", "KNN", "SVM", "DT", "RF"],
                        "Accuracy": [dt_acc, log_acc, knn_acc, svm_acc, xg_acc, rf_acc],
                        "Recall": [dt_recall, log_recall, knn_recall, svm_recall, xg_recall, rf_recall]})

def labels(ax):
    for p in ax.patches:
        width = p.get_width()                        # get bar length
        ax.text(width,                               # set the text at 1 unit right of the bar
                p.get_y() + p.get_height() / 2,      # get Y coordinate + X coordinate / 2
                '{:1.2f}'.format(width),             # set variable to display, 2 decimals
                ha='left',                         # horizontal alignment
                va='center')                       # vertical alignment
    
plt.figure(figsize=(14,10))
plt.subplot(211)
compare = compare.sort_values(by="Accuracy", ascending=False)
ax=sns.barplot(x="Accuracy", y="Model", data=compare, palette="Blues_d")
labels(ax)

plt.subplot(212)
compare = compare.sort_values(by="Recall", ascending=False)
ax=sns.barplot(x="Recall", y="Model", data=compare, palette="Blues_d")
labels(ax)
plt.show()