# Working with Data Dictionary: simplier, faster and more accurate way to understand the data.

This notebook not for very beginners but for entry and middle-level Data Scientists. I want to share here a simple way to make your job easy and produce a more accurate result.

What you will NOT find here:
1. EDA
2. Plots
3. Long explanations which is called in my country a "text water"

What you will find:
1. Data Dictionary
2. Data Types explanation
3. My choise of feature engineering (some of features you may not expect)
4. Model selection
5. Model tuning (feature also)
6. Modeling

Let's get started.

# Imports and styling.

In [1]:
import numpy as np
import pandas as pd
import sklearn 

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import platform
print('Versions:')
print('  python', platform.python_version())
n = ('numpy', 'pandas', 'sklearn', 'matplotlib', 'seaborn')
nn = (np, pd, sklearn, mpl, sns)
for a, b in zip(n, nn):
    print('  --', str(a), b.__version__)

Versions:
  python 3.6.8
  -- numpy 1.17.3
  -- pandas 0.25.2
  -- sklearn 0.21.3
  -- matplotlib 3.1.1
  -- seaborn 0.9.0


In [2]:
#pandas styling
pd.set_option('colheader_justify', 'left')
pd.set_option('precision', 0)
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_colwidth', -1)

In [3]:
#pd.reset_option('all')

In [4]:
#seaborn syling
sns.set_style('whitegrid', { 'axes.axisbelow': True, 'axes.edgecolor': 'black', 'axes.facecolor': 'white',
        'axes.grid': True, 'axes.labelcolor': 'black', 'axes.spines.bottom': True, 'axes.spines.left': True,
        'axes.spines.right': False, 'axes.spines.top': False, 'figure.facecolor': 'white', 
        #'font.family': ['sans-serif'], 'font.sans-serif': ['Arial', 'DejaVu Sans', 'Liberation Sans', 'Bitstream Vera Sans', 'sans-serif'],
        'grid.color': 'grey', 'grid.linestyle': ':', 'image.cmap': 'rocket', 'lines.solid_capstyle': 'round',
        'patch.edgecolor': 'w', 'patch.force_edgecolor': True, 'text.color': 'black', 
        'xtick.top': False, 'xtick.bottom': True, 'xtick.color': 'navy', 'xtick.direction': 'out', 
        'ytick.right': False,    'ytick.left': True, 'ytick.color': 'navy', 'ytick.direction': 'out'})

In [5]:
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process 
from sklearn import feature_selection, model_selection, metrics
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from xgboost import XGBClassifier

## Concatenating data sets

In [6]:
train_raw = pd.read_csv('...TitanicKaggleTrain.csv')
test_raw = pd.read_csv('...TitanicKaggleTest.csv')
len(train_raw) + len(test_raw)

1309

In [7]:
df = pd.concat(objs=[train_raw, test_raw], axis=0)
df.shape

(1309, 12)

## Open a Data Dictionary

In [8]:
ddict = pd.read_excel('...TitanicDDict.xlsx', index_col=0)
ddict

Unnamed: 0,Definition,Description,#Unique,TopValue,%UsedTop,%Missing,Unit,Type,Dtype
age,Age in years,123,98,24,3.59,20.09,years,cont,float64
fare,Passenger fare,-,281,8.05,4.58,0.08,usd,cont,float64
pclass,Ticket class,"1 = 1st, 2 = 2nd, 3 = 3rd",3,3,54.16,0.0,-,ordinal,int64
survived,Survival,"0 = No, 1 = Yes",2,0,61.8,0.0,-,binary,int64
sibsp,# of siblings / spouses aboard the Titanic,123,7,0,68.07,0.0,person,discr,int64
parch,# of parents / children aboard the Titanic,123,8,0,76.55,0.0,person,discr,int64
name,passengers names,-,1307,"Kelly, Mr. James",0.15,0.0,-,nominal,object
sex,Sex,"f,m",2,male,64.4,0.0,-,binary,object
ticket,Ticket number,-,939,CA. 2343,0.84,0.0,-,nominal,object
cabin,Cabin number,-,186,C23 C25 C27,0.46,77.46,-,nominal,object


*(If it's difficult to you to create such kind of table, you can find the way I've made [this dictionary exactly at my GitHub](https://github.com/datalanas/Jupyter_notebooks_to_share/blob/master/Titanic_What_is_DataDictionary.ipynb) )*

Data Dictionary has main data set's feature's names as indexes and followed columns:
- 'Definition' - meaning of feature
- 'Description' - meaning of feature's values 
- '#Unique' - number of unique values in the columns, where NaN calculated as value also  
- 'TopValue' - the most used value 
- '%UsedTop' - % of using top value
- '%Missing' - % of missing values
- 'Unit' - measurement units 
- 'Type' - measurement scales                  
- 'Dtype' - column's python data type  

Column "Type" may has followed values:
1. Useless (useless for machine learning algorithms)
2. Nominal (truly categorical, labels or groups without order)
3. Binary (dichotomous - a type of nominal scales that contains only two categories)
4. Ordinal (groups with order)
5. Discrete  (count data, the number of occurrences)
6. Cont (continuous with an absolute zero,but without a temporal component)
7. Interval (continuous without an absolute zero, but without a temporal component)
8. Time (cyclical numbers with a temporal component; continuous)
9. Text
10. Image
11. Audio
12. Video

Creating such kind of table takes not a lot, but save much more time and gives me feeling of TRUE understanding the data.
Hope you feel the same.

# Data engineering

#### useless features

We call feature "useless" if it:
- has a single unique value
- brings no information to algoritms, like "PassengerID"
- has too many missing values, like "Cabin"
- highly correlated
- has a zero importance in a tree-based model

In [9]:
col1 = test_raw['PassengerId'] # will need for a submission
df.drop(['PassengerId', 'Cabin'], axis=1, inplace=True)

#### missing values

In [10]:
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Fare'] = df['Fare'].fillna(df['Fare'].mean())
df['Embarked'] = df['Embarked'].fillna('S')
df.isnull().sum().to_frame().T

Unnamed: 0,Age,Embarked,Fare,Name,Parch,Pclass,Sex,SibSp,Survived,Ticket
0,0,0,0,0,0,0,0,0,418,0


#### feature engineering

In [11]:
#family sizes
df['Fsize'] =  df['Parch'] + df['SibSp'] + 1

In [12]:
#titles
df['Surname'], df['Name'] = zip(*df['Name'].apply(lambda x: x.split(',')))
df['Title'], df['Name'] = zip(*df['Name'].apply(lambda x: x.split('.')))

titles = (df['Title'].value_counts() < 10)
df['Title'] = df['Title'].apply(lambda x: ' Misc' if titles.loc[x] == True else x)
df['Title'].value_counts().to_frame().T

Unnamed: 0,Mr,Miss,Mrs,Master,Misc
Title,757,260,197,61,34


In [13]:
# ticket set (how many person in one ticket)
# if one ticket has family members only, it's "monotinic"; if not - "mixed"
df['Tname'] = df['Ticket']
df['Tset']=0

for t in df['Tname'].unique():
    if df['Surname'].loc[(df['Tname']==t)].nunique() != 1:
        df['Tset'].loc[(df['Tname']==t)] = 'mixed'
    else: 
        df['Tset'].loc[(df['Tname']==t)] = 'monotonic'

for t in df['Tname'].unique():
    if df['Surname'].loc[(df['Tname']==t)].nunique() != 1:
        df['Tset'].loc[(df['Tname']==t)] = 'mixed'
    else: 
        df['Tset'].loc[(df['Tname']==t)] = 'monotonic'

In [14]:
#price and 
for t in df['Ticket'].unique():
    df['Ticket'].loc[(df['Ticket']==t)] = len(df.loc[(df['Ticket']==t)]) 
df['Price'] = df['Fare'] / df['Ticket']
#renaming "Ticket"
df.rename(columns={'Ticket':'Tgroup'}, inplace=True)

In [15]:
#deleting useless again
df.drop(['Parch', 'SibSp', 'Name', 'Surname', 'Tname', 'Fare'], axis=1, inplace=True)

In [16]:
df.head(2)

Unnamed: 0,Age,Embarked,Pclass,Sex,Survived,Tgroup,Fsize,Title,Tset,Price
0,22.0,S,3,male,0.0,1,2,Mr,monotonic,7.25
1,38.0,C,1,female,1.0,2,2,Mrs,monotonic,35.64


#### Transforming

In [17]:
df.dtypes.to_frame().sort_values([0]).T

Unnamed: 0,Pclass,Tgroup,Fsize,Age,Survived,Price,Embarked,Sex,Title,Tset
0,int64,int64,int64,float64,float64,float64,object,object,object,object


In [18]:
#code categorical data
label = LabelEncoder()
cols = df.dtypes[df.dtypes == 'object'].index.tolist()
for col in cols:
    df[col] = label.fit_transform(df[col])

In [19]:
#binning
df['Price'] = pd.qcut(df['Price'], 4)
df['Age'] = pd.cut(df['Age'].astype(int), 5)

In [20]:
#code binning data
df['Age'] = label.fit_transform(df['Age'])
df['Price'] = label.fit_transform(df['Price'])

In [21]:
df.head()

Unnamed: 0,Age,Embarked,Pclass,Sex,Survived,Tgroup,Fsize,Title,Tset,Price
0,1,2,3,1,0.0,1,2,3,1,0
1,2,0,1,0,1.0,2,2,4,1,3
2,1,2,3,0,1.0,1,1,2,1,1
3,2,2,1,0,1.0,2,2,4,1,3
4,2,2,3,1,0.0,1,1,3,1,1


## Splitting back to train and test

In [22]:
a = len(train_raw)
train = df[:a]
test = df[a:]

In [23]:
train_raw.shape[0] == train.shape[0]

True

In [24]:
test.drop(['Survived'], axis=1, inplace=True)
test_raw.shape[0] == test.shape[0]

True

# Modeling

In [25]:
X = train.drop(['Survived'], axis=1).columns.to_list()
y = ['Survived']

In [26]:
#Machine Learning Algorithm initialization
MLA = [ #Ensemble Methods
        ensemble.AdaBoostClassifier(), ensemble.BaggingClassifier(), ensemble.ExtraTreesClassifier(),
        ensemble.GradientBoostingClassifier(), ensemble.RandomForestClassifier(),
        #Gaussian Processes
        gaussian_process.GaussianProcessClassifier(),
        #GLM
        linear_model.LogisticRegressionCV(), linear_model.PassiveAggressiveClassifier(),
        linear_model.RidgeClassifierCV(), linear_model.SGDClassifier(), linear_model.Perceptron(),
        #Navies Bayes
        naive_bayes.BernoulliNB(), naive_bayes.GaussianNB(),
        #Nearest Neighbor
        neighbors.KNeighborsClassifier(),
        #SVM
        svm.SVC(probability=True), svm.NuSVC(probability=True), svm.LinearSVC(),
        #Trees    
        tree.DecisionTreeClassifier(), tree.ExtraTreeClassifier(),
        #Discriminant Analysis
        discriminant_analysis.LinearDiscriminantAnalysis(), discriminant_analysis.QuadraticDiscriminantAnalysis(),
        #xgboost
        XGBClassifier() ]

In [27]:
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0)

mla = pd.DataFrame(columns=['Name','TestScore','ScoreTime','FitTime','Parameters'])
prediction = train[y]

i = 0
for alg in MLA:
    name = alg.__class__.__name__
    mla.loc[i, 'Name'] = name
    mla.loc[i, 'Parameters'] = str(alg.get_params())
    
    cv_results = model_selection.cross_validate(alg, train[X], train[y], cv=cv_split)
        
    mla.loc[i, 'FitTime'] = cv_results['fit_time'].mean()
    mla.loc[i, 'ScoreTime'] = cv_results['score_time'].mean()
    mla.loc[i, 'TestScore'] = cv_results['test_score'].mean()

    alg.fit(train[X], train[y])
    prediction[name] = alg.predict(train[X])    
    i += 1

mla = mla.sort_values('TestScore', ascending=False).reset_index(drop=True)
mla

Unnamed: 0,Name,TestScore,ScoreTime,FitTime,Parameters
0,GradientBoostingClassifier,0.83,0.0,0.05,"{'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_iter_no_change': None, 'presort': 'auto', 'random_state': None, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}"
1,SVC,0.83,0.0,0.03,"{'C': 1.0, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'auto_deprecated', 'kernel': 'rbf', 'max_iter': -1, 'probability': True, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}"
2,XGBClassifier,0.83,0.0,0.02,"{'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': None, 'n_estimators': 100, 'n_jobs': 1, 'nthread': None, 'objective': 'binary:logistic', 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 1, 'verbosity': 1}"
3,NuSVC,0.82,0.0,0.04,"{'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'auto_deprecated', 'kernel': 'rbf', 'max_iter': -1, 'nu': 0.5, 'probability': True, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}"
4,BaggingClassifier,0.82,0.0,0.01,"{'base_estimator': None, 'bootstrap': True, 'bootstrap_features': False, 'max_features': 1.0, 'max_samples': 1.0, 'n_estimators': 10, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}"
5,AdaBoostClassifier,0.82,0.01,0.06,"{'algorithm': 'SAMME.R', 'base_estimator': None, 'learning_rate': 1.0, 'n_estimators': 50, 'random_state': None}"
6,QuadraticDiscriminantAnalysis,0.82,0.0,0.0,"{'priors': None, 'reg_param': 0.0, 'store_covariance': False, 'tol': 0.0001}"
7,RandomForestClassifier,0.81,0.0,0.01,"{'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 'warn', 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}"
8,DecisionTreeClassifier,0.81,0.0,0.0,"{'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': False, 'random_state': None, 'splitter': 'best'}"
9,ExtraTreesClassifier,0.81,0.0,0.01,"{'bootstrap': False, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 'warn', 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}"


#### parameters tuning

In [28]:
# assing parameters
param_grid = {'criterion': ['gini', 'entropy'],  #default is gini
              #'splitter': ['best', 'random'], #default is best
              'max_depth': [2,4,6,8,10,None], #default is none
              #'min_samples_split': [2,5,10,.03,.05], #default is 2
              #'min_samples_leaf': [1,5,10,.03,.05], #default is 1
              #'max_features': [None, 'auto'], #default none or all
              'random_state': [0]}

#choose best model with grid_search
tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring='roc_auc', cv=cv_split)
tune_model.fit(train[X], train[y])
print('Parameters: ', tune_model.best_params_)

Parameters:  {'criterion': 'entropy', 'max_depth': 4, 'random_state': 0}


#### feature selection

In [29]:
clf = tree.DecisionTreeClassifier()
results = model_selection.cross_validate(clf, train[X], train[y], cv=cv_split)
clf.fit(train[X], train[y])
results['test_score'].mean()*100

#feature selection
fs = feature_selection.RFECV(clf, step=1, scoring='accuracy', cv=cv_split)
fs.fit(train[X], train[y])

#transform x and y to fit a new model
X = train[X].columns.values[fs.get_support()]
results = model_selection.cross_validate(clf, train[X], train[y], cv=cv_split)

print('Shape New: ', train[X].shape) 
print('Features to use: ', X)

Shape New:  (891, 6)
Features to use:  ['Pclass' 'Sex' 'Tgroup' 'Fsize' 'Title' 'Price']


In [30]:
#tune parameters
tuned = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring = 'roc_auc', cv=cv_split)
tuned.fit(train[X], train[y])

param_grid = tuned.best_params_
param_grid

{'criterion': 'gini', 'max_depth': 4, 'random_state': 0}

In [31]:
clf = ensemble.GradientBoostingClassifier()
results = model_selection.cross_validate(clf, train[X], train[y], cv=cv_split)
clf.fit(train[X], train[y])
results['test_score'].mean()*100

83.35820895522389

## Prediction

In [32]:
test['Survived'] = clf.predict(test[X])

## Submission

In [33]:
submit = pd.DataFrame({ 'PassengerId' : col1, 'Survived': test['Survived'] }).set_index('PassengerId')
submit['Survived'] = submit['Survived'].astype('int')
submit.to_csv('...TitanicSubm.csv')

This simple method gives me 0.799 score in Kaggle, what is a top 12% for today.

I will be glad to see any your questions or suggestions. Criticism welcomed also.  
Many thanks for your time.  
Best,  
Lana  