# Basic Overview
This is a stab to see if Oscar's ideas listed as part of the comments [here](https://www.kaggle.com/aashita/xgboost-model-with-minimalistic-features) (regarding analysis of names using a machine learning algorithm) would work.

The idea here is to see if a simple Count vectorization algorithm on the survived cases and not survived cases and classify a name as survived or not depending on how much the name has in common with the survived/not survived vocabulary.

Comments/criticisms/appreciations are greatly accepted and appreciated. Do not be shy and send me an email at babinu@gmail.com !

Source of data : https://www.kaggle.com/c/titanic/data

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
# Ha
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

In [4]:
train_data = pd.read_csv("../input/train.csv")
test_data = pd.read_csv("../input/test_data_processed_correct.csv")

In [5]:
test_data.Survived.unique()

array([0., 1.])

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
train_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [8]:
train_data['Name'].values

array(['Braund, Mr. Owen Harris',
       'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
       'Heikkinen, Miss. Laina',
       'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
       'Allen, Mr. William Henry', 'Moran, Mr. James',
       'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard',
       'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
       'Nasser, Mrs. Nicholas (Adele Achem)',
       'Sandstrom, Miss. Marguerite Rut', 'Bonnell, Miss. Elizabeth',
       'Saundercock, Mr. William Henry', 'Andersson, Mr. Anders Johan',
       'Vestrom, Miss. Hulda Amanda Adolfina',
       'Hewlett, Mrs. (Mary D Kingcome) ', 'Rice, Master. Eugene',
       'Williams, Mr. Charles Eugene',
       'Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)',
       'Masselmani, Mrs. Fatima', 'Fynney, Mr. Joseph J',
       'Beesley, Mr. Lawrence', 'McGowan, Miss. Anna "Annie"',
       'Sloper, Mr. William Thompson', 'Palsson, Miss. Torborg Danira',
       'Asplund, Mrs. Carl Oscar 

In [9]:
# Gets a vocabulary using all entries from the 'Name' column from the given dataframe.
def get_names_count_vector(given_data , count_vectorizer=True):
    if count_vectorizer :
        vectorizer = CountVectorizer() 
    else:
        vectorizer = TfidfVectorizer()
    vectorizer.fit(given_data['Name'].values)
    return(vectorizer)

In [10]:
# Compares a string with a master vocabulary that is passed on and evaluates a comparison score.
def get_relevant_score(given_text, vector_obj):
    fit_cur_val = vector_obj.transform([given_text])
    score = np.sum(1.0/(1.0 + fit_cur_val.todense()))    
    #score = np.sum(fit_cur_val)
    #print(score)
    return score

In [11]:
# Generates vocabularies for survived and not survived cases separately and 
# generates predictions on test data accordingly.
def update_predictions_out_of_sample(train_data, test_data, count_vectorizer=True):
    survived_vector = get_names_count_vector(train_data[train_data['Survived'] == 1], count_vectorizer)
    not_survived_vector = get_names_count_vector(train_data[train_data['Survived'] == 0], count_vectorizer)        

    test_data['Survived_score'] = test_data['Name'].apply(lambda x : get_relevant_score(x, survived_vector)) 
    test_data['Not_survived_score'] = test_data['Name'].apply(lambda x : get_relevant_score(x, not_survived_vector))
    test_data['Predictions'] = test_data['Survived_score'] > test_data['Not_survived_score']        
    

In [12]:
# The bread and butter routine for cross validation.
def get_cross_val_output_name_analysis(input_df, nfolds=5, count_vectorizer=True):
    partition_indices = np.array_split(np.arange(len(input_df)), nfolds)
    cross_validated_data = pd.DataFrame(columns=input_df.columns)
    for i in range(nfolds):
        cross_validated_set = input_df[partition_indices[i][0]:partition_indices[i][-1] + 1].copy()
        rel_training_data = pd.DataFrame(columns=input_df.columns)
        for j in range(nfolds):
            if j != i:
                training_set = input_df[partition_indices[j][0]:partition_indices[j][-1] + 1]
                rel_training_data = pd.concat([rel_training_data, training_set])

        # Make sure that a copy is made, so that the original dataset is untouched.
        rel_training_data_v1 = rel_training_data.copy()
        update_predictions_out_of_sample(rel_training_data_v1, cross_validated_set, count_vectorizer)
        
        cross_validated_data = pd.concat([cross_validated_data, cross_validated_set])

    return cross_validated_data
    

### In sample performance.
At first we train the model and see how it performs on the training data itself. Note that this is bound to show significant performance as words are selected from the training data itself.

In [13]:
update_predictions_out_of_sample(train_data, train_data)
(train_data['Predictions'] == train_data['Survived']).sum()*100/(len(train_data))

61.61616161616162

### Cross validation.
This time, let us do a 5-fold cross validation and see how things look.

In [14]:
cross_validated_data = get_cross_val_output_name_analysis(train_data)
(cross_validated_data['Predictions'] == cross_validated_data['Survived']).sum()/len(cross_validated_data)

0.6161616161616161

Comments : As expected the performance is much lesser on out of sample data.

### Sidetrack : Learning advanced tutorial


#### TfidVectorizer

In [15]:

text = ["The quick brown fox jumped over the lazy dog.",
		"The dog.",
		"The fox"]

In [16]:
# create the transform
vectorizer_tfid = TfidfVectorizer(use_idf=False)
# tokenize and build vocab
vectorizer_tfid.fit(text)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=False,
        vocabulary=None)

In [17]:
vectorizer_tfid.vocabulary_

{'brown': 0,
 'dog': 1,
 'fox': 2,
 'jumped': 3,
 'lazy': 4,
 'over': 5,
 'quick': 6,
 'the': 7}

In [18]:
vector_tfid_transform = vectorizer_tfid.transform([text[1]]) 

In [19]:
vector_tfid_transform.todense()

matrix([[0.        , 0.70710678, 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.70710678]])

#### CountVectorizer

In [20]:
vectorizer_count = CountVectorizer()
vectorizer_count.fit(text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [21]:
vectorizer_count.vocabulary_

{'brown': 0,
 'dog': 1,
 'fox': 2,
 'jumped': 3,
 'lazy': 4,
 'over': 5,
 'quick': 6,
 'the': 7}

In [22]:
vector_count_transform = vectorizer_count.transform([text[1]]) 

In [23]:
vector_count_transform.todense()

matrix([[0, 1, 0, 0, 0, 0, 0, 1]])

In [24]:
np.sum(vector_count_transform)

2

### Repeat the same analysis using TfidVectorizer

In [25]:
update_predictions_out_of_sample(train_data, train_data, count_vectorizer=False)
(train_data['Predictions'] == train_data['Survived']).sum()*100/(len(train_data))

61.61616161616162

In [26]:
cross_validated_data = get_cross_val_output_name_analysis(train_data, count_vectorizer=False)
(cross_validated_data['Predictions'] == cross_validated_data['Survived']).sum() * 100/len(cross_validated_data)

61.61616161616162

### Comment

The first method performs reasonably well, and it takes into count , just the presence of the word in the training data (NOTE : The index however is linked the frequency of the word's occurence in the document , as is detailed [here](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

The second method does not perform very well, and it takes into count the frequency of the word and inverts the same.

This points us in the direction that we should be looking for a method where we increase the weights according to the frequency of the occurence of the word(and not inverts as was done in the second method)

### Using Naive Bayes algorithm 

In [27]:
# We use a pipeline to make things easire
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
clf = Pipeline([('vect', TfidfVectorizer()),
                ('transformer', TfidfTransformer()),
                ('classify', MultinomialNB())])

In [28]:
train_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Survived_score',
       'Not_survived_score', 'Predictions'],
      dtype='object')

In [29]:
clf.fit(train_data['Name'][:700], train_data['Survived'][:700])

Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ..._tf=False, use_idf=True)), ('classify', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [30]:
predictions = clf.predict(train_data['Name'][700:]) 

In [31]:
survived_or_not = train_data['Survived'][700:]

In [32]:
np.mean(predictions == survived_or_not)

0.7696335078534031

#### Optimize parameters using GridSearch

In [33]:
from sklearn.model_selection import GridSearchCV

In [34]:
parameters = {'vect__ngram_range' : [(1, 1), (2, 2), (3 , 3)],
              'transformer__use_idf' : (True, False),
              }

In [35]:
gs_clf = GridSearchCV(clf, parameters, n_jobs=-1)

In [36]:
gs_clf.fit(train_data['Name'], train_data['Survived'])

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ..._tf=False, use_idf=True)), ('classify', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'vect__ngram_range': [(1, 1), (2, 2), (3, 3)], 'transformer__use_idf': (True, False)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [37]:
# Make sure that unnecessary deprecation warnings are avoided.
# Thanks to https://stackoverflow.com/questions/49545947/sklearn-deprecationwarning-truth-value-of-an-array
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning)


In [38]:
cv_result = pd.DataFrame(gs_clf.cv_results_)

In [39]:
cv_result

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_transformer__use_idf,param_vect__ngram_range,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.017531,0.005901,0.739618,0.95174,True,"(1, 1)","{'transformer__use_idf': True, 'vect__ngram_ra...",2,0.734007,0.962963,0.737374,0.949495,0.747475,0.942761,0.002759,0.000394,0.005723,0.008399
1,0.025163,0.007973,0.679012,0.983726,True,"(2, 2)","{'transformer__use_idf': True, 'vect__ngram_ra...",4,0.690236,0.983165,0.676768,0.988215,0.670034,0.979798,0.001941,0.001807,0.008399,0.003459
2,0.021897,0.011019,0.618406,0.996633,True,"(3, 3)","{'transformer__use_idf': True, 'vect__ngram_ra...",5,0.622896,0.998316,0.616162,0.996633,0.616162,0.994949,0.002507,0.000424,0.003174,0.001375
3,0.017906,0.006593,0.758698,0.934905,False,"(1, 1)","{'transformer__use_idf': False, 'vect__ngram_r...",1,0.754209,0.936027,0.750842,0.930976,0.771044,0.93771,0.001741,0.001414,0.008837,0.002861
4,0.022219,0.006543,0.680135,0.973625,False,"(2, 2)","{'transformer__use_idf': False, 'vect__ngram_r...",3,0.700337,0.978114,0.673401,0.974747,0.666667,0.968013,0.001536,0.000244,0.014547,0.004199
5,0.014181,0.005105,0.618406,0.996633,False,"(3, 3)","{'transformer__use_idf': False, 'vect__ngram_r...",5,0.622896,0.998316,0.616162,0.996633,0.616162,0.994949,0.003801,0.001078,0.003174,0.001375


In [40]:
gs_clf.best_score_

0.7586980920314254

#### Generate the best model using Naive Bayes.


In [41]:
best_model_pipeline = Pipeline([('vect', TfidfVectorizer()),
                                ('transform', TfidfTransformer(use_idf=False)),
                                ('classify', MultinomialNB())])

In [42]:
best_model_pipeline.fit(train_data['Name'], train_data['Survived'])

Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...
         use_idf=False)), ('classify', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [43]:
# Generate out of sample predictions
predictions = best_model_pipeline.predict(test_data['Name'])

In [44]:
test_data['Predictions'] = predictions

In [45]:
kaggle_data = test_data[['PassengerId', 'Predictions']].copy()
kaggle_data.rename(columns={'Predictions' : 'Survived'}, inplace=True)
kaggle_data.sort_values(by=['PassengerId']).to_csv('kaggle_out_names_naive_bayes.csv', index=False)
print("Out of sample score is {0}".format(np.sum(test_data['Predictions'] == test_data['Survived'])/len(test_data)))

Out of sample score is 0.7392344497607656


### Using SVC

In [46]:
# We use a pipeline to make things easire
from sklearn.linear_model import SGDClassifier
svc_clf = Pipeline([('vect', TfidfVectorizer()),
                    ('transformer', TfidfTransformer()),
                    ('classify', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, 
                                               random_state=0, max_iter=5, tol=None))])

In [47]:
svc_clf.fit(train_data['Name'][:700], train_data['Survived'][:700])

Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...lty='l2', power_t=0.5, random_state=0, shuffle=True,
       tol=None, verbose=0, warm_start=False))])

In [48]:
predictions = svc_clf.predict(train_data['Name'][700:]) 

In [49]:
survived_or_not = train_data['Survived'][700:]

In [50]:
np.mean(predictions == survived_or_not)

0.8010471204188482

### Finding optimal parameters using GridSearch

In [51]:
parameters = {'vect__ngram_range' : [(1, 1), (2, 2), (3 , 3)],
              'transformer__use_idf' : (True, False),
              'classify__alpha' : (1e-2, 1e-3),
              }

In [52]:
gs_clf = GridSearchCV(svc_clf, parameters, n_jobs=-1)

In [53]:
gs_clf.fit(train_data['Name'], train_data['Survived'])

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...lty='l2', power_t=0.5, random_state=0, shuffle=True,
       tol=None, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'vect__ngram_range': [(1, 1), (2, 2), (3, 3)], 'transformer__use_idf': (True, False), 'classify__alpha': (0.01, 0.001)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [54]:
cv_result = pd.DataFrame(gs_clf.cv_results_)

In [55]:
cv_result

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_classify__alpha,param_transformer__use_idf,param_vect__ngram_range,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.018975,0.006868,0.637486,0.616162,0.01,True,"(1, 1)","{'classify__alpha': 0.01, 'transformer__use_id...",6,0.626263,0.616162,0.632997,0.616162,0.653199,0.616162,0.002588,0.001415,0.011446,0.0
1,0.027952,0.008416,0.616162,0.616162,0.01,True,"(2, 2)","{'classify__alpha': 0.01, 'transformer__use_id...",7,0.616162,0.616162,0.616162,0.616162,0.616162,0.616162,0.00248,0.002168,0.0,0.0
2,0.025147,0.008129,0.616162,0.616162,0.01,True,"(3, 3)","{'classify__alpha': 0.01, 'transformer__use_id...",7,0.616162,0.616162,0.616162,0.616162,0.616162,0.616162,0.002497,0.002554,0.0,0.0
3,0.01574,0.004674,0.781145,0.84119,0.01,False,"(1, 1)","{'classify__alpha': 0.01, 'transformer__use_id...",2,0.791246,0.848485,0.784512,0.821549,0.767677,0.853535,0.002803,0.000336,0.009912,0.01404
4,0.01626,0.006522,0.616162,0.616162,0.01,False,"(2, 2)","{'classify__alpha': 0.01, 'transformer__use_id...",7,0.616162,0.616162,0.616162,0.616162,0.616162,0.616162,0.001064,0.000444,0.0,0.0
5,0.012839,0.004891,0.616162,0.616162,0.01,False,"(3, 3)","{'classify__alpha': 0.01, 'transformer__use_id...",7,0.616162,0.616162,0.616162,0.616162,0.616162,0.616162,0.002766,0.001502,0.0,0.0
6,0.007987,0.002674,0.768799,0.993827,0.001,True,"(1, 1)","{'classify__alpha': 0.001, 'transformer__use_i...",3,0.750842,0.994949,0.767677,0.991582,0.787879,0.994949,0.00025,6.3e-05,0.015141,0.001587
7,0.012981,0.004038,0.69697,0.998878,0.001,True,"(2, 2)","{'classify__alpha': 0.001, 'transformer__use_i...",5,0.710438,0.998316,0.700337,1.0,0.680135,0.998316,0.004573,0.000948,0.012598,0.000794
8,0.008631,0.002867,0.616162,0.999439,0.001,True,"(3, 3)","{'classify__alpha': 0.001, 'transformer__use_i...",7,0.622896,0.998316,0.609428,1.0,0.616162,1.0,0.000182,9.3e-05,0.005498,0.000794
9,0.009929,0.003433,0.799102,0.979237,0.001,False,"(1, 1)","{'classify__alpha': 0.001, 'transformer__use_i...",1,0.791246,0.983165,0.811448,0.976431,0.794613,0.978114,0.002996,0.001268,0.008837,0.002861


In [56]:
gs_clf.best_params_

{'classify__alpha': 0.001,
 'transformer__use_idf': False,
 'vect__ngram_range': (1, 1)}

### Find best model using SVC

In [57]:
# We use a pipeline to make things easire
from sklearn.linear_model import SGDClassifier
best_model_svc = Pipeline([('vect', TfidfVectorizer()),
                           ('transformer', TfidfTransformer(use_idf=False)),
                           ('classify', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, 
                                                      random_state=0, max_iter=5, tol=None))])

In [58]:
best_model_svc.fit(train_data['Name'], train_data['Survived'])

Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...lty='l2', power_t=0.5, random_state=0, shuffle=True,
       tol=None, verbose=0, warm_start=False))])

In [59]:
# Generate out of sample predictions
predictions = best_model_svc.predict(test_data['Name'])

In [60]:
test_data['Predictions'] = predictions

In [61]:
kaggle_data = test_data[['PassengerId', 'Predictions']].copy()
kaggle_data.rename(columns={'Predictions' : 'Survived'}, inplace=True)
kaggle_data.sort_values(by=['PassengerId']).to_csv('kaggle_out_names_svc.csv', index=False)
print("Out of sample score is {0}".format(np.sum(test_data['Predictions'] == test_data['Survived'])/len(test_data)))

Out of sample score is 0.7870813397129187


### Use SVC with ticket names as well.

Since we see that results using SVC are encouraging, let us use the same with the ticket field as well.

In [63]:
best_model_svc.fit(train_data['Ticket'], train_data['Survived'])

# Generate out of sample predictions
predictions = best_model_svc.predict(test_data['Ticket'])
test_data['Predictions'] = predictions

In [64]:
kaggle_data = test_data[['PassengerId', 'Predictions']].copy()
kaggle_data.rename(columns={'Predictions' : 'Survived'}, inplace=True)
kaggle_data.sort_values(by=['PassengerId']).to_csv('kaggle_out_tickets_svc.csv', index=False)
print("Out of sample score is {0}".format(np.sum(test_data['Predictions'] == test_data['Survived'])/len(test_data)))

Out of sample score is 0.69377990430622
