# Logistic Regression w/ categorical features

I read on [this](https://www.quora.com/What-are-the-advantages-of-different-classification-algorithms) quora question that :  
```
Tree Ensembles have different advantages over LR. One main advantage is that they do not expect linear features or even features that interact linearly. Something I did not mention in LR is that it can hardly handle categorical (binary) features. Tree Ensembles, because they are nothing more than a bunch of Decision Trees combined, can handle this very well.
```

Which intrigued me because I couldn't understand why from a mathematical point of view so I tried it out with the Titanic dataset to see how it behaved compared to other models.


In [34]:
import pandas as pd
df = pd.read_csv("/Users/edouardcuny/Downloads/train.csv")

In [35]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [36]:
# delete non categorical columns = name, fare, age, ticket
del df['Name']
del df['Fare']
del df['Age']
del df['Ticket']

In [37]:
# drop
df = df.dropna()

In [38]:
# dummy encode other variables
df = pd.get_dummies(df, columns=['Pclass','Sex','SibSp','Parch','Cabin','Embarked'])

In [39]:
# set index as passengerid and delete column
df.set_index(df['PassengerId'],inplace=True)
del df['PassengerId']

In [40]:
# train & test set
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.3)

In [59]:
x_train = train.drop('Survived', axis=1)
y_train = train['Survived']
x_test = test.drop('Survived', axis=1)
y_test = test['Survived']

x = df.drop('Survived', axis=1)
y = df['Survived']

In [69]:
# score with cross validation
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

clf = LogisticRegression()
print(np.mean(cross_val_score(clf, x, y, scoring='accuracy')))

clf = RandomForestClassifier()
print(np.mean(cross_val_score(clf, x, y, scoring='accuracy')))



0.76678550208
0.717320261438


# Conclusion

The models were not optimised in any way but Logistic Regression seems to work just fine.  

In [72]:
x.head()

Unnamed: 0_level_0,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,SibSp_2,SibSp_3,Parch_0,...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
11,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
12,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
