### Titanic Dataset

In [1]:
import pandas as pd
from sklearn import naive_bayes
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

Read in the actual dataset of 887 Titanic passengers. 

In [2]:
df = pd.read_csv('data/titanic.csv')
print(df.shape)
df.head()

(887, 8)


Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


### Question 1:
Create an input matrix containing the explanatory variables (including a one-hot matrix of the Sex column) and use it to predict the response variable, Survived, using the Naive-Bayes algorithm.

In [3]:
#insert
one_hot = pd.get_dummies(df['Sex'])
df = df.drop(columns=['Sex'])
df = df.join(one_hot)

In [4]:
X = df.drop(columns=['Survived', 'Name'])
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

model = naive_bayes.GaussianNB()
model.fit(X_train,y_train)

print('train accuracy', model.score(X_train, y_train))
print('test accuracy', model.score(X_test, y_test))

train accuracy 0.7909774436090226
test accuracy 0.8153153153153153


### Question 2:
How many dead passengers were incorrectly predicted to survive? How many survivors were incorrectly predicted to be deceased?

In [5]:
model.classes_  # 0 is dead 1 is survived

array([0, 1])

In [6]:
c = confusion_matrix(y_test, model.predict(X_test))
c

array([[112,  17],
       [ 24,  69]])

In [7]:
print(f"Out of the {c[0][0]+c[0][1]} who died, {c[0][1]} were incorrectly predicted to survive, out of the {c[1][0]+c[1][1]} people who survived, {c[1][0]} were predicted to die")

Out of the 129 who died, 17 were incorrectly predicted to survive, out of the 93 people who survived, 24 were predicted to die


### Question 3:
Would you predict survival or death of a 3rd class, 18 year old, male passenger who had no family aboard and paid $1?

Would you predict survival or death of a 1st class, 18 year old, female passenger who had no family aboard and paid $50?

For each of these questions, also print the probabilities estimates using predict_proba.

In [8]:
df.columns

Index(['Survived', 'Pclass', 'Name', 'Age', 'Siblings/Spouses Aboard',
       'Parents/Children Aboard', 'Fare', 'female', 'male'],
      dtype='object')

In [9]:
#insert
print('male death probability: ', model.predict_proba([[3,18,0,0,1,0,1]]))
print('female death probability: ', model.predict_proba([[1,18,0,0,50,1,0]]))
print('the 3rd class guy highly likely dies, and 1st class girl is extremely likely to survive')

male death probability:  [[0.97977669 0.02022331]]
female death probability:  [[0.004077 0.995923]]
the 3rd class guy highly likely dies, and 1st class girl is extremely likely to survive


### Question 4: 
Return to the golf example from yesterday. Write a function called NaiveBayes that takes in an outlook, temp, humidity, and wind, and returns whether we predict that we will play golf or not. Within the function, print both the probabilities of yes or no. If you want to be fancy, you can make this function more general, but it's okay to make this very specific to the golf example.

Here is an example of correct output:
```python
NaiveBayes('Rainy', 'Hot', 'Normal', True)
```

```
Prob Yes: 0.6067961165048542
Prob No: 0.3932038834951457
```



In [10]:
#insert
df = pd.read_csv("data/golf.csv")
df.head()

Unnamed: 0,Outlook,Temp,Humidity,Windy,Play Golf
0,Rainy,Hot,High,False,No
1,Sunny,Hot,High,True,No
2,Overcast,Hot,High,False,Yes
3,Rainy,Mild,High,False,Yes
4,Sunny,Cool,Normal,False,Yes


In [11]:
#  worked on this with the group


def NaiveBayes(outlook, temp, humidity, windy, df):
    """
    inputs: outlook, temperature, humidity, wind, and the dataframe
    outputs: prints the possibility of playing
    """
    y = df['Play Golf']
    X = df.drop(columns = ['Play Golf'])
    headers = list(df[:0])
    yes= y.value_counts()["Yes"]/y.count()
    no = y.value_counts()["No"]/y.count()
    v = [outlook, temp, humidity, windy]
    
    #  looping through to determine the yes and no
    for v_ in v:
        y_ = len(X[y.isin(['Yes'])][X[y.isin(['Yes'])][headers[v.index(v_)]] == v_])/y.value_counts()['Yes']
        n_ = len(X[y.isin(['No'])][X[y.isin(['No'])][headers[v.index(v_)]] == v_])/y.value_counts()['No']
        yes *= y_
        no *= n_
        print(f"P({v_}|Yes): {y_}")
        print(f"P({v_}|No): {n_}")
        
    #  normalizing values
    yes_percent = yes/(yes + no)
    no_percent = no/(yes + no)
    print()
    print('Today\'s:')
    print(f"P(Yes|Today): {yes_percent}")
    print(f"P(No|Today): {no_percent}")


NaiveBayes('Rainy', 'Hot', 'Normal', True, df)

P(Rainy|Yes): 0.3333333333333333
P(Rainy|No): 0.4
P(Hot|Yes): 0.2222222222222222
P(Hot|No): 0.4
P(Normal|Yes): 0.6666666666666666
P(Normal|No): 0.2
P(True|Yes): 0.3333333333333333
P(True|No): 0.6

Today's:
P(Yes|Today): 0.6067961165048542
P(No|Today): 0.3932038834951457
