# Introduction

Hello y'all, this is notebook descibes my approach to the [Titanic Competition](https://www.kaggle.com/competitions/titanic/), a competition for beginners on Kaggle.

I am a beginner! This is my first notebook so please share feedback, better ways to of doing things, or anything that could be helpful! I am getting started with all of this and I am here to learn.

# The Titanic Dataset

The Titanic is the famous ship that hit an iceberg. The first thing I did for this was skim through the [wikipedia page](https://en.wikipedia.org/wiki/Titanic), which I'd recommned for anyone to refresh the story. I was mainly looking for information about survivors and how they might relate to our features. The important line I read was the lifeboats were filled with "women and children first." So right off the bat, sex and age are going to be important. 

The training set contains 10 columns:

    survival: 0 = No, 1 = Yes
    pclass: Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd
    sex: Sex
    Age: Age in years
    sibsp: # of siblings / spouses aboard the Titanic
    parch: # of parents / children aboard the Titanic
    ticket: Ticket number
    fare: Passenger fare
    cabin: Cabin number:
    embarked: Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton
### First off, installing libraries and a look at the data:
    

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import ExtraTreesClassifier

data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

data.head()

Another look:

In [None]:
data.describe()

Right away, we can see a couple important things about the data:

1. Age and Cabin are missing values

2. PassengerID, Name, Ticket, and Cabin aren't going to be very useful for our model<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1)
3. Sex is binary (duh)
4. Age and Fare have pretty big ranges


But before and data cleaning, let's see if we can find any trends.

# Data Exploration

We already know women were prioritized in the lifeboats, but let's look at sex:

In [None]:
men = data.loc[data.Sex == 'male']["Survived"]
women = data.loc[data.Sex == 'female']['Survived']

print("Men:   " + str(round(100*sum(men)/len(men))) + "% survival")
print("Women: " + str(round(100*sum(women)/len(women))) + "% survival")

There is a large discrepency, so this is going to be an important feature. 

Next let's take a look at Passenger Class: 

In [None]:
first = data.loc[data.Pclass == 1]["Survived"]
second = data.loc[data.Pclass == 2]["Survived"]
third = data.loc[data.Pclass == 3]["Survived"]

print("survival, by passenger class:\n\n"
      +"First: "+str(round(100*sum(first)/(len(first))))+"%\n"
      +"Second: " + str(round(100*sum(second)/(len(second))))+"%\n"
      +"Third: "+str(round(100*sum(third)/(len(third)))) + "%\n")

Not looking too good for third class. I guess the wealthy always come out on top. 

Next up: Age

In [None]:
ageS = data.loc[data.Survived == 1]["Age"]
ageD = data.loc[data.Survived == 0]["Age"]

plt.hist(ageS, alpha=0.5, bins=15, label='Survived')
plt.hist(ageD, alpha=0.5, bins=15, label='Died')
plt.legend(loc='upper right')
plt.title("Age distribution by survivial")
plt.show()

This is interesting! Children are the only age group that were more likely to survive, but people in their 30s did a lot better than those 18-25. I'm guessing this is because of parents? If families were prioritized, which I think they were, this would make sense. 

I wonder what this would look like split up by sex?

In [None]:
ageSmen = data.loc[(data.Survived == 1) & (data.Sex == 'male')]["Age"]
ageDmen = data.loc[(data.Survived == 0) & (data.Sex == 'male')]["Age"]

ageSwomen = data.loc[(data.Survived == 1) & (data.Sex == 'female')]["Age"]
ageDwomen = data.loc[(data.Survived == 0) & (data.Sex == 'female')]["Age"]

plt.hist(ageSmen, alpha=0.5, bins=15, label='Male Survived')
plt.hist(ageDmen, alpha=0.5, bins=15, label='Male Died')
plt.hist(ageSwomen, alpha=0.5, bins=15, label='Female Survived')
plt.hist(ageDwomen, alpha=0.5, bins=15, label='Female Died')
plt.legend(loc='upper right')
plt.title("Age distribution by survivial")
plt.show()

Gist of the story: male only survived if they were children or fathers (I'm just guessing)


Finally let's look at Fare:

In [None]:
fareS = data.loc[(data.Survived == 1)]["Fare"]
fareD = data.loc[(data.Survived == 0)]["Fare"]

plt.figure(2)
plt.hist(fareS, alpha=0.5, bins=15, label='Survived')
plt.hist(fareD, alpha=0.5, bins=15, label='Died')

plt.legend(loc='upper right')
plt.title("Fare distribution by survivial")
plt.show()

This isn't that helpful, but reinforces the idea that the people who paid more had a higher chance of survival. I'm guessing fare would correlate pretty strongly with passenger class

# Feature Selection

I decided to use Pclass, Sex, Age, and Fare. Presumably it's more helpful to talk about the features I *didn't* use. 

both SibSp and Parch  were slighly correlated with survival, but they weren't useful in my experimenting. I saw someone do some feature engineering and create a FamilySize feature, which they did by combining SibSp, Parch, and last names. This is cool, but outside of my current scope. It might be interesting to play around with different ways of combining these features to improve the model.



# Feature Cleaning and Engineering

So we have our features, we can just put them into a model right??

Nope. I did this on the first run and the model is no better than random. The data needs to be cleaned. NaNs need to be filled, 'male' and 'female' needs to be encoded, and bins need to be created for age and fare. 

Binning is a helpful technique for transforming a continuous variable into categorical data. With ordinal binning, the order can be maintained. It reduces the chances of overfitting and often improves models.

Encoding is assigning numerical values to categorical data. Binary encoding is assigning 'male' as 0 and 'female' as 1. Ordinal encoding is assigning ordered categorical data, such as the bins we create for age. One-hot coding is for encoding nominal data, a column is created for each category and 0s and 1s are input accordingly. This would be helpful if "Embarked" was used as a feature. 

For age, the most important bin seemed to be children, so I decided to create the bins every 10 years, i.e. 0-10, 10-20, etc. I stopped at 60+, which was arbitrary. 

For fares, I used 8 equal frequency bins. Given the distribution of fares, I think frequency based binning is apporpriate. 8 is an arbitrary number. 

The age and fare bins are then encoded as 0,1,2,3,etc. 

Age NaNs are filled with the mean age. 

Finally I  drop columns that aren't going to be used. 


In [None]:
def preprocessing(data):
    
    # Fill NaNs with mean
    meanAge = data.Age.mean()
    data.Age = data.Age.fillna(meanAge)

    # create and encode Age bins
    Abins = [0,10,20,30,40,50,60]
    Abin_Labels = [0,1,2,3,4,5]
    data['Age_bins'] = pd.cut(data['Age'], Abins, labels=Abin_Labels)

    # create and encode Fare bins
    Fbin_labels = [0,1,2,3,4,5,6,7]
    data['Fare_bins'] = pd.qcut(data['Fare'], q=8, labels=Fbin_labels) # 8 is arbitrary, unsure what ideal would be

    # encode Sex
    data.Sex = data.Sex.replace({"male":0, "female":1})

    # drop unneccessary columns
    data = data.drop(['PassengerId', 'Name', 'Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked'],axis=1)
    if "Survived" in data.columns:
        data = data.drop(['Survived'], axis=1)
        
    return data


Next apply the preprocessing function to the data. grab the truth values of train_data and passenger IDs of the test_data before they're dropped in the preprocessing

In [None]:
y = data.iloc[:,1]
pID = test_data.PassengerId

train_data = preprocessing(data)
test_data = preprocessing(test_data)

X = train_data

# Creating the Model

Finally get to create the model. I chose Extra Trees over Random Forest because it is less prone to bias. Of course it's more prone to variance, but I thought less bias was best for this data. I tried out a bunch of different models, as well as various combinations of the best ones. Extra Trees tied for the best performance I got. A combination of Gradient Boost and Ada Boost performed just as well. Throwing in KNN got about the same results. Given I couldn't find any combination or single classifier that was better than Extra Trees, I went with that 

In [None]:
model = ExtraTreesClassifier()
model.fit(X,y)
predictions = model.predict(test_data)

output = pd.DataFrame({'PassengerId': pID, 'Survived': predictions})

output.to_csv('submission.csv', index=False)

# Conclusion

This model scores a 0.77272, which is solid. Looking at the leaderboards, this nearly the best possible score, without using the extended data or creating a FamilySize feature. So I am happy with it. Like I mentioned at the beginning, **please give any feedback!**

### Ways to Improve
 
There are a bunch of specific hyperparameters that people have figured out that will improve the model. I haven't really looked at what these specifically do.

Scaling the data? might help, not sure

Trying different bin sizes and distributions. Using 6 and 8 bins was arbitrary, this is likely could be improved. 

Feature Engineering: look up the ways people create a FamilySize feature and some other things, these can improve the model.

Extended data set: can be found online, imo this is cheating for this challenge though



<a name="cite_note-1"></a>1. [^](#cite_ref-1) people have found ways to use these but I'm not going to