**Titanic Dataset ~ 77% accuracy**

First, we import the libraries used in this notebook:

In [1]:
import pandas as pd
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression

We read and store both the **train** and the **test** datasets.

In [2]:
train = pd.read_csv("/Users/zolta/Desktop/Python_projects/Sources/titanic/train.csv")
test = pd.read_csv("/Users/zolta/Desktop/Python_projects/Sources/titanic/test.csv")
ids = test["PassengerId"]

Let's take a look at the first 5 records of the **train** dataset!

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's begin by getting rid of the columns we won't need. We are going to drop the **'PassengerId'**, **'Name'**, **'Ticket'** and **'Cabin'** columns.
Considering the **'Name'**, **'Ticket'** and **'Cabin'** attributes, I'm sure there is some correlation between these three and the **'Survived'** label, but as this is just a simple example, we are not going to take this into consideration.

In [4]:
train.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1, inplace = True)

In [5]:
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [6]:
test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1, inplace = True)

In [7]:
test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,34.5,0,0,7.8292,Q
1,3,female,47.0,1,0,7.0,S
2,2,male,62.0,0,0,9.6875,Q
3,3,male,27.0,0,0,8.6625,S
4,3,female,22.0,1,1,12.2875,S


We are going to use Logistic Regression at the end of the process to determine which passengers had survived the accident. Logistic Regression needs numerical features, so we have to convert the values of the columns **'Sex'**, **'Embarked'** and **'Cabin'** to numerical ones.

In [8]:
label_encoder = preprocessing.LabelEncoder()

columns = ['Sex', 'Embarked']

for col in columns:
    train[col] = label_encoder.fit_transform(train[col])
    test[col] = label_encoder.transform(test[col])

In [9]:
test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,1,34.5,0,0,7.8292,1
1,3,0,47.0,1,0,7.0,2
2,2,1,62.0,0,0,9.6875,1
3,3,1,27.0,0,0,8.6625,2
4,3,0,22.0,1,1,12.2875,2


Let's check which columns contain *NaN* values! We have to replace every *NaN* value in their respective column.

Let's create a function which returns the columns containing *NaN* values. 

In [10]:
def check_for_nan(dataset):
    contains_nan = []
    
    for col in dataset.columns:
        if dataset[col].isnull().values.any():
            contains_nan.append(col)
            
    return contains_nan

The **train** dataset contains these columns with *NaN* values:

In [11]:
check_for_nan(train)

['Age']

Regarding the **test** dataset:

In [12]:
check_for_nan(test)

['Age', 'Fare']

As we can see, **'Age'** and **'Fare'** columns contain *NaN* values either in the ***train*** or the ***test*** dataset

Let's create a function which replaces the *NaN* values with the respective column's median value

In [13]:
def replace_nan(dataset):
    columns = check_for_nan(dataset)
    
    for col in columns:
        dataset[col].fillna(dataset[col].median(), inplace = True)

In [14]:
replace_nan(train)
replace_nan(test)

Now we create a new DataFrame storing each **'Survived'** labels. Also, we create another DataFrame from the original one, except it won't contain the **'Survived'** label.

In [15]:
y = train["Survived"]
X = train.drop("Survived", axis=1)

We create a Logistic Regression model and fit the data.

In [16]:
model = LogisticRegression(random_state = 0, max_iter = 1000).fit(X, y)

We create the final predictions based on the **test** dataset.

In [17]:
submission_preds = model.predict(test)

We create a new DataFrame called **submission.csv**. This stores the predictions, we will submit this file on Kaggle.

In [18]:
df = pd.DataFrame({"PassengerId":ids.values,
                   "Survived":submission_preds
                  })

In [19]:
df.to_csv("submission.csv", index=False)