___________
# Summary
#### The aim of this notebook is to make first submission and get a baseline score from where improvement can be made. For this I would just be filling the null values, correcting the datatypes which includes one-hot-encoding categorical columns. All this is required as ML models need data which are numerical and void of null values. 
#### There is no additional EDA/Feature Engineering/Model optimization etc as our aim is first submission.

<a id='content-table'></a>
## Table of Contents
1. [Loading data](#load)
2. [Combine Train and Test data](#tag2)
3. [Filling missing values](#tag3)
4. [Remove unncessary columns](#tag4)
5. [Change datatypes if required](#tag5)
6. [Splitting into train/test set](#tag6)
7. [Training a simple model](#tag7)
8. [Making predicitions on Test set](#tag8)
9. [Making your first submission](#tag9)

In [1]:
import numpy as np 
import pandas as pd 

<a id='load'></a>
## [Step - 1 : Loading data](#content-table)

In [2]:
import pandas as pd
train = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/test.csv')
submission = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/sample_submission.csv')

print(train.shape, test.shape, submission.shape)
print(train.columns)                             #printing the column names
print(set(train.columns)-set(test.columns))      #printing the target column

(100000, 12) (100000, 11) (100000, 2)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
{'Survived'}


## Print first 5 rows

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,1,"Oconnor, Frankie",male,,2,0,209245,27.14,C12239,S
1,1,0,3,"Bryan, Drew",male,,0,0,27323,13.35,,S
2,2,0,3,"Owens, Kenneth",male,0.33,1,2,CA 457703,71.29,,S
3,3,0,3,"Kramer, James",male,19.0,0,0,A. 10866,13.04,,S
4,4,1,3,"Bond, Michael",male,25.0,0,0,427635,7.76,,S


## Check the %null values in  Train and Test data

In [4]:
_1 = train.isnull().sum()/len(train)*100
_2 = test.isnull().sum()/len(train)*100

df = pd.concat([_1,_2], axis = 1)
df.columns = ['train', 'test']
df

Unnamed: 0,train,test
PassengerId,0.0,0.0
Survived,0.0,
Pclass,0.0,0.0
Name,0.0,0.0
Sex,0.0,0.0
Age,3.292,3.487
SibSp,0.0,0.0
Parch,0.0,0.0
Ticket,4.623,5.181
Fare,0.134,0.133


We see that columns that have null values are same in both the dataset and the % missing values is around the same

<a id='tag2'></a>
## [Step - 2 : Combine Train and Test data](#content-table)

In [5]:
test['Survived'] = -1
all_data = pd.concat([train, test])
print(all_data.head())
all_data.tail()

   PassengerId  Survived  Pclass              Name   Sex    Age  SibSp  Parch  \
0            0         1       1  Oconnor, Frankie  male    NaN      2      0   
1            1         0       3       Bryan, Drew  male    NaN      0      0   
2            2         0       3    Owens, Kenneth  male   0.33      1      2   
3            3         0       3     Kramer, James  male  19.00      0      0   
4            4         1       3     Bond, Michael  male  25.00      0      0   

      Ticket   Fare   Cabin Embarked  
0     209245  27.14  C12239        S  
1      27323  13.35     NaN        S  
2  CA 457703  71.29     NaN        S  
3   A. 10866  13.04     NaN        S  
4     427635   7.76     NaN        S  


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
99995,199995,-1,3,"Cash, Cheryle",female,27.0,0,0,7686,10.12,,Q
99996,199996,-1,1,"Brown, Howard",male,59.0,1,0,13004,68.31,,S
99997,199997,-1,3,"Lightfoot, Cameron",male,47.0,0,0,4383317,10.87,,S
99998,199998,-1,1,"Jacobsen, Margaret",female,49.0,1,2,PC 26988,29.68,B20828,C
99999,199999,-1,1,"Fishback, Joanna",female,41.0,0,2,PC 41824,195.41,E13345,C


<a id='tag3'></a>
## [Step - 3 : Filling missing values](#content-table)

### Fill 'Age' and 'Fare' value with their mean value

In [6]:
for col in ['Age', 'Fare']:
    all_data[col] = all_data[col].fillna(all_data[col].mean())
    print(all_data[col].isnull().sum())

0
0


### Fill 'Embarked' and 'Ticket' values with their mode value

In [7]:
for col in ['Embarked', 'Ticket']:
    all_data[col] = all_data[col].fillna(all_data[col].mode()[0]) 
    print(all_data[col].isnull().sum())

0
0


### Filling Cabin values
Here 67% values are missing. Hence I will fill it with 1 if value is present and 0 if missing value

In [8]:
col = 'Cabin'
all_data[col] = all_data[col].notnull().astype(int)
print(all_data[col].isnull().sum())

0


### Verify there are no null values

In [9]:
all_data.isnull().sum()/len(train)*100

PassengerId    0.0
Survived       0.0
Pclass         0.0
Name           0.0
Sex            0.0
Age            0.0
SibSp          0.0
Parch          0.0
Ticket         0.0
Fare           0.0
Cabin          0.0
Embarked       0.0
dtype: float64

<a id='tag4'></a>
## [Step - 4 : Remove unncessary columns](#content-table)

We will check %unique values in column

In [10]:
# Taking only categorical columns
cols = [col for col in all_data.columns if all_data[col].dtype == 'object']
cols

for col in cols:
    print(f"{col} : {all_data[col].nunique()/len(all_data)*100}")

Name : 87.42699999999999
Sex : 0.001
Ticket : 66.3065
Embarked : 0.0015


`'Name'` and `'Ticket'` columns have more than 87% & 66% unique values respectively. They don't give any information to the model just as is. EDA/Feature Engineering might give us some insight, but we are not doing that here.

In [11]:
all_data.drop(['Name', 'Ticket'], axis = 1, inplace = True)

<a id='tag5'></a>
## [Step - 5 : Change datatypes if required](#content-table)

### Check column datatype with a sample datatype

In [12]:
df = pd.concat([all_data.iloc[0], all_data.dtypes], axis = 1)
df.columns = ['sample', 'dtype']
df

Unnamed: 0,sample,dtype
PassengerId,0,int64
Survived,1,int64
Pclass,1,int64
Sex,male,object
Age,34.464565,float64
SibSp,2,int64
Parch,0,int64
Fare,27.14,float64
Cabin,1,int64
Embarked,S,object


Here we see that the data type of sample matches with the datatype of the column. Hence no need to change column datatype

### One-hot-encode categorical columns

In [13]:
# Check which categorical columns are left
cols = [col for col in all_data.columns if all_data[col].dtype == 'object']
cols

['Sex', 'Embarked']

In [14]:
all_data = pd.get_dummies(all_data, drop_first = True)
all_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Cabin,Sex_male,Embarked_Q,Embarked_S
0,0,1,1,34.464565,2,0,27.14,1,1,0,1
1,1,0,3,34.464565,0,0,13.35,0,1,0,1
2,2,0,3,0.33,1,2,71.29,0,1,0,1
3,3,0,3,19.0,0,0,13.04,0,1,0,1
4,4,1,3,25.0,0,0,7.76,0,1,0,1


Now our data is ready to be fed into model. So we will split into train/validation/test set and train a basic model

<a id='tag6'></a>
## [Step - 6 : Splitting into train/test set](#content-table)

### Split into train-test set

In [15]:
n_train = len(train)
train_modified = all_data.iloc[:n_train].copy()   # This will create copy of the df. Done to avoid future warnings
test_modified = all_data.iloc[n_train:].copy()

print(len(train_modified), len(test_modified))

100000 100000


In [16]:
train_modified.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Cabin,Sex_male,Embarked_Q,Embarked_S
0,0,1,1,34.464565,2,0,27.14,1,1,0,1
1,1,0,3,34.464565,0,0,13.35,0,1,0,1
2,2,0,3,0.33,1,2,71.29,0,1,0,1
3,3,0,3,19.0,0,0,13.04,0,1,0,1
4,4,1,3,25.0,0,0,7.76,0,1,0,1


In [17]:
# Removing 'PassengerId' column
train_modified.drop('PassengerId', axis = 1, inplace = True)

In [18]:
test_modified.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Cabin,Sex_male,Embarked_Q,Embarked_S
0,100000,-1,3,19.0,0,0,63.01,0,1,0,1
1,100001,-1,3,53.0,0,0,5.81,0,0,0,1
2,100002,-1,1,19.0,0,0,38.91,1,0,0,0
3,100003,-1,2,25.0,0,0,12.93,0,1,0,1
4,100004,-1,1,17.0,0,2,26.89,1,0,0,0


In [19]:
# Remove 'Survived' column from test data
test_modified.drop('Survived', axis = 1,inplace = True)

### Create a train-test split on training data

In [20]:
from sklearn.model_selection import train_test_split

X = train_modified.drop('Survived', axis = 1)
y = train_modified['Survived'].copy()

x_train, x_test, y_train, y_test = train_test_split(X, y.values, test_size = 0.25, random_state = 42)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(75000, 9) (25000, 9) (75000,) (25000,)


In [21]:
x_test

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Cabin,Sex_male,Embarked_Q,Embarked_S
75721,2,57.0,0,0,8.32,0,0,0,0
80184,3,26.0,0,0,6.17,0,1,0,1
19864,3,31.0,0,0,7.47,0,0,0,1
76699,2,41.0,0,0,8.16,0,0,0,1
92991,2,26.0,0,0,61.50,0,0,0,1
...,...,...,...,...,...,...,...,...,...
21271,1,36.0,0,0,9.55,1,0,0,0
34014,3,31.0,0,0,10.02,0,0,0,1
81355,2,10.0,0,0,24.57,0,0,0,1
65720,3,60.0,0,0,26.11,0,0,0,0


<a id='tag7'></a>
## [Step - 7 : Training a simple model](#content-table)

In [22]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver='liblinear', random_state = 42)

classifier.fit(x_train.values, y_train)

LogisticRegression(random_state=42, solver='liblinear')

In [23]:
y_pred = classifier.predict(x_test.values)
accuracy = (y_pred == y_test).astype(int).sum()/len(y_test)*100
print(f"Model accuracy is : {accuracy: .3f} %")

Model accuracy is :  76.612 %


An accuracy of 76.612% is a good starting point. From here on we can improve

<a id='tag8'></a>
## [Step - 8 : Making predicitions on Test set](#content-table)

In [24]:
# Saving 'PassengerId' of test data and deleting it
test_idx = test_modified['PassengerId'].copy()

test_modified.drop('PassengerId', axis = 1, inplace = True)

print(test_modified.shape)

(100000, 9)


In [25]:
y_pred = classifier.predict(test_modified.values)
submission.loc[:, 'Survived'] = y_pred

In [26]:
submission

Unnamed: 0,PassengerId,Survived
0,100000,0
1,100001,1
2,100002,1
3,100003,0
4,100004,1
...,...,...
99995,199995,1
99996,199996,0
99997,199997,0
99998,199998,1


<a id='tag9'></a>
## [Step - 9 : Making your first submission](#content-table)

In [27]:
submission.to_csv('submission.csv', index = False)   # index = False is important 

In [28]:
# Recheck if the file is in correct format
pd.read_csv("submission.csv")

Unnamed: 0,PassengerId,Survived
0,100000,0
1,100001,1
2,100002,1
3,100003,0
4,100004,1
...,...,...
99995,199995,1
99996,199996,0
99997,199997,0
99998,199998,1


**Now you have made your first submssion. From here on you can do many things to improve your accuracy. You can do EDA to get better insights in your data. Furthur you can also do feature engineering, hyperparameter optimization, ensembling of models.**

_______________