# WELCOME
Hello, thanks for visiting my notebook.<br>
I am both new to Python and Kaggler and this is an exploratory challenge to me.<br>
Please give me any ***genuine*** comments, advice or impressions ! Thank you!!!!!

# Agenda:

- 0. Premise
- 1. Data Exploration
- 2. Data Cleansing
- 3. Training with KNN

# 0. Premise
- Below is a premise of the whole research.

In [None]:
# Premise
import numpy as np
np.random.seed = 1
import pandas as pd
from IPython import embed
from sklearn.neighbors import KNeighborsClassifier


train_array = ['Sex', 'Pclass',
'Age', 'SibSp', 'Parch',
'Fare', 'Embarked']

target_index = np.arange(start=892, stop=1310, step=1)

x_train = pd.read_csv("/kaggle/input/titanic/train.csv", usecols = train_array)
t_train = pd.read_csv("/kaggle/input/titanic/train.csv", usecols = ['Survived'])
x_test = pd.read_csv("/kaggle/input/titanic/test.csv", usecols = train_array)

# 1.Data Exploration


I first executed some data explorating process, in order to detect the incomplete wrong data input in advance (e.g. NaN, wrong input by those who conducted the initial research)

In [None]:
# NaN detection
print("x_train:\n", x_train.isnull().any())
print("\nt_train:\n", t_train.isnull().any())
print("\nx_test:\n", x_test.isnull().any())

### Implication:
- For x_train data, Age and Embarked column have an NaN value which must be arranged.
- There is no lack of information in the training data regarding which passenger has survived or not.
- In the x_test data, Age and Fare column have an NaN value;<br>the later situation of which should be handled with great care,<br>
because it's not predictable in the learning data(x_test).



In [None]:
# a. Looking further into the sex and embarked columns
print(np.unique(x_test.Sex))
u, counts = np.unique(x_test.Embarked, return_counts=True)
print("Embarked:\n", u,"\nCounts:\n", counts)

### Implications:
- Sex column can be simply arranged into 0-2 digit number: 0 for NaN, 1 for male and 2 for female.
- Embarked column can be arranged into 0-3 digit number: 0 for NaN, 1-3 for Q-C-S according to the frequency of each alpabets:
        - NaN of Embarked column is most close to 'the most less frequently alphabet out of C-Q-S.'

In [None]:
# b. Further investigation into the Fare column of x_test:
u, counts = np.unique(x_test.Fare, return_counts = True)
print("Unique values:\n", u, "\nCounts:\n", counts)

### Implications and consideration
- As there is only one NaN data in the Fare column it is realistic to turn it into 0 for simplicity,<br> which has later been conducted in this research.

In [None]:
print("x_train:\n", x_train.dtypes)
print("\n\nx_test:\n", x_test.dtypes)

### Implications:
- Sex and Embarked columns must be converted into dummy number
- It is preferable that all columns are converted into float64 dtype.
- The same conversion must be executed to the x_test data as well.

In [None]:
print(x_test.describe())
print(x_train.describe())

### Implications
- It needs to be investigated that some age records seem recorded with the decimal point below while others don't.

In [None]:
# Further investigation into Age column with decimal points
print(x_test[(x_test['Age'] != x_test['Age'].round()) & ~(x_test.Age.isnull())])

### Implications and consideration:
- As significant number of age datum is recorded with decimal points below and that most datum of the other columns in the same row are properly recorded, it is considered this is not the wrong input of ages.

# 2. Data Cleansing

In [None]:
# Cleansing Process

## Sex column: adding dummy numbers
x_train.Sex[x_train.Sex=='male'] = 1
x_train.Sex[x_train.Sex=='female'] = 2

x_test.Sex[x_test.Sex=='male'] = 1
x_test.Sex[x_test.Sex=='female'] = 2

## Embarked column: eliminating NaNs for the x_train, adding categorical numbers for both x_train and x_test 
x_train.Embarked[x_train.Embarked.isnull()] = 0
x_train.Embarked[x_train.Embarked=='Q'] = 1
x_train.Embarked[x_train.Embarked=='C'] = 2
x_train.Embarked[x_train.Embarked=='S'] = 3

x_test.Embarked[x_test.Embarked=='Q'] = 1
x_test.Embarked[x_test.Embarked=='C'] = 2
x_test.Embarked[x_test.Embarked=='S'] = 3


## Age column: adding 0 to NaNs
x_test.Age[x_test.Age.isnull()] = 0
x_train.Age[x_train.Age.isnull()] = 0


## Fare column: turning NaN into 0 for simplicity for the x_test
x_test.Fare[x_test.Fare.isnull()] = 0


## Data types conversion into float64
for key in train_array:
    x_train[key] = x_train[key].astype('float64')
    x_test[key] = x_test[key].astype('float64')


In [None]:
#To make sure that all the datum have been arranged properly

print("[NaN:]\nx_train:\n", x_train.isnull().any(), "\n\nx_test:\n", x_test.isnull().any(), \
     "\n\n[Data types:]\nx_train:\n", x_train.dtypes, "\n\nx_test:\n", x_test.dtypes)

# 3. Training

For exploratory purpose I decided to apply the KNN(k-nearest neighbors) method here, with sklearn library.

In [None]:
#  KNN method

model = KNeighborsClassifier(n_neighbors=2)
model.fit(x_train, t_train)
predicted=model.predict(x_test)


output = pd.DataFrame({'PassengerId': target_index, 'Survived': predicted})
output.to_csv('Submission_yosher_mar27_v3.csv', index=False)
print("Your submission was successfully saved!")