# **Space Titanic - Logistic Regression Prediction**

## **Introduction**

*Inspired by this notebook: https://www.kaggle.com/code/faeghehgh/3-eda-methods-lgbm/notebook*

I fill all missing data with imputation's Median Strategy. By droping all text columns and coverting catagory columns to numeric, I can use logistic regression to build a model for predicting submission.

## **Import Libraries and Data**

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

from lightgbm import LGBMClassifier

import time
import warnings
warnings.filterwarnings('ignore')

### Import Data

In [234]:
for dirname, _, filenames in os.walk('/content/Kaggle/Input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/content/Kaggle/Input/sample_submission.csv
/content/Kaggle/Input/train.csv
/content/Kaggle/Input/test.csv


In [235]:
train = pd.read_csv("/content/Kaggle/Input/train.csv")
test = pd.read_csv("/content/Kaggle/Input/test.csv")
submission = pd.read_csv("/content/Kaggle/Input/sample_submission.csv")

## **Data Preview**

### 1. Train Data:

In [None]:
train.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

In [None]:
train.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [None]:
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Train data features:**
*   121702 data and 2324 missing values included;

**Variables Meanings:**
* PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* Destination - The planet the passenger will be debarking to.
* Age - The age of the passenger.
* VIP - Whether the passenger has paid for special VIP service during the voyage.
* RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* Name - The first and last names of the passenger.
* Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.




### 2. Test Data

In [None]:
test.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
dtype: object

In [None]:
test.isna().sum()

PassengerId       0
HomePlanet       87
CryoSleep        93
Cabin           100
Destination      92
Age              91
VIP              93
RoomService      82
FoodCourt       106
ShoppingMall     98
Spa             101
VRDeck           80
Name             94
dtype: int64

In [None]:
test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez


**Test data features:**
* 55601 data and 1117 missing values included;

### 3. Submission Data

In [None]:
submission.describe()

Unnamed: 0,PassengerId,Transported
count,4277,4277
unique,4277,1
top,0013_01,False
freq,1,4277


In [None]:
submission.isna().sum()

PassengerId    0
Transported    0
dtype: int64

**Submission data features:**
* 4277 data included;

## Processing - Imputation with Median Strategy



In [None]:
# Copy Dataset for imputation
train_imputation = train.copy()
test_imputation = test.copy()

In [None]:
# Categorize columns according to their data types
categorical_col = ['HomePlanet','CryoSleep','Destination','VIP']
text_col = ['Cabin','Name']
continous_col = ['Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']

In [None]:
# Cover null columns in "continuous_col" with median
imputer = SimpleImputer(strategy='median')
imputer.fit(train_imputation[continous_col])
train_imputation[continous_col] = imputer.transform(train_imputation[continous_col])
test_imputation[continous_col] = imputer.transform(test_imputation[continous_col])

In [None]:
# For columns with object/categories, use most-frequent strategies
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(train_imputation[categorical_col])
train_imputation[categorical_col] = imputer.transform(train_imputation[categorical_col])
test_imputation[categorical_col] = imputer.transform(test_imputation[categorical_col])

In [None]:
# Convert string into label number 
for col in categorical_col:
    train_imputation[col] = train_imputation[col].astype(str)
    test_imputation[col] = test_imputation[col].astype(str)
    train_imputation[col] = LabelEncoder().fit_transform(train_imputation[col])
    test_imputation[col] = LabelEncoder().fit_transform(test_imputation[col])

In [None]:
# Drop data with text label (which are irrelevant feature e.g., name and ID)
train_imputation.drop(["Name", "Cabin", "PassengerId"] , axis = 1 ,inplace = True)
test_imputation.drop(["Name", "Cabin", "PassengerId"] , axis = 1 ,inplace = True)

### After filling

In [None]:
# Check the test data after filling
test_imputation.isna().sum()

HomePlanet      0
CryoSleep       0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
dtype: int64

In [None]:
# Check the train data after filling
train_imputation.isna().sum()

HomePlanet      0
CryoSleep       0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Transported     0
dtype: int64

### Data Validation

In [None]:
train_imputation.head()

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,1,0,2,39.0,0,0.0,0.0,0.0,0.0,0.0,False
1,0,0,2,24.0,0,109.0,9.0,25.0,549.0,44.0,True
2,1,0,2,58.0,1,43.0,3576.0,0.0,6715.0,49.0,False
3,1,0,2,33.0,0,0.0,1283.0,371.0,3329.0,193.0,False
4,0,0,2,16.0,0,303.0,70.0,151.0,565.0,2.0,True


In [None]:
test_imputation.head()

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0,1,2,27.0,0,0.0,0.0,0.0,0.0,0.0
1,0,0,2,19.0,0,0.0,9.0,0.0,2823.0,0.0
2,1,1,0,31.0,0,0.0,0.0,0.0,0.0,0.0
3,1,0,2,38.0,0,0.0,6652.0,0.0,181.0,585.0
4,0,0,2,20.0,0,10.0,0.0,635.0,0.0,0.0


After imputating continous_col and categorical_col, dataset has been change to fully numeric except the text_col.

## Modeling

In this section, I run the logistic regression model on the three data created in the previous step and check which method has better accuracy.

In [None]:
result_train = train_imputation['Transported']
predictors_train = train_imputation.drop(columns=['Transported'])
# Logistic Regression Model "clf"
clf = LogisticRegression(random_state=0).fit(predictors_train, result_train)

result_pred_train = clf.predict(predictors_train)

In [None]:
# Cross-Validation
scores = cross_val_score(clf, predictors_train, result_train, cv=10)
print('Cross-Validation Accuracy Scores', scores)

Cross-Validation Accuracy Scores [0.78965517 0.77241379 0.77356322 0.77790564 0.78135788 0.77905639
 0.79631761 0.78596087 0.81933257 0.77790564]


In [None]:
scores = pd.Series(scores)
scores.min(), scores.mean(), scores.max()

(0.7724137931034483, 0.7853468777694006, 0.8193325661680092)

In [None]:
# Use test data
result_pred_test = clf.predict(test_imputation)

## Output Result

In [None]:
submission['Transported'] = result_pred_test
submission.to_csv("/content/Kaggle/Output/submission_final.csv",index=False)