- WorkFlow
- Feature Engineering
- Data Pre-process
- Modeling
` This workflow picture shows the steps of my project.
Variable | Definition | Key |
---|---|---|
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1= 1st, 2=2nd, 3=3rd |
sex | Sex | |
sib sp | # of siblings/spouses aboard the Titanic | |
parch | # of parents/children aboard the Titanic | |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
-
import numpy as np
-
import pandas as pd
-
data_train = pd.read_csv('train.csv')
Feature Engineering proecess is an important process. I learned from other contributors' kernels and conclude the main features as follows:
import matplotlib.pyplot as plt
import seaborn as sns
f,ax = plt.subplots(2,3,figsize=(16,10))
sns.countplot('Pclass',hue='Survived',data=data_train,ax=ax[0,0])
sns.countplot('Sex',hue='Survived',data=data_train,ax=ax[0,1])
sns.countplot('Embarked',hue='Survived',data=data_train,ax=ax[0,2])
sns.countplot('SibSp',hue='Survived',data=data_train,ax=ax[1,0])
sns.countplot('Parch',hue='Survived',data=data_train,ax=ax[1,1])
From the plot, some guesses can be done as follows:
- the Pclass picture shows the first class people have a high rate to survive
- the Sex picture shows Female has a high possibility to survive compared with male
- the Embarked picture shows Cherbourg port has a high rate to survive, and most people died on Southampton
- The Sibsp and Parch picture show the relationship of people may affect the survive rate
from sklearn.ensemble import RandomForestRegressor
import sklearn.preprocessing as preprocessing
from sklearn import linear_model
- Total: 891
- Age: 714
- Cabin: 204
Cabin has to many missing values more than 3/4 so I decide to drop this attribute.
Age has 177 missing values so next step I am going to predict the missing values.
def set_missing_valuesset_mis (data):
age_df=data[['Age','Fare','Parch','SibSp','Pclass']]
know_age = age_df[data.Age.notnull()].values
unknow_age = age_df[data.Age.isnull()].values
X = know_age[:,1:]
y = know_age[:,0]
rfr=RandomForestRegressor(random_state=0,n_estimators=2000,n_jobs=-1)
rfr.fit(X,y)
PredictAges = rfr.predict(unknow_age[:,1:])
data.loc[(data.Age.isnull()),'Age'] = PredictAges
data.drop(['Cabin'], axis=1, inplace = True)
return data, rfr
data_train, rfr = set_missing_values(data_train)
def attribute_to_number(data):
dummies_Pclass = pd.get_dummies(data['Pclass'], prefix='Pclass')
dummies_Sex = pd.get_dummies(data['Sex'], prefix='Sex')
dummies_Embarked = pd.get_dummies(data['Embarked'], prefix='Embarked')
data = pd.concat([data,dummies_Pclass,dummies_Sex,dummies_Embarked],axis=1)
data.drop(['Pclass','Sex','Embarked'], axis=1, inplace=True)
return data
data_train_number = attribute_to_number(data_train)
Age, Fare have large value change so i will normalize these values to (-1,1)
def Scales(data):
scaler = preprocessing.StandardScaler()
age_scale_param = scaler.fit(data['Age'].values.reshape(-1, 1))
data['Age_scaled'] = scaler.fit_transform(data['Age'].values.reshape(-1, 1), age_scale_param)
fare_scale_param = scaler.fit(data['Fare'].values.reshape(-1, 1))
data['Fare_scaled'] = scaler.fit_transform(data['Fare'].values.reshape(-1, 1), fare_scale_param)
data.drop(['Fare', 'Age'], axis=1, inplace=True)
return data
data_train_number_scales = Scales(data_train_number)
data_train_number_scales.drop(['PassengerId','Name','Ticket'], axis=1, inplace=True)
data_copy = data_train_number_scales.copy(deep=True)
data_copy.drop( ['Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Sex_female', 'Sex_male', 'Age_scaled', 'Fare_scaled','SibSp','Parch'], axis=1, inplace=True)
y = np.array(data_copy)
data_train_number_scales.drop(['Survived'], axis=1, inplace=True)
X = np.array(data_train_number_scales)
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
clf.fit(X, y)
clf
data_test = pd.read_csv('test.csv')
data_test.loc[data_test.Fare.isnull(), 'Fare'] = 0
set_missing_values(data_test)
##missing values
data_test_number = attribute_to_number(data_test)
##binarize
data_test_number_scales = Scales(data_test_number)
##scales
df_test = data_test_number_scales
df_test.drop(['PassengerId','Name','Ticket'],axis=1,inplace=True)
test = np.array(df_test)
predictions = clf.predict(test)
result = pd.DataFrame({'PassengerId':data_test['PassengerId'].values, 'Survived':predictions.astype(np.int32)})
result.to_csv('logistic_regression_predictions.csv', index=None)