# Titanic Dataset

### = Read In CSV...

** + Where did the data come from?**
- Harland & Wolff Heavy Industries, builders of the Titanic 

** + What information does this data represent?**
- The **rows** represent: Pasengers who were aboard the titanic.
- The **columns/features** represent: Attributes of each passenger.

** + What information do the feature labels represent?**
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/dylanjorgensen/data/master/titanic/titanic.csv")

df.head()

Unnamed: 0,Id,Name,Class,Sex,Age,Fare,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,7.25,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,71.2833,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,7.925,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,53.1,1
4,5,"Allen, Mr. William Henry",3,male,35.0,8.05,0


### Machine Learning

In [3]:
train_all = pd.read_csv("https://raw.githubusercontent.com/dylanjorgensen/data/master/titanic/titanic_train_300.csv")
train_x = train_all.loc[:, "Class" : "Fare"]
train_y = train_all.loc[:, "Survived"]

train_x.head()

Unnamed: 0,Class,Sex,Age,Fare
0,3,male,22.0,7.25
1,1,female,38.0,71.2833
2,3,female,26.0,7.925
3,1,female,35.0,53.1
4,3,male,35.0,8.05


In [4]:
test_all = pd.read_csv("https://raw.githubusercontent.com/dylanjorgensen/data/master/titanic/titanic_test_200.csv")
test_x = test_all.loc[:, "Class" : "Fare"]
test_y = test_all.loc[:, "Survived"]

In [5]:
validate_all = pd.read_csv("https://raw.githubusercontent.com/dylanjorgensen/data/master/titanic/titanic_validate_200.csv")
validate_x = validate_all.loc[:, "Class" : "Fare"]
validate_y = validate_all.loc[:, "Survived"]

### Regression Preprocess

In [6]:
# drop unwanted columns
df = df.drop(['Id', 'Name'], axis=1)

# If 'any' values are missing in a row, then drop that row
df = df.dropna(how='any', axis=0)

df = df.replace(['female','male'], [0,1])

print(df.shape)
df.head()

(714, 5)


Unnamed: 0,Class,Sex,Age,Fare,Survived
0,3,male,22,7.25,female
1,male,female,38,71.2833,male
2,3,female,26,7.925,male
3,male,female,35,53.1,male
4,3,male,35,8.05,female


In [7]:
# 300 training examples
train_all = df.iloc[0:300, :]
export = pd.DataFrame(train_all)
export.to_csv('titanic_train_300.csv')

print(export.shape)
export.head()

(300, 5)


Unnamed: 0,Class,Sex,Age,Fare,Survived
0,3,male,22,7.25,female
1,male,female,38,71.2833,male
2,3,female,26,7.925,male
3,male,female,35,53.1,male
4,3,male,35,8.05,female


In [8]:
# 300 training examples
test_all = df.iloc[300:500, :]
export = pd.DataFrame(test_all)
export.to_csv('titanic_test_200.csv')

print(export.shape)
export.head()

(200, 5)


Unnamed: 0,Class,Sex,Age,Fare,Survived
376,3,female,22,7.25,male
377,male,male,27,211.5,female
378,3,male,20,4.0125,female
379,3,male,19,7.775,female
380,male,female,42,227.525,male


In [9]:
# 300 training examples
validate_all = df.iloc[500:700, :]
export = pd.DataFrame(validate_all)
export.to_csv('titanic_validate_200.csv')

print(export.shape)
export.head()

(200, 5)


Unnamed: 0,Class,Sex,Age,Fare,Survived
632,male,male,32,30.5,male
634,3,female,9,27.9,female
635,2,female,28,13.0,male
636,3,male,32,7.925,female
637,2,male,31,26.25,female
