# Introduction to Data Science

<br/>
<br/>


#  ***1. Titanic Survival Prediction***

<br/>
<br/>
<br/>

<hr>


## Follow the steps

<br/>

### 0. Setup

### 1. Dataset

### 2. Exploration

### 3. Clean

### 4. Classification




<br/>

<hr>

# 0. Setup

<br>

### Libraries:

- pandas (https://pandas.pydata.org/docs/getting_started/index.html#getting-started)


- numpy (https://numpy.org/doc/1.15/user/quickstart.html)


- scikit-learn (https://scikit-learn.org/stable/index.html)


- matplotlib (https://matplotlib.org/gallery/index.html)



In [None]:
import pandas as pd
import numpy as np
import sklearn # the library name is scikit-learn
import matplotlib.pyplot as plt

### Installation
<br>
(Replace name_of_library by the library that you want to install.)

If the import code does not show an error such as: "`ImportError: No module named name_of_library`" the library is installed.

<br>

**For Anaconda** 

Probably your environment already has this library. 

To install it go to: Anaconda Navigator -> Environments -> select the dropdown input to "all" -> search for "name_of_library" -> check and apply on "name_of_library". 

<br>

**For Google Colab** 

Execute this line of code to install it `!pip install name_of_library`.

<br>

**For Python environment** 

Execute on the terminal: `pip install name_of_library`, note that if your environment has the python 2 and 3 versions you need to execute `pip3 install name_of_library`. You can check the installed libraries by executing on terminal `pip freeze`, again, replace `pip` by `pip3` if you have python 2 and 3.


<br>
<hr>

# 1. Dataset

We will use the famous Titanic Dataset, data covers passengers only, not the crew.


The dataset is the `"titanic_dataset.csv"` file. A `.csv` (comma-separated values) can be seen as a table with the first line being the header and the following rows the content, the columns are separated usually by comma, but can also be separated by semicolon.

<br>

### Data Dictionary


- `pclass` - ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)


- `survival` - survival (0 = No, 1=Yes)


- `name` - name	


- `sex` - sex	


- `age`	- age in years, can be aproximated	


- `sibsp` - number of siblings / spouses aboard the Titanic	


- `parch` - number of parents / children aboard the Titanic	


- `ticket` - ticket number	


- `fare` - passenger fare (british pound)


- `cabin` - cabin number (A = top deck, G = bottom deck)


- `embarked` - port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)


- `boat`  lifeboat (if survived)


- `body` - body number (if did not survive and body was recovered)


- `home.dest` - home/destination

<br>
<br>

## 1.1 Read into a `dataframe`

We will use the `pandas` library to read our dataset, let's read into a `dataframe`, a `dataframe` can be seen as a table, but with useful methods to interact with our data (https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#min-tut-01-tableoriented).

In [None]:
df = pd.read_csv("titanic_dataset.csv")

<br>
<br>

## 1.2 What is the size of our Dataframe?

In [None]:
# your code here


In [None]:
df.shape

We have 1310 rows and 14 columns 

<br>

<hr>


# 2. Exploration




<br>

## 2.1 Display the First 5 Rows



In [None]:
# your code here


In [None]:
df.head() # default is 5, but you can specify with .head(n_of_rows)  

<br>
<br>

## 2.2 How many passengers do we have?

In [None]:
# your code here

In [None]:
len(df)

<br>
<br>

## 2.3 What type of features do we have?

In [None]:
df.info()

<br><br>

## 2.4 Column description

In [None]:
df['age'].describe() # numeric description

In [None]:
df['sex'].describe() # categorical description

<br>

### What is the mean `fare`?

In [None]:
# your code here

In [None]:
df['fare'].describe()

<br>

### What is the most common `pclass`?

You need to pass `pclass` to string with the following code:

`df["pclass"] = df["pclass"].astype(str)`

In [None]:
# your code here

In [None]:
df["pclass"] = df["pclass"].astype(str)
df['pclass'].describe()

<br><br>

## 2.5 Value counts

In [None]:
df["pclass"].value_counts()

<br>

### With normalization

In [None]:
df["pclass"].value_counts(normalize=True)

<br>

### What is the survival percentage?

In [None]:
# your code here

In [None]:
df["survived"].value_counts(normalize=True)

<br>
<hr>

# 3. Clean

Prepare the dataset to after perform classification.

<br>

## 3.1 Drop columns 

In [None]:
df = df.drop(['ticket'], axis=1)

<br>

### Select other columns that you may find relevant to drop

In [None]:
# you code here

In [None]:
df = df.drop(['cabin','boat','body','home.dest'], axis=1)

<br>

## 3.2 Convert categorical features to numeric 

In [None]:
genders = {"male": 0, "female": 1}
df['sex'] = df['sex'].map(genders)

<br>

## 3.3 One hot-encoding

In [None]:
embarked_df = pd.get_dummies(df["embarked"],prefix='embarked') # returns a dataframe
embarked_df.head()

In [None]:
# concatenate the dataframe with the embarker dataframe
df = pd.concat([df,embarked_df],axis=1)

# now drop the original 'embarked' column (you don't need it anymore)
df.drop(['embarked'], axis=1, inplace=True)

df.head()

<br>

## 3.5 How many relatives

Create a new feature called `relatives` that represents the number of relatives.

In [None]:
# your code here

In [None]:
df["relatives"] = df["sibsp"] + df["parch"]
df["relatives"].value_counts()

<br>

## 3.6 It was alone?

Create a new feature called `alone` that represents if a passenger was alone. 

You need to pass `alone` to int with the following code:

`df["alone"] = df["alone"].astype(int)`

In [None]:
# your code here
df["alone"] = df["relatives"] == 0
df["alone"] = df["alone"].astype(int)
df["alone"].value_counts(normalize=True)

In [None]:
df.loc[ df['age'] <= 11, 'age'] = 0
df.loc[(df['age'] > 11) & (df['age'] <= 18), 'age'] = 1
df.loc[(df['age'] > 18) & (df['age'] <= 22), 'age'] = 2
df.loc[(df['age'] > 22) & (df['age'] <= 27), 'age'] = 3
df.loc[(df['age'] > 27) & (df['age'] <= 33), 'age'] = 4
df.loc[(df['age'] > 33) & (df['age'] <= 40), 'age'] = 5
df.loc[(df['age'] > 40) & (df['age'] <= 66), 'age'] = 6
df.loc[ df['age'] > 66, 'age'] = 6


<br>

## 3.7 Missing values

### Number of missing values

In [None]:
df.isnull().sum()

### Percentage of missing values

In [None]:
df.isnull().sum()/len(df)

<br>

## 3.7.1 Set default value
 

In [None]:
df["age"] = df["age"].fillna(20)

<br>

## 3.7.2 Remove missing values
 

In [None]:
# drop all rows with at least one missing value
df = df.dropna()

<br>

## 3.8 Age into categories

Add the categories `age_<10`, `age_10-25`, `age_25-40`, `age_40-65`, `age_>65`.
 

In [None]:
# your code here

In [None]:
df['age'] = df['age'].astype(int)

df["age_<10"] = df["age"]<=10
df["age_10-25"] = (df["age"]>10) & (df["age"]<25)
df["age_25-40"] = (df["age"]>=25) & (df["age"]<40)
df["age_40-65"] = (df["age"]>=40) & (df["age"]<65)
df["age_>65"] = df["age"]>=65


<br>

## 3.9 Clean your way

In [None]:
# your code here

In [None]:
df["c"] = df['cabin'].astype(str).str[0]
df.head()


cabins = df["cabin"]
for c in cabins:
    print(c)

<br>

## 3.10 Checkpoint

Save the cleaned dataset.


In [None]:
df.to_csv('titanic_cleaned.csv')

<br>
<hr>

# 4. Classification

We want to predict if a passenger survived. 

The metric to evaluate the best model will be the accuracy

List of Metrics: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics


<br>

## 4.1 Split data between train and test

In [None]:
from sklearn.model_selection import train_test_split

target_name = "survived"

# save feature names
feature_names = list(df.columns)
feature_names.remove(target_name)

# split between train and test, you may update the test_size percentage
x_train, x_test, y_train, y_test = train_test_split(df.drop(columns=[target_name]),df[target_name], test_size=0.2) 


<br>

## 4.2 Logistic regression

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

<br>

### 4.2.1 Train

(you may get some warnings)

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(x_train, y_train)

### 4.2.2 Test

In [None]:
from sklearn.metrics import accuracy_score

# predict
y_pred = model.predict(x_test)

# evaluate
accuracy = accuracy_score(y_test, y_pred) * 100
print(accuracy)

<br>

## 4.3 Decision Tree

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

<br>

### 4.3.1 Train

(you may get some warnings)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# set model
model = DecisionTreeClassifier()

# you can set a limit of depth and/or leaf nodes
# model = DecisionTreeClassifier(max_depth=4, max_leaf_nodes=10)

# train
model.fit(x_train, y_train)

# preview of tree
tree.plot_tree(model)
plt.show()

<br>

### 4.3.2 Test

In [None]:
from sklearn.metrics import accuracy_score

# predict
y_pred = model.predict(x_test)

# evaluate
accuracy = accuracy_score(y_test, y_pred) * 100
print(accuracy)

<br>

### 4.3.3 Visualize Tree

To visualize the tree you need to install the `pydotplus` library. 

The tree will be saved with the name `model_decision_tree.png`.

In [None]:
from sklearn.tree import export_graphviz
import pydotplus
import collections

def save_tree_graph(model, feature_names):
    dot_data = export_graphviz(model,
                               feature_names=feature_names,
                               out_file=None,
                               filled=True,
                               rounded=True)

    graph = pydotplus.graph_from_dot_data(dot_data)

    colors = ('turquoise', 'orange')
    edges = collections.defaultdict(list)

    for edge in graph.get_edge_list():
        edges[edge.get_source()].append(int(edge.get_destination()))

    for edge in edges:
        edges[edge].sort()
        for i in range(2):
            dest = graph.get_node(str(edges[edge][i]))[0]
            dest.set_fillcolor(colors[i])

    # save
    graph.write_png('model_decision_tree.png')

save_tree_graph(model,feature_names)