<a href="https://colab.research.google.com/github/angel539/Python-Notebooks/blob/main/Titanic_Challenge_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Ángel Mora Segura** / *Data Scientist*
https://www.linkedin.com/in/angelmoras/

> Data set extracted from **Kaggle's challenge** about Titanic disaster: https://www.kaggle.com/c/titanic/

**About the data set:**

The sinking of the Titanic is one of the most infamous shipwrecks in history. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this notebook, we will explore different ways to predict *“what sorts of people were more likely to survive?”* using passenger data (i.e. *name*, *age*, *gender*, *socio-economic class*, etc).

# 0. Instalations and `imports`.

> Let's first import the required libraries:

In [None]:
# Library for data analysis and manipulation
import pandas as pd
# Library for math operations with matrices and vectors
import numpy as np

# For graphical representation
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
import seaborn as sns

from scipy import stats

In [None]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier

from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

# 1. Exploratory Data Analysis.

## 1.1. Loading the dataset and checking its content.

First, we will load a CSV with the data for training into a pandas's Data Frame (`df`). Then, we will recover the information about the size of the data set (with `shape -> (rows, columns)`) and its number of dimensions (`ndim`).

In [None]:
df_titanic = pd.read_csv('titanic/train.csv', error_bad_lines=False, engine="python", sep=",")
df_titanic.set_index("PassengerId", inplace = True)
print(df_titanic.shape)
print(df_titanic.ndim)

(891, 11)
2


> For example, in this case the `df_titanic` contains 891 rows and 11 columns. The dataframe has two dimensions - i.e. a `2D data` frame with height and width.

To check the dataframe structure (columns), we will print the first 3 rows using `iloc` instead of `head(3)`.

In [None]:
# There are three main methods of selecting columns in pandas:
#   1. using a dot notation, e.g. data.column_name,
#   2. using square braces and the name of the column as a string, e.g. data['column_name'] or
#   3. using numeric indexing and the iloc selector data.iloc[:, <column_number>]

# For selecting rows:
#   1. numeric row selection using the iloc selector, e.g. data.iloc[0:10, :] – select the first 10 rows.
#   2. label-based row selection using the loc selector 
#              (this is only applicably if you have set an “index” on your dataframe. e.g. data.loc[44, :]
#   3. logical-based row selection using evaluated statements, e.g. data[data["Area"] == "Ireland"]
#               – select the rows where Area value is ‘Ireland’.
df_titanic.iloc[0:10, :]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## 1.2 Dealing with missing values and categories.

### 1.2.1 Looking for NaN(s) and possible strong correlations.

We will check if there are missing values that may affect our study. For that purpose, we will use the `isna()` function. Then, we will check the types of the different columns checking the `dtypes` value.

In [None]:
print(df_titanic.isna().sum())

In [None]:
print(df_titanic.dtypes)

Let's check if possible correlations exists among the numerical values. For that purpose, we will first select only the columns with numerical categories:

In [None]:
numerical_columns = df_titanic.select_dtypes(include=['float64','int64']).columns

In [None]:
sns.heatmap(df_titanic[numerical_columns].corr(), annot = True, cmap = "Blues")

At this point, it seems that:
*   There is an **small positive correlation** between the `Fare` and the `Survived` values. We will come back to this graph after changing the `Sex` from categorial to numerical.


From this part of the study we know that we do not need to apply any transformation over the data types. However, we will need to reasign the missing values and prepare them for the training of the machine learning models. For example, we can follow some strategies such as:

*   Filling out the missing values with the `mean()`, `median()` or `mode()` of the rest of the values present in the column.
*   Change some categorical values into categories based on numbers.
*   Reasing groups based on the rest of the information present in the dataframe.



### 1.2.2 Filling out the missing values.

From the previos section, we know that the Age, the Cabin and the Embarked columns have NaNs or missing values. Then, we will try to fix this situation with different approaches.

**a. Fixing the `Age`.**

In this following case, to fill out the age, we will use an strategy `split-apply-combine` to:

1.   **Create groups** depending on the `Sex` and the `Pclass` of the passengers (`groupby`).
2.   **Apply the changes in each group**. In this case, we will fill out the missing values with the mean value for each group (`apply`).
3.   **Combine each group again** in the main dataframe (saving the dataframe in the variable `df_titanic`).



In [None]:
def f(group):
    group['Age'].fillna(np.mean(group['Age'][:]), inplace = True)
    return group    

df_titanic = df_titanic.groupby(['Sex','Pclass']).apply(f)

**b. Fixing the `Embarked` information.**

In this case, the Embarked column has values that are divided into categories. Then, we will substitute the missing values with the `mode()`.

In [None]:
# Docu: https://stackoverflow.com/a/42789818/5486382
df_titanic['Embarked'].fillna(df_titanic['Embarked'].mode()[0], inplace=True)

**c. Fixing the `Cabin` missing information.**

> In the case of the `Cabin` and due to the percentage of missing values, I decided to drop that column from the study.

In [None]:
# Delete a column
# Docu with examples: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
df_titanic.drop(['Cabin'], axis=1, inplace = True)

### 1.2.3 Changing categories into numerical values.

In this case, we have one categorial column called `Sex` that refers to the gender of the passenger. Then, we will apply a transformation in this column from each category to a number.

In [None]:
print(df_titanic['Sex'].unique())

Let's map now the gender to numerical values and check if correlations exists between the gender and the chance to survive.

In [None]:
dict_gender_map = {
  'male': 0,
  'female': 1
}

def gender_to_numeric(gender):
  return dict_gender_map[gender]

df_titanic['Sex'] = df_titanic['Sex'].apply(gender_to_numeric)

## 1.3 Making groups based on the distribution of the dataset.

Let's chech now if it is worthy to make groups in the dataset based on the distribution of its values. For that purpose, we can make use of some visualization techniques.

### 1.3.1 Pairing `Sex` and `Survived`.

In [None]:
sns.jointplot(df_titanic['Sex'], df_titanic['Survived'], kind="hex", gridsize=20)
plt.show()

From this graph, we can basically conclude that:
*   **Most of the deads were men**.

### 1.3.2 Pairing `Age` and `Survived`.

Let's check this hypothesis in comparison with the age.

In [None]:
sns.jointplot(df_titanic['Age'], df_titanic['Survived'], kind="hex", gridsize=20)
plt.show()

Now, we also know that:
*   **Most of the deads were men and they had between 26 and 30 years of age.**

Let's see now the `Age` distribution in detail:

In [None]:
fig, ax = plt.subplots(figsize = (6, 4))
sns.distplot(df_titanic[df_titanic["Survived"] == 1].Age, kde_kws={"color": "b", "lw": 2, "label": "Survived"})
sns.distplot(df_titanic[df_titanic["Survived"] == 0].Age, kde_kws={"color": "r", "lw": 2, "label": "Died"})
plt.show()

In this case, if blue are the survivors, then **most of the babies survived**. Probably, this is because they were inside the boats with their moms.

> Based on this study, perhaps is worthy to classify also the passenger per ranges of age depending on the observations made in the plots.

To do that, we can use the `apply()` function.

In [None]:
def age_to_groups(age):
    if age > 65:
        return 8 # Groups with low population density
    elif age > 50:
        return 7 # Groups with low population density
    elif age > 40:
        return 6
    elif age > 34:
        return 5
    elif age > 26:
        return 4
    elif age > 18:
        return 3
    elif age > 12:
        return 2  # Teens
    elif age > 5:
        return 1  # Children
    else:
        return 0  # Babies and little children

df_titanic["Age"] = df_titanic["Age"].apply(age_to_groups)

Let's check again the distribution.

In [None]:
fig, ax = plt.subplots(figsize = (6, 4))
sns.distplot(df_titanic[df_titanic["Survived"] == 1].Age, kde_kws={"color": "b", "lw": 2, "label": "Survived"}, bins=8)
sns.distplot(df_titanic[df_titanic["Survived"] == 0].Age, kde_kws={"color": "r", "lw": 2, "label": "Died"}, bins=8)
plt.show()

In the previous graph, we can adjust the ranges based on convinience. For example, to make each group equally important in terms of the survival chance. We are making this to predict whether the passenger survived or not depending on the rest of the columns.

> For example, from this distribution we know that **most of the kids survived**. Then, **they can be considered as an outlier from the study**.

Let's count now how many kids (group of `Age = 0`) were in the Titanic.

In [None]:
# Counting members present in the category 0.
counter = df_titanic[df_titanic["Age"] == 0].value_counts()
print("There were", len(counter), "kids and babies")
print(counter)

> From this table, we know t**hat most of the kids that died** were in the `PClass = 3` and they had more that 2 siblings (`SibSp`).

Let's drop the kids from the rest of the study. We will asign the values directly in the `test` subset.

In [None]:
index_babies_rows = df_titanic[df_titanic['Age'] == 0].index
df_titanic.drop(index_babies_rows, inplace = True)

### 1.3.3 Pairing `Fare` and `Survived`.

Based on the dataset, it seems that there were only three classes inside the boat, but we would like to have more information about the number of classes present in the passengers. Then, let's substitute the `Pclass` with the `Fare`.

In [None]:
# Distribution of the passengers based on the Fare.
fig, ax = plt.subplots(figsize = (6, 4))
sns.distplot(df_titanic["Fare"], bins=40)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize = (6, 4))
sns.distplot(df_titanic[df_titanic["Survived"] == 1].Fare, kde_kws={"color": "b", "lw": 2, "label": "Survived"})
sns.distplot(df_titanic[df_titanic["Survived"] == 0].Fare, kde_kws={"color": "r", "lw": 2, "label": "Died"})
plt.show()

> Let's see in detail those who paid less that 100.

In [None]:
fig, ax = plt.subplots(figsize = (6, 4))
sns.distplot(df_titanic[(df_titanic["Survived"] == 1) & (df_titanic["Fare"] < 100)].Fare, kde_kws={"color": "b", "lw": 2, "label": "Survived"})
sns.distplot(df_titanic[(df_titanic["Survived"] == 0) & (df_titanic["Fare"] < 100)].Fare, kde_kws={"color": "r", "lw": 2, "label": "Died"})
plt.show()

> It seems that most of the deads paid less that 20.

Let's see now how many different fares were in the less that 20 group.

In [None]:
print(df_titanic[df_titanic["Fare"] < 20].Fare.unique())

In [None]:
print(df_titanic[df_titanic["Fare"] > 20].Fare.unique())

Let's pair this values with the survival chance.

In [None]:
sns.jointplot(df_titanic[df_titanic["Fare"] < 20].Fare, df_titanic['Survived'], kind="hex", gridsize=20)
plt.show()

> From the previous graph, we know that most the deads paid less that 8.

Now, we will group our passengers based on the possible fares by adjusting the values of the groups.

In [None]:
def fare_to_groups(fare):
    if fare > 120:
        return 7
    elif fare > 50:
        return 6
    elif fare > 40:
        return 5
    elif fare > 24:
        return 4  
    elif fare > 16:
        return 3
    elif fare > 12:
        return 2 
    elif fare > 8:
        return 1 
    else:
        return 0  # Groups with highest population density

df_titanic["Fare"] = df_titanic["Fare"].apply(fare_to_groups)

Let's check how many passengers paid more than 100.

In [None]:
# Counting members present in the category 0.
counter = df_titanic[df_titanic["Fare"] == 7].value_counts()
print("There were", len(counter), "rich people")
print(counter)

> We will drop them also from the study, because most of the men that paid more than 100 died and most the women that paid more than 100 survived.

In [None]:
index_rich_rows = df_titanic[df_titanic['Fare'] == 7].index
df_titanic.drop(index_rich_rows, inplace = True)

Once the outliers have been dropped, we will check again the number of rows that we have to train our models.

In [None]:
print(df_titanic.shape)

In [None]:
print(df_titanic.isna().sum())

# 2. Regression, clustering and decision trees.

In [None]:
# To use scikit-learn library, we have to convert the Pandas data frame to a Numpy array:
X = df_titanic.drop(['Survived', 'Pclass', 'Name', 'SibSp', 'Ticket', 'Embarked'], axis=1).values
Y = df_titanic['Survived']

In [None]:
print(X[0:30])
print(Y[0:30])

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.20, random_state=10)
print ('Train set:', X_train.shape,  Y_train.shape)
print ('Test set:', X_test.shape,  Y_test.shape)

In [None]:
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train, Y_train)
Y_LR_predicted = LR.predict(X_test)
Y_LR_predicted_prob = LR.predict_proba(X_test)
Y_LR_predicted[0:5]

In [None]:
print("Train set Accuracy: ", metrics.accuracy_score(Y_train, LR.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(Y_test, Y_LR_predicted))

In [None]:
def bestK(max_number_of_Ks):
  mean_acc = np.zeros((max_number_of_Ks-1))
  std_acc = np.zeros((max_number_of_Ks-1))

  for n in range(1, max_number_of_Ks):
      neigh         = KNeighborsClassifier(n_neighbors = n).fit(X_train, Y_train)
      Y_predicted   = neigh.predict(X_test)
      mean_acc[n-1] = metrics.accuracy_score(Y_test, Y_predicted)
      std_acc[n-1]  = np.std(Y_predicted == Y_test)/np.sqrt(Y_predicted.shape[0])
  
  return (mean_acc.argmax() + 1, mean_acc.max())

In [None]:
k = bestK(25)[0]
print("k =", k)
# Train Model and Predict with the best K
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train, Y_train)
Y_bestK_predicted = neigh.predict(X_test)
Y_bestK_predicted[0:5]

In [None]:
print("Train set Accuracy: ", metrics.accuracy_score(Y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(Y_test, Y_bestK_predicted))

In [None]:
def best_tree(max_number_of_max_depth, max_min_samples_split):
  max_acc = 0
  pos_max_acc = (1, 2)

  for n in range(1, max_number_of_max_depth):
    for s in range(2, max_min_samples_split):
      tree          = DecisionTreeClassifier(criterion="entropy", max_depth = n, min_samples_split = s).fit(X_train, Y_train)
      Y_predicted   = tree.predict(X_test)

      acc           = metrics.accuracy_score(Y_test, Y_predicted)  
      if (acc > max_acc):
          max_acc   = acc
          pos_max_acc = (n, s)

  return pos_max_acc

In [None]:
max_acc = best_tree(8, 4)
depth = max_acc[0]
samples_split = max_acc[1]
tree = DecisionTreeClassifier(criterion="entropy", max_depth = depth, min_samples_split = samples_split).fit(X_train, Y_train)
Y_bestMax_Depth_predicted = tree.predict(X_test)
Y_bestMax_Depth_predicted[0:5]

In [None]:
print("Train set Accuracy: ", metrics.accuracy_score(Y_train, tree.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(Y_test, Y_bestMax_Depth_predicted))

In [None]:
svm_model = svm.SVC().fit(X_train, Y_train)
Y_svm_predicted = svm_model.predict(X_test)

In [None]:
print("Train set Accuracy: ", metrics.accuracy_score(Y_train, svm_model.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(Y_test, Y_svm_predicted))

In [None]:
RandomForest = RandomForestClassifier(n_estimators=500,
                                      min_samples_split=2,
                                      min_samples_leaf=1,
                                      random_state=81)

random_forest = RandomForest.fit(X_train, Y_train)
Y_rf_predicted = random_forest.predict(X_test)

In [None]:
print("Train set Accuracy: ", metrics.accuracy_score(Y_train, random_forest.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(Y_test, Y_rf_predicted))

In [None]:
print("LogReg Accuracy train set:  %.4f" % metrics.accuracy_score(Y_train, LR.predict(X_train)))
print("LogReg Accuracy test set:   %.4f" % metrics.accuracy_score(Y_test, Y_LR_predicted))
print("LogReg F1-score test set:   %.4f" % f1_score(Y_test, Y_LR_predicted, average='weighted'))
print("LogReg LogLoss:             %.4f" % log_loss(Y_test, Y_LR_predicted_prob))

print("KNN Accuracy train set:  %.4f" % metrics.accuracy_score(Y_train, neigh.predict(X_train)))
print("KNN Accuracy test set:   %.4f" % metrics.accuracy_score(Y_test, Y_bestK_predicted))
print("KNN F1-score test set:   %.4f" % f1_score(Y_test, Y_bestK_predicted, average='weighted'))

print("DT Accuracy train set:   %.4f" % metrics.accuracy_score(Y_train, tree.predict(X_train)))
print("DT Accuracy test set:    %.4f" % metrics.accuracy_score(Y_test, Y_bestMax_Depth_predicted))
print("DT F1-score test set:    %.4f" % f1_score(Y_test, Y_bestMax_Depth_predicted, average='weighted'))

print("SVM Accuracy train set:  %.4f" % metrics.accuracy_score(Y_train, svm_model.predict(X_train)))
print("SVM Accuracy test set:   %.4f" % metrics.accuracy_score(Y_test, Y_svm_predicted))
print("SVM F1-score test set:   %.4f" % f1_score(Y_test, Y_svm_predicted, average='weighted'))

print("RF Accuracy train set:  %.4f" % metrics.accuracy_score(Y_train, random_forest.predict(X_train)))
print("RF Accuracy test set:   %.4f" % metrics.accuracy_score(Y_test, Y_rf_predicted))
print("RF F1-score test set:   %.4f" % f1_score(Y_test, Y_rf_predicted, average='weighted'))

In [None]:
estimators = [('KNN', neigh),
              ('DecisionTree', tree),
              ('SVM', svm_model),
              ('LogReg',LR)]

stack = StackingClassifier(estimators = estimators)
stack.fit(X_train, Y_train)

Y_stack_predicted = stack.predict(X_test)
stack_train_accuracy = metrics.accuracy_score(Y_train, stack.predict(X_train))
stack_accuracy = metrics.accuracy_score(Y_test, Y_stack_predicted)
print("Accuracy:", stack_train_accuracy, " ", stack_accuracy)

In [None]:
df_titanic_prediction = pd.read_csv('titanic/test.csv', error_bad_lines=False, engine="python", sep=",")
df_titanic_prediction.set_index("PassengerId", inplace = True)
print(df_titanic_prediction.shape)
print(df_titanic_prediction.ndim)

In [None]:
print(df_titanic_prediction.isna().sum())

> **Dropping the cabin:**

In [None]:
df_titanic_prediction.drop(['Cabin'], axis=1, inplace = True)

> **Transforming the age:**

In [None]:
df_titanic_prediction = df_titanic_prediction.groupby(['Sex','Pclass']).apply(f)

In [None]:
df_titanic_prediction["Age"] = df_titanic_prediction["Age"].apply(age_to_groups)

> **Transforming the sex:**

In [None]:
df_titanic_prediction['Sex'] = df_titanic_prediction['Sex'].apply(gender_to_numeric)

> **Dealing with the one that has the missing value in `Fare`:**

In [None]:
# Docu: https://stackoverflow.com/a/42789818/5486382
df_titanic_prediction['Fare'].fillna(df_titanic_prediction['Fare'].median(), inplace=True)

In [None]:
df_titanic_prediction["Fare"] = df_titanic_prediction["Fare"].apply(fare_to_groups)

**Indexing babies and rick people to assign the survived value directed:**

In [None]:
index_babies_test_rows = df_titanic_prediction[df_titanic_prediction['Age'] == 0].index
index_rich_test_rows = df_titanic_prediction[(df_titanic_prediction['Fare'] == 7) & (df_titanic_prediction['Age'] > 0)].index

**Locating those specific rows:**

In [None]:
df_titanic_test_babies = df_titanic_prediction.loc[index_babies_test_rows] # Those that were likely to survive.

In [None]:
df_titanic_test_rich = df_titanic_prediction.loc[index_rich_test_rows] # Those that were likely to survive were only women.

**Indexing the rest of the values for prediction:**

In [None]:
index_rest_test_rows = df_titanic_prediction[(df_titanic_prediction['Fare'] < 7) & (df_titanic_prediction['Age'] > 0)].index

In [None]:
df_titanic_test_ML = df_titanic_prediction.loc[index_rest_test_rows]

In [None]:
# To use scikit-learn library, we have to convert the Pandas data frame to a Numpy array:
X_evaluation = df_titanic_test_ML.drop(['Pclass', 'Name', 'SibSp', 'Ticket', 'Embarked'], axis=1).values

In [None]:
X_evaluation = scaler.transform(X_evaluation)

In [None]:
Y_bestMax_Depth_predicted_evaluation = stack.predict(X_evaluation)
Y_bestMax_Depth_predicted_evaluation[0:25]

In [None]:
df_ML = pd.DataFrame({\
                           'Survived': Y_bestMax_Depth_predicted_evaluation\
                          }, index = df_titanic_test_ML.index)

**Filling the rest based on the heuristics:**

In [None]:
df_babies = pd.DataFrame(columns = ['PassengerId', 'Survived'])
for index, baby in df_titanic_test_babies.iterrows():
  if (baby['SibSp'] > 2):
    new_row = {
        'PassengerId' : index,
        'Survived'    : 0
    }
  else:
    new_row = {
        'PassengerId' : index,
        'Survived'    : 1
    }
  df_babies = df_babies.append(new_row, ignore_index=True)

df_babies.set_index("PassengerId", inplace = True)

In [None]:
df_rich = pd.DataFrame(columns = ['PassengerId', 'Survived'])
for index, rich in df_titanic_test_rich.iterrows():
  new_row = {
    'PassengerId' : index,
    'Survived'    : rich['Sex']
  }
  df_rich = df_rich.append(new_row, ignore_index=True)

df_rich.set_index("PassengerId", inplace = True)

In [None]:
frames = [df_ML, df_babies, df_rich]
submission = pd.concat(frames)
submission = submission.sort_index(ascending=True)

In [None]:
submission.to_csv('submission.csv')
submission.iloc[0:100, :]