This is [30 Days Of Kaggle](https://www.kaggle.com/alexisbcook/getting-started-with-kaggle?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-1) - [Day 1](https://www.kaggle.com/alexisbcook/titanic-tutorial).

Let's predict who died when the Titanic went down.  It's the "Hello, World" or flat plate deflection analysis of data science.

Kaggle provides test and training data sets of passengers aboard the Titanic.  The goal is to predict who lived and died when the ship went down.

The model could be as simple as "women and children first; therefore women and children survived and men drowned". Being more subtle and including more variables will make a model better and better.  There are models that claim accuracy of 99% or greater.  Those sound more like fitting than prediction, but that's okay.  Applying the same model to any other case, like extrapolating to the Lusitania, would not be a good idea.

First step is to download the data sets.  I put them in datasets/kaggle/titanic.

Start on the code:


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, .csv file I/O (e.g. pd.read_cvs

import os
for dirname, _, filenames in os.walk('../datasets/kaggle/titanic'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


../datasets/kaggle/titanic/train.csv
../datasets/kaggle/titanic/test.csv
../datasets/kaggle/titanic/gender_submission.csv
../datasets/kaggle/titanic/duffymo-random-forest-submission.csv


Begin by reading in the training data.


In [2]:
train_data = pd.read_csv("../datasets/kaggle/titanic/train.csv")
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Then read in the test data:


In [3]:
test_data = pd.read_csv("../datasets/kaggle/titanic/test.csv")
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


The typical motto is "Women and children first".  Let's see how many men and women perished:

In [4]:
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)
print("% of women who survived: ", rate_women*100)

% of women who survived:  74.20382165605095


In [5]:
men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)
print("% of men who survived: ", rate_men*100)

% of men who survived:  18.890814558058924


In [6]:
c = train_data.columns.to_list()
c

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [7]:
from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
x = pd.get_dummies(train_data[features])
x_test = pd.get_dummies(test_data[features])
model = RandomForestClassifier(n_estimators = 100, max_depth = 5, random_state = 1)
model.fit(x, y)
predictions = model.predict(x_test)

output = pd.DataFrame({'Passengerid': test_data.PassengerId, 'Survived': predictions})
output.to_csv('../datasets/kaggle/titanic/duffymo-random-forest-submission.csv', index = False)
print("Submission saved")

Submission saved


Submitted my entry.  I've  done better in the past following along with other tutorials.  I checked the leaderboard.  There are lots of entries with 100% success rate.  How is that anything other than a fitting exercise?

You should do more with this problem.  There's another link [Getting Started With Titanic](https://www.kaggle.com/debajyoti1/getting-started-with-titanic) that would be good to go through.  Get back to it once you're all caught up.

Onto 30 Days of Kaggle - [Day 2](https://www.kaggle.com/colinmorris/hello-python?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-2).  It's time for easy Python.  I've got 7 days worth to complete today so I can catch up.  Fortunately I already know a little Python.  It won't be difficult.  This is just variables and operators - simple stuff.

Here's the [Day 2 exercise](https://www.kaggle.com/duffymo/exercise-syntax-variables-and-numbers/edit) they offer.


In [8]:
print(6/5)
print(6//5)
print(6%5)
print(6**5)

1.2
1
1
7776


In [9]:
print("You've successfully run some Python code")
print("Congratulations!")
print("Michael Duffy - you're a goddamned genius!")

You've successfully run some Python code
Congratulations!
Michael Duffy - you're a goddamned genius!


Genius, indeed.

The exercise includes a ```learncode``` import.  Why do they do that?  Better to tell people how to use publicly available packages.


In [10]:
import math
diameter = 3
radius = diameter / 2.0
area = math.pi*radius*radius

print("area = ", area)

area =  7.0685834705770345


Look at that!  Python calculates the area of circles correctly!

Let's swap references:



In [11]:
a = [1, 2, 3]
b = [3, 2, 1]

print("a = ", a)
print("b = ", b)

temp = a
a = b
b = temp

print("a = ", a)
print("b = ", b)

a =  [1, 2, 3]
b =  [3, 2, 1]
a =  [3, 2, 1]
b =  [1, 2, 3]


Another problem.  These are simple.

In [12]:
(5 - 3) // 2

1

In [13]:
8 - (3 * 2) - (1 + 1)

0

A last problem:

Alice, Bob and Carol have agreed to pool their Halloween candy and split it evenly among themselves. For the sake of their friendship, any candies left over will be smashed. For example, if they collectively bring home 91 candies, they'll take 30 each and smash 1.
Write an arithmetic expression below to calculate how many candies they must smash for a given haul.

In [14]:
alice = 121
bob = 77
carol = 109
smash = (alice + bob + carol) % 3
print("smash = ", smash)

smash =  1


I'm onto [Day 3](https://www.kaggle.com/colinmorris/functions-and-getting-help).

In [15]:
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
    Round a number to a given precision in decimal digits.
    
    The return value is an integer if ndigits is omitted or None.  Otherwise
    the return value has the same type as the number.  ndigits may be negative.



Defining functions - it feels so much like Kotlin.  Functions in Python are easy.

In [16]:
def least_difference(a, b, c):
    """Return the smallest difference between any two numbers among a, b, and c.

    >>> least_difference(1, 5, -5)
    4
    """
    return min(abs(a-b), abs(b-c), abs(c-a))

print(least_difference(1, 10, 100),
      least_difference(1, 10, 10),
      least_difference(5, 6, 7))

help(least_difference)

9 0 1
Help on function least_difference in module __main__:

least_difference(a, b, c)
    Return the smallest difference between any two numbers among a, b, and c.
    
    >>> least_difference(1, 5, -5)
    4



Python is a hybrid of object-oriented and functional programming.  You can compose functions into higher-order functions, like this:

In [17]:
def mult_by_five(x):
    return 5 * x

def call(fn, arg):
    """Can fn on arg"""
    return fn(arg)

def squared_call(fn, arg):
    return fn(fn(arg))

print(
    call(mult_by_five, 1),
    squared_call(mult_by_five, 1),
    sep='\n',
)

def mod_5(x):
    """Return the remainder of x after dividing by 5"""
    return x % 5

print(
    'Which number is biggest?',
    max(100, 51, 14),
    'Which number is biggest modulo 5?',
    max(100, 51, 14, key=mod_5),
    sep='\n'
)

5
25
Which number is biggest?
100
Which number is biggest modulo 5?
14


Now it's time for exercises.

In [18]:
import math
x = math.e
y = 100 * x - 0.5 + 1
z = int(y)
print("rounded: ", z / 100.0)

def round_to_two_place(x = 0.0):
    return my_round(x, 2)

def my_round(x = 0.0, ndigits = 2):
    c = math.pow(10, ndigits)
    return (int(c * x - 0.5 + 1)) / c

print("expected: ", round(x, 2))
print("my version of round: ", round_to_two_place(x))


rounded:  2.72
expected:  2.72
my version of round:  2.72


In [19]:
def to_smash(candies, num_friends = 3):
    """
    Return the number of leftover candies that must be smashed
    after distributing candy evenly between friends

    >>> to_smash(91)
    1
    """
    if candies == 1:
        print("Splitting", candies, "candy")
    else:
        print("Splitting", candies, "candies")

    return candies % num_friends

print(to_smash(91))

Splitting 91 candies
1


I'm already up to [Day 4](https://www.kaggle.com/colinmorris/booleans-and-conditionals?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-4)!  Booleans and conditionals.  Nothing new here.

In [20]:
def can_run_for_president(age, is_natural_born_citizen = True):
    """Can someone of the given age run for US president?"""
    return is_natural_born_citizen and age >= 35

def is_citizen(citizen = True):
    if citizen:
        return "citizen"
    else:
        return "non-citizen"

a = 65
c = True
print("Can a ", a, "-year-old ", is_citizen(c), " run for president? ", can_run_for_president(a, c), sep="")


Can a 65-year-old citizen run for president? True


In [21]:
def is_odd(n):
    return (n % 2) == 1

m = -1
print("Is ", m, " odd? ", is_odd(m))

Is  -1  odd?  True


Time for exercises: seven in all.  I'll have completed 4 out of 7 today.  Not bad, but you need to keep going.

In [22]:
def sign(xx = 0):
    if xx < 0:
        return -1
    elif xx > 0:
        return 1
    else:
        return 0

[sign(k) for k in range(-5, 5)]

[-1, -1, -1, -1, -1, 0, 1, 1, 1, 1]

In [23]:
# Day 4, exercise 3
# I think this needed parens to make sure it worked in all cases.  Easy to add.

def prepared_for_weather(have_umbrella, rain_level, have_hood, is_workday):
    return have_umbrella or (rain_level < 5 and have_hood) or (rain_level > 0 and is_workday)

have_umbrella = False
rain_level = 5.0
have_hood = True
is_workday = False
print(prepared_for_weather(have_umbrella, rain_level, have_hood, is_workday))

False


In [24]:
# Day 4, exercise 5
def is_negative(number):
    return number < 0

v = 5
print(is_negative(v))

False


In [25]:
# Day 4, exercise 5
def onionless(ketchup, mustard, onion):
    return not onion

def everything(ketchup, mustard, onion):
    return ketchup and mustard and onion

def plain(ketchup, mustard, onion):
    """Think DeMorgan's Theorem"""
    return not (ketchup or mustard or onion)

def exactly_one_topping(ketchup, mustard, onion):
    return (ketchup + mustard + onion) == 1

In [26]:
# Day 4, Exercise 7
# Blackjack

def should_hit(dealer_total, player_total, player_low_aces, player_high_aces):
    return False

It's Day 5: Lists

In [27]:
# Day 5 Exercise 1

def select_second(list1):
    if (len(list1) > 1):
        return list1[1]
    else
        return None


# Day 5 Exercise 2

def losing_team_captain(teams):
    if (len(teams) > 0):
        return teams[-1][1]
    else:
        return None

# Day 5 Exercise 3

def purple_shell(racers):
    first = racers[0]
    racers[0] = racers[-1]
    racers[-1] = first

# Day 5 Exercise 5

def fashionably_late(arrivals, name):
    if len(arrivals) % 2 == 0:
        mid = len(arrivals)//2
    else:
        mid = 1 + len(arrivals)//2

    return name in arrivals[mid:-1]


SyntaxError: invalid syntax (<ipython-input-27-69579f29f246>, line 6)

Day 6: Loops and List Comprehensions.

4 exercises to get through.  I'm catching up!

In [None]:
def has_lucky_number(nums):
    for num in nums:
        if num % 7 == 0:
            return True
    return False

In [None]:
def elementwise_greater_than(list2, th):
    return [g > th for g in list2]

In [None]:
# Exercise 3

def menu_is_boring(meals):
    """Given a list of meals served over some period of time,
    return True if the same meal has ever been served two days in a row,
    False otherwise
    :param meals: list of meals
    :return: rue if the same meal has ever been served two days in a row,
    False otherwise
    """
    for i in range(len(meals)-1):
        if meals[i] == meals[i+1]:
            return True
    return False

In [None]:
def estimate_average_slot_payout(n_runs):
    total_payout = 0.0
    for n in range(n_runs):
        total_payout += play_slot_machine()
    return total_payout / n_runs

Day 7 of 30 Days of Kaggle: [Strings and Dictionaries](https://www.kaggle.com/colinmorris/strings-and-dictionaries).  There are 3 exercises to complete.

In [None]:
def is_valid_zip(zip_code):
    return re.match(r"^\d{5}$", zip_code) is not None

In [None]:
import re

def word_search(doc_list, keyword):
    indicies = []
    for i in range(0, len(doc_list)):
        without_punctuation = re.sub(r'[^\w\s]', '', doc_list[i])
        tokens = without_punctuation.lower().split()
        if keyword.lower() in tokens:
            indicies.append(i)
    return indicies


In [None]:
import re

def multi_word_search(doc_list, keywords):
    keyword_indicies = {}
    for keyword in keywords:
        indicies = []
        for i in range(0, len(doc_list)):
            without_punctuation = re.sub(r'[^\w\s]', '', doc_list[i])
            tokens = without_punctuation.lower().split()
            if keyword.lower() in tokens:
                indicies.append(i)
        keyword_indicies[keyword] = indicies
    return keyword_indicies

Day 8 of 30 Days of Kaggle: Working With External Libraries


In [29]:
import math

print("It's math!  It has type {}".format(type(math)))
print(dir(math))

It's math!  It has type <class 'module'>
['__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'acos', 'acosh', 'asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'comb', 'copysign', 'cos', 'cosh', 'degrees', 'dist', 'e', 'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp', 'fsum', 'gamma', 'gcd', 'hypot', 'inf', 'isclose', 'isfinite', 'isinf', 'isnan', 'isqrt', 'lcm', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'log2', 'modf', 'nan', 'nextafter', 'perm', 'pi', 'pow', 'prod', 'radians', 'remainder', 'sin', 'sinh', 'sqrt', 'tan', 'tanh', 'tau', 'trunc', 'ulp']


In [40]:
def evaluate_blackjack_hand(hand):
    total = 0
    hand = [card.upper() for card in hand] # convert each card to upper case
    numAces = len(list(filter(lambda card: card == 'A', hand)))
    for card in hand:
        if card.isdigit():
            total += int(card)
        elif card in ['K', 'Q', 'J']:
            total += 10
        else:
            total += 1
    for i in range(0, numAces):
        if total <= 11:
            total += 10
    return total


In [43]:
print(evaluate_blackjack_hand(['K', 'K', '2']))

22


In [54]:
def blackjack_hand_greater_than(hand_1, hand_2):
    """
    Return True if hand_1 beats hand_2, and False otherwise.

    In order for hand_1 to beat hand_2 the following must be true:
    - The total of hand_1 must not exceed 21
    - The total of hand_1 must exceed the total of hand_2 OR hand_2's total must exceed 21

    Hands are represented as a list of cards. Each card is represented by a string.

    When adding up a hand's total, cards with numbers count for that many points. Face
    cards ('J', 'Q', and 'K') are worth 10 points. 'A' can count for 1 or 11.

    When determining a hand's total, you should try to count aces in the way that
    maximizes the hand's total without going over 21. e.g. the total of ['A', 'A', '9'] is 21,
    the total of ['A', 'A', '9', '3'] is 14.

    Examples:
    >>> blackjack_hand_greater_than(['K'], ['3', '4'])
    True
    >>> blackjack_hand_greater_than(['K'], ['10'])
    False
    >>> blackjack_hand_greater_than(['K', 'K', '2'], ['3'])
    False
    """
    t1 = evaluate_blackjack_hand(hand_1)
    t2 = evaluate_blackjack_hand(hand_2)
    return t1 <= 21 and (t1 > t2 or t2 > 21)

In [55]:
hand1 = ['2', '10', '5', 'A', '9', '9']
hand2 = ['5', '7', '5', 'Q', '5']
print("hand1 = ", evaluate_blackjack_hand(hand1))
print("hand2 = ", evaluate_blackjack_hand(hand2))
print(blackjack_hand_greater_than(hand1, hand2))

hand1 =  36
hand2 =  32
False
