In [5]:
%%html
<style>
.h1_cell, .just_text {
    box-sizing: border-box;
    padding-top:5px;
    padding-bottom:5px;
    font-family: "Times New Roman", Georgia, Serif;
    font-size: 125%;
    line-height: 22px; /* 5px +12px + 5px */
    text-indent: 25px;
    background-color: #fbfbea;
    padding: 10px;
    border-style: groove;
}

hr { 
    display: block;
    margin-top: 0.5em;
    margin-bottom: 0.5em;
    margin-left: auto;
    margin-right: auto;
    border-style: inset;
    border-width: 2px;
}
</style>

<h1>
<center>
Module 1: First look at text-based classification
</center>
</h1>
<div class=h1_cell>

<p>The first problem I'd like to look at in the course is classifying tweets as carrying fake-news (or not). But before getting to that in later modules, we need to pick up skills in what is called data wrangling and feature engineering. We will do that in this module. I am going to use a standard tutorial-type data set for machine learning: the passenger record of the Titanic steamship. The Titanic sunk on its maiden voyage. We have the record of the passengers. We will do a practice problem of predicting who survived and who perished based solely on their name. Will this be effective? Seems kind of like reading Tarot cards. But let's keep an open mind. Maybe it will work.
<p>
Many text-based machine-learning problems contain their data in spreadsheet form. Python has a powerful library for dealing with spreadsheets called pandas. In this module we will use a handful of features from the *`pandas`* library. I'll go through some basic clean-up steps using pandas. Common wisdom is that the clean-up process can take up to 70% of your entire effort. Life is messy. Text data comes to us in unstructured forms. We have to deal with it.
</div>

<hr>
<h1>
Read in spreadsheet
</h1>
<div class=h1_cell>
<p>
For the first part of the course, we will be working on a problem called classification. The data we will be using to make classifications will be in spreadsheet form (I'll also call this *table* form).
<p>
We could read in the data to our own custom Python data-structure. Instead we will use the pandas library to store our data and modify it.
<p>
I am going to use something called comma-separated values or csv as my raw file format. I like csv because you can use it to pass data around easily from things like Excel and google Sheets. And pandas knows how to read raw csv format and produce its own version called a Dataframe. Our week 2 goal is to read a table of tweets, in csv form, and classify them as fake-news or not.
<p>
Caveat: I said we are interested in classification (e.g., fake-news or not) but I'll use the term `prediction` for the titanic. You can use them as interchangeable for now. I could say I am trying to `predict` who will survive or I could say I am trying to `classify` passengers into survivors and non-survivors. We will use the same methods for each.
<p>
I have the titanic data stored on google sheets. I used sheets to give me a url to the csv version of the file. Once I have that url, I can hand it to pandas and suck it in. Pretty dang cool.
<p>
BTW: it is convention to alias pandas as `pd`. It is also convention to use `df` as an abstract name for a Dataframe - you will see this in docs and StackOverflow. I am using `titanic_table` in place of `df` to give it more meaning.

</div>

In [7]:
import pandas as pd
url = 'https://docs.google.com/spreadsheets/d/1z1ycUZjJpmMWB4gXbhwRQ9B_qa42CwzAQkf82mLibxI/pub?output=csv'
titanic_table = pd.read_csv(url)

In [8]:
titanic_table.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<hr>
<h2>
Importing libraries
</h2>
<div class=h1_cell>
<p>
We will be writing our own functions during the quarter. I'll keep track of them and place them in python files we can import as libraries. So if we write a function for module i that I think will be also useful in module i+1, I'll store that function in a file that I can import in module i+1. I'll do the work of keeping the libraries up to date. You need to do the work of bringing them in so you can use them.
<p>
Here's how I am organizing things. I am storing useful functions in files named week1.py, week2.py, etc. I am placing them in a dropbox directory `/Dropbox/cis399_ds2_w18/code_libraries/datascience_2`. I am not making this directory accessible to you. Instead, I am pushing my directory contents to `https://github.com/fickas/datascience_2`, which is accessible to you.
<p>
Here is the cleanest way for you to go forward: set yourself up so you can pull from my github repository to your own local directory. I've seen students in past classes that never quite get this set up correctly, and end up copying the source from my github repository into the top of each notebook they create. Yuck.
<p>
First clone my repository then set yourself up to pull without password. Now ready to execute the code below: (1) Pull the latest versions of the code from my repository. I do make changes to code when I find bugs. So you will want the latest version. (2) Use normal `import` to bring in the latest library. (3) Use the magic command `%who function` to make sure you have the functions you need.
<p>
I set up a test libraries, `week0` and `week1`, so you can test things out. Notice my use of `!` to execute commands in the shell. Kind of cool.
</div>

In [2]:
import os

home_path =  os.path.expanduser('~')

<div class=h1_cell>
<p>
You will need to replace my file path below with your own.
</div>

In [4]:
os.chdir(home_path + '/Documents/CIS/UpperDiv/CIS_399/datascience_2')  #fill in your own
!git pull

Already up to date.


In [5]:
#load the lbirary from content this week

import sys
sys.path.append(home_path + '/Documents/CIS/UpperDiv/CIS_399/datascience_2')
from week1 import *
%who function

do_nothing	 foo	 python_version	 


<div class=h1_cell>
<p>
Don't panic if it takes you awhile to get your git repository set up. The week1 and week0 libraries have nothing of use. I just put them there so you can test your set-up. In week 2, we will start to use them for real.
</div>

In [9]:
#I am setting the option to see all the columns of our table as we build it, i.e., it has no max.
pd.set_option('display.max_columns', None)

<hr>
<h1>
Explore
</h1>
<p>
<div class=h1_cell>
We now have the 891 passengers in 891 rows of a table. We can use pandas methods to look a little more deeply at the data.
<p>
<ul>
<li>Use *`head()`* to get general layout.</li>
<p>
<li>Find which columns have *`NaN`*s (empties) and how many.</li>
<p>
<li>Use *`describe`* method to see if any odd looking columns, e.g., more than 2 unique values for a binary column.</li>
</ul>
</div>

<div class=just_text>
<p>
There are a mixture of column types. Some have discrete values (e.g., `Pclass`, `Sex`, `Embarked`), some have continuous values (e.g., `Age`, `Fare`), and some are in between (e.g., `SibSp`, `Parch`). The `Name` column has text values. The `Ticket` and `Cabin` columns are a bit of a hodge podge and will take further wrangling to make them useful.
<p>
Let's next see how many empties there are in each column.
<div>

In [10]:
titanic_table.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

<div class=just_text>
<p>
<ul>
<li>The `Age` column is a bit worrisome. It looks like a column that can be useful in prediction but has 177 empty values.
<p>
<li>The `Cabin` column has a lot of empties. I am dubious that the column as a whole will be useful. However, it might make sense to use the empty/non-empty question. For instance, maybe passengers with non-empty cabins were more likely to survive.
<p>
<li>The `Embarked` column has only 2 empties and that seems like something we can fill in.
<p>
</ul>
Let's complete our initial exploration by using the `describe` method. I am using the value `all` to get discrete columns along with numeric (continuous) columns.
<div>

In [16]:
titanic_table.describe(include='all')

Unnamed: 0,Survived,Name,Length
count,891.0,891,891.0
unique,,891,
top,,"Graham, Mr. George Edward",
freq,,1,
mean,0.383838,,26.965208
std,0.486592,,9.281607
min,0.0,,12.0
25%,0.0,,20.0
50%,0.0,,25.0
75%,1.0,,30.0


<hr>
<h1>Drop unneeded columns</h1>
<p>
<div class=h1_cell>
<p>

I am really only interested in the `Name` column and the `Survived` column. Since we are trying to predict Survived values, it is known as the target column or just plain y. The other columns are called features or xi. I am saying that we will only be interested in Name so it is the sole feature (for now).
<p>
As you can see below, I'll first use the columns attribute to obtain all the columns. I turn this into a list to make it print more cleanly. I am doing this in prepraration of dropping most of them. I am being lazy - I just want to copy and paste the output into the drop method.
<p>
Note in the drop method I am using `axis=1` to say I am dropping columns and not rows (`axis=0`). 

</div>

In [14]:
list(titanic_table.columns)

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [15]:
titanic_table = titanic_table.drop(['PassengerId',
                                    'Pclass', 
                                    'Sex',
                                     'Age',
                                     'SibSp',
                                     'Parch',
                                     'Ticket',
                                     'Fare',
                                     'Cabin',
                                     'Embarked'], axis=1)

ValueError: labels ['PassengerId' 'Pclass' 'Sex' 'Age' 'SibSp' 'Parch' 'Ticket' 'Fare'
 'Cabin' 'Embarked'] not contained in axis

<div class=h1_cell>
<p>

Most pandas operations make shallow copies of the table. This is true above: the drop method gives me a new table. I am reassigning `titanic_table` to the new table. I suppose I could keep a lot of variables around like `titanic_table_1`, `titanic_table_2`, etc. Never overwrite a variable. But I find trying to manage such a name space clumsy. It is true my way does not allow you to roll back to a prior version of the table. But you can "roll forward" by just restarting the kernel and executing all of the cells from the top of the notebook to get to a specific state.
</div>

In [17]:
titanic_table.head()  # Should see drop of columns

Unnamed: 0,Survived,Name,Length
0,0,"Braund, Mr. Owen Harris",23
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",51
2,1,"Heikkinen, Miss. Laina",22
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",44
4,0,"Allen, Mr. William Henry",24


<hr>
<h2>That's what I'm talkin about</h2>
<p>
<div class=h1_cell>
<p>
We trimmed down to the two columns we need. But as a warm up for word-vectorization in later modules, I am going to add a new column that is based on the Name column.
</div>

<h2>Numerology</h2>
<p>
<div class=h1_cell>
<p>
I have a theory that the length of your full name gives a clue to your future. I'm going to add a new column, `Length`, so I can test this out a little later. You can see below that pandas makes this pretty easy to do.
<p>
What is going on on the right hand side is that pandas `apply` is generating every row in turn and then passing that row to my lambda expression. The value returned by that lambda expression goes into the new column `Length`. If you like list comprehensions better, you can use this:
<p>
<code>
titanic_table['Length'] = [len(row['Name']) for index,row in titanic_table.iterrows()]
</code>
<p>
The iterrows method gives you the same functionality but also includes the row index (which we are not using).
</div>

In [18]:
titanic_table['Length'] = titanic_table.apply(lambda row: len(row['Name']), axis=1)
titanic_table.head()

Unnamed: 0,Survived,Name,Length
0,0,"Braund, Mr. Owen Harris",23
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",51
2,1,"Heikkinen, Miss. Laina",22
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",44
4,0,"Allen, Mr. William Henry",24


<div class=just_text>
If you squint, you can almost believe that those who perished had shorter names.
</div>

<hr>
<h2>Write it out</h2>
<p>
<div class=h1_cell>
<p>
I am going to write out the Titanic table because we have now changed it - we added a column and dropped columns. I set up a special folder just for holding different versions of tables during the quarter.
<p>
When I move over to the assignment notebook, I can read my table in (instead of loading it from google Sheets).
</div>

In [20]:
week = '1a' # change this each version

home_path =  os.path.expanduser('~')

file_path = '/Documents/CIS/UpperDiv/CIS_399/week1/'  #use your own path

file_name = 'titanic_table_w'+week+'.csv'

titanic_table.to_csv(home_path + file_path + file_name, index=False)

<hr>
<h1>
Use K-NN for classification
</h1>
<div class=h1_cell>
<p>
I said I was interested in predicting the Survived value for any passenger. This is a machine learning problem.  There are many machine-learning methods I might employ. But I am not ready to get into a comparison or survey at this point. I am just going to choose K Nearest Neighbor (K-NN) because it will get us going the fastest. It has its issues, but it is straightforward to build. And guess what, I am going to ask you to build it. It will look nice on your resume: "I built K-NN from scratch."
<p>
I'll meet you over on the assignment notebook. I'll ask you to add more columns (features) to our table and then we can get K-NN built and then see if Numerology is legit. [spoiler alert] You might be surprised.
</div>