# An introduction to Jupyter notebooks

A Jupyter notebook (formally iPython) is an interactive environment for Python, and it's probably the best way of using Python for data manipulation.  You may ask: "I can just run python interactively from the terminal, why do I need jupyter?"  Well, that's a fair question, and the answer will hopefully become clear as we work through this notebook.

Jupyter notebooks are broken down into **cells**.  We're in the topmost cell of this notbook at the moment.  Cells come in three flavors:

* **Markdown cells** allows you to edit the text in Markdown ("so *that's* why Dr. Z made us learn Markdown...").  These cells are used for exposition, discussion, and general formatting.  Think of them as extended comments that can be formatted beautifully, and can contain [links](http://www.jupyter.org), bulleted lists, etc.  Anything that Markdown can!
* **Code cells** contain code (for us, Python code).  These cells can contain code as short as one line, or as long as you'd like!  (Actually, I have no idea what the maximum length is.  I've had cells well over 100 lines long though).  They have some basic text editor support, so they'll help you with indentation, tab completion, etc., but they won't be able to do some of the magic that true editors like Atom or Sublime can handle.  They're also interactive in the same way that the Python interpretter in interactive mode is.  Type `15*4 %3` and it returns the answer, no need to print out everything.

There's one more, but it's not used as often:

* **Raw cells** are used when you want to hack the notebook to make it fancier.  We won't be using them tons, but it's good to know they exist.

How about a little demonstation?

In [None]:
def is_prime(n):
    """ Determine whether n is prime."""
    k = 2
    while k*k <= n:
        if n % k == 0:
            return False
        k += 1
    return True

print(*[x for x in range(2,401) if is_prime(x)])

## Editting a notebook: Command mode and Edit mode

While working with a notebook, you are always in one of two modes.

1. In **Edit mode** you can edit the content of a cell.  It acts like a text file inside a text editor, and has some helpful syntax highlighting.  If you're editing a Markdown cell, it will look significantly different.  If you're editing a code cell, it will look mostly the same.  To *run* the cell, you have a few options:
 * press `command + enter` or `ctrl + enter` to run the cell and exit edit mode.  Running a markdown cell with render it, and running a code cell performs as you expect.
 * press `shift + enter` to run the cell and insert a new cell below.  This is the standard command then you're building the notebook.
1. In **Command mode** you have access to your cells in a larger-scale way.  You can press `up` or `down` to move between cells, and press `enter` to enter edit mode on the currently selected cells.  You can also cut, copy, paste, and delete cells with appropriate keyboard commands.  Open the *Command Palette* (the keyboard in the top center of the toolbar) to see all the commands you can use in Command mode.

## Linearity of code: the kernel

A notebook has a **kernel** attached to it.  Think of it as the interactive python running behind the scenes, executing your commands when you send them.  There are two forms of linearity going on here, and it can be a bit confusing to new Jupyter users:

* **Kernel Linearity**: After you execute a code cell, it gives you its output and places a number next to the top of the cell.  This number is the *order of cell execution* in the kernel.  It's the order the kernel is receiving.  This means you can run cells, tweak them and run them again, run something "below" the cell in the notebook, then come back and run the upper cell, *etc.*, and the kernel will keep track of this in terms of the order in which you ran them **chronologically**.  This is the order you want to keep in mind.  It's really useful!  You can start out with a junky-looking notebook, figure out your data analysis, realize you want to change stuff "in the past", and just go back and change them.  Once you get used to this, you'll love it.
* **Cell Linearity**: There is an obvious order to the cells: the top ones "go first", and the lower ones "go next".  This isn't exactly necessary, though.  It definitely is the goal of the *final product* to go linearly, but programming, and especially data analysis, isn't like writting an essay.  Very often, you'll need to go back and change things, then rerun all the cells that come after the one you just edited.  You may type one line in a cell, hit `shift + enter` to see the output and move on to the next cell, then do that three more times.  You then realize that you'd prefer to have done all that at once, and you can merge those three cells together.  It's a workflow that I hope you'll learn to love.

Play around with it now: use an uninitialized variable `my_hat` in a cell, then hit `esc` to leave the cell without running (or hit `shift + enter` to see the error.  Below that, create a cell in which you give the variable a value, then run that cell, followed by the original cell: 

In [None]:
# Push enter to see an error or esc to not run the cell:

print(my_hat)

In [None]:
my_hat = "Oh, now it works!"

# run this cell, then run the above cell!

you'll get the hang of it in time.  Now let's move on to something more interesting.

# An analysis of desired datasets

At the beginning of the term, you gave me some data.  You told me all kinds of things about you personally, and all kinds of things about the types of datasets you'd like to work with.  I've sneakily typed all that up and saved it into a CSV file, so now we're going to play around with it.

In [None]:
# standard import statements.  We'll understand these soon enough!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Jupyter "magics": lines beginning with a `%` are how you talk to Jupyter.  Here, I'm telling 
#   Jupyter to display matplotlib plots as inline, as opposed to the default of having them pop
#   up in their own window, buried behind everything else.
# Another option is `matplotlib notebook`, which gives the plot a few more features, 
#   but those features are usually not that necessary unless you're needing to manipulate the 
#   graphic, like in a 3d plot, or save the individual plot to a .png file.
%matplotlib inline

with open("Fall_2016_Survey_Values.csv", 'r') as f:
    print(f.read()[:1200])

What a mess!  Let's turn that into something a bit more manageable.  Enter pandas:

In [None]:
# `df` is short for dataframe, the standard object in "pandas", the python dataset manipulation 
#    tool.  Think of it as excel, but awesome.
df = pd.read_csv("Fall_2016_Survey_Values.csv")
df.head()

Actually, that was pretty cool.  Those bolded lines are the column headers and row indices, respectively.  That's awesome that pandas just knew the top row was headers.  Let's see what we've got for columns:

In [None]:
df.columns

Okay, so my dataset comes from your dream job, your clubs and courses, and your rating of all the datasets.  Cool.  One problem though: people didn't call all the clubs and classes the same thing.  Time to get our hands dirty:

### Cleaning up clubs:

In [None]:
# We can access individual columns like this.  `.head()` just shows the top 5 items in the list
df[["Club 1", "Club 2", "Club 3"]].head(15)

In [None]:
clubs = []

for _, club in df[["Club 1", "Club 2", "Club 3"]].iterrows():
    clubs.extend([club[i] for i in range(3) if pd.notnull(club[i])])

In [None]:
print(*sorted(clubs), sep = ', ')

Okay, there are a few ways to deal with this.  Here are the two extremes:

* **Group the ones obviously refering to the same clubs, then create indicator columns for each club**.  An *indicator column* is just a 1 or 0 based on whether a given student is in the given club.  This has quite an obvious, important benefit:  You keep all of the *signal* (information) contained in the dataset, and so that might be helpful for us in our analysis.  However, it has the effect of greatly increasing the "width" of your dataset (the number of "columns", or "predictors", or "variables"), which may make the dataset intractible.  With a dataset of this tiny size, I'm not too concerned about that.  The other potential issue is that you're not taking into effect that some of these clubs are very similar to each other, and some are very different.  By making these indicators weighted the same, you may be capturing a great deal of *noise*, a word meaning the random fluctuations that comes from real data.
* **Create two classes of clubs, based on something you may think is relevant.**  For example, I may make "techy" clubs and "non-techy" clubs.  This has the benefit of perhaps not having as much noise, but it also loses some signal. 

You may think I'm about to say "You have to choose one and stick with it".  However, that's simply not the case.  It's really a matter of how much time you have to go through all the ways of modeling your data.  You could have the motto of "always try everything", and that may work out for you in terms of getting the best analysis.  Or you may just want a "good enough" result, and roll with it.  There's really no right or wrong answer; it's really a matter of how much time you want to spend with this particular dataset.

For now, I'll take a middle ground, and I'll hard code these into a few groups.

In [None]:
club_groups = {'philo/poli/econ': ['model UN','philo/debate', 'investment club','Democracy project',
                                   'economics society','mock trial','debate','filo'],
               'social': ['out of the blue', 'ecoaction','GSA','WOFO','SLAM/dance',
                          'quiz bowl','band','movie makers'],
               'PA duties' : ['tour guide','student activities','phillipian','peer tutor'],
               'international': ['german', 'indopak','international club',],
               'math/sci/tech': ['gaming', 'astronomy club','math','blue moon (sci)',
                                 'cs club','techmasters','makers','astronomy','aviation club',
                                 'chess','robotics'],
               'entreprenuerial': ['big ideas club', 'tang', 'entrpreneurs club']}

for group in club_groups.keys():
    df[group] = df[['Club 1','Club 2','Club 3']].isin(club_groups[group]).sum(axis=1)

In [None]:
df.head() # Scroll way to the right!

### Cleaning up Courses

We'll do the same thing for courses, and also for dream jobs.  We remove the course number, because we don't have it for everyone.

In [None]:
course_columns = ["Course 1", "Course 2", "Course 3", "Course 4", "Course 5"]

courses = []

for _, course in df[course_columns].iterrows():
    courses.extend([course[i] for i in range(5) if pd.notnull(course[i])])

In [None]:
course_groups = {'math/cs':['math','mth','bc calculus','multivariate calc','axiom of choice','cs ip','math ip','statistics',],
                 'arts': ['art'],
                 'science': ['physics','physics c','astronomy','animal behavior','biology','org chemisty', 'astro ip'],
                 'history': ['history', 'history ip','nonconformity in the renaisance', 'us history'],
                 'rel/phil': ['law and morality','law & morality'],
                 'language': ['span','french','japanese iw'],
                 'english': ['eng','medieval lit', 'gothic lit','english senior lecture','shakespeare english lecture',],
                 'other': ['econ', 'media studies', 'architecture',"women's lit"]}

# Remove numbers from the courses
import re
numbers = re.compile("[0-9]+")
for course in course_columns:
    df[course] = df[course].apply(lambda x: re.sub(numbers,"",str(x)).strip())    

In [None]:
for group in course_groups:
    df[group] = df[course_columns].isin(course_groups[group]).sum(axis=1)

df.head() # Scroll way right again!

### Cleaning up Dream jobs

In [None]:
dream_jobs = []

for job in df['Dream Job']:
    if pd.notnull(job): 
        dream_jobs.append(job)

In [None]:
job_groups = {'jobs_tech':['Google R&D', 'tech entrepreneur','Developer'],
              'jobs_science': ['biomedical engineer','math-related','research scientist', 
                          'JPL operations engineer','neuroscientist'],
              'jobs_business': ['CEO', 'product manager','venture capital',],
              'jobs_fantastic': ['rollerskating waitress']}

In [None]:
for group in job_groups:
    df[group] = df['Dream Job'].isin(job_groups[group]).apply(lambda x: {False:0, True:1}[x])

df.head(15) #scroll way right one more time!

### Final Cleanup: create data matrix

The last step is to remove all string columns, creating just a matrix:

In [None]:
print(*list(df.columns), sep=', ')

In [None]:
numerical_columns = ['Period','Tweets','LHC','VLA','MRI','Seis','Water Samples','Emails', 'Social Network',
                     'animal image', 'instagram', 'handwritten', 'martian landscape', 'sport', 'HTML',
                     'interview', 'web-ad', 'psychological', 'books', 'delivery logs', 'county vote',
                     'temperatures', 'wireless str', 'math/sci/tech', 'philo/poli/econ', 'international',
                     'social', 'PA duties', 'entreprenuerial', 'arts', 'science', 'english', 'math/cs',
                     'history', 'other', 'language', 'rel/phil', 'jobs_fantastic', 'jobs_science',
                     'jobs_tech', 'jobs_business']

datasets = ['Tweets','LHC','VLA','MRI','Seis','Water Samples','Emails', 'Social Network',
            'animal image', 'instagram', 'handwritten', 'martian landscape', 'sport', 'HTML',
            'interview', 'web-ad', 'psychological', 'books', 'delivery logs', 'county vote',
            'temperatures', 'wireless str']

X = df[numerical_columns]

# When you're done with your analysis, you should save the dataset as a new csv:
X.to_csv("dataset_preferences.csv")

# If you plug your dataset into a model, it's typically best to make it a matrix:
# X = np.array(X)

In [None]:
X

## Analysis

The first thing we want to do is plot some histograms to view our sample distribution.  However, we have some missing data. (Thanks, NEEL! (Kidding, of course.  Learning about missing data is important.  So really, thanks.))

In [None]:
X[X.isnull().any(axis=1)] # Which rows have NaN?

In [None]:
X.columns[pd.isnull(X).any()].tolist() # Which columns have NaN?

The topic of what to do with missing data is a big one, and not something we can cover in a single cell.  The simplest solution is to *impute* them to be some kind of average or otherwise representative value.  Here, I'll take the median of the column.  Side note: setting individual values in a dataframe is strange.  I often find pandas yells at me for setting to a copy instead of the dataframe itself, even though I'm doing exactly what their documentation says.  I haven't figured this out yet; it seems to work, even if it sometimes raises warnings.

In [None]:
X.at[2,"Water Samples"] = X["Water Samples"].median()
X.at[2,"Social Network"] = X["Social Network"].median()
X.at[2,"martian landscape"] = X["martian landscape"].median()

Okay, let's take a look

In [None]:
datasets = ['Tweets','LHC','VLA','MRI','Seis','Water Samples','Emails', 'Social Network',
            'animal image', 'instagram', 'handwritten', 'martian landscape', 'sport', 'HTML',
            'interview', 'web-ad', 'psychological', 'books', 'delivery logs', 'county vote',
            'temperatures', 'wireless str']

X_datasets = X[datasets]

In [None]:
totals = X_datasets.sum(axis=0).sort_values()

In [None]:
totals.plot(kind='barh')
plt.show()

Interesting!  I wasn't expecting emails to win.  Or really any of those top 5 or so.  The last thing we'll do with this elementary analysis is to look at basic statistics of the dataset.

In [None]:
X.describe()

There's so, so, so much more we can do with this dataset, now that we have it in a form we want!  We'll do that in the future.