# Writing python code within jupyter notebooks

## First, counting csv items with traditional loop structures

Before starting, I want to make sure I know where I am working so I can copy my csv file there. The exclamation point lets me run shell commands, "cd" without argument returns the current working directory in Windows.

In [1]:
!cd

C:\Users\mikes_000\Desktop


Python can read in a csv file, but needs to load the extra library to do it nice and clean. Most of the csv files I have hanging around belong to the MIND Center and cannot be uploaded to a public site like this, but I also have some fake data that I generated randomly representing a bunch of student ids and their preference for class section: A, B, or C.

In [2]:
import csv  # load the library to handle delimited text files

with open('.\class.csv') as preferencefile:
    # Instantiate the csv.reader object based on the file just opened
    myreader = csv.reader(preferencefile, delimiter=',')
    # Skip the header row; I only want the data
    next(myreader, None)
    # I don't NEED enumerate, but it will allow me to know which row I'm on without adding my own counter.
    # More simply, "for student in myreader" would allow the same loop but without counting lines.
    for i, student in enumerate(myreader):
        # I randomly created 1000 students, so don't want to print them all. Every 200 will do.
        if i % 200 == 0:
            # This is a fairly complicated print statement. See if you can figure it out...
            # The i refers to the enumerate() counter.
            # "student" ends up being a list of everything else on one row of the file
            # You can name your variable list: {num} {id}; or order it: {0} {1}, etc.
            print("[{num}] Student {id} wants section {fave}, but not {hate}".format(
                num=i, id=student[0], fave=student[1], hate=student[3]))

[0] Student 5128 wants section C, but not B
[200] Student 1700 wants section C, but not B
[400] Student 553 wants section B, but not C
[600] Student 6446 wants section B, but not C
[800] Student 7817 wants section B, but not C


The following code is the same as above (but comments only for new lines), but adds a dictionary to count the breakdown in demand for class sections.

In [3]:
import csv
from collections import Counter  # we want to use the Counter to count preference items

with open('.\class.csv') as preferencefile:
    # Create a Counter dictionary to hold counts of each section preference
    c = Counter()
    myreader = csv.reader(preferencefile, delimiter=',')
    next(myreader, None)
    for i, student in enumerate(myreader):
        # For EVERY row, count the student's first choice section
        c[student[1]] += 1
        if i % 200 == 0:
            print("    [{num}] Student {id} wants section {fave}, but not {hate}".format(
                num=i, id=student[0], fave=student[1], hate=student[3]))
# Count up a summary of the students' preferences and report them.
print("Students' first choices (n={0}):".format(i+1))
for pref, count in c.items():
    # You can see some of the formatting flexibility below, treating term 2 as a percentage with one decimal place.
    print("    {0} ({2:.1f}%) students want section {1}.".format(count, pref, 100*(count/(i+1)) ))

    [0] Student 5128 wants section C, but not B
    [200] Student 1700 wants section C, but not B
    [400] Student 553 wants section B, but not C
    [600] Student 6446 wants section B, but not C
    [800] Student 7817 wants section B, but not C
Students' first choices (n=1000):
    285 (28.5%) students want section B.
    614 (61.4%) students want section A.
    101 (10.1%) students want section C.


This is a very simple example of how to pull summary data out of a csv file in python.

## Second, counting the same items with numpy

Languages like R were designed to perform one operation on every item in a collection without the need for a loop. Python has been extended to do the same type operations with numpy, which allows matrix-like arrays so it can compete with matlab. Scipy builds on top of numpy by providing signal processing algorithms like FFTs that make analyzing sleep data possible. Pandas is also built on top of numpy to allow you to treat data sets like a database table and do statistical analyses with reasonable treatment of missing data. You could do all of these things on your own with python, but it's nice somebody else spent their time to write all of the code for us.

The following does almost the same thing we did above, but using the numpy stack.

In [4]:
# First, grab ahold of the code
import numpy as np
import pandas as pp

# Next, read the csv and save it all to memory in one step.
mydata = pp.read_csv('./class.csv')
# Finally, summarize the frequency table using the pandas value_counts function
print(mydata['first'].value_counts())

A    614
B    285
C    101
Name: first, dtype: int64


Since I haven't used any of the numpy stack before, I can't do much with it, but it works a lot like R. If you learn either pandas or R, you could probably switch back and forth fairly easily. That's what I'm working on, now. Take a look at http://pandas.pydata.org/pandas-docs/stable/10min.html when you have time to think and experiment. The "10 minutes" bit is probably true for someone who is already expert at both python and R, and just needs to see how pandas is implemented.