# Python Basics

Sign in here: https://goo.gl/forms/SFsCpO4Vy7PDa2Rg2

## Syntax
Sign in here: https://goo.gl/forms/0TtZgaOOpUfylznt1    Python is an object oriented scripting language and does not require a specific first or last line (such as <code>public static void main</code> in Java or <code>return</code> in C).

There are no curly braces {} to define code blocks or semi-colons ; to end a line.  Instead of braces, indentation is rigidly enforced to create a block of code.

Arbitrary indentation can be used within a code block, as long as the indentation is consistent.

## Variables and Types

Variables can be given alphanumeric names beginning with an underscore or letter.  Variable types do not have to be declared and are inferred at run time.

In [1]:
# Int

In [2]:
# Float

Strings can be declared with either single or double quotes.

In [3]:
# Strings

The scope of variables is local to the function, class, and file in that increasing order of scope.  Global variables can also be declared.

In [4]:
# First function

## Modules and Import
Files with a .py extension are known as Modules in Python.  Modules are used to store functions, variables, and class definitions.  

Modules that are not part of the standard Python library are included in your program using the <code>import</code> statement.

In [5]:
# To use Math, we must import it

Whoops.  Importing the <code>math</code> module allows us access to all of its functions, but we must call them in this way

In [6]:
# Whole.part

Alternatively, you can use the <code>from</code> keyword

In [7]:
# From with pi

Using the <code>from</code> statement we can import everything from the math module.  

Disclaimer: many Pythonistas discourage doing this for performance reasons.  Just import what you need

In [8]:
# From ... *

## Strings
As you may expect, Python has a powerful, full featured string module.  

### Substrings
Python strings can be substringed using bracket syntax

In [9]:
# Print 1

Python is a 0-index based language.  Generally whenever forming a range of values in Python, the first argument is inclusive whereas the second is not, i.e. <code>mystring[11:25]</code> returns characters 11 through 24.

You can omit the first or second argument

In [10]:
# Print first 4

In [11]:
# Characters before 9th

In [12]:
# Characters after 27th

In [13]:
# Omitting start and end

Using negative values, you can count positions backwards

In [14]:
# Print almost last 4 characters

### String Functions
Here are some more useful string functions
#### find

In [15]:
# Find "Gators"

Looks like nothing was found.  -1 is returned by default.

#### lower and upper

#### replace

Notice that replace returned a new string.  Nothing was modified in place

#### split

#### join

The <code>join</code> is useful for building strings from lists or other iterables.  Call <code>join</code> on the desired separator

In [73]:
# Join with spaces

For more information on string functions:

https://docs.python.org/2/library/stdtypes.html#string-methods

## Data Structures
### Lists
The Python standard library does not have traditional C-style fixed-memory fixed-type arrays.  Instead, lists are used and can contain a mix of any type.

Lists are created with square brackets []

In [17]:
# mylist list of 5

In [18]:
# append 6

In [19]:
# extend the list with the contents of another list


In [20]:
# insert the number 7 at index 6


In [21]:
# removes the first matching occurence 


In [22]:
# by default, the last item in the list is removed and returned


In [23]:
# pops at at index


In [24]:
# len()

In [25]:
# the range function returns a list from -3 inclusive to 0 non inclusive


In [26]:
# default list sorting. When more complex objects are in the list, arguments can be used to customize how to sort


In [27]:
# reverse the list


For more information on Lists:

https://docs.python.org/2/tutorial/datastructures.html#more-on-lists

### Tuples

Python supports n-tuple sequences.  These are non-mutable

In [28]:
# mytuple

In [29]:
# access an item

In [30]:
# results in error

###Sets
Python includes the set data structure which is an unordered collection with no duplicates

In [31]:
# in

In [32]:
# more in

In [33]:
# set arithmetic

In [34]:
# set and

In [35]:
# OR

In [36]:
# XOR

### Dictionaries
Python supports dictionaries which can be thought of as an unordered list of key, value pairs.  Keys can be any immutable type and are typically integers or strings.  Values can be any object, even dictionaries.

Dictionaries are created with curly braces {}

In [37]:
# mydict

In [38]:
# Florida

In [39]:
# delete

In [40]:
# assignment

In [41]:
# appending

In [42]:
# keys

## Conditionals
Python supports the standard if-else-if conditional expression

## Loops
Python supports for, foreach, and while loops
### For (counting)
Traditional counting loops are accomplished in Python with a combination of the <code>for</code> key word and the <code>range</code> function

In [43]:
#with one argument, range produces integers from 0 to 9
    print x

In [44]:
# with two argumentts, range produces integers from 5 to 11


In [45]:
# with three arguments, range starts at 1 and goes in steps of 3 until greater than 12

In [46]:
# can use a negative step size as well


In [47]:
# with a positive step, all values are less than 1. No integers are produced


In [48]:
# same goes for a negative step as all values are less than 2


### Foreach
As it turns out, counting loops are just foreach loops in Python.  The <code>range</code> function returns a list of integers over which <code>for in</code> iterates.  This can be extended to any other iterable type

In [49]:
# iterate over a list of strings

### While
Python supports standard <code>while</code> loops

Python does not have a construct for a do-while loop, though it can be accomplished using the <code>break</code> statement

## Functions
Functions in Python do not have a distinction between those that do and do not return a value.  If a value is returned, the type is not declared.

Functions can be declared in any module without any distinction between static and non-static.  Functions can even be declared within other functions

The syntax is the following

In [50]:
# define function

In [51]:
# define player, name, number

Functions can have optional arguments if a default value is provided in the function signature

In [52]:

    
# no team argument supplied

In [53]:
# supplying all three arguments

Python functions can be called using named arguments, instead of positional

### \*args and \**kwargs
In Python, there is a special deferencing scheme that allows for defining and calling functions with argument lists or dictionaries.

#### \*args

In [54]:
# calling with *args

Argument lists can also be used in defining a function as such

In [55]:
# define foo, *args, print *args

#### **kwargs
Similarly, we can define a dictionary of named parameters

In [56]:
# calling with kwargs

Just as before, we can define a function taking an arbitrary dictionary

### return
In Python functions, an arbitrary number of values can be returned

In [57]:
# def sum, return a + b

# Data Science Tutorial

Sign in here: https://goo.gl/forms/SFsCpO4Vy7PDa2Rg2

Now that we've covered some Python basics, we will begin a tutorial going through many tasks a data scientist may perform.  We will obtain real world data and go through the process of auditing, analyzing, visualing, and building classifiers from the data.

We will use a database of selected disease statistics of various contries from the Global Health Observatory. The data is organized by country and year, with the number of specific incidents of each disease listed. The attributes and domain of each entry are described by the table below:

| Attribute                     | Domain                          |
|-------------------------------|---------------------------------|
| 1. Country                    | String                          |
| 2. Year                       | Year (2009-2014)                |
| 3. T.b. gambiense             | Integer                         |
| 4. T.b. rhodesiense           | Integer                         |
| 5. Cholera                    | Integer                         |
| 6. Meningitis (suspected)     | Integer                         |
| 7. Congenital Rubella         | Integer                         |
| 8. Diphtheria                 | Integer                         |
| 9. Japanese encephalitis      | Integer                         |
| 10. Leprosy                   | Integer                         |
| 11. Malaria                   | Integer                         |
| 12. Measles                   | Integer                         |
| 13. Mumps                     | Integer                         |
| 14. Neonatal Tetanus          | Integer                         |
| 15. Pertussis                 | Integer                         |
| 16. Plague                    | Integer                         |
| 17. Poliomyelitis             | Integer                         |
| 18. Rubella                   | Integer                         |
| 19. Total Tetanus             | Integer                         |
| 20. Tuberculosis              | Integer                         |
| 21. Yellow Fever              | Integer                         |
| 22. Cutaneous Leishmaniasis   | Integer                         |
| 23. Visceral Leishmaniasis    | Integer                         |

For more information on this data set:
http://apps.who.int/gho/data/node.home

## Obtaining the Data
Lets begin by programmatically obtaining the data.  Here I'll define a function we can use to make HTTP requests and download the data

In [58]:
#define download_file function

Now we'll specify the url of the file and the file name we will save to

In [59]:
#specify url and filename

And make a call to <code>download_file</code>

In [60]:
#call our download_file function

**Note:**  If you see an InsecurePlatformWarning message, ignore it. More info can be found here: https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning

Now this might seem like overkill for downloading a single, small csv file, but we can use this same function to access countless APIs available on the World Wide Web by building an API request in the url.

## Wrangling the Data
Now that we have some data, lets get it into a useful form.  For this task we will use a package called pandas. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for Python.  The most fundamental data structure in pandas is the dataframe, which is similar to the data.frame data structure found in the R statistical programming language.

For more information: http://pandas.pydata.org

pandas dataframes are a 2-dimensional labeled data structures with columns of potentially different types.  Dataframes can be thought of as similar to a spreadsheet or SQL table.

There are numerous ways to build a dataframe with pandas.  Since we have already attained a csv file, we can use a parser built into pandas called <code>read_csv</code> which will read the contents of a csv file directly into a data frame.

For more information: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_csv.html

In [61]:
#import pandas and load dataset into a frame

Lets take a look at some simple statistics for the **Cholera** column

In [62]:
#describe cholera column

Referring to the documentation, the data contains 1164 entries. However, if we take a look at the "count" section, it shows only 245 entries. This is because the original data is filled with empty strings, which pandas automatically converts to Numpy's <code>nan</code> datatype, or "Not a Number". 

Lets take a look at another column, this time **Pertussis**

In [63]:
#describe pertussis column

Well at least the name is correct.  We were expecting a mean and standard deviation, and now the data type is an object.  

Whats up with our data?

We have arrived at arguably the most important part of performing data science: dealing with messy data.  One of most important tools in a data scientist's toolbox is the ability to audit, clean, and reshape data.  The real world is full of messy data and your sources may not always have data in the exact format you desire.

In this case we are working with csv data, which is a relatively straightforward format, but this will not always be the case when performing real world data science.  Data comes in all varieties from csv all the way to something as unstructured as a collection of emails or documents.  A data scientist must be versed in a wide variety of technologies and methodologies in order to be successful.

Now, lets do a little bit of digging into why were are not getting a numeric pandas column

In [64]:
#find unique values

Using <code>unique</code> we can see that '0 0', '5 5', and '2 2' all appear as distinct values in this series. Because of the space between the numbers, Python has classified these as *strings* rather than *integers*. Indeed, it's not immediately obvious that these were meant to be legitimate entries in the first place.

Lets see what we can do with these unrecognized values. 

In [65]:
#convert column to numeric values

Here we have attempted to convert the **Pertussis** series to a numeric type.  Lets see what the unique values are now.

In [66]:
#find new unique values

The decimal point after each number means that it is an integer value being represented by a floating point number.  Now instead of our pesky *strings* we have <code>nan</code> (not a number).  <code>nan</code> is a construct used by pandas to represent the absence of value.  It is a data type that comes from the package numpy, used internally by pandas, and is not part of the standard Python library.

Now that we have <code>nan</code> values in place of strings, we can use some nice features in pandas to deal with these missing values.

What we are about to do is what is called "imputing" or providing a replacement for missing values so the data set becomes easier to work with.  There are a number of strategies for imputing missing values, all with their own pitfalls.  In general, imputation introduces some degree of bias to the data, so the imputation strategy taken should be in an attempt to minimize that bias.

Here, we will simply ignore all of the <code>nan</code> values, however other strategies such as replacing the <code>nan</code>'s with the mean of the data are also commonly used.

In [67]:
#convert whole data frame

<code>health_data.mean().round()</code> will take the mean of each column (this computation ignores the currently present nan values), then round, and return a dataframe indexed by the columns of the original dataframe.

This function can be used to replace all missing values with the mean of each column. In this tutorial however, we will not use this method, because the large number of missing values would greatly skew our standard deviations.

In [68]:
#find mean values for imputing

Now that we have figured out how to impute these missing values, lets start over and quickly apply this technique to the entire dataframe.

In [69]:
#quickly load and convert data

In [70]:
#check unique values

Structurally, Pandas dataframes are a collection of Series objects sharing a common index.  In general, the Series object and Dataframe object share a large number of functions with some behavioral differences.  In other words, whatever computation you can do on a single column can generally be applied to the entire dataframe.

Now we can use the dataframe version of <code>describe</code> to get an overview of all of our data

In [71]:
#overview description of data frame

## Visualizing the Data
Another important tool in the data scientist's toolbox is the ability to create visualizations from data.  Visualizing data is often the most logical place to start getting a deeper intuition of the data.  This intuition will shape and drive your analysis.

Even more important than visualizing data for your own personal benefit, it is often the job of the data scientist to use the data to tell a story.  Creating illustrative visuals that succinctly convey an idea are the best way to tell that story, especially to stakeholders with less technical skillsets.

Here we will be using a Python package called ggplot (https://ggplot.yhathq.com).  The ggplot package is an attempt to bring visuals following the guidelines outlayed in the grammar of graphics (http://vita.had.co.nz/papers/layered-grammar.html) to Python.  It is based off of and intended to mimic the features of the ggplot2 library found in R.  Additionally, ggplot is designed to work with Pandas dataframes, making things nice and simple. 

We'll start by doing a bit of setup

In [72]:
# The following line is NOT Python code, but a special syntax for enabling inline plotting in IPython
%matplotlib inline 

from ggplot import *

import warnings

# ggplot usage of pandas throws a future warning
warnings.filterwarnings('ignore')

ImportError: No module named ggplot

So we enabled plotting in IPython and imported everything from the ggplot package.  Now we'll create a plot and then break down the components

In [None]:
#create our first plot

A plot begins with the <code>ggplot</code> function.  Here, we pass in the cancer_data pandas dataframe and a special function called <code>aes</code> (short for aesthetic).  The values provided to <code>aes</code> change depending on which type of plot is being used.  Here we are going to make a histogram from the **human African trypanosomiasis (T.b. rhodesiense)** column in health_data, so that column name needs to be passed as the x parameter to <code>aes</code>.

The grammar of graphics is based off of a concept of "geoms" (short for geometric objects).  These geoms provide granular control of the plot and are progressively added to the base call to <code>ggplot</code> with + syntax.


Lets say we wanted to show the mean number of cases on this plot.  We could do something like the following

In [None]:
#plot with vline

As you can see, each geom has its own set of parameters specific to the appearance of that geom (also called aesthetics).

Lets try a scatter plot to get some multi-variable action

In [None]:
#scatter plot

With a simple aesthetic addition, we can see how these two variables have changed over the past six years.

In [None]:
#color scatter plot

By adding <code>color = 'Year'</code> as a parameter to the aes function, we now give a color to each unique value found in that column and automatically get a legend.

We can also do things such as add a title or change the axis labeling with geoms

In [None]:
#plot with custom title and axes

I highly encourage you to check out https://ggplot.yhathq.com/docs/index.html to see all of the available geoms.  The best way to learn is to play with and visualize the data with many different plots and aesthetics.

There doesn't seem to be much correlation between these two variables as a function of the year, however it is difficult to say with such simple statistics. If we wanted to analyze relationships between all of the variables in our data set, we would need to analyze a very high-dimensional space. Using some more complex data analysis and machine learning techniques, we may be able to extract higher order correlations in this high-dimensional data space. This will be the topic of our next workshop, Python II.

## Summary and Python II

In the first half of our two-part Python series, we've learned about variables, data structures, functions, and graphing. While we have introduced these topics in the context of data science with Python, they are central to programming in any language and in any context. We have also laid the foundation for programmatically obtaining, cleaning, and visualizing data sets.

Now that we have an understanding of how to obtain and visualize some simple statistical information contained in a dataset, we've set the stage for machine learning and data analysis. These topics will be covered in depth in our next workshop, Python II.

Post workshop survey: https://goo.gl/forms/gxGfc9LzDIev11R53