# An Introduction to Python

The following jupyter notebook serves as a basic introduction to python for use in analytics (statistics, machine learning, data science, data analytics, business analytics, etc.) applications. Python is a very general language, so this introduction does nothing to even scratch the surface of the capabilities of the language. We introduce python datatypes, defining objects, subsetting and slicing, and then introduce `pandas` and `numpy` as libraries which allow us to conveniently interface with and operate on data frames.

We'll typically make use of these libraries as well as a few others -- in QSO370/QSO570 the following Python libraries will be used quite often:
+ `pandas` for reading and manipulating data frames
+ `numpy` for numerical operations on vectors/columns
+ `matplotlib` and `seaborn` for data visualization (I'll provide a short primer)
+ `sklearn` for predictive modeling

### Python Basics: Objects

We define objects/variables with the `=` operator in Python. Python interprets type via the value stored in the object. We can run a block of code in a jupyter notebook using `shift+Enter`.

In [1]:
w = True
x = 2
y = 2.0
z = "Hello"

In [4]:
type(w)
type(x)
type(y)
type(z)

str

By default, Python only prints the result of the last statement from a code block. We use the `print()` function to force Python to print additional items. Try removing the calls to `print()` in the block above and re-executing the cell. What happens?

### Special Objects: Lists and Dictionaries

Beyond the basic boolean (true/false), numeric, and string datatypes Python also has lists (ordered containers of objects), dictionaries, arrays, and data frames. We will most often be working with data frames (tables) and lists, but it is worth seeing those other datatypes too since they can be useful.

In [5]:
#A list
x = [1, 2, 4, 99]
#Another (more complicated) list
y = ["a", True, 45, 22, [1, 9, 3]]
print(type(x))
print(type(y))

<class 'list'>
<class 'list'>


Note that both `x` and `y` above are lists, but they are quite different from one another. We can easily see that `x` is a list of four numeric values but `y` is a list of disparate objects. The first entry is a string, the second is a boolean, the third and fourth are numerics, and the fifth is a list itself! We can use subsetting to confirm this. There are a few things to keep in mind when referencing indices in python.

+ Indexing in python begins from 0 -- that is, the first item in these lists occupies slot 0, the second item occupies slot 1, etc.
+ We can reference a range of indices using the colon (`:`) operator -- for example, `1:4` but the colon operator is right endpoint exclusive. That is, if we request `x[1:4]` we will get the objects in slots 1, 2, and 3 of the list `x`. Remember, these are the second, third, and fourth items. This is indeed confusing, but it will be natural soon.

In [8]:
y

['a', True, 45, 22, [1, 9, 3]]

In [7]:
x[3]

99

In [9]:
y[1:4]

[True, 45, 22]

In [10]:
print(y[1:4])
print(type(y[0]))
print(type(y[4]))

[True, 45, 22]
<class 'str'>
<class 'list'>


In [11]:
y

['a', True, 45, 22, [1, 9, 3]]

In [14]:
y[4][2]

3

Notice that, indeed, the last item of the list `y` is indeed a list! Now, what would happen if we requested `y[4][1]`? Would this work if we replaced the `4` by a `3`? Try it!

In [None]:
#Use this cell to explore...


We can also subset beginning at the start of a list using ` :n`

In [15]:
y[2:]

[45, 22, [1, 9, 3]]

Adding list does not necessarily work in the way we would expect.

In [16]:
[1, 2, 4] + [5, 9, 10]

[1, 2, 4, 5, 9, 10]

In [None]:
z = [1, 2, 4] + [5, 9, 10, 18, 20]
sorted(z)

[1, 2, 4, 5, 9, 10, 18, 20]

Dictionaries are similar to lists in that they are ordered containers for objects. They get us closer to data frames, however, because each object has a name with which we can reference it!

In [17]:
#A simple dictionary
colleges = {"State" : "NH",
            "School" : ["Plymouth", "UNH", "Keene", "SNHU", "Franklin Pierce", "St. Anselm's", "Dartmouth", "Colby-Sawyer", "New England College", "Rivier"],
            "Accredited" : 10*[True]}
print(colleges)

{'State': 'NH', 'School': ['Plymouth', 'UNH', 'Keene', 'SNHU', 'Franklin Pierce', "St. Anselm's", 'Dartmouth', 'Colby-Sawyer', 'New England College', 'Rivier'], 'Accredited': [True, True, True, True, True, True, True, True, True, True]}


The code above creates a dictionary with slots for `State`, `Schoool` name, and whether or not the school is `Accredited`. Note that dictionaries can have slots storing objects of different lengths (which is not allowed for data frames), and we can define what goes into those slots in a variety of ways. A nice shortcut was used to create the `Accredited` slot though -- multiplying a list by a number extends that list by copying it that number of times!

### Data Frames and `pandas`

This section begins our discussion of Python for analytics. You'll have a fourth notebook which re-covers some of this material and takes a more purposeful look at troubleshooting common errors we'll encounter.

The `pandas` library is really useful for working with and creating data frames. It will be one of the libraries that we use most often in QSO370. In order to make use of a library in python we must first `import` it. Furthermore, anytime we wish to use a feature of that library, we must tell python the library name -- since this is the case, we often provide a shortened reference name with `as`.

In [18]:
#import the pandas library with shortened reference name pd
import pandas as pd

Using `pd` in reference to the `pandas` library is standard in the Python community. Now let's use `pandas` to create a data frame from our `colleges` dictionary!

In [19]:
colleges_df = pd.DataFrame(colleges)
colleges_df

Unnamed: 0,State,School,Accredited
0,NH,Plymouth,True
1,NH,UNH,True
2,NH,Keene,True
3,NH,SNHU,True
4,NH,Franklin Pierce,True
5,NH,St. Anselm's,True
6,NH,Dartmouth,True
7,NH,Colby-Sawyer,True
8,NH,New England College,True
9,NH,Rivier,True


Note that `pandas` automatically replicated "NH" down the `State` column when creating the data frame. This behavior seems convenient, but what if in our original dictionary we had set `"State" : ["NH", "VT"]`? Go back and try it! Python also added row numbers to our data frame -- these are more than just a superficial artifact of printing so that we know row 6 belongs to Dartmouth. These row numbers are called `indices` and are a referenceable part of our data frame.

### Importing a Data Frame

It will often be the case that we wish to import a data frame from a `*.csv` file stored on our local machine or out on the web. With `pandas` we have some simple functionality that allows for this. Below you can see how to read in both types of files (local and remote), but know that the local file exists on my machine. The line to read in the housing data **will not** run for you unless you download the `housing.csv` file from BrightSpace and adjust the path to point to the file's location on your machine. In addition, if you are running Python through Google Colab there are a few additional steps to reading local data into a Python session.

In [20]:
#Read in the housing.csv file which exists in the "Grad Datasets" folder on my desktop.
#homes = pd.read_csv("C:/Users/agilb/Desktop/Grad Datasets/housing.csv")

#Read in the heart_transplant.csv file which is hosted at openintro.org
heart_transplant_df = pd.read_csv("http://www.openintro.org/data/csv/heart_transplant.csv")

#print(type(homes))
print(type(heart_transplant_df))

#print(homes.shape)
print(heart_transplant_df.shape)

<class 'pandas.core.frame.DataFrame'>
(103, 8)


We did a bit more than just load the heart transplant data in that last chunk. We asked python to print the object type of `heart_transplant_df`, verifying that it is a `pandas` data frame. We also asked for the shape of the data frame -- this returns a vector of length two, of the format (*number of rows*, *number of columns*). Now that we have the data, we can view it. Calling print on the name of the data frame will print the entire data frame to our notebook -- this is more than we need. In order to get a feel for the dataset we likely need only see a few rows. There are many ways to achieve this -- we can use subsetting to request a specific set of rows or also use the `.head()` or `.tail()` methods to get the first five and last five rows respectively. I'll show a few examples and then ask you to write and run code to pull specific rows.

In [22]:
#Two ways to print the first four rows of the data frame.
print(heart_transplant_df[0:4])
print(heart_transplant_df[:10])

   id  acceptyear  age survived  survtime prior transplant  wait
0  15          68   53     dead         1    no    control   NaN
1  43          70   43     dead         2    no    control   NaN
2  61          71   52     dead         2    no    control   NaN
3  75          72   52     dead         2    no    control   NaN
   id  acceptyear  age survived  survtime prior transplant  wait
0  15          68   53     dead         1    no    control   NaN
1  43          70   43     dead         2    no    control   NaN
2  61          71   52     dead         2    no    control   NaN
3  75          72   52     dead         2    no    control   NaN
4   6          68   54     dead         3    no    control   NaN
5  42          70   36     dead         3    no    control   NaN
6  54          71   47     dead         3    no    control   NaN
7  38          70   41     dead         5    no  treatment   5.0
8  85          73   47     dead         5    no    control   NaN
9   2          68   51   

In [23]:
#Use .head() to get the first five rows
heart_transplant_df.head()

Unnamed: 0,id,acceptyear,age,survived,survtime,prior,transplant,wait
0,15,68,53,dead,1,no,control,
1,43,70,43,dead,2,no,control,
2,61,71,52,dead,2,no,control,
3,75,72,52,dead,2,no,control,
4,6,68,54,dead,3,no,control,


In [25]:
#Use .tail() to get the last five rows
heart_transplant_df.tail(n = 25)

Unnamed: 0,id,acceptyear,age,survived,survtime,prior,transplant,wait
78,81,73,52,alive,445,no,treatment,6.0
79,80,72,46,alive,482,yes,treatment,26.0
80,78,72,48,alive,515,no,treatment,210.0
81,76,72,52,alive,545,yes,treatment,46.0
82,64,72,48,dead,583,yes,treatment,32.0
83,72,72,26,alive,596,no,treatment,4.0
84,71,72,47,alive,630,no,treatment,31.0
85,69,72,47,alive,670,no,treatment,10.0
86,7,68,50,dead,675,no,treatment,51.0
87,23,69,58,dead,733,no,treatment,3.0


In [28]:
#Use this block to print rows 10 through (and including 23). 
#How does this change if you want to print the rows corresponding to indices 10 through (and including 23)?
heart_transplant_df[9:23]

Unnamed: 0,id,acceptyear,age,survived,survtime,prior,transplant,wait
9,2,68,51,dead,6,no,control,
10,103,67,39,dead,6,no,control,
11,12,68,53,dead,8,no,control,
12,48,71,56,dead,9,no,control,
13,102,74,40,alive,11,no,control,
14,35,70,43,dead,12,no,control,
15,95,73,40,dead,16,no,treatment,2.0
16,31,69,54,dead,16,no,control,
17,3,68,54,dead,16,no,treatment,1.0
18,74,72,29,dead,17,no,treatment,5.0


In [None]:
#Use this block to edit your call to the .head() method so that python prints the first 8 rows instead of the first 5.


In addition to subsetting a data frame by rows (observations) we can also subset by columns (variables).

In [31]:
heart_transplant_df[["age", "survived", "prior"]]
#print(heart_transplant_df["age"].head())
#print(heart_transplant_df[["id", "age", "survtime"]].head())

Unnamed: 0,age,survived,prior
0,53,dead,no
1,43,dead,no
2,52,dead,no
3,52,dead,no
4,54,dead,no
...,...,...,...
98,30,alive,no
99,48,alive,yes
100,40,alive,no
101,48,alive,no


### Subsetting with `.loc[]` and `.iloc[]`

With `pandas` we have two methods that are very useful for subsetting. They are much more convenient and flexible than the aproaches we have used above. For example, it is quite easy to subset by both *rows* and *columns* with these methods. The difference between `.loc[]` and `.iloc[]` is that `.iloc[]` subsets by row and column **indices** while `.loc[]` subsets by **name or condition**. We will most often use `.loc[]`. 

In [32]:
#Get rows indexed 1, 2, and 3 and columns indexed 2, 3, 4, 5, and 6.
heart_transplant_df.iloc[1:4, 2:7]

Unnamed: 0,age,survived,survtime,prior,transplant
1,43,dead,2,no,control
2,52,dead,2,no,control
3,52,dead,2,no,control


In [33]:
heart_transplant_df[["age", "survived", "survtime", "prior", "transplant"]][1:4]

Unnamed: 0,age,survived,survtime,prior,transplant
1,43,dead,2,no,control
2,52,dead,2,no,control
3,52,dead,2,no,control


In [34]:
#Get the "age", "survived", and "prior" columns, and print only the last 7 rows of the dataset.
heart_transplant_df.loc[(len(heart_transplant_df) - 7):  , ["age", "survived", "prior"]]

Unnamed: 0,age,survived,prior
96,45,alive,yes
97,53,dead,no
98,30,alive,no
99,48,alive,yes
100,40,alive,no
101,48,alive,no
102,33,alive,no


In [35]:
heart_transplant_df[["age", "survived", "prior"]].tail(n = 7)

Unnamed: 0,age,survived,prior
96,45,alive,yes
97,53,dead,no
98,30,alive,no
99,48,alive,yes
100,40,alive,no
101,48,alive,no
102,33,alive,no


Note that in the above code chunk we requested rows from "7 less than the length of the data frame, all the way to the end of the data frame". Also, since we wanted multiple columns returned, we needed to use the list notation -- for convenience, if we wanted every column between `acceptyear` and `survtime` we could conveniently use `"acceptyear":"survtime"` in place of the list notation. 

In addition to this prescribed subsetting we can use `.loc[]` to subset based on a condition. For example, we can obtain all of those transplant recipients who had a prior transplant by requesting rows for which `heart_transplant_df["prior"] == "yes"` (see below, and note that this condition is case sensitive).

In [36]:
heart_transplant_df[heart_transplant_df["prior"] == "yes"]

Unnamed: 0,id,acceptyear,age,survived,survtime,prior,transplant,wait
30,100,74,35,alive,39,yes,treatment,38.0
60,94,73,43,dead,165,yes,treatment,4.0
62,90,73,52,dead,186,yes,treatment,160.0
74,58,71,47,dead,342,yes,treatment,21.0
79,80,72,46,alive,482,yes,treatment,26.0
81,76,72,52,alive,545,yes,treatment,46.0
82,64,72,48,dead,583,yes,treatment,32.0
92,50,71,45,dead,979,yes,treatment,83.0
93,46,71,48,dead,995,yes,treatment,2.0
95,49,71,36,alive,1141,yes,treatment,36.0


There are many boolean (T/F) operators which can be used for conditioning (subsetting or flow control). Some of the most common are below:
+ `a == b` a **test** for exact equality between `a` and `b`
+ `a > b`, `a < b`, `a >= b`, and `a <= b` behave as expected
+ `a in list` or `a in ["red", "green", "blue", "yellow"]` tests whether the value stored in the object `a` is contained in the corresponding list objects.

Use the code blow below to subset the `heart_transplant_df` data frame so that you achieve each of the following (individually).
1. You obtain information on all recipients over 50 years old.
2. You obtain `age`, `survtime`, `prior`, and `wait` on all recipients who are still alive.
3. You obtain information on all recipients who are dead or (`|`) are atleast 57 years old.
4. You obtain information on all recipients who are under 55 years old and (`&`) who had a prior heart transplant.

In [49]:
# Write and execute your code here.
heart_transplant_df.head()
heart_transplant_df[heart_transplant_df["age"] > 50]
heart_transplant_df[["age", "survtime", "prior", "wait", "survived"]][heart_transplant_df["survived"] == "alive"]
heart_transplant_df[(heart_transplant_df["survived"] == "dead") | (heart_transplant_df["age"] >= 40)]
heart_transplant_df[(heart_transplant_df["age"] < 55) & (heart_transplant_df["prior"] == "yes")]

Unnamed: 0,id,acceptyear,age,survived,survtime,prior,transplant,wait
30,100,74,35,alive,39,yes,treatment,38.0
60,94,73,43,dead,165,yes,treatment,4.0
62,90,73,52,dead,186,yes,treatment,160.0
74,58,71,47,dead,342,yes,treatment,21.0
79,80,72,46,alive,482,yes,treatment,26.0
81,76,72,52,alive,545,yes,treatment,46.0
82,64,72,48,dead,583,yes,treatment,32.0
92,50,71,45,dead,979,yes,treatment,83.0
93,46,71,48,dead,995,yes,treatment,2.0
95,49,71,36,alive,1141,yes,treatment,36.0


### Missing Values

We've covered a lot in this crash course so far, but we've left out a significant chunk -- how do we identify missing values? We can do that quite easily with the `.isna()` method. We'll combine `.isna()` with `.sum()` to identify where `NA` (missing) values occur in our dataset. There are lots of questions to consider when dealing with missing data.
+ Why is the data missing?
+ Does the missing data tell us something (ie. is the missing data actually data itself)?
+ Does the missing data corrupt (make useless) the variable or possibly even dataset?
+ How should we treat the missing data? Leave it as missing? Remove rows containing missing data? Try to fill it in somehow?

We won't address any of those items here but you will certainly encounter missing data and need to answer these questions throughout our course. Learning how to deal with missing data is critical, especially since `sklearn` won't let us build models involving variables with missing values!

In [51]:
heart_transplant_df.isna().sum()

id             0
acceptyear     0
age            0
survived       0
survtime       0
prior          0
transplant     0
wait          34
dtype: int64

Now that we know *the basics of* how to work with data frames in python, we should learn how to gain insight from the information in these data frames.

## Analyzing Data

The `pandas` library comes with methods for computing useful information on data frames:
+ `.info()` provides information on the column structure of a data frame
+ `.describe()` computes summary statistics for numerical columns of a data frame

and also for computing summary statistics for individual columns as well.
+ `.mean()` for computing the standard mean
+ `.median()` for computing the median (a better -- more robust -- measure of center in the case of outliers)
+ `.std()` for computing standard deviation
+ `.min()` and `.max()` for identifying the minimum and maximum values in a column
+ `.quantile([0.1, 0.25, 0.5, 0.75, 0.9])` computes quantile boundaries -- for example this call would compute the boundaries for the 10th, 25th, 50th, 75th, and 90th percentiles
+ `.value_counts()` for constructing a frequency table for categorical columns.

See the `pandas` cheatsheet [here](http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3) for more.

In [56]:
#Get standard info on columns including how many non-null (not missing) 
#values are present and also the type of object (numeric or non-numeric object).
heart_transplant_df["age"].quantile([0.05, 0.25, 0.5, 0.75, 0.95])
heart_transplant_df["prior"].value_counts()/len(heart_transplant_df)

no     0.883495
yes    0.116505
Name: prior, dtype: float64

In [None]:
#Compute general summary statistics for all numerical columns.


Use the code block below to find:
1. The average survival time across all recipients
2. The average survival time for those recipients who have died
3. The minimum and maximum wait times as well as the cutoffs for the 10th and 90th percentiles of wait times.
4. A frequency table for the survived and transplant columns.
5. Execute `pd.crosstab(heart_transplant_df["survived"], heart_transplant_df["transplant"])` and explain its output.

In [None]:
#Experiment here


## Summary

Okay -- that's plenty for this notebook. You were exposed to some Python basics, including atomic data types (booleans, strings, numerics, etc.), lists, arrays, dictionaries, and data frames. In analytics we'll most often be working with data frames, lists, and series but it is useful to know about these other structures since dealing with them is sometimes required. The most important items you learned in this notebook were how to subset both rows and columns of data frames -- in particular, subsetting based on conditions. Your next notebook will give you more opportunity to work with data frames and will discuss some of this material in more detail.