# **Python for Data Science**

# Part -3: This is a Jupyter Notebook

Jupyter Notebooks are a popular alternative to a traditional Interactive Development Environment (IDE). Although they have their [detractors](https://twitter.com/joelgrus/status/1033035196428378113) they're generally a very good tool for teaching, displaying code and analysis, and interactively interacting with a running Python process.

In [None]:
# They allow you to write code in one cell:
2 + 2 == False

And write text in the "Markdown" text formatting language in the next cell. 
```python
this looks like code, but it isn't!
```
[this is a hyperlink](https://xkcd.com/353/)
###### This is a small heading

### Some important things:
- Make sure your code cells are actually code cells, not Markdown cells.
- Figure out the difference between Edit Mode (green) and Command Mode (blue)
- Pay attention to code cell execution numbers and execution order
- Learn how to Restart & Run All

Handy shortcuts:
- Shift + enter = run cell and move to next
- Control + enter = run cell and stay at that cell

Command Mode shortcuts:
- b = Add a new cell
- Shift-click = Select multiple cells
- Shift-M = Merge cells
- Esc-y = Make cell code cell
- Esc-m = Make cell markdown cell
- Control-shift = Split cell at cursor

Edit Mode / Code shortcuts:
- Tab = Autocomplete
- Shift-tab = Bring up documentation (can press multiple times for *more* documentation)


# Part -2: Python 101 Review

You should be familiar with the basics of Python, including:
- How to call a function
- How to declare variables and assign function results to variables
- How to access elements of a list and a dictionary
- Understand the difference between functions, methods, and attributes (??)

You may also know how to:
- Define a function
- Implement control flow structures like `if`/`else` statements
- Make use of iteration in the form of `for` loops or list comprehensions

But we won't really be making use of those concepts today.

In [None]:
x = [3, 1, 2, 4]

In [None]:
# Take the sume of x

# Assign the result to y

In [None]:
y = {'Cohen':True, 'Flynn':True, 'Manafort':True, 'Giuliani':False}

In [None]:
# Access the value for 'Flynn'

# Change the value for 'Giuliani' to True


# Part -1: Same Background on Numpy
#### Methods and Attributes

This is adapted from [a notebook](Numpy - Pandas Ontology.ipynb) I wrote introducing Numpy and Pandas from a more fundamental perspective.

Everything is an *object* in Python. Objects may have attributes and methods:
- Attributes are called without a parentheses, because they require no arguments. You can think of them as accessing a pre-defined characteristic of that object, so no computation is generally required.
- Methods are functions that are specific to objects. When you call a method on an object, you are applying that function to the object, with some additional argument(s). Thus, they are called with parentheses.

Place your cursor to the right of `x.` below and you press `Tab`. You should see a list of available attributes and methods appear.

In [None]:
x.

In [None]:
# Apply the 'sort' method to x:


#### Numpy

Numpy stands for 'numerical python,' and it was one of the first tools to turn Python into a data science programming language.

In [None]:
# Import the library:
import numpy as np

Numpy's main contribution to the Python world are *arrays*. Arrays are multi-dimensional collections of elements of the same type. Almost always the objects in an array are numbers. You can create an array from a list of Python lists:

In [None]:
A = np.array([[1,2,3],[4,5,6]])
A

Or you can create an array using a Numpy function, such as np.full:

In [None]:
# The first argument specifies the dimensions, the second tells Numpy what to fill the array with:
# This array is two-dimensional. Most arrays you see will be two dimensional, but remember that an
# array can contain data in any number of dimensions.
np.full((3,3), 8)

In [None]:
# Note that A is an array:
type(A)

In [None]:
# But the dtype of A is numeric:
A.dtype

# This is because the 'dtype' attribute of an array tells us the data type of the elements *inside* that array.
# Moving on, it will be helpful to remember the difference between the type of an object and the data type
# of the elements it contains.

Arrays have their own attributes and methods. You've already seen the dtype attribute.

In [None]:
# The mean method returns the average of the numbers inside the array:
A.mean()

In [None]:
# The sum method returns the sum:
A.sum()

In [None]:
B = np.array([[0,1,0],[2,2,2]])
B

In [None]:
# Arrays can be added:
A + B

In [None]:
# And multiplied:
A * B

These operations are *element-wise*. That means that when we multiply two arrays, we are just multiplying all the corresponding elements from each array.

You can't do this with Python lists!

In [None]:
[1, 2, 3] + [1, 2, 3] # Not what we wanted.

Arrays have a certain number of dimensions, or *axes*:

In [None]:
A.ndim # A is two-dimensional. It has rows and columns.

In [None]:
A.shape # A has 2 rows and 3 columns.

Just like Python lists, arrays can be indexed:

In [None]:
X[0] # 1 is the first element of the list X

In [None]:
# The first elemtent of A is just a smaller array representing the first row of A:
A[0] 

In [None]:
# We index the array twice (if it is two-dimensional) to get to a particular element
A[0][0]

In addition to arrays, Numpy has almost all statistical functions you may have heard of:

In [None]:
np.mean([1,2,3])

In [None]:
np.std([1,3,7])

Indeed, Numpy is the library we use for most of these simple functions, even if we are operating on a list:

In [None]:
sum(x)/len(x)

In [None]:
np.mean(x)

#### Pandas

Pandas is a data manipulation library built on top of Numpy. Almost all of the time, you will use Pandas to store and interact with your data.

While Numpy gives us arrays to work with, Pandas provides Dataframes. A Dataframe is the basic tabular data structure you'll use in this course. Since they have rows and columns, they are always two-dimensional. Typically you will read in data from some source, like a CSV. But we can also construct them from scratch:

In [None]:
import pandas as pd

df = pd.DataFrame([[1,2,'A'],[3,4,'B'],[5,6,'C']])
df

In [None]:
# Compare the dataframe above with an array constructed in the same way:
np.array([[1,2,'A'],[3,4,'B'],[5,6,'C']])

There are a few differences you should notice right away. Jupyter Notebooks displays Dataframes really nicely, whereas arrays are just shown as a list of lists.

Secondly, dataframes can contain elements of more than one type. Notice that Numpy treats all of the elements of the array as characters. Dataframes can hold data of different types in each column.

Lastly, the Dataframe is *labelled*. Each row and column has a label, or index. **Dataframes can be thought of as labeled two-dimensional Numpy arrays.**

Lets play with some of these features of dataframes.

In [None]:
# Because Pandas is based on Numpy, many of the methods are the same or similar.
# In Numpy .dtype returns the type of the elements in the array.
# In Pandas, .dtypes returns the data type of each column:
df.dtypes
# This will be one of the Pandas commands you use most often

In [None]:
# You can examine the shape a dataframe, just like you would with an array:
df.shape

In [None]:
# Since dataframes are just labeled arrays, we can return the array
# that a dataframe is built on top of:
df.values

In [None]:
# The columns labels of a dataframe can be accessed:
df.columns

In [None]:
# And changed:
df.columns = ['Column One','Column Two','Column Three']
df

In [None]:
# The row labels of a dataframe are the index:
df.index
# Typically the index just counts up from 0, unless you've rearranged your data.

In [None]:
# You can assign a new index to the data:
df.index = [3,4,5]

In [None]:
# Or you can choose a pre-existing column to be the index:
df.set_index('Column One')

# Side note about operations occuring in place:
# Notice that this operation hasn't occured 'in place.' Some Pandas
# operations occur in place, and others don't. You'll just have to remember
# which is which. 

In [None]:
# To select a certain column, just pass the name of the column to the square brackets which we 
# usually use for subsetting and indexing. If you want more than one column, pass a list of column names.

# This should look like accessing a value from a dictionary. In a way, DataFrames are dictionaries
# where the key is the column name and the value is the Series of data in the column.
df['Column One']

A Pandas series is just a one-dimensional collection of data. When you're working with dataframes, always remember that your columns are Pandas series:

In [None]:
type(df['Column One'])

As you work with Pandas, remember that certain methods apply to series, and some to dataframes. For example, the methods dealing with setting the index operate on the entire dataframe.

In [None]:
# Take the mean of this column
df['Column One'].mean()

Pandas knows there are a lot of functions we might want to apply to all the columns, though, so it lets us just apply them directly to the dataframe:

In [None]:
df.max()

In [None]:
df.mean()

Other times we might want to create new columns from our existing columns. We can work with the columns directly as series:

In [None]:
df['Column One'] * df['Column Two']

# Part 0: Python for Data Science

We'll be working with a few key datasets throughout this session.

- MovieLens 100k movie rating data:
    - main page: http://grouplens.org/datasets/movielens/
    - data dictionary: http://files.grouplens.org/datasets/movielens/ml-100k-README.txt
- WHO alcohol consumption data:
    - article: http://fivethirtyeight.com/datalab/dear-mona-followup-where-do-people-drink-the-most-beer-wine-and-spirits/
    - original data: https://github.com/fivethirtyeight/data/tree/master/alcohol-consumption
    - original data from WHO: http://apps.who.int/gho/data/node.gisah.A1039?lang=en&showonly=GISAH
- National UFO Reporting Center data:
    - main page: http://www.nuforc.org/webreports.html


In [None]:
# the pandas library
import pandas as pd

## Reading Files, Selecting Columns, and Summarizing


In [None]:
# read in directly from the file
users = pd.read_table('u.user')

In [None]:
users

In [None]:
# read 'u.user' into 'users'
users = pd.read_table('https://raw.githubusercontent.com/josephofiowa/DAT8/master/data/u.user', sep='|', index_col='user_id')

In [None]:
# examine the users data
users                   # print the first 30 and last 30 rows

In [None]:
type(users)             # DataFrame

In [None]:
users.head()            # print the first 5 rows

In [None]:
users.head(10)          # print the first 10 rows

In [None]:
users.tail()            # print the last 5 rows

In [None]:
users.index             # "the index" (aka "the labels")

In [None]:
users.columns           # column names (which is "an index")

In [None]:
users.dtypes            # data types of each column

In [None]:
users.shape             # number of rows and columns

In [None]:
users.values            # underlying numpy array

#### Select a column

What's the difference between these two ways of accessing a single column?

In [None]:
users['gender']
users.gender            # select one column using the DataFrame attribute

#### Summarize (describe) the DataFrame

In [None]:
users.describe()                    # describe all numeric columns

In [None]:
users.describe(include=['object'])  # describe all object columns

In [None]:
users.describe(include='all')       # describe all columns

#### Summarize a Series

In [None]:
users.gender.describe()             # describe a single column

In [None]:
users.age.mean()                    # only calculate the mean

### Exercise One

#### Read drinks.csv into a DataFrame called 'drinks'
Data: https://raw.githubusercontent.com/josephofiowa/GA-DSI/master/example-lessons/plotting-with-pandas/drinks.csv

In [None]:
drinks = 

#### Print the head and the tail


#### Examine the default index, data types, and shape



#### Print the 'beer_servings' Series


#### Calculate the mean 'beer_servings' for the entire dataset

#### Count the number of unique occurrences of each 'continent' value and see if it looks correct

In [None]:
drinks.continent.unique()

In [None]:
drinks.continent.value_counts()

#### BONUS: display only the number of rows of the 'users' DataFrame

#### BONUS: display the 3 most frequent occupations in 'users'


## Filtering and Sorting

#### Boolean filtering: only show users with age < 20

In [None]:
young_bool = users.age < 20         # create a Series of booleans...

In [None]:
young_bool

In [None]:
type(young_bool)

In [None]:
users[young_bool]                   # ...and use that Series to filter rows

In [None]:
users[users.age < 20]               # or, combine into a single step

In [None]:
users[users.age < 20].occupation    # select one column from the filtered results

In [None]:
users[users.age < 20].occupation.value_counts()     # value_counts of resulting Series

#### Boolean filtering with multiple conditions

In [None]:
users[(users.age < 20) & (users.gender=='M')]       # ampersand for AND condition

In [None]:
users[(users.age < 20) | (users.age > 60)]          # pipe for OR condition

#### Sorting

In [None]:
users.age.order()                   # sort a column

In [None]:
users.sort_values(by='age')                   # sort a DataFrame by a single column

In [None]:
users.sort_values(by='age', ascending=False)  # use descending order instead

### Exercise Two

#### Filter 'drinks' to only include European countries

#### Filter 'drinks' to only include European countries with wine_servings > 300


#### Calculate the mean 'beer_servings' for all of Europe

#### Determine which 10 countries have the highest total_litres_of_pure_alcohol


#### BONUS: sort 'users' by 'occupation' and then by 'age' (in a single command)

#### BONUS: filter 'users' to only include doctors and lawyers without using a |

#### Hint: read the pandas.Series.isin documentation

## Renaming, Adding, and Removing Columns

#### Rename one or more columns

In [None]:
drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'})

In [None]:
drinks.head()

In [None]:
drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'}, inplace=True)

#### Replace all column names

In [None]:
drink_cols = ['country', 'beer', 'spirit', 'wine', 'liters', 'continent']

In [None]:
drinks.columns = drink_cols

In [None]:
drinks.head()

#### Replace all column names when reading the file

In [None]:
drinks = pd.read_csv('drinks.csv', header=0, names=drink_cols)

#### Add a new column as a function of existing columns

In [None]:
drinks['servings'] = drinks.beer + drinks.spirit + drinks.wine

In [None]:
drinks['mL'] = drinks.liters * 1000

In [None]:
drinks.head()

#### Removing columns

In [None]:
drinks.drop('mL', axis=1)                               # axis=0 for rows, 1 for columns

In [None]:
drinks.drop(['mL', 'servings'], axis=1, inplace=True)   # drop multiple columns

## Handling Missing Values

#### Missing values are usually excluded in methods by default

In [None]:
drinks.continent.value_counts()              # excludes missing values

In [None]:
drinks.continent.value_counts(dropna=False)  # includes missing values

#### Find missing values in a Series

In [None]:
drinks = pd.read_csv('https://raw.githubusercontent.com/josephofiowa/GA-DSI/master/example-lessons/plotting-with-pandas/drinks.csv')

In [None]:
drinks.continent.notnull()          # True if not missing

In [None]:
drinks.continent.isnull()          # True if missing

#### Use a boolean Series to filter DataFrame rows

In [None]:
drinks[drinks.continent.isnull()]   # only show rows where continent is missing

In [None]:
drinks[drinks.continent.notnull()]  # only show rows where continent is not missing

#### Side note: understanding axes

In [None]:
drinks.sum()            # sums "down" the 0 axis (rows)

In [None]:
drinks.sum(axis=0)      # equivalent (since axis=0 is the default)

In [None]:
drinks.sum(axis=1)      # sums "across" the 1 axis (columns)

#### Side note: adding booleans

In [None]:
pd.Series([True, False, True])          # create a boolean Series

In [None]:
pd.Series([True, False, True]).sum()    # converts False to 0 and True to 1

#### Find missing values in a DataFrame

In [None]:
drinks.isnull()             # DataFrame of booleans

In [None]:
drinks.isnull().sum()       # count the missing values in each column

#### Drop missing values

In [None]:
drinks.dropna()             # drop a row if ANY values are missing

In [None]:
drinks.dropna(how='all')    # drop a row only if ALL values are missing

#### Fill in missing values

In [None]:
drinks.continent.fillna(value='NA', inplace=True)   # fill in missing values with 'NA'

In [None]:
drinks.continent.value_counts()

#### Turn off the missing value filter

In [None]:
drinks = pd.read_csv('https://raw.githubusercontent.com/josephofiowa/GA-DSI/master/example-lessons/plotting-with-pandas/drinks.csv', header=0, names=drink_cols, na_filter=False)

### Exercise Three

#### Read ufo.csv into a DataFrame called 'ufo'
https://raw.githubusercontent.com/josephofiowa/DAT8/master/data/ufo.csv


#### Check the shape of the DataFrame

#### Calculate the most frequent value for each of the columns (in a single command)

#### What are the four most frequent colors reported?

#### For reports in VA, what's the most frequent city?

#### Show only the UFO reports from Arlington, VA

#### Count the number of missing values in each column

#### Show only the UFO reports in which the City is missing

#### How many rows remain if you drop all rows with any missing values?

#### Replace any spaces in the column names with an underscore

#### BONUS: redo the task above, writing generic code to replace spaces with underscores
In other words, your code should not reference the specific column names

#### BONUS: create a new column called 'Location' that includes both City and State
For example, the 'Location' for the first row would be 'Ithaca, NY'

## Split-Apply-Combine
Diagram: http://i.imgur.com/yjNkiwL.png

#### For each continent, calculate the mean beer servings

In [None]:
drinks.groupby('continent').beer.mean()

#### For each continent, count the number of occurrences

In [None]:
drinks.continent.value_counts()

#### For each continent, describe beer servings

In [None]:
drinks.groupby('continent').beer.describe()

#### Similar, but outputs a DataFrame and can be customized

In [None]:
drinks.groupby('continent').beer.agg(['count', 'mean', 'min', 'max'])

In [None]:
drinks.groupby('continent').beer.agg(['count', 'mean', 'min', 'max']).sort_values(by='mean')

#### If you don't specify a column to which the aggregation function should be applied, it will be applied to all numeric columns

In [None]:
drinks.groupby('continent').mean()

In [None]:
drinks.groupby('continent').describe()

### Exercise Four

#### For each occupation in 'users', count the number of occurrences

#### For each occupation, calculate the mean age

#### BONUS: for each occupation, calculate the minimum and maximum ages

#### BONUS: for each combination of occupation and gender, calculate the mean age