# Python Tutorial - Getting Started with Data Analytics

* Mark Andersen
* Elizavet Fakou

July 2019

## Goals

* Introduce use of pandas with python
* Get comfortable with Jupyter notebooks
* Hands on workshop which does not require prior python training


This particular session "dives right in".  Another session (Python Data Structures with Pandas) makes a more paced introduction to functions, lists and other parts of the python language.  


## Prerequisites: 
*  Anaconda for Python 3.7 (https://bit.ly/2IaPVPb)
*  Data: download the file imdb_titles_reduced.csv (http://bit.ly/2Df11i6) in the same directory as this .ipynb file.



## Why Jupyter Notebooks?

### Environment Options for Python Programming

* Command line
* Integrated Development Environment (IDE) - Spyder
* Notebooks - Jupyter Notebook, Jupyter Lab.  Use the one more comfortable to you:
  * Jupyter Notebook has been around for some time and has one tab per notebook in your browser
  * Jupyter Lab is an updated version and has one tab containing all notebooks and tabs within the browser tab.

Data scientists primarily use notebooks because the code and its output are kept together.  This allows others to review your lab notebook and its output rather than having to re-run your code to see its output.  Because other people may not have the time or data to re-run, they can review the output alongside the code which generates it.  This pairing practices accelerates learning and code review.

### How does Jupyter Notebook / Jupyter Lab work?

When you start the Jupyter environment it can run muliple interactive python notebooks (.ipynb).  When each is started it will have its own python session running behind it.  You restart a notebook's python session by "restarting the kernel".

#### Notebooks are more like SQL Management Studio than other Development Environments

With a notebook, the user should be aware of the order in which the cells (groups of commands) have been run:
* Notebooks build up state (variables) as you run cells.
* Notebooks have no direct sense of the order of cells.  The state (variables) depend on the order in which *you* run the cells in the notebook.
* Jupyter Notebook lets you run cells in order.  That is up to you.  
* If you do not run a cell, then the code has not executed yet!

Therefore:
* If you run cells out of order you might get different results

If you are confused about python state as you work through a Notebook, you have a remedy:
* Use the menu option *Kernel - Restart* to restart the entire python process, wiping out all variables, and giving you a fresh start.

#### Markdown Cells

There are multiple cell types, and you will edit code cells in this tutorial.

The comments you encounter between code cells are *markdown* cells (see drop down top center for type).  Markdown cells can also be highlighted to "run".  When you run a markdown cell it interprets the markdown, replacing it with the resulting output.  You can double click on a markdown cell to edit it, and then hit the Run button to convert back to markdown output.

### Exercise: Double click on this cell -- this text -- right now to see the markdown source and then click the "Run" button to put it back as markdown output. 

<hr>

### python: import statements

Typically top of every module has import statements for important modules

* import os: import the os module (operating system functions)
* import pandas as pd: import the pandas module but call it "pd" (conventional shorthand for pandas) for readable code.

In [1]:
#
# Typical first lines of a python program are import statements:
#
import os
import pandas as pd

# Click "Run" on this cell to do the imports

### pandas: read_csv (or read_excel)

We are diving right into pandas, the most important module for data analysts in python.

Functions in python are invoked with ().  Parameters are passed in the parentheses.

read_csv (or read_excel) are pandas functions which create a DataFrame from a comma separated variable (csv) file (or excel file).
* pd.read_csv(filename): calls read_csv in the module pandas (pd) and passes in a single parameter specifying the filename

A DataFrame is the core data structure found in pandas:

* A *DataFrame* is like a SQL table, with varying types for its columns.
* Each column of a pandas DataFrame is a pandas *Series*.
* The DataFrame *info()* command provides names, types and counts for columns. 
  * Note about counts: the count adds up non-NULL records, so when a column has NULLs the count will be reduced.

***info()*** is the most common command for getting information about a data frame.
* imdb_df.info(): calls the info() function with no parameters (empty parentheses) on the object imdb_df. 
  * Note about notation: A function may be called on an object, in this case the imdb_df DataFrame object.  


In [2]:
# Run this code
#
# This code assumes you have placed the file in the same directory that this
# .ipynb file is running in. If you place it elsewhere you will need to include the path
# and on windows you need to double your backslashes for path.
#
# i.e.  e:\\temp\\imdb_titles_reduced.csv 
#

#
# filename is a *variable* we create here and assign a string representing the filename:
filename = "imdb_titles_reduced.csv"

# imdb_df is another variable.  It is assigned the result of making the function call on the right side
# The pd prefix refers to the pandas module imported above.
# The read_csv function exists in the pandas module and requires a filename for the csv file to read:
imdb_df = pd.read_csv(filename)  
imdb_df.info()


# 
# While this cell runs, which takes may be 5 seconds, you will see an asterisk on the left side of Jupyter in the brackets.
# 
# When completed the asterisk is replaced by a number, for example [3] if it is the third cell you have run
#

# Jupyter prints the returned value from the last executed line in the cell.
# This saves us typing print(imdb_df.info()) which asks for it to be printed.
# If you need to print multiple lines in a cell be sure to wrap print() around the lines you want to print.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4312405 entries, 0 to 4312404
Data columns (total 4 columns):
primaryTitle    object
isAdult         int64
startYear       int64
genres          object
dtypes: int64(2), object(2)
memory usage: 131.6+ MB


### describe()

describe() can be called on a particular column of a DataFrame (a Series) to find out min, max, median, etc.
* imdb_df.genres.describe(): calls the describe() function on the genres column of the imdb_df dataframe

In [None]:
# Run this:
imdb_df.genres.describe()
# Again, Jupyter will print results because this is the last command in the cell.  Otherwise wrap in print(xxx)

### value_counts()

value_counts() can be called on a Series to find out the frequency distribution, especially useful for categorical or binary variables.
* imdb_df.genres.value_counts(): calls the value_counts() function on the genres column of the imdb_df dataframe

We use ***value_counts()*** frequently and in exercises you will be asked to specify it.


Beware, it must be specified on a particular column (Series).

In [None]:
# Run this:
imdb_df.genres.value_counts()

### Pandas slicing is used to select a portion of a dataframe.

Slicing is analogous to a where clause in SQL.

Use slices to obtain part of the original dataframe.

However, it is much more like where in a SQL view; the dataframe points to the original dataframe and remembers how you have sliced it, rather than creating a duplicate of the entire set of data.

Note the syntax:
*  Put the "where" restriction in square brackets
*  You must refer to the dataframe name again in the brackets (imdb_df.genres).
*  We are using the | operator as "or" since we want films which are either Comedy or Drama.  Use & for "and".

### Exericse: Use info() to see how the new dataframe is smaller than the original



In [None]:
# Your first slice:
dramacom_df = imdb_df[(imdb_df.genres == 'Comedy') | (imdb_df.genres == 'Drama')]

# Add your code here


### Exercise: get the count of how many records there are per start year

Hint: Reuse the important command which was introduced earlier.

In [None]:
# Please edit this following line:
dramacom_df.startYear.your_code_here()

### Exercise Use slicing to limit dramacom_df to films starting in the year 2000 or later.  

Hint: Look at how we restricted to comedy and drama above

In [None]:
dramacom_df = dramacom_df[ # your code here ]

### Exercise: Check the value_counts() for the resulting slice

In [None]:
# Your code here

### Exercise: Use dataframe slicing to identify title and genre of the 2115 film

**We are puzzled by the 2115 film.  Let us see how to inspect the data for one row.**

Hint: we are going to slice again.  Feel free to look at how we selected drama and comedy again.

Since we are not assigning to a variable on the left side, jupyter will simply print the result.

In [None]:
dramacom_df[# your code here]

### Exercise: Limit dramacom_df to films starting before 2020, then check the counts

Hint: slicing yet again

Note: Before 2020, not including 2020.

In [None]:
# You are re-assiging the result to dramacom_df.
#
# So if you mess this line up you will need to run cells again from the first time it is assigned, above.
#
dramacom_df = dramacom_df[# your code here]



### crosstab()

The pandas crosstab function will generate a dataframe (i.e. table) summing up the intersections of data.

In [None]:
# Run this:
pd.crosstab(dramacom_df.genres, dramacom_df.startYear)

### Crosstab: genre v. isAdult

Try a crosstab yourself here

In [None]:
# Your code here

## Matplotlib Module

Matplotlib is an important module for plotting.

The jupyter magic command "%matplotlib inline" is required before importing matplotlib or the interactive display will not work correctly.  



In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
# Run this

### Exercise: copy crosstab to a dataframe and get the info() for the resulting dataframe

We want to copy a crosstab about genres and startYear to a dataframe for possible plotting.

We have wrapped the type() call in print() so it will still print after you add another line of code.
* print(x(y)): call the function x(y) and get its return value and then print what was returned from the x(y) function.

In [None]:
pdata_df = pd.crosstab(dramacom_df.genres, dramacom_df.startYear)
print(type(pdata_df))

# Get the types of columns in this dataframe pdata
# Your code here:


### Exercise: Rewrite the crosstab above to swap rows and columns.  Put the revised code below

Looking at the above columns, it seems that the rows and columns are swapped from the ideal.

In [None]:
# your code here


## plot.line()

plot.line() or plot.barh() or plot.bar() are among many functions available to plot pandas dataframes directly.

These functions are all stacked on top of the matplotlib plotting libraries.

In [None]:
# Run this
pdata_df.plot.line()

### --- End of Core Tutorial ---

## Further exercises

### Exercise: Save the plot to file

In [None]:
# Supply a filename (end in .png) to savefig and run this cell.  
# Be sure to put the filename in single or double quotes and beware if using \ characters anywhere
_ = pdata_df.plot.line()
plt.savefig(...)
plt.show()

### Exercise: Pickle the dataframe for later reading

Pickling in python refers to saving the full object to disk. 

Later you can read in this file quite quickly.


In [None]:
# Supply a filename (end in .pkl) and run this cell:
pdata_df.to_pickle(...)

### Exercise: Export the dataframe to Excel for sharing

In [None]:
# Supply a filename (end in .xlsx) and run this cell:
pdata_df.to_excel(...)

### Exercise: Make the chart larger

You will want to review the documentation at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html for parameters related to these steps.

The documentation link is for plot() not plot.line(), but I point you here since it has all the parameters.

plot.line() already takes care of x, y, and kind so you do not have to pass those in.

In the documentation there are many arguments.  You do **not** pass them all into the function.  You simply pass in the ones you want to modify from the default value in the documentation.

Python lets you name the argument with syntax like: 
    parameter_name = parameter_value
    
The above line simply overwrites the default value for the specified parameter.

In [None]:
pdata_df.plot.line( # your code here )

### Exercise: Add title to chart 

Hint: You will find something in the documentation of plot() about this.

In [None]:
# your code here


### Exercise: Add x-axis and y-axis labels

This will be a bit tricky.  It appears you have to get the axes returned from the plot and call another function on it.

Google it...

You may run across the "axes" concept in matplotlib.  The plot() returns an object of type axes, and often is assigned to a variable like ax:
    ax = pdata_df.plot.line()
    
There are functions in matplotlib which are invoked on the axes, like ax.func_name(func_arguments)

If you make modifications to the axes but the plot does not display, try plt.show().

In [None]:
# your code here


### Exercise: Save the updated plot to file

In [None]:
# your code here