# Intro: Python and Pandas

This notebook will provide you with a short introduction into the basic command in Python, and specifically the basic commands of the pandas packages which will help you investigate and prepare data for subsequent analyses. If you are new to Python and need more resources you can do a free online intro class through [DataCamp](https://www.datacamp.com/courses/intro-to-python-for-data-science). 

After todays class you will be able to read in data from flatfiles and execute basic data manipulations using Pandas. 

## Table of Contents
1. [General Remarks](#General-Remarks)
    1. [Python Setup](#Python-Setup)
1. [Data Analysis in Pandas](#Data-Analysis-in-Pandas)
    1. [Loading Data](#Load-the-Data)
    1. [Displaying Data](#Displaying-Data)
    1. [Columns, rows, data selection](#Columns,-rows,-data-selection)
    1. [Subsetting Data](#Subsetting-Data)
    1. [Statistics](#Statistics)
    1. [Adding and Updating Data](#Adding-and-Updating-Data)
    1. [Grouping and Aggregating Data](#Grouping-and-Aggregating-Data)
    1. [Merging Dataframes](#Merging-Dataframes)
    1. [Saving a CSV](#Saving-a-CSV)

## General Remarks
---

Python: 
* Is a high-level interpreted general purpose programming language named after the Monty Python British comedy troupe
* Was created by Guido van Rossum, and is maintained by an open source community
* Is the fifth most popular programming language
* Is an object orientied language
* Is used mostly in data science because it is powerful and fast, and is compatible with other languages
* Runs everywhere, it's easy to learn, it's highly readable, open-source and its fast development time compared to other languages
* Comes with a growing and always-improving list of open-source libraries for scientific programming, data manipulation, and data analysis (e.g., Numpy, Scipy, Pandas, Scikit-Learn, Statsmodels, Matplotlib, Seaborn, PyTables, etc.)

IPython/Jupyter
* Is an enhanced, interactive python interpreter that started as a grad school project by Fernando Perez 
* Evolved into the IPython notebook, which allowed users to archive their code, figures, and analysis in a single document, making doing reproducible research and sharing said research much easier
* Other languages including but not limited to Julia, Python and R were included. This then led to a rebranding known as the Jupyter Project

### Python Setup

- In Python, we `import` packages. The `import` command allows us to use libraries created by others in our own work by "importing" them. You can think of importing a library as opening up a toolbox and pulling out a specific tool. 
- NumPy is short for numerical python. NumPy is a lynchpin in Python's scientific computing stack. Its strengths include a powerful *N*-dimensional array object, and a large suite of functions for doing numerical computing. 
- Pandas is a library in Python for data analysis that uses the DataFrame object from R which is similiar to a spreedsheet but allows you to do your analysis programaticaly rather than the point-and-click of Excel. It is a lynchpin of the PyData stack.  
- Matplotlib is the standard plotting library in python. 
`%matplotlib inline` is a so-called "magic" function of Jupyter that enables plots to be displayed inline with the code and text of a notebook. 

#### This is how the start of a notebook might look like

In [None]:
# remember to put this line in your notebook, otherwise the visualization won't show up
%pylab inline
# import the packages
# numpy for array and matrix computation
import numpy as np

# pandas for data analysis
import pandas as pd

# matplotlib and seaborn are the data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

# configure pandas display: set the maximum number of columns displayed to 25
pd.options.display.max_columns = 25

In practice we typically load libraries like `numpy` and `pandas` with shortened aliases, e.g, `import numpy as np`. This is like saying, "`import numpy`, and wherever you see `np`, read it as `numpy`." Similarly, you'll often see `import pandas as pd`, or `import matplotlib.pyplot as plt`. 

Another shortcut is `%pylab inline`. This command includes both `import numpy as np` and `import matplotlib.pyplot as plt `. This shortcut was invented because it's faster to type `plt.plot()` rather than `matplotlib.pyplot.plot()`, and even programmers don't like to type more than they have to. 

In documentation and in examples, you will frequently see `numpy` commands starting with the alias `np` rather than `numpy` (e.g, `np.array()` or `np.argsort`) and `pandas` commands starting with `pd` (e.g., `pd.DataFrame()` or `pd.concat()`).

In [None]:
# This is how you make comments
import pandas as pd

In object-oriented programming languages like Python, an object is an entity that contains data along with associated metadata and/or functionality. In Python everything is an object, which means every entity has some metadata (called attributes) and associated functionality (called methods). These attributes and methods are accessed via the dot syntax. Even the attributes and methods of objects are themselves objects with their own type information

In [None]:
# Python is an object based language, and these objects come with types
x = 1         # x is an integer
x = 'hello'   # x is a string
x = [1, 2, 3] # x is a list

You want to think of variables as pointers to objects, rather than of variables as buckets that contain data.

In [None]:
# And variables can point to the same objects
x = [1, 2, 3]
y = x
print(x)
print(y)


In [None]:
# When you manipulate one the other changes as well
x.append(4) # append 4 to the list pointed to by x
print(y) # y's list is modified as well!

In [None]:
## How to do easy operations: Let's say we have a list of numbers and we want to separate them into two lists
# 1. set the halfway point: We generated a variable and assigned the value 8 to it
half = 4

# 2. generate the two lists (; allows us to generate this in one line)
lower = []; upper = []

# 3. split the numbers into lower and upper, and assign to list
## In Python whitespace is meaningful! block of code is a set of statements that should be treated as a unit and
## this is indicated by the indent. Indented code blocks are always preceded by a colon (:) on the previous line
## however: whitespace within lines does not matter
for i in range(8):
    if (i < half):
        lower.append(i)
    else:
        upper.append(i)
        
print("lower:", lower)
print("upper:", upper)

In [None]:
# You can print anything you want
## syntax is different for python 2 and 3
print("My favorite nuymbers are", lower)

In [None]:
# linebreaks (enclosing the statement in () works too)
x = 1 + 2 + 3 + 4 + \
    5 + 6 + 7 + 8
    
print(x)

## Data Analysis in Pandas

When we are working with Pandas we are thinking in terms of dataframes. It's a pandas representation of a spreadsheet/ sql table/Stata/R or SAS dataset. It contains information such as column names, row indices (starting from 0), and the actual data. They are the basic objects on which we will perform our data analysis.

### Loading Data

Before we can start analysing the data we have to load it into memory. We can read in different kind of data formats. The Pandas package provides many ways to load data. It allows the user to read the data from a local csv or excel file, or pull the data from a relational database. We use a function within the pandas packages that is called `pandas.read_csv` (https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.read_csv.html)

In [None]:
# Let's check where we are first and switch into the correct directory
%pwd
%cd ~
%cd "Yandex.Disk/BigDataPubPol/data/projects"
%ls

In [None]:
%ls

In [None]:
df = pd.read_csv("FedRePORTER_PRJ_C_FY2010.csv", low_memory=False)

After the dataset is loaded how do we find out what is in the data? 

### Displaying Data

#### The shape and the columns of the dataframe
When we get the data, we not only want to know the column names but we also want to know how many rows and columns are in the data. We can find out the row and column numbers by calling the shape instance variable with a dot operator.

In [None]:
# shape of a dataframe (row number, column number)
df.shape
# We can see how many columns and rows are in the dataframe

In [None]:
# See the list of variables in data
df.count()

In [None]:
# We can print the column names into a list
print(list(df.columns.values))

In [None]:
# Or we can also save this list in an object in case we want to use it later
colnames = list(df.columns.values)
print(colnames)

**Is there anything you notice here?**

In [None]:
df = (pd.read_csv('FedRePORTER_PRJ_C_FY2010.csv',
                  skipinitialspace=True,encoding='utf-8'))

If you have different file formats, such as .txt or .tsv (tab delimited) you can also use `pandas.read_csv` but you need to specify the delimiter option `delimiter='\t'`

**Data Types**

Python has different types that the data is stored in, depending on what information the attribute contains.

Pandas types | usage
---|---
object | text
int64 | integer numbers
float64 | floating point numbers
bool | true false values
datetime64 | date time values

In [None]:
# It is always good to know what type your variables are
df.dtypes

In [None]:
# Change our date variable to the correct type
df['PROJECT_START_DATE'] = pd.to_datetime(df['PROJECT_START_DATE'])
df['PROJECT_END_DATE'] = pd.to_datetime(df['PROJECT_END_DATE'])
df['BUDGET_START_DATE'] = pd.to_datetime(df['BUDGET_START_DATE'])
df['BUDGET_END_DATE'] = pd.to_datetime(df['BUDGET_END_DATE'])

In [None]:
# You can also apply that function to specific variables in your data
# To break down long statements we encolse statement in ()
df[['PROJECT_START_DATE', 'PROJECT_END_DATE', 'BUDGET_START_DATE', 'BUDGET_START_DATE']] = (df[['PROJECT_START_DATE', 
    'PROJECT_END_DATE', 'BUDGET_START_DATE', 'BUDGET_START_DATE']].apply(pd.to_datetime))

There are other functions than `pandas.to_datetime` that you can use to change the types of variables such as `pandas.to_string` or `pandas.to_numeric`.

In [None]:
# Practice: Are all the other variable formatted correctly? Correct the type if not.


#### The head and tail of the dataframe
It is also helpful to have a look at the first or last few rows of the data for a first impression, as well as a sanity check. We can call the head()/tail() methods. We can also specify how many lines we would like to see in the parentheses at the end. We choose to display 10. If not specified, by default the first 5 lines will be returned

In [None]:
# Display the first few rows of the dataframe
df.head(10)

In [None]:
# last few rows of the dataframe
# the syntax is similar to head
df.tail(10)

In [None]:
# We can sort the values (by one or multiple variables)
(df[['PROJECT_ID', 'DEPARTMENT','ORGANIZATION_NAME','PROJECT_START_DATE', 
     'FY_TOTAL_COST']].tail().sort_values(['PROJECT_START_DATE','PROJECT_ID'], ascending=[True, True]))

### Columns, rows, data selection

#### Single column selection
If we want to select a specific column, we can use the following syntax:

In [None]:
# select a single column: the dataframe variable name, followed by square brackets, and then put the
# the column name between quotes (either single or double). 
df['AGENCY'].head()

In [None]:
# the same would be
df.AGENCY.head()

In [None]:
# It is more comfortable having column names in lowercase
df.columns = df.columns.str.lower()
df.head()

In [None]:
# When you want to check the values of a variable
df.agency.value_counts()

#### Multiple-column selection
to select multiple columns, wrap the column names in a python list, then put the list or tuple between the brackets after the dataframe

In [None]:
# here we selected the columns and assigned them to a new dataframe example2
df2 = (df[['agency', 'project_title', 'fy_total_cost',
                           'project_start_date','project_end_date']])
df2.head()

#### single/ multiple cell(s) selection
Use the `loc` method for cell selection. Pass the row and column indices in the _square brackets_ after `loc`. Specify the row index first, and then column name, separated by a comma. Note that both indices will be included.

In [None]:
# single cell selection
# select the cell in row 3 and column project_start_date
cell = df2.loc[3, 'project_start_date']
cell

In [None]:
# multiple cells selection
# option 1: use a python list to explicitly list the rows/columns
cell = df2.loc[[0, 2, 4], 'project_start_date']
cell

In [None]:
# option 2: use colon to indicate contiguous selection
cell = df2.loc[0:4, 'project_start_date']
cell

In [None]:
# if we want to select all columns from row 5, we can use a colon symbol :.
row5 = df2.loc[5, :]
row5

### Subsetting Data
#### Subsetting numerical data
Similar to the `where` statement in sql, we can also select only data that meet certain condition. Depending on whether the data is numberical or string, we should choose to use different syntax for each situation. For example, if we would like to select columns that start from year 2015, we can use a larger than or equal to operator condition to subset.

In [None]:
# conditional subsetting: put the conditional statement within the square brackets 
# the conditional statement here is that we want the cost to be higher than or equal to 50.0000.
df3 = df2[df2['fy_total_cost'] >= 50000]
df3.head()

#### Subsetting string/categorical data
When the column contains string data or categorical data, the comparison operators might not be the choice for data selection. Instead, we can compare each data in a column to a target list to see if the data in column is included in the list. This is done by calling the `isin` method.

In [None]:
# select specific agencies
# we specify the target list within the parentheses of the `isin` method
df4 = df2[df2['agency'].isin(['NIH', 'NSF'])]
df4.head()

In [None]:
# Let's check
df4.agency.value_counts()

#### Subsetting with multiple conditions
If we want to subset the data with more than one condition, we can specify all the conditions and concatenate them with the python keyword `&`. Remember to put every single condition within a pair of parentheses.

In [None]:
# combine both selections from above
df5 = df2[(df2['fy_total_cost'] >= 1000000) & (df2['agency'].isin(['NIH', 'NSF']))]
df5.head()

In [None]:
# Let's check again
df5.agency.value_counts()

### Statistics
#### Descriptive stats
Pandas has integrated some very useful tools to help us understand the distribution of the data. The `describe` method computes the most commonly used descriptive statistics, such as count, mean, standard deviation and quantiles for a dataframe. 

In [None]:
# see the descriptive statistics of the variables
df.describe()

#### Value counts and unique values
For categorical values, it is often helpful to figure out what are the unique values of a given column, and the quantity of each data. Let's go back to the welfare data

In [None]:
# find out how many different agencies are there in the data
df['agency'].unique()

In [None]:
# to count how many observations for each agency appeared in the data
df['agency'].value_counts()

In [None]:
# We can combine the the value counts and unique statements
len(df['agency'].unique())

### Adding and Updating Data
#### Creating columns
We sometimes need to creat a new column, either to save the previously calculation from other columns, or add new information to the dataframe. The syntax is given below:
`dataframe['column_name'] = value`
where:
dataframe is the dataframe in which the new column is created,
column_name is the string of the new column name, 
value is the value of the each cell.

In [None]:
# we can then calculate the monthly cost by dividing the project costs column by 12, 
# and assign this newly computed column to the monthly column
df5['monthly'] = df5['fy_total_cost']/12
df5.head().round(1)

### Grouping and Aggregating Data
#### Group by and aggregation functions
It is possible to group the dataframe by a column, and use aggregation function on them, and sort the result

In [None]:
# calculate the how many grants each agency funded
# step1: in the groupby method, we pass the column we want to group by, we can also select what columns
# we want to carry out the operation
# step2: use the count method to count the number of cases
# step3: sort the value in descending order (set the ascending parameter to False)
df_group = df.groupby('agency')['project_id'].count().sort_values(ascending=False)
df_group.head()

Other useful aggregation functions are:
`sum()`: sum, 
`mean()`: average, 
`agg()`: use a python dictionary to specify aggregation function based on each column

In [None]:
# Note that the aggregation function didn't return a dataframe. So we have to convert it into a dataframe if we wnat 
# to process it further
df_group = df_group.to_frame().reset_index()
df_group.head()

In [None]:
# Let's correct the columns names, this shouldn't be project_id but sum of all funded projects
df_group.rename(columns={'project_id':'number of funded projects'}, inplace=True)
df_group.head()

### Merging Dataframes
Pandas provides an ability to merge (join) two datasets together. You can store the results in a new dataframe. There are different ways of mergeing data: left, right, outer, inner (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html).

In [None]:
merge_df = pd.merge(df5, df_group, on=["agency"], how="inner")
merge_df.head()

In [None]:
merge_df.shape

### Saving a CSV
You can save a copy of your dataframe as a .csv file.

In [None]:
merge_df.to_csv("~/Yandex.Disk/example_data.csv", encoding='utf8')

In [None]:
%cd ~
%pwd