sources:

https://pandas.pydata.org/pandas-docs/stable/10min.html

https://cloudxlab.com/blog/numpy-pandas-introduction/

https://docs.scipy.org/doc/numpy/user/quickstart.html

# Python Libraries
Python libraries are sets of functions written by other people and open sourced for the general use. We do not need to worry about the code inside these functions, we just need to know how to use the functions. Libraries use library specifications (i.e., the *library spec*) to describe what functions do and what arguments the functions take. 

Today, we will learn about **Pandas** - a library which allows us to read in data in a spreadsheet format and manipulate it easily.

# What is Pandas?
Pandas is one of the most widely used python libraries in data science. It provides high-performance, easy to use structures and data analysis tools. The main object in Pandas is called a DataFrame. A DataFrame is like a spreadsheet with column names and row labels.

However, a DataFrame is better than a spreadsheet because there are many functions that can be called to act on a DataFrame and provide many additional functionalities. Examples include creating pivot tables (a table that summarizes the data from another table), creating new columns based on other columns, calculating statistics on the values in the table, and plotting graphs. 

Pandas is usually imported into Python under the *pd* nickname:

In [2]:
import pandas as pd

The nickname pd allows us to call Pandas functions using the notation pd.FunctionName

Pandas is similar to excel and provides a nice graphical representation of arrays.
The best way to learn a new library is to go through tutorials offered online, such as these:

https://pandas.pydata.org/pandas-docs/stable/10min.html

https://pandas.pydata.org/pandas-docs/stable/tutorials.html

You can also find the name of functions available in Pandas through the Pandas online library:

https://pandas.pydata.org/pandas-docs/stable/api.html

Additionally, [Stack Overflow](https://www.stackoverflow.com) is a fantastic resource with hundreds of answer to common Python and Python libraries questions. Using google to ask questions about python usually yields links to Stack Overflow.


#### Today, we will learn the basic ways to manipulate Pandas DataFrames and use them to look at the Fragile Families data. 

To create a DataFrame, we use the Pandas DataFrame function and pass a list as input to the function. DataFrames have a built-in print function that displays the table when the name of the DataFrame is called:

In [3]:
s = pd.DataFrame([1,3,5,6,8])
s

Unnamed: 0,0
0,1
1,3
2,5
3,6
4,8


The DataFrame above has one column labelled "0" (this is the default name) with values 1,3,5,6,8. The left column in bold is the index column with the row labels (the labels start at "0" by default). 

We can also create a DataFrame with multiple columns by passing in a list of lists into the DataFrame function like this:

In [19]:
# Let's create a Python 6x4 (6 rows, 4 columns) list: the list will have 6 elements, each of which is a list with
# 4 elements.
# This is a "list of lists", or a "2D list"
a = [[2,3,4,5],[4,677,774,3],[402,3034,202,22],[3.4,67.8,3,8],[5,4,22,5.],[1,2,3,4]]
print("list:\n",a)

# Now, let's create a DataFrame from that list
df = pd.DataFrame(a)
# Since we have 6 rows, the row labels for the DataFrame will go from 0 to 5 and since we have 4 columns, the column
# labels will go from 0 to 3
df

list:
 [[2, 3, 4, 5], [4, 677, 774, 3], [402, 3034, 202, 22], [3.4, 67.8, 3, 8], [5, 4, 22, 5.0], [1, 2, 3, 4]]


Unnamed: 0,0,1,2,3
0,2.0,3.0,4,5.0
1,4.0,677.0,774,3.0
2,402.0,3034.0,202,22.0
3,3.4,67.8,3,8.0
4,5.0,4.0,22,5.0
5,1.0,2.0,3,4.0


We can also label rows and columns with distinct names, which helps clarify what the different rows and columns mean. To name rows we use DataFrame.index while to name columns we use DataFrame.columns:

In [28]:
df.index = range(5,11) 
# the range(a,b) function generates a list containing all integers in the range from a to b-1.
df.columns = ['A','B','C','D']
df

Unnamed: 0,A,B,C,D
5,2.0,3.0,4,5.0
6,4.0,677.0,774,3.0
7,402.0,3034.0,202,22.0
8,3.4,67.8,3,8.0
9,5.0,4.0,22,5.0
10,1.0,2.0,3,4.0


In [7]:
# We can also define the index and column names when we create the DataFrame
df1 = pd.DataFrame(a, index=range(5,11), columns=['A','B','C','D'])
df1

Unnamed: 0,A,B,C,D
5,2.0,3.0,4,5.0
6,4.0,677.0,774,3.0
7,402.0,3034.0,202,22.0
8,3.4,67.8,3,8.0
9,5.0,4.0,22,5.0
10,1.0,2.0,3,4.0


Now that we know how to create DataFrames, let's see what are some of the things we can do with them.

**Selecting a particular entry in a DataFrame**

In [9]:
# To select a particular entry of the DataFrame, we can use the loc function which takes in the index label and 
# the column label.
print(df.loc[6,'A'])
print(df.loc[10,'B'])

4.0
2.0


In [18]:
# Alternatively, we can use the iloc function which takes in the index position starting from 0 and the column position
# starting from 0. So if we want to get the same values as above, we want to pick the second row (position 1) and the 
# first column (position 0)
print(df.iloc[1,0])
# and for the second value we need to pick the sixth row (position 5) and the second column (position 1)
print(df.iloc[5,1])

4.0
2.0


**Selecting an entire column or row in a DataFrame**

In [33]:
# To select an entire column, we can use square brackets:
df["C"]

5       4
6     774
7     202
8       3
9      22
10      3
Name: C, dtype: int64

In [34]:
# Or we can also use the loc or iloc functions with semicolon ":" which means "all entries"
# So, for example "df.loc[:,'C']" says "select all rows in column C"
df.loc[:,'C']

5       4
6     774
7     202
8       3
9      22
10      3
Name: C, dtype: int64

In [35]:
# and "df.iloc[:,2]" says "select all rows in the third column"
df.iloc[:,2]

5       4
6     774
7     202
8       3
9      22
10      3
Name: C, dtype: int64

In [37]:
# To select an entire row, we can do the same thing, but this time  we pass the semicolon as the second argument.
# So now "df.loc[6,:]" says "select all columns in the row labelled 6" (that is, the second row)
df.loc[6,:]

A      4.0
B    677.0
C    774.0
D      3.0
Name: 6, dtype: float64

In [38]:
# and "df.iloc[1,:]" says "select all rows in the second column"
df.iloc[1,:]

A      4.0
B    677.0
C    774.0
D      3.0
Name: 6, dtype: float64

**Checking the number of rows and columns in a DataFrame**

In [9]:
# We can check the shape of the DataFrame by calling the shape function:
df.shape

(6, 4)

**Creating a Pivot DataFrame**

A pivot DataFrame provides a summary of the column values. To create the pivot DataFrame, we use the describe function like this:

In [39]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,69.566667,631.3,168.0,7.833333
std,162.864627,1206.38905,306.937779,7.139094
min,1.0,2.0,3.0,3.0
25%,2.35,3.25,3.25,4.25
50%,3.7,35.9,13.0,5.0
75%,4.75,524.7,157.0,7.25
max,402.0,3034.0,774.0,22.0


We could have calculated these values ourselves. For example, for column A:

In [41]:
print(df['A'].count())
print(df['A'].mean())
print(df['A'].std())
print(df['A'].min())
print(df['A'].quantile(0.25))
print(df['A'].quantile(0.5))
print(df['A'].quantile(0.75))
print(df['A'].max())

6
69.56666666666666
162.86462681216773
1.0
2.35
3.7
4.75
402.0


But using the function describe is certainly easier!

# Reading CSV files into a Pandas DataFrame

Another way to create a DataFrame is from a CSV files. CSV files are comma separated files and are a very common format used to save tabular (excel-like) data. To create a DataFrame from a CSV file, we use the Pandas read_csv function.

Reminder: 
* ".." means "go up a directory"
* "." means "this directory"
* "~" means "my username (so starting a level above Desktop, Documents, etc)
* "pwd" means "display the current directory"
* "ls" means "list contents of directory"

In [42]:
pwd

'/Users/Renato/Documents/GitHub/ai4all/Lectures'

In [45]:
background = "../../ai4all_data/background.csv"
data_frame = pd.read_csv(background, low_memory=False)

Now that we have created a DataFrame from the CSV file, we can see how many rows and columns there are in the DataFrame using the shape function as we did before:

In [47]:
data_frame.shape

(4242, 12943)

Our DataFrame has 4,242 rows where each row corresponds to a different family that took part in the study and 12,943 columns where each column represents one piece of information (a "variable" or a "feature") of the family.

Because this dataset is huge, it doesn't make sense to try and look at it all at the same time. Instead, we can display a quick peek at the DataFrame using the "data_frame.head()" function to only show the first few rows and the first and last few columns:

In [48]:
data_frame.head()

Unnamed: 0,challengeID,m1intmon,m1intyr,m1lenhr,m1lenmin,cm1twoc,cm1fint,cm1tdiff,cm1natsm,m1natwt,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
0,1,-3,,-3,40,,0,,,,...,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3,-3.0,-3.0
1,2,-3,,0,40,,1,,,,...,-3.0,8.473318,-3.0,-3.0,-3.0,-3.0,9.845074,-3,-3.0,9.723551
2,3,-3,,0,35,,1,,,,...,-3.0,-3.0,9.097495,10.071504,-3.0,-3.0,-3.0,-3,-3.0,-3.0
3,4,-3,,0,30,,1,,,,...,-3.0,-3.0,9.512706,10.286578,-3.0,10.677285,-3.0,-3,8.522331,10.608137
4,5,-3,,0,25,,1,,,,...,-3.0,-3.0,11.076016,9.615958,-3.0,9.731979,-3.0,-3,10.115313,9.646466


# Understanding Fragile Families Variables

Each row in the data frame represents a single family enrolled in the Fragile Families study. Each column represents a different *variable* - information collected about that family. The row labels is the default numbering of rows starting at 0 while the column labels are the names of the variables. However, this names don't make much sense!

The first column labelled "challengeID" is the a unique identifier for each family. So challengeID=5 stands for the family identified as Family 5.

For all other variables, the names appear to be not that helpful! Luckily for us, the Fragile Families project has a fantastic website that provides information on what each variable name represents, what type of variable it is, and what the values of the variable mean:

[http://metadata.fragilefamilies.princeton.edu/variables](http://metadata.fragilefamilies.princeton.edu/variables)


First, let's use the challengeID as the labels for the rows. This way, we can refer to Family 5 by referring to variables in data_frame for the row with label "5". To do this, we can use the "data_frame.set_index('row label')" function where we set the row label to 'challengeID':  

In [49]:
data_frame = data_frame.set_index('challengeID')

In [50]:
data_frame.head()

Unnamed: 0_level_0,m1intmon,m1intyr,m1lenhr,m1lenmin,cm1twoc,cm1fint,cm1tdiff,cm1natsm,m1natwt,cm1natsmx,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-3,,-3,40,,0,,,,,...,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3,-3.0,-3.0
2,-3,,0,40,,1,,,,,...,-3.0,8.473318,-3.0,-3.0,-3.0,-3.0,9.845074,-3,-3.0,9.723551
3,-3,,0,35,,1,,,,,...,-3.0,-3.0,9.097495,10.071504,-3.0,-3.0,-3.0,-3,-3.0,-3.0
4,-3,,0,30,,1,,,,,...,-3.0,-3.0,9.512706,10.286578,-3.0,10.677285,-3.0,-3,8.522331,10.608137
5,-3,,0,25,,1,,,,,...,-3.0,-3.0,11.076016,9.615958,-3.0,9.731979,-3.0,-3,10.115313,9.646466


# Other libraries

Some of the libraries popular in AI research are:

For data manipulation:
* **numpy** - for efficient math and dealing with operations on tables and matrices
* **pandas** - similar to numpy, more readable, more similar to the R programming language

For visualization:
* **matplotlib** - plotting

We will not be using the following libraries, but you will probably use them in the future:

For Machine Learning calculations
* TensorFlow - offering efficient implementations of Machine Learning algorithms, such as Neural Networks
* PyTorch
* Caffe