sources:

https://pandas.pydata.org/pandas-docs/stable/10min.html

https://cloudxlab.com/blog/numpy-pandas-introduction/

https://docs.scipy.org/doc/numpy/user/quickstart.html

# Python Libraries
Python libraries are sets of functions written by other people and open sourced for the general use. You do not need to worry about how the function is implemented - you can just call it according to the *library spec*. Today, we will learn **Pandas** - a library which allows us to read in data in the spreadsheet format and manipulate it easily.

# What is Pandas?
Pandas is one of the most widely used python libraries in data science. It provides high-performance, easy to use structures and data analysis tools. Pandas provides in-memory 2d table object called Dataframe. It is like a spreadsheet with column names and row labels.

Hence, with 2d tables, pandas is capable of providing many additional functionalities like creating pivot tables, computing columns based on other columns and plotting graphs. Pandas can be imported into Python using:

In [2]:
import pandas as pd

Pandas similar to excel and provides a nice graphical representation of arrays.

The best way to learn a new library is to go through tutorials offered online, such as these:

https://pandas.pydata.org/pandas-docs/stable/10min.html

https://pandas.pydata.org/pandas-docs/stable/tutorials.html

You can find functions available through the Pandas library online:

https://pandas.pydata.org/pandas-docs/stable/api.html

Or simply by using google.


#### Today, we will learn the basic ways to manipulate Pandas DataFrames and use them to look at the Fragile Families data. 

You initialize arrays (called Data Frames in Pandas) similaraly to numpy:

In [3]:
s = pd.DataFrame([1,3,5,6,8])
s

Unnamed: 0,0
0,1
1,3
2,5
3,6
4,8


You can also create a 2D data frame, with multiple columns and rows

In [24]:
# Let's create a Python 6x4 (6 rows, 4 columns) list: it will have 6 elements, each of which is a length 4 list
a = [[2,3,4,5],[4,677,774,3],[402,3034,202,22],[3.4,67.8,3,8],[5,4,22,5.],[1,2,3,4]]
print("list:\n",a)

# Now, let's create a DataFrame from that list
df = pd.DataFrame(a)#, index=range(5,11), columns=['A','B','C','D'])
print("\n pandas dataframe:\n", df)

list:
 [[2, 3, 4, 5], [4, 677, 774, 3], [402, 3034, 202, 22], [3.4, 67.8, 3, 8], [5, 4, 22, 5.0], [1, 2, 3, 4]]

 pandas dataframe:
        0       1    2     3
0    2.0     3.0    4   5.0
1    4.0   677.0  774   3.0
2  402.0  3034.0  202  22.0
3    3.4    67.8    3   8.0
4    5.0     4.0   22   5.0
5    1.0     2.0    3   4.0


You can also name your rows and columns distinct names, which helps with data manipulation

In [25]:
df.index = range(5,11)
df.columns = ['A','B','C','D']
print("\n pandas dataframe with updated column, row names:\n", df)

# We could combine the last two steps
df1 = pd.DataFrame(a, index=range(5,11), columns=['A','B','C','D'])
print("\n pandas dataframe, created in one step\n",df1)


 pandas dataframe with updated column, row names:
         A       B    C     D
5     2.0     3.0    4   5.0
6     4.0   677.0  774   3.0
7   402.0  3034.0  202  22.0
8     3.4    67.8    3   8.0
9     5.0     4.0   22   5.0
10    1.0     2.0    3   4.0

 pandas dataframe, created in one step
         A       B    C     D
5     2.0     3.0    4   5.0
6     4.0   677.0  774   3.0
7   402.0  3034.0  202  22.0
8     3.4    67.8    3   8.0
9     5.0     4.0   22   5.0
10    1.0     2.0    3   4.0


In [5]:
df

Unnamed: 0,A,B,C,D
5,2.0,3.0,4,5.0
6,4.0,677.0,774,3.0
7,402.0,3034.0,202,22.0
8,3.4,67.8,3,8.0
9,5.0,4.0,22,5.0
10,1.0,2.0,3,4.0


In [6]:
# To select a particular entry of the array, use the iloc function.
print(df.iloc[0,0])
print(df.iloc[4,3])

2.0
5.0


In [7]:
# You can subselect a column using its name
df["C"]

5       4
6     774
7     202
8       3
9      22
10      3
Name: C, dtype: int64

In [8]:
# You can also use indexing.
# "[:,2]" says "select all entries in the 0th (vertical) dimension, and 2nd entry in the 1st (horizontal) dimension
df.iloc[:,0]

5       2.0
6       4.0
7     402.0
8       3.4
9       5.0
10      1.0
Name: A, dtype: float64

In [9]:
# You can check the shape of your array by calling the following function:
df.shape

(6, 4)

Pandas also gives you a summary of value distribution in columns

In [10]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,69.566667,631.3,168.0,7.833333
std,162.864627,1206.38905,306.937779,7.139094
min,1.0,2.0,3.0,3.0
25%,2.35,3.25,3.25,4.25
50%,3.7,35.9,13.0,5.0
75%,4.75,524.7,157.0,7.25
max,402.0,3034.0,774.0,22.0


You can calculate those values yourself, by typing

In [45]:
print(df['A'].mean())
print(df['B'].min())
print(df['C'].std())
print(df['D'].max())

69.56666666666666
2.0
306.9377787109303
22.0


# Reading CSV files into pandas.
CSV files are "comma separated files" - that's the format most often used to save tabular (excel-like) data.

Reminder: 
* ".." means "go up a directory"
* "." means "this directory"
* "~" means "my username (so starting a level above Desktop, Documents, etc)
* "pwd" means "display the current directory"
* "ls" means "list contents of directory"

In [11]:
pwd

'/Users/agataforyciarz/ai4all/Lectures'

In [12]:
background = "../../ai4all_data/background.csv"
data_frame = pd.read_csv(background, low_memory=False)

You can display the entire table by just typing its name. Alternatively, you can call a ".head()" function to only show the first couple rows

In [13]:
data_frame.head()

Unnamed: 0,challengeID,m1intmon,m1intyr,m1lenhr,m1lenmin,cm1twoc,cm1fint,cm1tdiff,cm1natsm,m1natwt,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
0,1,-3,,-3,40,,0,,,,...,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3,-3.0,-3.0
1,2,-3,,0,40,,1,,,,...,-3.0,8.473318,-3.0,-3.0,-3.0,-3.0,9.845074,-3,-3.0,9.723551
2,3,-3,,0,35,,1,,,,...,-3.0,-3.0,9.097495,10.071504,-3.0,-3.0,-3.0,-3,-3.0,-3.0
3,4,-3,,0,30,,1,,,,...,-3.0,-3.0,9.512706,10.286578,-3.0,10.677285,-3.0,-3,8.522331,10.608137
4,5,-3,,0,25,,1,,,,...,-3.0,-3.0,11.076016,9.615958,-3.0,9.731979,-3.0,-3,10.115313,9.646466


# Understanding Fragile Families Variables

Each row in the data frame represents a single individual enrolled in the Fragile Families study. Each column represents a different *variable* - information collected about that individual.

You can learn more about each variable by calling three ff functions:

**`select(varName, fieldName)`**: Returns metadata for variable varName.   (Optionally, returns only the field specified by fieldName.)

**`filter(*fieldNames)`**: Return a list of variables where fieldName matches the provided value.

**`search(query, fieldName)`**: Return a list of variables where query is found in fieldName.

In [57]:
# ff is a library written by the Fragile Families team and stored in a file ff.py a level up from our directory. 
# To import it, we need to append the directory above ours ('..') to our path and import the ff package (file ff.py)
import sys
sys.path.append('..')
import ff

# Let's look up the variable description
var_description = ff.select('t5c13a')
print(var_description, "\n\n")

# Let's examine what individual responses mean
print(var_description['responses'])

# Let's filter for all the variables where mother was the respondent:
mother_variables = ff.filter(respondent='m')
print(len(mother_variables))



{'-8': 'Out of range', '-5': 'Not asked', '1': 'far below average', '3': 'average', '-1': 'Refuse', '4': 'above average', '-4': 'Multiple ans', '5': 'far above average', '-3': 'Missing', '-2': "Don't know", '-6': 'Skip', '-9': 'Not in wave', '2': 'below average', '-7': 'N/A'}
4692


In [61]:
# For all these variables, list their labels:
#mother_labels = [ff.select(el)['label'] for el in mother_variables[:5]]

In [None]:
mother_labels

Choose a row to be the row index:

In [16]:
data_frame = data_frame.set_index('challengeID')

In [17]:
data_frame

Unnamed: 0_level_0,m1intmon,m1intyr,m1lenhr,m1lenmin,cm1twoc,cm1fint,cm1tdiff,cm1natsm,m1natwt,cm1natsmx,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-3,,-3,40,,0,,,,,...,-3.000000,-3.000000,-3.000000,-3.000000,-3.000000,-3.000000,-3.000000,-3,-3.000000,-3.000000
2,-3,,0,40,,1,,,,,...,-3.000000,8.473318,-3.000000,-3.000000,-3.000000,-3.000000,9.845074,-3,-3.000000,9.723551
3,-3,,0,35,,1,,,,,...,-3.000000,-3.000000,9.097495,10.071504,-3.000000,-3.000000,-3.000000,-3,-3.000000,-3.000000
4,-3,,0,30,,1,,,,,...,-3.000000,-3.000000,9.512706,10.286578,-3.000000,10.677285,-3.000000,-3,8.522331,10.608137
5,-3,,0,25,,1,,,,,...,-3.000000,-3.000000,11.076016,9.615958,-3.000000,9.731979,-3.000000,-3,10.115313,9.646466
6,-3,,0,25,,1,,,,,...,8.515700,10.558813,-3.000000,-3.000000,7.022328,-3.000000,10.564085,-3,-3.000000,10.255825
7,-3,,0,35,,1,,,,,...,-3.000000,-3.000000,9.660643,9.861125,-3.000000,10.991854,-3.000000,-3,10.972726,10.859800
8,-3,,1,10,,1,,,,,...,-3.000000,10.558813,-3.000000,-3.000000,-3.000000,-3.000000,-3.000000,-3,-3.000000,-3.000000
9,-3,,0,30,,1,,,,,...,-3.000000,-3.000000,11.689877,9.373199,-3.000000,8.194868,-3.000000,-3,9.842380,9.566678
10,-3,,0,33,,1,,,,,...,-3.000000,-3.000000,-3.000000,-3.000000,-3.000000,-3.000000,10.564085,-3,-3.000000,10.105870


# Other libraries

Some of the libraries popular in AI research are:

For data manipulation:
* **numpy** - for efficient math and dealing with operations on tables and matrices
* **pandas** - similar to numpy, more readable, more similar to the R programming language

For visualization:
* **matplotlib** - plotting

We will not be using the following libraries, but you will probably use them in the future:

For Machine Learning calculations
* TensorFlow - offering efficient implementations of Machine Learning algorithms, such as Neural Networks
* PyTorch
* Caffe