sources:

https://pandas.pydata.org/pandas-docs/stable/10min.html

https://cloudxlab.com/blog/numpy-pandas-introduction/

https://docs.scipy.org/doc/numpy/user/quickstart.html

# Python Libraries
Python libraries are sets of functions written by other people and open sourced for the general use. You do not need to worry about how the function is implemented - you can just call it according to the *library spec*. Some of the libraries popular in AI research are:

For data manipulation:
* **numpy** - for efficient math and dealing with operations on tables and matrices
* **pandas** - similar to numpy, more readable, more similar to the R programming language

For visualization:
* **matplotlib** - plotting

We will not be using the following libraries, but you will probably use them in the future:

For Machine Learning calculations
* TensorFlow - offering efficient implementations of Machine Learning algorithms, such as Neural Networks
* PyTorch
* Caffe

## Importing a library

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

What we need to learn from pandas:
* how to make 2D arrays
* how indexing works
* how to add two rows
* find mean of a row
* display the mean, median of a row, histogram it
* how to flip axes
* how to access row/column with a specific name
* how to convert a column into indices

# Numpy
NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. It is an open source module of Python which provides fast mathematical computation on arrays and matrices. Since, arrays and matrices are an essential part of the Machine Learning ecosystem, NumPy along with Machine Learning modules like Scikit-learn, Pandas, Matplotlib, TensorFlow, etc. complete the Python Machine Learning Ecosystem.

NumPy provides the essential multi-dimensional array-oriented computing functionalities designed for high-level mathematical functions and scientific computation. Numpy can be imported into the notebook using

In [1]:
import numpy as np

### Creating arrays
The most basic way to create an array is to use the np.array() function

In [23]:
a = np.array([1, 2, 3])
print("a:\n", a)
b = np.array([[1, 2, 3],[4, 5, 6]])
print("b:\n", b)

a:
 [1 2 3]
b:
 [[1 2 3]
 [4 5 6]]


### Generating empty arrays: fill all values with 0, 1 or a predefined value:

You can look up the documentation (input arguments) for different numpy functions by googling their names. For example, np.ones is defined here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.ones.html

```
numpy.ones(shape, dtype=None, order='C')
```

The arguments which have default values in the definition (such as dtype=None) don't have to be defined. So the only thing you need to worry about is providing the shape argument.

The definition states:
```
shape : int or sequence of ints

Shape of the new array, e.g., (2, 3) or 2.
```

In [24]:
ones_array = np.ones((3,4))
print(ones_array)

[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]


In [25]:
zeros_array = np.zeros((2,4))
print(zeros_array)

[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]


Some of the important attributes of a NumPy object are:

* Shape: returns a tuple of integers indicating the size of the array
* Size: returns the total number of elements in the NumPy array
* Dtype: returns the type of elements in the array, i.e., int64, character
* Reshape: Reshapes the NumPy array

In [17]:
print(zeros_array.shape)
print(zeros_array.size)
print(zeros_array.dtype)

(2, 4)
8
float64


In [26]:
zeros_array.resize(8)
zeros_array

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

## Indexing in numpy
NumPy array elements can be accessed using indexing. Below are some of the useful examples:
* A[2:5] will print items 2 to 4. Index in NumPy arrays starts from 0
* A[2::2] will print items 2 to end skipping 2 items
* A[::-1] will print the array in the reverse order
* A[1:] will print from row 1 to end

In [36]:
A = np.arange(30).reshape(5, 6)

In [37]:
A

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

Select element

In [49]:
A[0,0]

0

Select row

In [43]:
A[1]

array([ 6,  7,  8,  9, 10, 11])

In [44]:
A[-1]

array([24, 25, 26, 27, 28, 29])

In [48]:
A[2:4]

array([[12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23]])

Select column

In [46]:
A[:,0]

array([ 0,  6, 12, 18, 24])

In [47]:
A[:,2:4]

array([[ 2,  3],
       [ 8,  9],
       [14, 15],
       [20, 21],
       [26, 27]])

In [39]:
A[2::2]

array([[12, 13, 14, 15, 16, 17],
       [24, 25, 26, 27, 28, 29]])

In [40]:
A[::-1]

array([[24, 25, 26, 27, 28, 29],
       [18, 19, 20, 21, 22, 23],
       [12, 13, 14, 15, 16, 17],
       [ 6,  7,  8,  9, 10, 11],
       [ 0,  1,  2,  3,  4,  5]])

In [42]:
A[1:]

array([[ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

Find values of the matrix with values above 20

In [51]:
A[A>20]

array([21, 22, 23, 24, 25, 26, 27, 28, 29])

Updating a specific value in an array

In [56]:
A[4,5] = -3
A

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, -3]])

### Flipping axes
You can flip ("transpose") an array using the .T function

In [59]:
A.T

array([[ 0,  6, 12, 18, 24],
       [ 1,  7, 13, 19, 25],
       [ 2,  8, 14, 20, 26],
       [ 3,  9, 15, 21, 27],
       [ 4, 10, 16, 22, 28],
       [ 5, 11, 17, 23, -3]])

In [103]:
# Take average etc.

# Why NumPy and Pandas over regular Python arrays?
In python, a vector can be represented in many ways, the simplest being a regular python list of numbers. Since Machine Learning requires lots of scientific calculations, it is much better to use NumPy’s ndarray, which provides a lot of convenient and optimized implementations of essential mathematical operations on vectors.

Vectorized operations perform faster than matrix manipulation operations performed using loops in python. For example, to carry out a 100 * 100 matrix multiplication, vector operations using NumPy are two orders of magnitude faster than performing it using loops.

# What is Pandas?
Similar to NumPy, Pandas is one of the most widely used python libraries in data science. It provides high-performance, easy to use structures and data analysis tools. Unlike NumPy library which provides objects for multi-dimensional arrays, Pandas provides in-memory 2d table object called Dataframe. It is like a spreadsheet with column names and row labels.

Hence, with 2d tables, pandas is capable of providing many additional functionalities like creating pivot tables, computing columns based on other columns and plotting graphs. Pandas can be imported into Python using:

In [6]:
import pandas as pd

Pandas is much more similar to excel than numpy is and provides a nice graphical representation of arrays.

You initialize arrays (called Data Frames in Pandas) similaraly to numpy:

In [78]:
s = pd.DataFrame([1,3,5,6,8])
s

Unnamed: 0,0
0,1
1,3
2,5
3,6
4,8


You can also name your rows and columns distinct names, which helps with data manipulation

In [93]:
df = pd.DataFrame(np.random.randn(6,4), index=range(5,11), columns=['A','B','C','D'])

In [94]:
df

Unnamed: 0,A,B,C,D
5,0.470829,0.035857,0.101347,-0.585304
6,-0.442721,0.144462,-0.289245,-0.330571
7,-1.096155,-1.114756,0.565268,1.64422
8,0.420934,0.58205,0.415613,-0.077742
9,0.042786,2.05601,-1.617703,0.372624
10,-2.64611,0.78581,0.952019,1.656516


In [95]:
df["A"]

5     0.470829
6    -0.442721
7    -1.096155
8     0.420934
9     0.042786
10   -2.646110
Name: A, dtype: float64

In [96]:
df.shape

(6, 4)

Pandas also gives you a summary of value distribution in columns

In [97]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.54174,0.414906,0.021216,0.446624
std,1.186529,1.040713,0.906289,0.984624
min,-2.64611,-1.114756,-1.617703,-0.585304
25%,-0.932797,0.063008,-0.191597,-0.267364
50%,-0.199968,0.363256,0.25848,0.147441
75%,0.326397,0.73487,0.527854,1.326321
max,0.470829,2.05601,0.952019,1.656516


If you want the description by rows, use the T function (transpose)

In [99]:
df.T.describe()

Unnamed: 0,5,6,7,8,9,10
count,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.005682,-0.229519,-0.000356,0.335214,0.213429,0.187059
std,0.438059,0.257615,1.349966,0.285933,1.505834,1.92612
min,-0.585304,-0.442721,-1.114756,-0.077742,-1.617703,-2.64611
25%,-0.119433,-0.358608,-1.100805,0.292274,-0.372337,-0.07217
50%,0.068602,-0.309908,-0.265443,0.418274,0.207705,0.868914
75%,0.193717,-0.180818,0.835006,0.461213,0.79347,1.128143
max,0.470829,0.144462,1.64422,0.58205,2.05601,1.656516


# Reading CSV files into pandas.
CSV files are "comma separated files" - that's the format most often used to save tabular (excel-like) data.

Reminder: 
* ".." means "go up a directory"
* "." means "this directory"
* "~" means "my username (so starting a level above Desktop, Documents, etc)
* "pwd" means "display the current directory"
* "ls" means "list contents of directory"

In [65]:
pwd

'/Users/agataforyciarz/ai4all/Lectures'

In [7]:
background = "../../ai4all_data/background.csv"
data_frame = pd.read_csv(background, low_memory=False)

You can display the entire table by just typing its name. Alternatively, you can call a ".head()" function to only show the first couple rows

In [8]:
data_frame.head()

Unnamed: 0,challengeID,m1intmon,m1intyr,m1lenhr,m1lenmin,cm1twoc,cm1fint,cm1tdiff,cm1natsm,m1natwt,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
0,1,-3,,-3,40,,0,,,,...,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3,-3.0,-3.0
1,2,-3,,0,40,,1,,,,...,-3.0,8.473318,-3.0,-3.0,-3.0,-3.0,9.845074,-3,-3.0,9.723551
2,3,-3,,0,35,,1,,,,...,-3.0,-3.0,9.097495,10.071504,-3.0,-3.0,-3.0,-3,-3.0,-3.0
3,4,-3,,0,30,,1,,,,...,-3.0,-3.0,9.512706,10.286578,-3.0,10.677285,-3.0,-3,8.522331,10.608137
4,5,-3,,0,25,,1,,,,...,-3.0,-3.0,11.076016,9.615958,-3.0,9.731979,-3.0,-3,10.115313,9.646466


You can switch between numpy and pandas arrays easily. You may want to do this because:
* some functions are only available in numpy or pandas
* some libraries require the numpy format
* pandas is easier to look at


In [74]:
df_as_numpy = data_frame.values

In [75]:
df_as_numpy

array([[1, -3, nan, ..., -3, -3.0, -3.0],
       [2, -3, nan, ..., -3, -3.0, 9.723551464684661],
       [3, -3, nan, ..., -3, -3.0, -3.0],
       ..., 
       [4240, -3, nan, ..., -3, -3.0, -3.0],
       [4241, -3, nan, ..., -3, 10.6119807592936, 10.255825479727],
       [4242, -3, nan, ..., -3, 10.9727256002543, 9.566677741865739]], dtype=object)

Choose a row to be the row index:

In [9]:
data_frame = data_frame.set_index('challengeID')

In [10]:
data_frame

Unnamed: 0_level_0,m1intmon,m1intyr,m1lenhr,m1lenmin,cm1twoc,cm1fint,cm1tdiff,cm1natsm,m1natwt,cm1natsmx,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-3,,-3,40,,0,,,,,...,-3.000000,-3.000000,-3.000000,-3.000000,-3.000000,-3.000000,-3.000000,-3,-3.000000,-3.000000
2,-3,,0,40,,1,,,,,...,-3.000000,8.473318,-3.000000,-3.000000,-3.000000,-3.000000,9.845074,-3,-3.000000,9.723551
3,-3,,0,35,,1,,,,,...,-3.000000,-3.000000,9.097495,10.071504,-3.000000,-3.000000,-3.000000,-3,-3.000000,-3.000000
4,-3,,0,30,,1,,,,,...,-3.000000,-3.000000,9.512706,10.286578,-3.000000,10.677285,-3.000000,-3,8.522331,10.608137
5,-3,,0,25,,1,,,,,...,-3.000000,-3.000000,11.076016,9.615958,-3.000000,9.731979,-3.000000,-3,10.115313,9.646466
6,-3,,0,25,,1,,,,,...,8.515700,10.558813,-3.000000,-3.000000,7.022328,-3.000000,10.564085,-3,-3.000000,10.255825
7,-3,,0,35,,1,,,,,...,-3.000000,-3.000000,9.660643,9.861125,-3.000000,10.991854,-3.000000,-3,10.972726,10.859800
8,-3,,1,10,,1,,,,,...,-3.000000,10.558813,-3.000000,-3.000000,-3.000000,-3.000000,-3.000000,-3,-3.000000,-3.000000
9,-3,,0,30,,1,,,,,...,-3.000000,-3.000000,11.689877,9.373199,-3.000000,8.194868,-3.000000,-3,9.842380,9.566678
10,-3,,0,33,,1,,,,,...,-3.000000,-3.000000,-3.000000,-3.000000,-3.000000,-3.000000,10.564085,-3,-3.000000,10.105870


# What are the NaN values?
NaN (Not a Number) entries appear in real-world datasets wery often, usually signifying missing data. NaNs are also produced when dividing by zero, or casting a non-numerical value to a number.

# What should we do about missing values?
Some researchers simply discard data samples where NaN values are present. This is problematic, because in relatively small datasets, this means getting rid of a large portion of the data.

The alternative solution is to *impute* - or fill in - missing data points. However, correct imputation requires advanced statistical knowledge. Sometimes, the average of a given column is used to replace NaN values. Other times, values are copied from other rows which have similar entries in the non-missing columns (the K Nearest Neighbors algorithm).

During this project, we will use three ways of dealing with missing data:
* removing NaN columns (simplest)
* filling in average values of the column (potentially making a simplifying assumption)
* using the *K Nearest Neighbors (KNN)* algorithm described above

## 1. Removing NaN values
We can either remove all the columns (*features*) that contain NaNs, or all the rows (*data points*) that contain NaNs. Let's first identify how many columns and rows do.

In [11]:
data_frame.isnan()

AttributeError: 'DataFrame' object has no attribute 'isnan'

In [None]:
#married[married==True].index.values
#array[~np.isnan(bg.cm1age)]