# Libraries and Data management in Python with Jupyter
There are an abundance of libraries that expand the language

Useful libraries for statistics and data science:
- Numpy: allows work with arrays of data
- Pandas: provides new data structures, notably in tabular form (pandas dataframe), and analysis tools
- Scipy: provides tools for numerical and scientific computing
- Matplotlib: provides visualization tools
- Seaborn: expands matplotlib's visualization tools
- Statsmodels: implements many statistical techniques

Documentation for all these libraries is readily available, and Jupyter makes it easy to access it using the ? magic function
Also go to www.python.org for documentation

In [3]:
# When importing libraries, it is common practice to use conventional aliases
import numpy as np
import pandas as pd
?np

[0;31mType:[0m        module
[0;31mString form:[0m <module 'numpy' from '/home/bdiethelmv/.local/lib/python3.8/site-packages/numpy/__init__.py'>
[0;31mFile:[0m        ~/.local/lib/python3.8/site-packages/numpy/__init__.py
[0;31mDocstring:[0m  
NumPy
=====

Provides
  1. An array object of arbitrary homogeneous items
  2. Fast mathematical operations over arrays
  3. Linear Algebra, Fourier Transforms, Random Number Generation

How to use the documentation
----------------------------
Documentation is available in two forms: docstrings provided
with the code, and a loose standing reference guide, available from
`the NumPy homepage <https://www.scipy.org>`_.

We recommend exploring the docstrings using
`IPython <https://ipython.org>`_, an advanced Python shell with
TAB-completion and introspection capabilities.  See below for further
instructions.

The docstring examples assume that `numpy` has been imported as `np`::

  >>> import numpy as np

Code snippets are indicated by thre

## Using libraries
Prepend the library alias to the function you want to use

In [5]:
a = np.array([1,2,3,4,5])
np.mean(a)

3.0

## Data management
The pandas library has become the standard way of importing tabular data. Once imported, it can be visualized and transformed with the built-in tools

In a pandas dataframe, rows are cases and columns are variables

Note: missing values can be stored as either empty cells, NA, NULL, or NaN

Pandas' "read_xxx" functions allow reading a variety of functions. The most common is:

    read_csv()
    
Let's import a dataset

In [6]:
# Import the data
df = pd.read_csv('cartwheeldata.csv')

# View it. Use the head method to view only the first 5 rows
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [7]:
# Entire dataset
df

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4
5,6,24,M,2,N,0,75.0,71.0,81,N,0,3
6,7,28,M,2,N,0,75.0,76.0,107,Y,1,10
7,8,22,F,1,N,0,65.0,62.0,98,Y,1,9
8,9,29,M,2,Y,1,74.0,73.0,106,N,0,5
9,10,33,F,1,Y,1,63.0,60.0,65,Y,1,8


# Dataframe housekeeping

In [8]:
# Use the columns method to get info on the variables
df.columns

Index(['ID', 'Age', 'Gender', 'GenderGroup', 'Glasses', 'GlassesGroup',
       'Height', 'Wingspan', 'CWDistance', 'Complete', 'CompleteGroup',
       'Score'],
      dtype='object')

In [16]:
# Use dtypes to see the data types of your variables
df.dtypes

ID                 int64
Age                int64
Gender            object
GenderGroup        int64
Glasses           object
GlassesGroup       int64
Height           float64
Wingspan         float64
CWDistance         int64
Complete          object
CompleteGroup      int64
Score              int64
dtype: object

In [19]:
# Use shape to see the number of rows and columns
df.shape

(25, 12)

In [17]:
# Retrieve unique values for a column
df.Complete.unique()

array(['Y', 'N'], dtype=object)

In [18]:
# Compare variables with groupby, especially to check if certain categories match
df.groupby(['Gender', 'GenderGroup']).size()

Gender  GenderGroup
F       1              12
M       2              13
dtype: int64

The output above confirms that the two columns portray essentially the same information

# Dataframe indexing
Select rows with loc, iloc, and ix

## loc()

In [9]:
# .loc selects by string or numeric label. First argument is row, second is column. 
# Slices can be used. Use empty colon for all rows or columns
df.loc[:, ['GenderGroup']]

Unnamed: 0,GenderGroup
0,1
1,1
2,1
3,1
4,2
5,2
6,2
7,1
8,2
9,1


In [10]:
# Only first row
df.loc[0, :]

ID                  1
Age                56
Gender              F
GenderGroup         1
Glasses             Y
GlassesGroup        1
Height           62.0
Wingspan         61.0
CWDistance         79
Complete            Y
CompleteGroup       1
Score               7
Name: 0, dtype: object

In [12]:
# Grab a selection. Up to row 6 and 2  custom columns
df.loc[:6, ['GenderGroup', 'Height']]

Unnamed: 0,GenderGroup,Height
0,1,62.0
1,1,62.0
2,1,66.0
3,1,64.0
4,2,73.0
5,2,75.0
6,2,75.0


In [20]:
# Use loc to grab all values of the variable gender and store them in a new variable
a = df.loc[:, 'Gender']
a

0     F
1     F
2     F
3     F
4     M
5     M
6     M
7     F
8     M
9     F
10    M
11    F
12    F
13    F
14    M
15    M
16    F
17    M
18    M
19    F
20    M
21    M
22    M
23    M
24    F
Name: Gender, dtype: object

In [21]:
# An equivalent, simpler method is to access the column directly. Pandas assumes all observations are wanted
b = df['Gender']
b

0     F
1     F
2     F
3     F
4     M
5     M
6     M
7     F
8     M
9     F
10    M
11    F
12    F
13    F
14    M
15    M
16    F
17    M
18    M
19    F
20    M
21    M
22    M
23    M
24    F
Name: Gender, dtype: object

In [23]:
# In both of these cases we can do math and functions with the extracted variable
max(df['GenderGroup'])

2

## iloc()

In [13]:
# .iloc() selects by integer, which indicates index, NOT label

In [24]:
# Slice only the first row (starts at index 0!)
df.iloc[0,:]

ID                  1
Age                56
Gender              F
GenderGroup         1
Glasses             Y
GlassesGroup        1
Height           62.0
Wingspan         61.0
CWDistance         79
Complete            Y
CompleteGroup       1
Score               7
Name: 0, dtype: object

In [14]:
# Indices from start to 6, with 6 not included
df.iloc[:6, :]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4
5,6,24,M,2,N,0,75.0,71.0,81,N,0,3


In [15]:
# Iloc is not compatible with string labels. Only indices
df.iloc[1:6, ['GenderGroup']]

IndexError: .iloc requires numeric indexers, got ['GenderGroup']