# NumPy's and Pandas

- Our next lecture will go over the most commonly used packages for handling data.
- Parts of this lecture were adapted from *VanderPlas, Jake. Python data science handbook: Essential tools for working with data. " O'Reilly Media, Inc.", 2016.*
- **If you have any questions over the course of this lecture, please post them to the 'Day 2 Lecture Questions' assignment on the Canvas course page.**

## What is NumPy: an alternative to lists

- NumPy objects are arrays which are multidimensional objects which analogous to the Python list.
- NumPy stands for Numerical Python and, as it sounds, is used for copmutation in array objects.
- There are many advantages to using NumPy over lists:
    - NumPy can read and write array data.
    - You can do quick math functions without all the for loops.
    - You can do linear algebra and create random numbers.


In [None]:
import numpy as np

an_array = np.arange(1000) # this is numpys version of range

print(an_array)

In [None]:
a_list = list(range(1000))

print(a_list)

# we can compare how quickly each object performs tasts

## Speed test: Lists vs. NumPy

In [None]:
%%time
for i in an_array: 
    an_array2 = an_array * 2

#6.98 milliseconds to complete

In [None]:
%%time
for i in a_list: 
    a_list2 = [x * 2 for x in a_list]

#118 milliseconds to complete

In [None]:
a_nump = np.array([[1,2,3],[4,5,6]])
a_nump

In [None]:
a_nother = np.array([['one','two','three'],['four','five','six']])
a_nother

## NumPy descriptives


- NumPy arrays are 3 dimensional so it is like have a list of lists (like we can have a dictionary of dictionaries).

- There are a lot of functionalities of NumPy.
- Here are some methods of NumPy which describe your data

    - **npdata.ndim**: the number of axes (dimensions) of the array.

    - **npdata.shape**: the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the number of axes, ndim.

    - **npdata.size**: the total number of elements of the array. This is equal to the product of the elements of shape.

    - **npdata.dtype**: an object describing the type of the elements in the array. One can create or specify dtypeâ€™s using standard Python types. Additionally NumPy provides types of its own. numpy.int32, numpy.int16, and numpy.float64 are some examples.



## NumPy attributes and methods

- And here are a list of useful functions:


| Operation |      Function call|           Description |
|-----------|-------------------|------------------------:|
| +         |   npdata.add          |   Addition               |
| -         |   npdata.subtract     |   Subtraction            |
| -         |   npdata.negative     |   Negation               |
|*          |   npdata.multiply     |   Multiplication         |
|/          |   npdata.divide       |   Division               |
| //        |   npdata.floor_divide |   Floor Division         |
| `**`      |   npdata.power        |   Exponentiation         |
|%          |   npdata.mod          |   Modulus/Remainder      |
|[0-x]      |   npdata.arange       |   Add a range of numbers |
| \|x\|       |   np.absolute     |   Absolute value         |

In [None]:
#np.<TAB>

In [None]:
import numpy as np
arr = np.array([[ 1, 2, 3], [4, 5, 6]]) # this is a 2x3
arr

In [None]:
arr.shape

In [None]:
# NumPy add

1 + arr

In [None]:
arr

In [None]:
# NumPy subtract

3 - arr

In [None]:
# NumPy Negate

- arr

In [None]:
# NumPy Multiply

20* arr

In [None]:
# NumPy Divide

2/arr

## More NumPy features

- There are lots of convenient features of NumPy:

    - Compare arrays.
       
    - Indexing.

    - Built-in mean function.

    - Random number generator.

    - Sort function 

    - Read/write NumPy files.

    - Linear algebra.
- Like Python types, NumPy comes with a wide variety of attributes. 

    -You can view these by typing `np.<TAB>`

In [None]:
#make a new array
arr2 = np.array([[ 0, 4, 1], [7, 2, 12]])
arr2

In [None]:
arr

In [None]:
arr2

In [None]:
# sum numpys

np.add(arr,arr2)

In [None]:
# add two numpys
# element wise
arr+arr2

## NumPy indexing

In [None]:
arr

In [None]:
#index array

arr[0]

In [None]:
#index array

arr[0][0]

In [None]:
# a 3-dimensional array

threeD = np.random.randint(10, size=(3, 4, 5))  #3x4x5

threeD


In [None]:
threeD.ndim

In [None]:
# indexing a 3d array
threeD[0]

In [None]:
threeD[0][0]

In [None]:
threeD[0][0][0]

## More NumPy features

In [None]:
print(arr)
print(arr2)

In [None]:
# compare numpy arrays

arr > arr2

In [None]:
type(threeD)

In [None]:
# find the mean

threeD[0].mean()


In [None]:
# the mean for all levels

threeD.mean()

In [None]:
# create random numbers

arr3 = np.random.randint(low=1, high=100, size=4) # you can also choose the distribution of random numbers np.random.normal(size=x)
print(arr3)

In [None]:
# built-in sort function that automatically updates object

arr3.sort()
print(arr3)

In [None]:
# Save a NumPy object
np.save('some_array', arr)


In [None]:
# Load a NumPy object

np.load('data\dem_load.npy')

In [None]:
# linear algebra
x = np.array([[ 1, 2, 3], [4, 5, 6]])
y = np.array([[ 6, 23], [-1, 7], [8, 9]])

x.dot( y)



## Pandas dataframes

- NumPy's are good to know, but the most useful package for data scientists will be pandas.
- If you had only one column of data it would be known as a panda series.
- However, most often we have a dataframe with multiple columns.

### Creating Data

In [None]:
# a series
import pandas as pd

a_series = pd.Series([ 4, 7, -5, 3])
print(a_series)


In [None]:
# a list of lists
names = [
    ['Dominique','Lockett','M'],
    ['Jordan','Mroz','M'],
    ['Jesse','Woollems','J']
    ]

print(names)

In [None]:
# turn it into a panda
pd.DataFrame(names)

In [None]:
# add column names and assign the data frame to a variable

data = pd.DataFrame(names, columns=['First','Last','Middle'])
data

In [None]:
data.size # no parentheses here

In [None]:
# another dataframe

# When you format you panda like a dictionary, it automatically knows the column names
allergies = {
    'Jasmihn' : {'bananas': ['itching','angioedema'], 'peanuts': ['anaphylaxis','angioedema','hives']},
    'Joe':{'pollen': ['itching','sneezing'], 'milk':['hives','vomiting','indigestion']},
    'Sally':{'soy': ['stomach cramps'], 'shellfish':['anaphylaxis','angioedema','hives']}}

pnda_all= pd.DataFrame(allergies) # we can convert our dictionaries into neat panda dataframes
print(pnda_all)

### Basic Functions

In [None]:
# Concatenate
my_data1 = pd.DataFrame({'key1': ['green', 'green', 'red'], 'key2': [' one', 'two', 'one'], 'data1': [1, 2, 3], 'data1': [9,10,11]})

my_data2 = pd.DataFrame({'key1': ['green', 'green', 'red', 'red'],  'key2': [' one', 'one', 'one', 'two'], 'data1': [14, 25, 16, 17]})

my_data3 = pd.DataFrame({'key1': ['red', 'green', 'red', 'red'],  'key2': [' two', 'one', 'one', 'two'], 'data1': [24, 53, 26, 73], 'data2': [42, 52, 62, 27]})



In [None]:
print(my_data1)

print(my_data3)

In [None]:
pd.concat([my_data1, my_data2, my_data3]) # notice though the indexing is not ideal

In [None]:
pd.concat([my_data1, my_data2, my_data3], ignore_index = True) 

In [None]:
# create random number and view the first few values

long_series = pd.Series(np.random.randn(100)) # randn = random normal distribution
long_series.head()

In [None]:

long_series.tail()

In [None]:
# more attributes can be observed using the normal <TAB> exploration
#long_series.
long_series.abs

In [None]:
list('ABCD')

In [None]:
# more on making your own panda object

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD')) #np.random.randint(start.value, end.value., size = (#rows, #columns))
df

In [None]:
#Transpose your data
pnda_all.T

# the construction of our dictionary wasn't the most reasonable way to make the desired dataframe

### Sort data 

In [None]:
# sort 

# We can change the order of the columns

df.sort_index(axis=1, ascending =False)

In [None]:
# or the rows
df.sort_index(axis=0, ascending = False)

In [None]:
# we can sort by a certain column; see the row order changes for all
df.sort_values(by='B')

If all this group stuff is confusing with just numbers look at [this](https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/) example that I drew from which uses basketball teams to explain.

## Other functions

In [4]:
# copy
df2 = df.copy() # make a new copy that doesn't impact the original

NameError: name 'df' is not defined

In [None]:
# new columns
New = [1]*33 + [2]*33 + [3]*33 + [4]
df2['NewCol'] = New

In [None]:
df2


In [None]:
df

In [None]:
df2 = df2.replace(42,0.00)
df2

In [None]:
#where allows us to isolate variables with certian conditions
df2 = df2.where(df2<70)
df2

In [None]:
#dropna can be used on all of the data
df2.dropna()

In [None]:
#or we can drop na's depending on a certain column

df2.dropna(subset=["A"])

### Indexing

In [None]:
df

In [None]:
data

In [None]:
# use indexing like we have leanred
data['First']

In [None]:
pnda_all

In [None]:
pnda_all['Joe']

### .loc attribute

In [None]:
data[0:3] # we can call on a range of rows

In [None]:

pnda_all['soy':'shellfish'] # call on a range of rows



In [None]:
#data[1] # but cannot call on  a single row

In [None]:
#pnda_all['soy'] # cannot just call on a single row

In [None]:
data.loc[0] # unless we add.loc

In [None]:
pnda_all.loc['soy']

In [None]:
# More indexing

df[0:3]


In [None]:
# cannot double index without loc
#df[:50,"A"]

In [None]:
#df[2,"C"]

In [None]:
df.loc[:50,"A"]

In [None]:
#df[:50,["A"]]

In [None]:
df.loc[:50, ['A','D']]

df.loc[2:3, ['C',"D"]]

In [None]:
df.loc[[2,13,12],'C']

In [None]:
df.loc[2,'C']

## Attributes of columns and rows (index)

In [None]:
# Calling columns and rows.
# Let's say we want to change a column name. There are a couple of ways

df.A

In [None]:
# columns have their own set of attributes

#df.A.<TAB>

df.A.isin(range(0,50)) # we can see if the values of a column are between 0 and 49

In [None]:
# we can call on the columns of a dataset and change all their names

df.columns = ['This one', 'That one', 'And','The other']
df

In [None]:
# but now there are spaces in the column names so we can't call on them with a period

#df.This one

In [None]:
# instead we will have to go back to brackets

df['This one']

In [None]:
# and we can still use the column attributes
df['This one'].isin(range(20,30))

In [None]:
# change them all back
df.columns = ['A', 'B', 'C','D']


In [None]:
# we can also change individual columns and rows (indexes)
df.rename(columns={'A': 'a'}, index={1:'one'})

In [None]:
# or we can change multiple

# BUT REMEMBER: if you are not assigning this to a new variable the changes will not be saved
#notice how 1 is back to a number not the word

df.rename(columns={'B': 'b', 'C':'c'}, index={2:'two', 3:'three'})

## Subsetting

In [None]:
# Boolean

df[df>20] # returns the normal dataframe but puts NaN where the condition is not true

In [None]:
# remove missing variables

df2 = df[df>20].dropna()

df2 # you have to assign this to a new variable to keep it

In [None]:
df2

In [None]:
# Remove missing variable by column

df2 = df2[df2.A>50]
df2

In [None]:
df

In [None]:
# Remove variable by value

df[df.B < 50]

In [None]:
df.A.where(df.A >30)

In [None]:
new_df = df.copy().where(df.A >30).dropna()

In [None]:
new_df

In [None]:
# more cell wise selection

df[df.A != 86][0:10] #Keep only values within column A that are NOT 61

In [None]:
df

In [None]:
# Combining numpy and panda 

df['E'] = np.where(df['A']>=50, 'yes', 'no') # create new dataframe where if the first column value is greater than or equal to 50, the value is yes, otherwise, no
df

### Create a new file

In [None]:
df.to_csv('File.csv')