# Getting started with NumPy and Pandas  

In this notebook, we're going to go through some very common use cases for NumPy and Pandas libraries. These libraries are probably the most important when it comes to doing Data Science in Python, so it's important that we're all at least somewhat familiar with them!  

Pandas have two data structures - `pd.Series` and `pd.DataFrame`. `pd.Series` are one-dimensional, while `pd.DataFrame` is two-dimensional with rows and columns (when you create a `pd.DataFrame`, each column is a `pd.Series`).  
NumPy has `np.array`, and these are actually called `np.ndarrays`as these can be multi-dimensional! If they're two-dimensional, then you can think of these as regular matrices, but remember that an `np.array` can be extended beyond 2 dimensions and can hold much more information. 



In [3]:
import pandas as pd 
import numpy as np

## Let's first work with Pandas!  



In [4]:
pd.__version__ # what version of pandas do we have?

'2.0.3'

In [5]:
## let's create a series!

x = [3,6,9,12,15,18,21]
ser = pd.Series(x)
ser

0     3
1     6
2     9
3    12
4    15
5    18
6    21
dtype: int64

If nothing else is specified, the values are labeled with their index number. The label's can be used to access a specified value.

In [7]:
ser[2]

9

In [8]:
x = [3,6,9,12,15,18,21]
ser = pd.Series(x,index = ["A","B","C","D","E","F","G"]) # but you can also give custom labels!
ser

A     3
B     6
C     9
D    12
E    15
F    18
G    21
dtype: int64

In [9]:
## and access the values using these new labels!

ser["A"]

3

In [10]:
# we can also create a series using a dictionary!

cal = {"day1": 420, "day2": 380, "day3": 390,"day4":410,"day5":320}
cal_ser = pd.Series(cal)
cal_ser

day1    420
day2    380
day3    390
day4    410
day5    320
dtype: int64

In [10]:
type(cal_ser)

pandas.core.series.Series

Now that we've seen how `pd.Series` in pandas works, now let's get started with `pd.DataFrame`!

In [11]:
## let's first create a DataFrame using a dictionary!

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40]}

df = pd.DataFrame(data)
print(df)


      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   40


In [12]:
type(df)

pandas.core.frame.DataFrame

In [14]:
## if you have a file with rows and columns, you can also read it in as a pd.DataFrame

df = pd.read_csv('data.txt', sep='\t') # this is a file we have save, the columns are separated by a \t so we specify that

# default for this is comma-separated

In [15]:
## to get a quick overview of the data

df.head() # returns the first 5 rows

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


In [18]:
list(df.columns) # returns the names of the columns in your DataFrame

['Duration', 'Date', 'Pulse', 'Maxpulse', 'Calories']

In [16]:
df.tail(2) # returns the last 5 rows

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
30,60,'2020/12/30',102,129,380.3
31,60,'2020/12/31',92,115,243.0


In [22]:
df.shape # tells us the size of our dataframe i.e. the number of rows and columns

(32, 5)

We can see that `df` has 32 rows and 5 columns - Duration, Date, Pulse, Maxpulse and Calories. Let's see what we can do with this!

In [23]:
## first we can get a quick summary of the data in our DataFrame

df.describe() # returns statistics for numerical columns only!

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
count,32.0,32.0,32.0,30.0
mean,68.4375,103.5,128.5,304.68
std,70.039591,7.832933,12.998759,66.003779
min,30.0,90.0,101.0,195.1
25%,60.0,100.0,120.0,250.7
50%,60.0,102.5,127.5,291.2
75%,60.0,106.5,132.25,343.975
max,450.0,130.0,175.0,479.0


In [19]:
## we can select columns!

# Using square brackets
dates = df['Date']

# Using dot notation
durations = df.Duration


In [21]:
dates

0     '2020/12/01'
1     '2020/12/02'
2     '2020/12/03'
3     '2020/12/04'
4     '2020/12/05'
5     '2020/12/06'
6     '2020/12/07'
7     '2020/12/08'
8     '2020/12/09'
9     '2020/12/10'
10    '2020/12/11'
11    '2020/12/12'
12    '2020/12/12'
13    '2020/12/13'
14    '2020/12/14'
15    '2020/12/15'
16    '2020/12/16'
17    '2020/12/17'
18    '2020/12/18'
19    '2020/12/19'
20    '2020/12/20'
21    '2020/12/21'
22             NaN
23    '2020/12/23'
24    '2020/12/24'
25    '2020/12/25'
26        20201226
27    '2020/12/27'
28    '2020/12/28'
29    '2020/12/29'
30    '2020/12/30'
31    '2020/12/31'
Name: Date, dtype: object

In [22]:
# we can filter the data

# Selecting people that have a pulse greater than 125
pulse_more_than_125 = df[df['Pulse'] > 125]

# Combining conditions
specific_people = df[(df['Pulse'] > 100) & (df['Calories'] < 250)] # the & means 'and', for 'or' use |


In [23]:
pulse_more_than_125

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
23,60,'2020/12/23',130,101,300.0


In [24]:
specific_people

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
8,30,'2020/12/09',109,133,195.1
24,45,'2020/12/24',105,132,246.0


The above is only one way to filter the data in pandas! You can also use `df.loc`, `df.iloc` and `df.query()`. Let's see them all below!  
With `df.loc` and `df.iloc` you can select specific rows and columns at the same time!

In [25]:
## df.loc let's you select based on labels and conditions

df.loc[
    (df['Duration'] > 45) | (df['Pulse'] < 105), 
    ['Duration', 'Pulse', 'Calories'] ]

Unnamed: 0,Duration,Pulse,Calories
0,60,110,409.1
1,60,117,479.0
2,60,103,340.0
5,60,102,300.0
6,60,110,374.0
7,450,104,253.3
9,60,98,269.0
10,60,103,329.3
11,60,100,250.7
12,60,100,250.7


In [28]:
## df.iloc let's you select based on indices

df.iloc[5:10, 2:4]

Unnamed: 0,Pulse,Maxpulse
5,102,127
6,110,136
7,104,134
8,109,133
9,98,124


In [47]:
## df.query() works a bit similar to what we had before

df.query("Maxpulse > 101 & ")

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Sex
0,60,'2020/12/01',111,130,409.1,M
1,60,'2020/12/02',118,145,479.0,M
2,60,'2020/12/03',104,135,340.0,F
3,45,'2020/12/04',110,175,282.4,F
4,45,'2020/12/05',118,148,406.0,F
5,60,'2020/12/06',103,127,300.0,M
6,60,'2020/12/07',111,136,374.0,F
7,450,'2020/12/08',105,134,253.3,M
8,30,'2020/12/09',110,133,195.1,F
9,60,'2020/12/10',99,124,269.0,M


Now that we know what we can do with data that exists within a DataFrame, can we add more to it? Yes!

In [29]:
## we can add a new column or make changes to an existing column

# Adding a new column
df['Sex'] = ['M', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'F', 'F',
            'M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M']

# Modifying an existing column
df['Pulse'] = df['Pulse'] + 1


In [30]:
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Sex
0,60,'2020/12/01',111,130,409.1,M
1,60,'2020/12/02',118,145,479.0,M
2,60,'2020/12/03',104,135,340.0,F
3,45,'2020/12/04',110,175,282.4,F
4,45,'2020/12/05',118,148,406.0,F
5,60,'2020/12/06',103,127,300.0,M
6,60,'2020/12/07',111,136,374.0,F
7,450,'2020/12/08',105,134,253.3,M
8,30,'2020/12/09',110,133,195.1,F
9,60,'2020/12/10',99,124,269.0,M


In [31]:
# for categorical columns, we can count how much of each value we have 

df['Sex'].value_counts()

Sex
M    16
F    16
Name: count, dtype: int64

In [35]:
len(df['Sex'].unique())

2

Below we'll see what makes `pd.DataFrame` so useful! Sometimes we want to bypass using `for loops` and `if statements` - we could use them but pandas makes it so easy! For example, we want to separate our data based on `Sex` and find the mean `Pulse` for each category. We could do this using a for loop by iterating over each row, checking if the sex is `M` or `F` and then add their associated values to separate lists and find the mean of that list. Again, we could do this, but see how easy pandas makes this below:

In [43]:
## we can also group categories using df.groupby() and perform operations like sum() and mean() etc.

x = df.groupby('Sex')['Pulse'].mean() # just one line of code!

In [45]:
x[0]

106.1875

Let's say we've completed our analysis and our final results are in the form of a `pd.DataFrame`. Now, we would like to not have to rerun the whole analysis/code to create the `pd.DataFrame`, so it would be nice if we could just save it, so we have all of our results that we could just read in for further analysis (or plotting!). Thankfully, we can easily do this!  
```df.to_csv("<path/to/file_name>", sep='\t' or ',') ```

In [48]:
# make sure you give your file an extension - I'm creating a tsv file, so I'm also making sure I tell Python to separate the columns by a tab

df.to_csv("our_final_df.tsv", sep='\t')

## Let's talk about NumPy!  
NumPy is a fundamental Python library for numerical computing. It provides support for large, multi-dimensional arrays and matrices, as well as a wide range of mathematical functions to operate on these arrays. 

In [4]:
## Creating an numpy array
## We will see some ways to create numpy arrays below!

my_list = [1, 2, 3, 4, 5] # by a list
my_array = np.array(my_list)

print(my_list)
print(my_array)
print(type(my_array))

[1, 2, 3, 4, 5]
[1 2 3 4 5]
<class 'numpy.ndarray'>


In [29]:
## We can create some standard arrays using numpy's builtin functions

# Create an array of zeros
zeros_array = np.zeros([2,3], dtype=int)

# Create an array of ones
ones_array = np.ones([3,2])

# Create an array with a range of values
range_array = np.arange(0, 10, 2) # recall start, stop, step!


print(zeros_array)
print(ones_array)
print(range_array)

[[0 0 0]
 [0 0 0]]
[[1. 1.]
 [1. 1.]
 [1. 1.]]
[0 2 4 6 8]


In [30]:
zeros_array[0,1] = 2

zeros_array

array([[0, 2, 0],
       [0, 0, 0]])

In [53]:
# we can access attributes of our arrays using the following

arr = np.array([[1, 2, 3, 4, 5], [4, 5, 6, 7, 8]])
print(arr)
print("Shape:", arr.shape)
print("Size:", arr.size)
print("Data Type:", arr.dtype)


[[1 2 3 4 5]
 [4 5 6 7 8]]
Shape: (2, 5)
Size: 10
Data Type: int64


In [7]:
## we can access elements in our numpy arrays the same way we access them in lists

arr = np.array([1, 2, 3, 4, 5])

# Accessing individual elements
print(arr[0])  # Prints the first element (1)

# Slicing
print(arr[1:4])  # Prints elements at index 1, 2, and 3


1
[2 3 4]


In [8]:
## Numpy let's you do element-wise operations

# Addition
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
result = a + b

print(result)

# Multiplication
p_result = a * 2  # Scalar multiplication

print(p_result)


[5 7 9]
[2 4 6]


NumPy handles a lot of the multiplication -- many times this is the desired result and it's very nifty! But sometimes you want to do the actual matrix multiplication - you need to use a special function for that!

In [17]:
## a note on multiplication

c = np.array([[2,3], [4,5]])
d = np.array([[1,2], [6,7]])
e = np.array([1,2])

# print(c*d)

#print(np.multiply(c,d))

print(c)
print(e)

# try:
#     print(c*e)
    
# except:
#     print('c*e not possible!')
    

    
# try:
#     print(np.multiply(c,e))
    
# except:
#     print('np.multiply(c,e) not possible!')
    
    
print(np.matmul(c,e)) # special function for matrix multiplication!

[[2 3]
 [4 5]]
[1 2]
[ 8 14]


In [19]:
## some more functions with numpy arrays

arr = np.array([1, 2, 3, 4, 5])

# Mean, median, and standard deviation
mean = np.mean(arr)
print(mean)
median = np.median(arr)
print(median)
std_dev = np.std(arr)
print(std_dev)

# Element-wise functions
squared = np.square(arr)
print(squared)
square_root = np.sqrt(arr)
print(square_root)


[ 1  4  9 16 25]
[1.         1.41421356 1.73205081 2.         2.23606798]


In [8]:
## we can change the shape of an array

arr = np.arange(12)
print(arr)
reshaped_arr = arr.reshape(3, 4)

new_arr = arr.reshape(-1,3)


print(arr.shape)
print(reshaped_arr.shape)
print(new_arr.shape)

print(reshaped_arr)

[ 0  1  2  3  4  5  6  7  8  9 10 11]
(12,)
(3, 4)
(4, 3)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [18]:
## we can stack and split an array

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Stack vertically
stacked_vertical = np.vstack((a, b))

# Stack horizontally
stacked_horizontal = np.hstack((a, b))

# # Split array
split_array = np.split(stacked_horizontal, 2)


print(stacked_vertical)

print(stacked_horizontal)

print(split_array)

[[1 2 3]
 [4 5 6]]
[1 2 3 4 5 6]
[array([1, 2, 3]), array([4, 5, 6])]


In [27]:
## another way of joining two arrays together

d = np.concatenate([a,b]) # default for axis = 0 - meaning to stack horizontally
print(a)
print(b)
print(d)


## to change the axis, the axis must exist
reshape_a = a.reshape(-1,1)

reshape_b = b.reshape(-1,1)

print(reshape_a)
print(reshape_b)

e = np.concatenate([reshape_a, reshape_b], axis=1) # now we can use the axis, axis = 1 means 'columns' in a 2-D array

print(e)

[1 2 3]
[4 5 6]
[1 2 3 4 5 6]
[[1]
 [2]
 [3]]
[[4]
 [5]
 [6]]
[[1 4]
 [2 5]
 [3 6]]


The `a.reshape(-1,1)` is saying - I want my array to have 1 column, Python figure out a suitable number of rows based on what I have! 

We can also use numpy to generate random numbers!

In [28]:
# Generate random integers between 1 and 10
random_integers = np.random.randint(1, 11, size=5)

# Generate random values from a normal distribution
random_normal = np.random.normal(0, 1, size=(3, 3))


print(random_integers)

print(random_normal)

[5 4 9 4 8]
[[ 1.09280455  1.72928304 -0.84794139]
 [ 0.01284034  0.01807955 -2.28326701]
 [-0.3923979  -0.18064689 -0.34252391]]
