# Intro to `numpy` and `pandas`
`numpy` and `pandas` are two essential libraries for data manipulation and analysis in Python.

`numpy` provides support for large, multi-dimensional arrays and matrices, along with a 
collection of mathematical functions to operate on these arrays in an efficient way.

`pandas` builds on top of numpy to provide some additional functionality and user-friendliness. It also provides data structures like `DataFrames`, which are more flexible than `numpy` arrays and can handle heterogeneous data types (similar to a spreadsheet or SQL table).

In [1]:
import pandas as pd
import numpy as np

# Checking the versions of numpy and pandas that we have installed
print(f"numpy version: {np.__version__}")
print(f"pandas version: {pd.__version__}")

numpy version: 2.1.0
pandas version: 2.2.2


## Matrices and Arrays with Numpy
The core of numpy is the `ndarray` class, which is a flexible n-dimensional array that can be used to represent vectors, matrices, or higher-dimensional arrays. We can create an object of this class by calling the `array` function and passing a list or tuple of values. In the below case, we pass a "list of lists" to create a 2D array (matrix).

In [7]:
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

print("2D Array (Matrix):")
print(matrix)
print("\n")

# Use the "shape" attribute to get the dimensions of the matrix
print("Shape:", matrix.shape)

# Use the "ndim" attribute to get the number of dimensions of the matrix
print("Dimensions:", matrix.ndim)

2D Array (Matrix):
[[1 2 3]
 [4 5 6]
 [7 8 9]]


Shape: (3, 3)
Dimensions: 2


### Basic Operations with Numpy
Numpy provides a wide range of functions for performing operations on arrays. Here are some of the most common ones:

In [8]:
print("Sum of all elements:", np.sum(matrix))
print("Mean of all elements:", np.mean(matrix))

# Matrix multiplication
matrix_b = np.array([[2, 0, 1],
                     [1, 2, 1],
                     [1, 1, 0]])

result = np.dot(matrix, matrix_b)
print("\n")
print("Matrix multiplication result:")
print(result)

Sum of all elements: 45
Mean of all elements: 5.0


Matrix multiplication result:
[[ 7  7  3]
 [19 16  9]
 [31 25 15]]


## `DataFrames` with Pandas
`DataFrames` are the primary data structure in `pandas`. They are similar to tables in SQL or spreadsheets in Excel, but with some additional functionality. We can create a `DataFrame` by passing a dictionary of lists to the `DataFrame` constructor. Each key in the dictionary represents a column name, and the corresponding list contains the values for that column.

In [11]:
data = {
    'Name': ['Dave', 'Anna', 'Charlie'],
    'Age': [25, 40, 35],
    'City': ['East Lansing', 'Grand Rapids', 'Detroit']
}

df = pd.DataFrame(data)
print("DataFrame:")
print(df)

# You can also neatly print a DataFrame using the "to_markdown" method. You'll need to install the "tabulate" package to use this method.
print("\n")
print(df.to_markdown())

DataFrame:
      Name  Age          City
0     Dave   25  East Lansing
1     Anna   40  Grand Rapids
2  Charlie   35       Detroit


|    | Name    |   Age | City         |
|---:|:--------|------:|:-------------|
|  0 | Dave    |    25 | East Lansing |
|  1 | Anna    |    40 | Grand Rapids |
|  2 | Charlie |    35 | Detroit      |


### Some Basic Operations on `DataFrames`

In [12]:
# Basic DataFrame operations
print("\nDataFrame info:")
df.info()

print("\nDescriptive statistics:")
print(df.describe())

# Selecting data
print("\nSelecting 'Name' column:")
print(df['Name'])

print("\nSelecting rows where Age > 28:")
print(df[df['Age'] > 28])


DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 204.0+ bytes

Descriptive statistics:
             Age
count   3.000000
mean   33.333333
std     7.637626
min    25.000000
25%    30.000000
50%    35.000000
75%    37.500000
max    40.000000

Selecting 'Name' column:
0       Dave
1       Anna
2    Charlie
Name: Name, dtype: object

Selecting rows where Age > 28:
      Name  Age          City
1     Anna   40  Grand Rapids
2  Charlie   35       Detroit


### Indexing with Pandas
One of the things you'll be doing most frequently with your data is indexing it (selecting certain rows or columns based on particular filters). Pandas provides several ways to index into `DataFrames`. We can use the `iloc` method to access rows and columns by their integer position. We can also use the `loc` method to access rows and columns by their labels. 

Before we get going, let's add some additional rows and columns to the `DataFrame` to make it more interesting.

In [17]:
# Adding 12 additional rows and 2 additional columns to the "df" dataframe

# New data to be added
additional_data = {
    'Name': ['Eve', 'Frank', 'Grace', 'Hannah', 'Ivy', 'Jack', 'Kathy', 'Leo', 'Mona', 'Nina', 'Oscar', 'Paul'],
    'Age': [22, 28, 32, 26, 24, 29, 31, 27, 33, 30, 34, 36],
    'City': ['Ann Arbor', 'Flint', 'Kalamazoo', 'Lansing', 'Muskegon', 'Pontiac', 'Saginaw', 'Traverse City', 'Warren', 'Wyoming', 'Ypsilanti', 'Sterling Heights'],
    'Occupation': ['Engineer', 'Doctor', 'Artist', 'Lawyer', 'Nurse', 'Teacher', 'Scientist', 'Chef', 'Designer', 'Architect', 'Pilot', 'Musician'],
    'Salary': [70000, 120000, 50000, 90000, 60000, 55000, 80000, 45000, 75000, 85000, 95000, 65000]
}

# Creating a DataFrame from the additional data
additional_df = pd.DataFrame(additional_data)

# Concatenating the original df with the additional_df using the "concat" function
df = pd.concat([df, additional_df], ignore_index=True)

print("Updated DataFrame:")
print(df)

Updated DataFrame:
       Name  Age              City Occupation    Salary
0      Dave   25      East Lansing        NaN       NaN
1      Anna   40      Grand Rapids        NaN       NaN
2   Charlie   35           Detroit        NaN       NaN
3       Eve   22         Ann Arbor   Engineer   70000.0
4     Frank   28             Flint     Doctor  120000.0
5     Grace   32         Kalamazoo     Artist   50000.0
6    Hannah   26           Lansing     Lawyer   90000.0
7       Ivy   24          Muskegon      Nurse   60000.0
8      Jack   29           Pontiac    Teacher   55000.0
9     Kathy   31           Saginaw  Scientist   80000.0
10      Leo   27     Traverse City       Chef   45000.0
11     Mona   33            Warren   Designer   75000.0
12     Nina   30           Wyoming  Architect   85000.0
13    Oscar   34         Ypsilanti      Pilot   95000.0
14     Paul   36  Sterling Heights   Musician   65000.0


In [18]:
# Indexing using loc
# Returns the value in the first (zero-indexed) row in the 'Age' column
print(df.loc[0, 'Age']) 

# Indexing using iloc
# Returns the value in the first (zero-indexed) row in the first (zero-indexed) column ("Name")
print(df.iloc[0, 0])

25
Dave


You might have noticed that the DataFrame is missing the Occupation and Salary for the first three rows. Let's add that information to the DataFrame using the indexing methods we just learned.

In [19]:
# Adding "Occupation" and "Salary" information to the first two rows using the .loc method
df.loc[0, ['Occupation', 'Salary']] = ['Data Scientist', 110000]
df.loc[1, ['Occupation', 'Salary']] = ['Manager', 95000]

print("Updated DataFrame with Occupation and Salary for the first two rows:")
print(df)

Updated DataFrame with Occupation and Salary for the first two rows:
       Name  Age              City      Occupation    Salary
0      Dave   25      East Lansing  Data Scientist  110000.0
1      Anna   40      Grand Rapids         Manager   95000.0
2   Charlie   35           Detroit             NaN       NaN
3       Eve   22         Ann Arbor        Engineer   70000.0
4     Frank   28             Flint          Doctor  120000.0
5     Grace   32         Kalamazoo          Artist   50000.0
6    Hannah   26           Lansing          Lawyer   90000.0
7       Ivy   24          Muskegon           Nurse   60000.0
8      Jack   29           Pontiac         Teacher   55000.0
9     Kathy   31           Saginaw       Scientist   80000.0
10      Leo   27     Traverse City            Chef   45000.0
11     Mona   33            Warren        Designer   75000.0
12     Nina   30           Wyoming       Architect   85000.0
13    Oscar   34         Ypsilanti           Pilot   95000.0
14     Paul   36

Now, let's update Charlie's information in the dataframe using a variant of the loc method that indexes based on the *value* of a particular column in a row.

In [20]:
# Updating Charlie's information in the DataFrame
df.loc[df['Name'] == 'Charlie', ['Occupation', 'Salary']] = ['Analyst', 85000]

print("Updated DataFrame with Charlie's information:")
print(df)

Updated DataFrame with Charlie's information:
       Name  Age              City      Occupation    Salary
0      Dave   25      East Lansing  Data Scientist  110000.0
1      Anna   40      Grand Rapids         Manager   95000.0
2   Charlie   35           Detroit         Analyst   85000.0
3       Eve   22         Ann Arbor        Engineer   70000.0
4     Frank   28             Flint          Doctor  120000.0
5     Grace   32         Kalamazoo          Artist   50000.0
6    Hannah   26           Lansing          Lawyer   90000.0
7       Ivy   24          Muskegon           Nurse   60000.0
8      Jack   29           Pontiac         Teacher   55000.0
9     Kathy   31           Saginaw       Scientist   80000.0
10      Leo   27     Traverse City            Chef   45000.0
11     Mona   33            Warren        Designer   75000.0
12     Nina   30           Wyoming       Architect   85000.0
13    Oscar   34         Ypsilanti           Pilot   95000.0
14     Paul   36  Sterling Heights     

In [22]:
# Example 1: Indexing using a condition
# Selecting rows where the 'Salary' is greater than 80000
high_salary_df = df[df['Salary'] > 80000]
print("Rows where Salary is greater than 80000:")
print(high_salary_df)

# Example 2: Indexing using a condition
# Selecting rows where the 'Age' is between 30 and 37
age_between_30_and_37_df = df[(df['Age'] >= 30) & (df['Age'] <= 37)]

print("\n")
print("Rows where Age is between 30 and 37:")
print(age_between_30_and_37_df)

Rows where Salary is greater than 80000:
       Name  Age          City      Occupation    Salary
0      Dave   25  East Lansing  Data Scientist  110000.0
1      Anna   40  Grand Rapids         Manager   95000.0
2   Charlie   35       Detroit         Analyst   85000.0
4     Frank   28         Flint          Doctor  120000.0
6    Hannah   26       Lansing          Lawyer   90000.0
12     Nina   30       Wyoming       Architect   85000.0
13    Oscar   34     Ypsilanti           Pilot   95000.0


Rows where Age is between 30 and 37:
       Name  Age              City Occupation   Salary
2   Charlie   35           Detroit    Analyst  85000.0
5     Grace   32         Kalamazoo     Artist  50000.0
9     Kathy   31           Saginaw  Scientist  80000.0
11     Mona   33            Warren   Designer  75000.0
12     Nina   30           Wyoming  Architect  85000.0
13    Oscar   34         Ypsilanti      Pilot  95000.0
14     Paul   36  Sterling Heights   Musician  65000.0


## Comparing Pandas to R Packages
We won't have time to get into the details of how `pandas` compares to R packages like `dplyr` and `tidyr`, but if you're familiar with those packages, you'll find that `pandas` provides similar functionality for data manipulation and analysis in Python. For more information about this, you can see [this link](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html)