<h2>Topic 1: Introduction to Pandas</h2>

---

In [29]:
%%capture
!pip install pandas
!pip install numpy


In [30]:
import pandas
import numpy

---
<h3>Problem 1: How do we add two lists together like this:</h3>

<img src="./images/list.jpg" width="400" height="300">

---

In [31]:
L1 = [5, 3, 2, 7]
L2 = [2, 3, 4, 1]
L3 = L1 + L2
print(L3)

[5, 3, 2, 7, 2, 3, 4, 1]


---
<h4>Normal Python Lists allow concatenation (literally adding two lists together). But if we want a way to efficiently add each element together without writing a for loop we use what you call a Numpy Array:</h4>

---

In [32]:
L1 = numpy.array(L1)
L2 = numpy.array(L2)
L3 = L1 + L2
print(L3)

# If we still want to concatenate we can simply do:

L3 = numpy.concatenate((L1, L2))
print(L3)

[7 6 6 8]
[5 3 2 7 2 3 4 1]


---
<h4>The power really comes when it comes to higher dimensional data (e.g matrices) and Numpy fully supports it</h4>

---

In [33]:
M1 = [[1, 2], [3, 4]]
M1 = numpy.array(M1)

print(M1 + 5, "\n") # Adds 5 to ALL entries

print(M1 * 5, "\n") # Multiplies 5 to ALL entries

print(M1 + M1 ** 2, "\n") # Adds the second power of itself to itself (Note that this is NOT matrix multiplication)

print(M1 @ M1) # Matrix Multiplication (we won't get into this)

[[6 7]
 [8 9]] 

[[ 5 10]
 [15 20]] 

[[ 2  6]
 [12 20]] 

[[ 7 10]
 [15 22]]


---

<h4>Problem 2: How do we represent missing data?</h4>

---

In [34]:
# We can simply use numpy.nan to represent missing data, this is also called Null or NaN

MyList = [1, 5, 3, 5, numpy.nan]
L1 = numpy.array(MyList)

# Now operations will simply "ignore" null positions

L1 + 7

array([ 8., 12., 10., 12., nan])

---

<h4>Finally, an important feature of numpy arrays is called boolean indexxing (or masking):</h4>

<img src="./images/boolean.jpg" height=300, width=500/>

---

In [35]:
L1 = numpy.array([1, 2, 5, 6, 3, 4])

# Step 1
L2 = L1 > 3 # This creates a boolean mask
print(L2)

# Step 2
L3 = L1[L2] # This applies the boolean mask
print(L3)

# But it's easier to just do:
L3 = L1[L1 > 3]
print(L3)

[False False  True  True False  True]
[5 6 4]
[5 6 4]


---

<h4> Now let's get into Pandas data structures, which are built on numpy arrays! Pandas has mainly two data structures:</h4>

#### 1. Series
#### 2. DataFrame

---

<h4> Pandas Series: </h4>

<img src="./images/series.jpg" height=300, width=400>

---

In [36]:
# Pandas Series

S1 = pandas.Series([9, 9, 3], index=[1, 2, 3])
S1

1    9
2    9
3    3
dtype: int64

---

<h4>Note that Pandas Series are ordered by index alphanumerically! So any operations between pandas series will automatically be ordered by their indices, not position!</h4>

<img src="./images/seriesadd.jpg" height=300 width=500>

---

In [37]:
# for example

S1 = pandas.Series([20, 23, 34], index=["A", 2, 3])
S2 = pandas.Series([66, 65, 55, 43], index=[1, "A", 3, "Z"])
S1 + S2

1     NaN
2     NaN
3    89.0
A    85.0
Z     NaN
dtype: float64

---
<h4>Everything else about Pandas Series work exactly like Numpy Arrays, if we use the .values we can manipulate them how we do with numpy arrays while preserving their respective index!</h4>

---

---
<h4>We can now move on to DataFrames, which is the main structure that data scientists use! and it's the same thing that powers excel and google sheets tables!</h4>

<h4>A Pandas DataFrame is literally just a collection of Pandas Series that share the same index! And each series is called a column.</h4>

<img src="./images/dataframe.jpg" height=300, width=400>

---

In [38]:
age = pandas.Series([19, 20, 30], index=["A", "B", "C"])
grade = pandas.Series([65, 66, 78], index=["A", "B", "C"])

D1 = pandas.DataFrame({"age": age, "grade": grade})
D1

Unnamed: 0,age,grade
A,19,65
B,20,66
C,30,78


In [28]:
# check the name of our columns (individual series)
print(D1.columns, "\n")

# select series from dataframe (for ex age)
print(D1["age"], "\n")

# select the index of the dataframe (collective index from individual series)
print(D1.index, "\n")

Index(['age', 'grade'], dtype='object') 

A    19
B    20
C    30
Name: age, dtype: int64 

Index(['A', 'B', 'C'], dtype='object') 



In [49]:
# You can also make a dataframe like so:

D1 = pandas.DataFrame({"age": [19, 20, 30], "grade": [65, 66, 78]}, index=["A", "B", "C"])
D1

Unnamed: 0,age,grade
A,19,65
B,20,66
C,30,78


---
<h4>Operations in DataFrames also work the same way as numpy arrays! This workshop will focus more on DataFrames. The next topics will entirely cover essential fluency needed to work with pandas dataframes!</h4>

---

<h2>Topic 1 Practice</h2>

<h4>Scenario 1: Alice is a school administrator, and she is tasked with creating a table system for a primary school.

The data is as follows:</h4>

<table border="1" style="border-collapse: collapse; width: 60%;">
  <thead>
    <tr>
      <th>student_id</th>
        <th>name</th>
      <th>age</th>
      <th>grade</th>
      <th>subject</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1221</td>
      <td>jamie</td>
      <td>7</td>
        <td>70</td>
        <td>Biology</td>
    </tr>
    <tr>
      <td>2557</td>
      <td>charles</td>
      <td>8</td>
        <td>65</td>
        <td>Chemistry</td>
    </tr>
    <tr>
      <td>1882</td>
      <td>jenny</td>
      <td>7</td>
        <td>54</td>
        <td>Chemistry</td>
    </tr>
  </tbody>
</table>






<h4>Task 1: Create a DataFrame with student_id as the index and name, age, grade, subject as columns</h4>

<h4>Task 2: Find all rows where age is 7 (hint: treat column as a ndarray and look at boolean indexxing above!)</h4>

<h4>Task 3: Add 10 to jamie's grade (hint: how do you usually update values in lists, what if list indexxes are dataframe indexxes?)</h4>

<h2>Topic 2: Data Querying</h2>

---

<h2>Topic 3: Data Manipulation</h2>

---

<h2>Topic 4: Working with Multiple Tables</h2>

---