<h2>Topic 1: Introduction to Pandas</h2>

---

In [227]:
%%capture
!pip install pandas
!pip install numpy


In [228]:
import pandas
import numpy

---
<h3>Problem 1: How do we add two lists together like this:</h3>

<img src="./images/list.jpg" width="400" height="300">

---

In [229]:
L1 = [5, 3, 2, 7]
L2 = [2, 3, 4, 1]
L3 = L1 + L2
print(L3)

[5, 3, 2, 7, 2, 3, 4, 1]


---
<h4>Normal Python Lists allow concatenation (literally adding two lists together). But if we want a way to efficiently add each element together without writing a for loop we use what you call a Numpy Array:</h4>

---

In [230]:
L1 = numpy.array(L1)
L2 = numpy.array(L2)
L3 = L1 + L2
print(L3)

# If we still want to concatenate we can simply do:

L3 = numpy.concatenate((L1, L2))
print(L3)

[7 6 6 8]
[5 3 2 7 2 3 4 1]


---
<h4>The power really comes when it comes to higher dimensional data (e.g matrices) and Numpy fully supports it</h4>

---

In [231]:
M1 = [[1, 2], [3, 4]]
M1 = numpy.array(M1)

print(M1 + 5, "\n") # Adds 5 to ALL entries

print(M1 * 5, "\n") # Multiplies 5 to ALL entries

print(M1 + M1 ** 2, "\n") # Adds the second power of itself to itself (Note that this is NOT matrix multiplication)

print(M1 @ M1) # Matrix Multiplication (we won't get into this)

[[6 7]
 [8 9]] 

[[ 5 10]
 [15 20]] 

[[ 2  6]
 [12 20]] 

[[ 7 10]
 [15 22]]


---

<h4>Problem 2: How do we represent missing data?</h4>

---

In [232]:
# We can simply use numpy.nan to represent missing data, this is also called Null or NaN

MyList = [1, 5, 3, 5, numpy.nan]
L1 = numpy.array(MyList)

# Now operations will simply "ignore" null positions

L1 + 7

array([ 8., 12., 10., 12., nan])

---

<h4>Finally, an important feature of numpy arrays is called boolean indexxing (or masking):</h4>

<img src="./images/boolean.jpg" height=300, width=500/>

---

In [233]:
L1 = numpy.array([1, 2, 5, 6, 3, 4])

# Step 1
L2 = L1 > 3 # This creates a boolean mask
print(L2)

# Step 2
L3 = L1[L2] # This applies the boolean mask
print(L3)

# But it's easier to just do:
L3 = L1[L1 > 3]
print(L3)

[False False  True  True False  True]
[5 6 4]
[5 6 4]


---

<h4> Now let's get into Pandas data structures, which are built on numpy arrays! Pandas has mainly two data structures:</h4>

#### 1. Series
#### 2. DataFrame

---

<h4> Pandas Series: </h4>

<img src="./images/series.jpg" height=300, width=400>

---

In [234]:
# Pandas Series

S1 = pandas.Series([9, 9, 3], index=[1, 2, 3])
S1

1    9
2    9
3    3
dtype: int64

---

<h4>Note that Pandas Series are ordered by index alphanumerically! So any operations between pandas series will automatically be ordered by their indices, not position!</h4>

<img src="./images/seriesadd.jpg" height=300 width=500>

---

In [235]:
# for example

S1 = pandas.Series([20, 23, 34], index=["A", 2, 3])
S2 = pandas.Series([66, 65, 55, 43], index=[1, "A", 3, "Z"])
S1 + S2

1     NaN
2     NaN
3    89.0
A    85.0
Z     NaN
dtype: float64

---
<h4>Everything else about Pandas Series work exactly like Numpy Arrays, if we use the .values we can manipulate them how we do with numpy arrays while preserving their respective index!</h4>

---

---
<h4>We can now move on to DataFrames, which is the main structure that data scientists use! and it's the same thing that powers excel and google sheets tables!</h4>

<h4>A Pandas DataFrame is literally just a collection of Pandas Series that share the same index! And each series is called a column.</h4>

<img src="./images/dataframe.jpg" height=300, width=400>

---

In [236]:
age = pandas.Series([19, 20, 30], index=["A", "B", "C"])
grade = pandas.Series([65, 66, 78], index=["A", "B", "C"])

D1 = pandas.DataFrame({"age": age, "grade": grade})
D1

Unnamed: 0,age,grade
A,19,65
B,20,66
C,30,78


In [237]:
# check the name of our columns (individual series)
print(D1.columns, "\n")

# select series from dataframe (for ex age)
print(D1["age"], "\n")

# select the index of the dataframe (collective index from individual series)
print(D1.index, "\n")

Index(['age', 'grade'], dtype='object') 

A    19
B    20
C    30
Name: age, dtype: int64 

Index(['A', 'B', 'C'], dtype='object') 



In [238]:
# You can also make a dataframe like so:

D1 = pandas.DataFrame({"age": [19, 20, 30], "grade": [65, 66, 78]}, index=["A", "B", "C"])
D1

Unnamed: 0,age,grade
A,19,65
B,20,66
C,30,78


---
<h4>Operations in DataFrames also work the same way as numpy arrays! This workshop will focus more on DataFrames. The next topic will talk more about working with multiple dataframes!</h4>

---

<h2>Topic 1 Practice</h2>

<h4>Scenario 1: Bob is a supermarket manager, he is tasked with managing the prices of each item in two supermarkets.

The data is as follows:</h4>

<table border="1" style="border-collapse: collapse; width: 20%;">
    <caption>Supermarket 1</caption>
  <thead>
    <tr>
      <th>item_name</th>
        <th>price</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>eggs</td>
      <td>2.5</td>
    </tr>
    <tr>
      <td>chicken</td>
      <td>5.5</td>
    </tr>
    <tr>
      <td>spinach</td>
      <td>2.3</td>
    </tr>
    <tr>
      <td>banana</td>
      <td>1.7</td>
    </tr>
    <tr>
        <td>lamb</td>
        <td>8.7</td>
      </tr>
     <tr>
         <td>milk</td>
         <td>3.5</td>
      </tr>
  </tbody>
</table>

<table border="1" style="border-collapse: collapse; width: 20%;">
   <caption>Supermarket 2</caption>
  <thead>
    <tr>
      <th>item_name</th>
        <th>price</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>eggs</td>
      <td>2.2</td>
    </tr>
    <tr>
      <td>beef</td>
      <td>7.6</td>
    </tr>
    <tr>
      <td>spinach</td>
      <td>3.6</td>
    </tr>
    <tr>
      <td>banana</td>
      <td>2</td>
    </tr>
    <tr>
        <td>lamb</td>
        <td>9.3</td>
      </tr>
     <tr>
         <td>milk</td>
         <td>3.2</td>
      </tr>
      <tr>
          <td>cheese</td>
          <td>2.4</td>
      </tr>
  </tbody>
</table>

<h4>Task 1: Create two series, one containing the data for supermarket 1, call it S1, and the other for supermarket 2, call it S2. The index for these two should be item_name </h4>

<h4>Task 2: add the two series together, call the new series S3, how many null values are there? replace these null values with the original prices (e.g null + 5 = 5)</h4>

<h4>Task 3: add the prices for beef and cheese in supermarket 1, their prices are 6.5 and 2.5 respectively. Add the price for chicken in supermarket 2 with price 4.7 (hint: series[x] = y creates a new entry with index x and value y!)</h4>

<h4>Task 4: add both S1 and S2 together again, and divide S1 by the sum, what can you say about the prices between supermarket 1 and 2? </h4>

<h4>Scenario 2: Alice is a school administrator, and she is tasked with creating a table system for a primary school.

The data is as follows:</h4>

<table border="1" style="border-collapse: collapse; width: 30%;">
  <thead>
    <tr>
      <th>student_id</th>
        <th>name</th>
      <th>age</th>
      <th>grade</th>
      <th>subject</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1221</td>
      <td>jamie</td>
      <td>7</td>
        <td>70</td>
        <td>Biology</td>
    </tr>
    <tr>
      <td>2557</td>
      <td>charles</td>
      <td>8</td>
        <td>65</td>
        <td>Chemistry</td>
    </tr>
    <tr>
      <td>1882</td>
      <td>jenny</td>
      <td>7</td>
        <td>54</td>
        <td>Chemistry</td>
    </tr>
  </tbody>
</table>






<h4>Task 1: Create a DataFrame with student_id as the index and name, age, grade, subject as columns</h4>

<h4>Task 2: Find all rows where age is 7 (hint: treat column as a ndarray and look at boolean indexxing above!)</h4>

<h4>Task 3: Add 10 to jamie's grade (hint: how do you usually update values in lists, what if list indexxes are dataframe indexxes?)</h4>

<h4>Task 4: Add a new column to the table and call it "grade_age_ratio", this column should be the result of grade / age (hint: how did you add a new entry to a series? what if each entry in a dataframe is a column?)

<h2>Topic 2: Data Querying, Manipulating and Working with Multiple Data</h2>


<h4>We have seen how to create dataframes, and do basic querying and updating. But we have not yet seen how to inspect data, insert rows / columns, delete rows / columns, and do more advanced operations


<h4>Scenario 0: Youâ€™re analyzing user activity on an online learning platform.</h4>

In [239]:
df = pandas.DataFrame({
    "country": ["US", "CA", "US", "IN", "CA", None],
    "course": ["Python", "Python", "Data Science", "Python", "Data Science", "Python"],
    "hours_watched": [10, 5, numpy.nan, 8, 12, 3],
    "completed": [True, False, False, True, True, False],
    "rating": [5, 4, numpy.nan, 5, 4, 3],
    "signup_date": pandas.to_datetime([
        "2023-01-10", "2023-01-12", "2023-02-01",
        "2023-02-10", "2023-02-15", "2023-03-01"
    ])
}, index=[101, 102, 103, 104, 105, 106])

df

Unnamed: 0,country,course,hours_watched,completed,rating,signup_date
101,US,Python,10.0,True,5.0,2023-01-10
102,CA,Python,5.0,False,4.0,2023-01-12
103,US,Data Science,,False,,2023-02-01
104,IN,Python,8.0,True,5.0,2023-02-10
105,CA,Data Science,12.0,True,4.0,2023-02-15
106,,Python,3.0,False,3.0,2023-03-01


<h4>Inspecting the Data</h4>

In [240]:
df.head(2) # Selects top 2 rows

Unnamed: 0,country,course,hours_watched,completed,rating,signup_date
101,US,Python,10.0,True,5.0,2023-01-10
102,CA,Python,5.0,False,4.0,2023-01-12


In [241]:
df.tail(2) # Selects bottom 2 rows

Unnamed: 0,country,course,hours_watched,completed,rating,signup_date
105,CA,Data Science,12.0,True,4.0,2023-02-15
106,,Python,3.0,False,3.0,2023-03-01


In [242]:
df.info() # The information of the data (entries, how many nulls, data type, memory)

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 101 to 106
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   country        5 non-null      object        
 1   course         6 non-null      object        
 2   hours_watched  5 non-null      float64       
 3   completed      6 non-null      bool          
 4   rating         5 non-null      float64       
 5   signup_date    6 non-null      datetime64[ns]
dtypes: bool(1), datetime64[ns](1), float64(2), object(2)
memory usage: 294.0+ bytes


In [243]:
df.describe() # the basic statistical properties of the data (count, mean, percentiles, etc)

Unnamed: 0,hours_watched,rating,signup_date
count,5.0,5.0,6
mean,7.6,4.2,2023-02-02 12:00:00
min,3.0,3.0,2023-01-10 00:00:00
25%,5.0,4.0,2023-01-17 00:00:00
50%,8.0,4.0,2023-02-05 12:00:00
75%,10.0,5.0,2023-02-13 18:00:00
max,12.0,5.0,2023-03-01 00:00:00
std,3.646917,0.83666,


<h4>Data Querying and Manipulation</h4>

In [244]:
df

Unnamed: 0,country,course,hours_watched,completed,rating,signup_date
101,US,Python,10.0,True,5.0,2023-01-10
102,CA,Python,5.0,False,4.0,2023-01-12
103,US,Data Science,,False,,2023-02-01
104,IN,Python,8.0,True,5.0,2023-02-10
105,CA,Data Science,12.0,True,4.0,2023-02-15
106,,Python,3.0,False,3.0,2023-03-01


<h4>Task 1: Select the "course" column</h4>

<h4>Task 2: Select the row with index 101</h4>

<h4>Task 3: Select the rows where hours_watched is more than 8</h4>

<h4>Task 4: Select the rows where hours_watched is more than 8 and completed is True</h4>

<h4>Task 5: Remove the rating column (hint: axis=0 is for indices, and axis=1 is for columns)</h4>

<h4>Task 6: Remove the row with index 105</h4>

<h4>Task 7: Add a new row with these details:</h4>
    
<h4>1. id is 108, country is UK, course is Python, hours_watched is 7.5, completed is False, signup_date is null</h4>

<h4>Task 8: Fill in null values for hours_watched (7.4), completed (False), and rating (5.4) (hint: df.fillna({x:val, y: val2}))</h4>

<h2>Working with Multiple Data</h2>

<h4>You received new data for the online streaming platform</h4>

In [246]:
df_new = pandas.DataFrame({
    "country": ["ID", "UK", "PH", "FR", "FR", "CN"],
    "course": ["Data Science", "Python", "Data Science", "Python", "Python", "Python"],
    "hours_watched": [9, 6, 7, 2, 11, 5],
    "completed": [False, False, True, False, True, False],
    "signup_date": pandas.to_datetime([
        "2023-02-12", "2023-03-9", "2023-04-10",
        "2023-02-15", "2023-03-15", "2023-02-11"
    ])
}, index=[108, 109, 110, 111, 112, 113])

df_new

Unnamed: 0,country,course,hours_watched,completed,signup_date
108,ID,Data Science,9,False,2023-02-12
109,UK,Python,6,False,2023-03-09
110,PH,Data Science,7,True,2023-04-10
111,FR,Python,2,False,2023-02-15
112,FR,Python,11,True,2023-03-15
113,CN,Python,5,False,2023-02-11


<h4>You have also received new features for some users</h4>

In [247]:
df_info = pandas.DataFrame({
    "review": [9.7, 9.6, 7.6, 5.5, 6.4, 7, 6, 9, 5],
    "failures": [3, 4, 2, 2, 6, 10, 12, 0, 7],
}, index=[101, 102, 104, 107, 110, 111, 112, 114, 120])

df_info

Unnamed: 0,review,failures
101,9.7,3
102,9.6,4
104,7.6,2
107,5.5,2
110,6.4,6
111,7.0,10
112,6.0,12
114,9.0,0
120,5.0,7


<h4>Task 1: Add the new users to our table (hint: pd.concat([df1, df2]))</h4>

<h4>Task 2: Make a new table called original, and add the new features to the original table, include all the users on the original table (hint: use left.merge(right, left_index=True, right_index=True, how=x)</h4>

<img src="images/leftjoin.jpg" height=200 width=400>

<h4>Task 3: Make a table called big_df and include all users from both tables</h4>

<img src="./images/outerjoin.jpg" height=300 width=400>

<h4>Task 4: Finally Make a new table called df_exc and do the same, but this time include only include users in both tables that match (i.e if one of them doesn't exist in the other table, don't include them)</h4>

<img src="./images/innerjoin.jpg" height=200 width=400>

<h2>Topic 2 Practice</h2>

<h4>Scenario 1: Suspicious Engagement Analysis</h4>


<p>Your platform suspects that some users may be gaming course completions to get certificates without actually watching the content</p>

In [252]:

new_users = pandas.DataFrame({
    "country": ["IN", "US", "VN", "TH"],
    "course": ["Python", "Data Science", "Data Science", "Python"],
    "hours_watched": [1, 7, 2, 13],
    "completed": [True, True, True, True],
    "signup_date": pandas.to_datetime(["2023-01-10", "2023-01-23", "2023-02-07", "2023-01-08"]),
    "review": [10, 4.5, 9, 6.5],
    "failures": [0, 9, 1, 5]
}, index = [114, 115, 116, 117])


<h4>Task 1: Add these new users to our database of users (use df_exc as our new database)</h4>

<h4>Task 2: Add a new boolean column called suspicious_user, A user is suspicious if all of the following are true:

1. User completed the course
2. User's watch hour is below the 25th percentile (hint: use the x.quantile function)</h4>

<h4>Task 3: Add a new column called "expected_hours" that is the median of the course's hours_completed</h4>

<h4>Scenario 2: Mentor Program</h4>

<p>Your platform decides to integrate a peer-to-peer mentoring program</p>

In [256]:
mentors = pandas.DataFrame(
    {"mentor": [102, 110, 111, 114, 116, 101, 117, 115, 104, 112],
    "mentor_review": [10, 7, 7, 5, 7, 5, 9, 6, 12, 10]}, index=[101, 102, 104, 110, 111, 112, 114, 115, 116, 117]
)

<h4>Task 1: Merge the mentor dataframe to our original dataframe</h4>

<h4>Task 2: for each user, find its course_to_mentor ratio which is essentially course review divided by their mentor review. What can you say about each course based on this information?</h4>

<h4>Task 3: Find the courses that produce the best mentors (averaged) (search up the pandas.merge documentation for reference)</h4>