---

# Introduction to Pandas

Pandas is a library (a collection of modules) that are used by data scientists to manage and deal with data most often in a tabular format.

We went brief into modules on the last session and how we can import them if they are in our project directory (same file folder), so how do we import modules that are made by other people then? We use what you call a package manager.

Package Managers help our local environment deal with external modules and libraries, run the code below to see how we use pip to install modules

---

In [851]:
!pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


---
As you can see above, we use the "!pip" command to install modules or libraries, we can now import the module below:

---

In [852]:
import pandas

---
Remember the clinical data system last session? We tried making a class system where can store data in a tabular format, a way more efficient and general way to create tables like that is through pandas!

Before that, we need to see what data classes / systems pandas can offer, pandas primarily has two classes as shown below:

---

In [853]:
# 1 1D data system
pandas.Series

pandas.core.series.Series

In [854]:
# 2D data system
pandas.DataFrame

pandas.core.frame.DataFrame

---
Let's start with Series!

A pandas series is a data system that is 1 dimensional, think of it as a single column with labels, unlike python lists where indices are numbers, the labels here can be anything, similar to dictionary keys.

pandas series offer way more functionality than python lists / dictionaries, this includes:

1. vectorised operations ([1, 2, 3] + [4, 5, 6] = [5, 7, 9])
2. efficient indexxing by label or index 
3. in-built attributes that give you more information about your data (index labels, series values, data type, etc)
4. in-built methods that lets you efficiently do calculations / operations with the data without making them from scratch (describe(), unique(), value_counts())
5. And many more!

This session won't cover all of pandas' functionality, just those that are used widely in data science. You can learn more about pandas through their documentation here: https://pandas.pydata.org/docs/

---

The code below is how you intialise a pandas series

---

In [855]:
MySeries = pandas.Series([1, 2, 3], index=["A", "B", "C"])
MySeries

A    1
B    2
C    3
dtype: int64

---
Notice that the left side of the output is the labels (also called index, but in this session we'll call it labels) and the right side is the values. Note that all elements in the labels is hashable and unique, see the example below:

---

In [856]:
Series1 = pandas.Series(["A", "B"], index=[(1, 2), (3, 4)])
Series2 = pandas.Series(["A", "B"], index=[[1, 2], [3, 4]])

Series1[(1, 2)]

'A'

---
Now try accessing "A" in Series2 using the labels, does it give an error?

---

In [857]:
# Your code here

---
# Basic Series Manipulation

## Accessing Values in Series

You can access values by position (numerical indexing like in python) or by labels (like the example shown before)

You can also access values using boolean filters

---

Examples

1. Find the first element
2. Find the element with label "D"
3. Find all labels such that their elements are less than 27

In [858]:
NewSeries = pandas.Series([10, 20, 30, 25, 60], index=["A", "B", "C", "D", "E"])

NewSeries[0] # first element in the values
NewSeries["D"] # gets the element with label "D"
NewSeries[NewSeries < 27] # filters ALL elements (label + value pair) where the value is less than 27
NewSeries[(NewSeries < 27) & (NewSeries % 2 == 1)] # filters ALL elements where value is less than 27 AND is odd
NewSeries[(NewSeries < 27) | ~(NewSeries % 2 == 1)] # filters ALL elements where valus is less than 27 OR IS NOT ODD (even number)

  NewSeries[0] # first element in the values


A    10
B    20
C    30
D    25
E    60
dtype: int64

---
Task

1. Find the last element, call it X
2. Find the element with label "C", call it Y
3. Find all labels such that their elements are bigger or equal to 30, add their values and call it Z
4. output the sum of X, Y and Z

---

In [859]:
# Your code here

---
You can access values / labels in a more pythonic way by doing:

---

In [860]:
NewSeries.index # gets a list of indices

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [861]:
NewSeries.values # gets a list of values

array([10, 20, 30, 25, 60])

---
## Adding / Removing Elements in Series

You can add / remove elements in a series by doing the following operations

1. For adding, it's very similar to how you add values in dictionaries

2. For removing, we use the .drop() method provided the label

---

In [862]:
NewSeries["F"] = 20
NewSeries

A    10
B    20
C    30
D    25
E    60
F    20
dtype: int64

In [863]:
NewSeries = NewSeries.drop("F")
NewSeries

A    10
B    20
C    30
D    25
E    60
dtype: int64

---
Task:

1. Add a new label called F2 to NewSeries, and give its value 50
2. Remove the label "C" from NewSeries
---

In [864]:
# Your code here

---
## Modifying Elements in a Series

Modifying an element in a series is super simple, it's pretty much the same as how you would do it in a dictionary!

---

In [865]:
NewSeries["A"] = 70
NewSeries

A    70
B    20
C    30
D    25
E    60
dtype: int64

---
The reason we use pandas instead of python lists and dicts is because sometimes we need tools for us to do common routines quickly, like calculating averages, etc. Instead of writing the mean function from scratch everytime we need them we can just use pandas' built in mean function! And this goes on for every other in built methods that pandas series offer!

In addition, these operations are also optimised for efficiency, meaning using pandas mean function is far quicker computationally than using our own!

---

There are dozens of in built methods, we won't cover these too much in detail as the workshop mainly tries to provide intuition, but these are some of our favourite examples:

---

In [866]:
TestSeries = pandas.Series([15, 324, 2, 3, 32, 4, 3, 56, 2, 15, 15])

TestSeries = TestSeries.sort_values()

x1 = TestSeries.describe()

x2 = TestSeries.value_counts()

In [867]:
TestSeries # Sorted Series!

2       2
8       2
6       3
3       3
5       4
0      15
9      15
10     15
4      32
7      56
1     324
dtype: int64

In [868]:
x1 # summary statistics

count     11.000000
mean      42.818182
std       94.702501
min        2.000000
25%        3.000000
50%       15.000000
75%       23.500000
max      324.000000
dtype: float64

In [869]:
x2 # frequency of unique values

15     3
3      2
2      2
4      1
32     1
56     1
324    1
Name: count, dtype: int64

---
## Working with multiple Series in python

In Pandas Series, you can use vectorised operations across series, look at the example below:

---

In [870]:
S1 = pandas.Series([10, 20, 25, 10, 20])
S2 = pandas.Series([35, 40, 15, 25, 26])

S3 = S1 + (S2 ** 0.5) - (S1 / (S1+S2))
S3

0    15.693858
1    25.991222
2    28.247983
3    14.714286
4    24.664237
dtype: float64

---
## Problem 3: GB Cycling Summer Olympics Analysis

You are a sports data analyst for the GB’s Cycling Olympics team. You are given three pandas series where all its labels are years

1. total_medals = the value is total number of medals won
2. gold_medals = the value is total number of gold medals won
3. silver_medals = total number of silver medals won

Problem: You don’t have access to Bronze Medals and the years for each series is shuffled (not sorted)

Goal:

1. Create a pandas series with the total number of bronze medals won (note that: total = bronze + silver + gold)
2. Find the percentage of gold medal winners for each year, which year has the highest percentage of gold medal winners?
3. Find the year with the highest gold medal percentage where the total medal winners are between 3 to 13 exclusively

---

In [871]:
years = [1896, 1900, 1904, 1908, 1912, 1920, 1924, 1928, 1932, 1936, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020, 2024]
total = [1, 2, 0, 9, 2, 2, 0, 3, 2, 0, 2, 1, 1, 0, 1, 0, 0, 1, 1, 2, 0, 1, 2, 4, 3, 14, 12, 12, 12, 3]
gold = [0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 8, 8, 6, 3, 1]
silver = [0, 1, 0, 3, 1, 1, 0, 2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 4, 2, 4, 1, 2]

total_medals = pandas.Series(total, index=years)
gold_medals = pandas.Series(gold, index=years)
silver_medals = pandas.Series(silver, index=years)

In [872]:
# TASK 1

# Your code here

In [873]:
# TASK 2 

# Your code here

In [874]:
# TASK 3 

# Your code here

---
## DataFrames
Now let's move on to DataFrames, A DataFrame object in pandas is a 2-dimensional tabular data system, with rows and columns (like a spreadsheet).

Each column in a DataFrame is a pandas Series! And all these series share a common label. A dataframe also has a set of columns (separate object like labels)

There are multiple ways to initialise a dataframe, but the one below is the clearest way:

---

In [875]:
rows = [
    [20, "Data Science", True], # row 1
    [22, "Computer Science", True], # row 2
    [19, "Mathematics", False], # row 3
]

df = pandas.DataFrame(rows, columns=["age", "degree", "is_first"], index=["1001", "2335", "5633"])
df

Unnamed: 0,age,degree,is_first
1001,20,Data Science,True
2335,22,Computer Science,True
5633,19,Mathematics,False


---
As you can see from below, each column is a Series!

---

In [876]:
type(df["age"])

pandas.core.series.Series

---
Unlike series, DataFrame has values, labels and columns, each separate objects, look at the example below:

---

In [877]:
df.index

Index(['1001', '2335', '5633'], dtype='object')

In [878]:
df.columns

Index(['age', 'degree', 'is_first'], dtype='object')

In [879]:
df.values # This is a list of lists! (matrix)

array([[20, 'Data Science', True],
       [22, 'Computer Science', True],
       [19, 'Mathematics', False]], dtype=object)

---
Accessing values is similar to Series, but it's column first, remember in Series it's label first. Modifying values is exactly the same as Series / Dictionary, just add an equal sign after and assign a new value. Look at the example below:

---

In [880]:
df["age"] # to access a column (pandas series)

1001    20
2335    22
5633    19
Name: age, dtype: int64

In [881]:
# change the ages of the age column
df["age"] = [21, 23, 20]
df

Unnamed: 0,age,degree,is_first
1001,21,Data Science,True
2335,23,Computer Science,True
5633,20,Mathematics,False


In [882]:
# Here we use the loc operator to access a specific value
# The first argument is label indexxer and second is column
df.loc["1001", "age"]

np.int64(21)

In [883]:
# Change the age of 1001
df.loc["1001", "age"] = 19
df

Unnamed: 0,age,degree,is_first
1001,19,Data Science,True
2335,23,Computer Science,True
5633,20,Mathematics,False


---
Note that df["1001"] DOESN'T WORK!!

To access values by label (returning all columns) we do this instead:

---

In [884]:
# This returns a Series where the label is the column and the value is the values!
df.loc["1001"] 

age                   19
degree      Data Science
is_first            True
Name: 1001, dtype: object

In [885]:
# changing the values for 1001
df.loc["1001"] = [20, "Data Science", True]
df


Unnamed: 0,age,degree,is_first
1001,20,Data Science,True
2335,23,Computer Science,True
5633,20,Mathematics,False


In [886]:
df.iloc[0] # This is to access values by numerical index instead! (note in series we can simply do df[0])

age                   20
degree      Data Science
is_first            True
Name: 1001, dtype: object

---
Boolean Filtering is very similar to Series too! We can filter by column!

(label based filtering is usually not recommended)

---

In [887]:
df[df["age"] == 20]

Unnamed: 0,age,degree,is_first
1001,20,Data Science,True
5633,20,Mathematics,False


---
## Adding / Removing Rows and Columns

Adding columns and rows is exactly the same as how you access columns / rows

---

In [888]:
# Adding a column
df["grade"] = [75, 82, 65]
df

Unnamed: 0,age,degree,is_first,grade
1001,20,Data Science,True,75
2335,23,Computer Science,True,82
5633,20,Mathematics,False,65


In [889]:
# Adding a row
df.loc["7766"] = [20, "Physics", False, 57]
df

Unnamed: 0,age,degree,is_first,grade
1001,20,Data Science,True,75
2335,23,Computer Science,True,82
5633,20,Mathematics,False,65
7766,20,Physics,False,57


---

Removing rows / columns is a bit tricky as it requires the axis argument in drop

1. Axis = 0: Dropping Rows
2. Axis = 1: Dropping Columns

---

In [890]:
df = df.drop("2335", axis=0) # axis = 0 means we look for rows that have label 2335 to drop
df

Unnamed: 0,age,degree,is_first,grade
1001,20,Data Science,True,75
5633,20,Mathematics,False,65
7766,20,Physics,False,57


In [891]:
df = df.drop("is_first", axis=1) # axis = 1 means we look for columns that have column is_first
df

Unnamed: 0,age,degree,grade
1001,20,Data Science,75
5633,20,Mathematics,65
7766,20,Physics,57


---
A very convenient function is called the .replace(current, new) function, look at the example below!

---

In [892]:
df = df.replace(20, 21) # dataframe wide replace
df

Unnamed: 0,age,degree,grade
1001,21,Data Science,75
5633,21,Mathematics,65
7766,21,Physics,57


In [893]:
df["grade"] = df["grade"].replace(65, 70) # column wide replace
df

Unnamed: 0,age,degree,grade
1001,21,Data Science,75
5633,21,Mathematics,70
7766,21,Physics,57


---
Another Convenient method is the shift() method! This shifts all the values by 1 in your column! See the example below:

---

In [894]:
prev_grades = df["grade"].shift(1) # shifts all the values down, +1 would shift it up
prev_grades

1001     NaN
5633    75.0
7766    70.0
Name: grade, dtype: float64

---
## Problem 4: GB General Summer Olympics Analyst

Context: You are the same analyst working on a bigger problem, now you are working with a pandas DataFrame where each column is total medals won at a specific type of sport in the olympics (not just cycling from the previous problem). The labels are years.

Goal:

1. Identify the top 2 best performing sports and the worst performing sports by the UK across historical averages.
2. Add each entry by 1, and Identify the year with the best improvement (highest average percentage change across all sports)
3. Identify the best performing cycling-biased year by performing the following operations:

    a. divide all medals by that sport’s historical maximum, the new dataset should be called “domination_index”

    b. Find the year with the highest sum of domination indices across all sports, giving more weight (0.5) to cycling, and other sports should get uniformly the same remaining weights

---

In [895]:
years = [1896, 1900, 1904, 1908, 1912, 1920, 1924, 1928, 1932, 1936, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020, 2024]
cycling = [1, 2, 0, 9, 2, 2, 0, 3, 2, 0, 2, 1, 1, 0, 1, 0, 0, 1, 1, 2, 0, 1, 2, 4, 3, 14, 12, 12, 12, 3]
athletics = [3, 2, 0, 18, 11, 8, 3, 2, 3, 4, 3, 4, 6, 2, 2, 2, 3, 1, 4, 7, 3, 1, 0, 1, 1, 4, 6, 7, 7, 3]
swimming = [1, 2, 0, 15, 7, 7, 4, 4, 3, 0, 1, 1, 3, 1, 0, 0, 1, 0, 2, 2, 1, 1, 1, 2, 2, 6, 3, 3, 8, 8]
rowing = [0, 1, 0, 8, 4, 2, 2, 4, 2, 2, 3, 0, 0, 0, 1, 0, 0, 2, 3, 1, 2, 2, 2, 3, 4, 6, 9, 5, 2, 4]
sailing = [0, 4, 0, 7, 2, 3, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 3, 5, 4, 5, 3, 5, 2]

# This is another way to initialise a dataframe by column
df = pandas.DataFrame({"cycling": cycling, "athletics": athletics, 
                       "swimming": swimming, "rowing": rowing,
                       "sailing": sailing}, index=years)

df.head(10) # Can you guess what this method does?? Look it up online!

Unnamed: 0,cycling,athletics,swimming,rowing,sailing
1896,1,3,1,0,0
1900,2,2,2,1,4
1904,0,0,0,0,0
1908,9,18,15,8,7
1912,2,11,7,4,2
1920,2,8,7,2,3
1924,0,3,4,2,0
1928,3,2,4,4,0
1932,2,3,3,2,0
1936,0,4,0,2,0


In [896]:
# TASK 1

# Your code here

In [899]:
# TASK 2

# Your code here

In [898]:
# TASK 3

# Your code here